## Capstone Project - detecting emotions in social media posts 

Reddit is a popular website where social news, web content and everything under the sun is discussed. Posts are organized by subject into user-created boards called "communities" or "subreddits", which cover topics that impact the world such as politics and science to more personal content such as hobbies and dealing with emotions.  Although there are strict rules guiding the postings by registered members, Reddit has its group of community-specific administrators who, on a volunteer basis, do the moderation on occasions. 

In the recent years since Covid struck in end 2019, many people all around the world have been negatively impacted. School-going children, working adults and retired seniors had to restrict their outdoor activities and stay home to keep save while adapting to the new mode of doing most things online. 

With extended amounts of time spent at home resulting in the lines between work and rest blurred in the process, many have reported burnout. Mental health has become the next buzzword. Everyone needs to prioritise it so that we can function in sanity. But how much resources out there are available to those in need? Question is, how many of us actually reach out? 

It seems that with the advancement and efficiency of technology, many favour the option of getting help online rather than face-to-face. In fact, some just vent online in the hope that it eases their woes.

Specific to my project, I've taken interest in exploring whether joining subreddit threads pertaining to negative emotions (r/depressed) help or worsen one's state of mental health.

**Data Collection**

I've chosen to scrap some 500 comments from subreddit thread r/depressed to study what the users post about.

In [1]:
import requests
import pandas as pd
import time

In [2]:
# Stating the parameters
url = 'https://api.pushshift.io/reddit/search/comment'
params = {
    'subreddit': 'depressed',
    'size': 100,
    'before' : 1641398400 #Local Datetime as 6th Jan 12am
}

In [3]:
res = requests.get(url, params)
res.status_code # check connection status, if 200 means successfully connected

200

In [4]:
data = res.json()
posts = data['data']
len(posts) # Check to see if we extracted 100 posts

100

In [5]:
# Create dataframe for r/depressed comments.
df_depressed = pd.DataFrame(posts)

In [6]:
# Initiate variable for time taken to scrap 500 posts
depressed_total_time = 0

# Extracting another 500 posts to get 600 rows for df_depressed
for i in range(4):
    start_time = time.time()
    params = {'subreddit': 'depressed', 'size': 100, 'before': posts[-1]['created_utc']}
    response = requests.get(url, params)
    data = response.json()
    posts = data['data']
    df_depressed = df_depressed.append(pd.DataFrame(posts))
    end_time = time.time()
    exe_time = end_time - start_time
    depressed_total_time += exe_time
    time.sleep(2)

# Check if we have 500 posts
df_depressed.shape

(500, 47)

In [7]:
# Taking a look at the column headings. 
df_depressed.columns

Index(['all_awardings', 'archived', 'associated_award', 'author',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext', 'author_flair_template_id',
       'author_flair_text', 'author_flair_text_color', 'author_flair_type',
       'author_fullname', 'author_patreon_flair', 'author_premium', 'body',
       'body_sha1', 'can_gild', 'collapsed', 'collapsed_because_crowd_control',
       'collapsed_reason', 'collapsed_reason_code', 'comment_type',
       'controversiality', 'created_utc', 'distinguished', 'gilded',
       'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'no_follow',
       'parent_id', 'permalink', 'retrieved_utc', 'score', 'score_hidden',
       'send_replies', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_name_prefixed', 'subreddit_type', 'top_awarded_type',
       'total_awards_received', 'treatment_tags', 'unrepliable_reason'],
      dtype='object')

In [8]:
# Checking one post to understand what each column is.
df_depressed.iloc[0]

all_awardings                                                                     []
archived                                                                       False
associated_award                                                                None
author                                                              mushroomrisottoo
author_flair_background_color                                                   None
author_flair_css_class                                                          None
author_flair_richtext                                                             []
author_flair_template_id                                                        None
author_flair_text                                                               None
author_flair_text_color                                                         None
author_flair_type                                                               text
author_fullname                                                  

In [9]:
# Keeping only the columns I'm interested to look into.
df = df_depressed[['author', 'created_utc', 'body']]

In [10]:
# Confirming that there are no missing values and dtypes are correct.
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 0 to 99
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   author       500 non-null    object
 1   created_utc  500 non-null    int64 
 2   body         500 non-null    object
dtypes: int64(1), object(2)
memory usage: 15.6+ KB


In [13]:
# Save data to csv
df.to_csv('./datasets/df_depressed.csv', index=False)