# Project 3

Done by: Goh Chun Shan, DSIF 7

## Overview of project notebooks:

**1 - Project Overview and Data Acquisition through Webscraping** (current notebook)

2 - Data Preprocessing and Exploratory Data Analysis

3 - Model Tuning and Insights

### Instructions:
1. Using [Pushshift's](https://github.com/pushshift/api) API, you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Project Description:

Reddit is a social news, content, and discussions website. Posts are organised according to subject into user-created 'subreddits', which covers practically any topic imaginable. Members submit content (such as images, texts, and links) to subreddits, which can then be voted up ('upvote') or down ('downvote') by other members.

The two subreddits that I have chosen for this project are: **Social Anxiety** & **OCD** (Obsessive Compulsive Disorders). These are two distinct mental health conditions. The goal of this project is to create a classification model that predicts which subreddit a random post belongs to with the highest accuracy.

###### A) Social Anxiety
Definition: Social anxiety disorder, also called social phobia, is a long-term and overwhelming fear of social situations. It's a common problem that usually starts during the teenage years. It can be very distressing and have a big impact on your life.

###### B) Obessessive compulsive disorder
Definition: a personality disorder characterized by excessive orderliness, perfectionism, attention to details, and a need for control in relating to others.

##### Usefulness:
Awareness of mental health disorders is increasing in recent years, especially with the popularity of the theme on social media. The younger generation is more comfortable with using social media to post about their daily lives, including struggles that they have, and in severe cases it could be a cry for help when they do not know where they can seek help from. Having a model that accurately sorts posts can help in online surveillance of mental health illnesses, by picking out common combinations of keywords in posts using CountVectorizer. These users can then be pointed to the right resources.

Through this exercise, we can also raise awareness about mental health conditions for the general population and roll out campaigns to educate healthcare workers or the general public about words that differentiate mental illnesses, reduce the stigma against them, and correct any common misunderstandings about them. For e.g., one common use of the word 'OCD' is  people who like to keep their living spaces clean, which is not what the term actually means, which may drown out voices online of people seeking help as people are numbed to the over-misuse of the term.

##### Method:
In this project, I create a classification model that predicts which subreddit a random post belongs to with the highest accuracy. To identify a production model, a variety of preliminary models would be tested and evaluated based on their accuracy scores (i.e. how many correct predictions they are able to make).

##### Possible Extension of project: 
If we have an indicator of severity of the cases, we can also train a model to pick out the most severe cases, and do surveillance on social media posts on potential people who are seeking help.

## Data Dictionary

The combined dataset:

|Feature|Type|Description|
|---|---|---|---|
|**subreddit**|*str*|which subreddit each post originates from| 
|**title**|*str*|title of each reddit post|
|**selftext**|*str*|text of each reddit post|
|**fulltext**|*str*|combination of title and text of each reddit post|
|**id**|*str*|created user id|
|**score**|*float*|number of upvotes a post has|
|**upvote_ratio**|*float*|ratio of upvotes a post has, by the total number of votes the post received|

Engineered one column that concatenates the title and selftext

In [2]:
# Import libaries
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
#from tqdm import tqdm

### A) Starter code - Introduction to pushshift

Follow-along code from the video "Introduction to Pushshift" to learn how to scrape data from subreddits.

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
params = {
    'subreddit' : 'eating_disorders',
    'size': 200
}

In [4]:
res = requests.get(url, params)

In [5]:
res.status_code

200

In [6]:
data = res.json()
posts = data['data']
df = pd.DataFrame(posts)

In [7]:
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,...,thumbnail_height,thumbnail_width,title,total_awards_received,treatment_tags,upvote_ratio,url,url_overridden_by_dest,whitelist_status,wls
0,[],False,Affectionate_Area830,,[],,,,text,t2_7cxchjel,...,,,Losing weight is triggering,0,[],1.0,https://www.reddit.com/r/eating_disorders/comm...,,no_ads,0.0
1,[],False,Latter_Nothing_2651,,[],,,,text,t2_pkjhlq9m,...,,,rant.,0,[],1.0,https://www.reddit.com/r/eating_disorders/comm...,,no_ads,0.0
2,[],False,Latter_Nothing_2651,,[],,,,text,t2_pkjhlq9m,...,,,I hate how I look but theres nothing I can do ...,0,[],1.0,https://www.reddit.com/r/eating_disorders/comm...,,no_ads,0.0
3,[],False,Jemmayeetyeet,,[],,,,text,t2_2mtig4cx,...,,,slippery slope,0,[],1.0,https://www.reddit.com/r/eating_disorders/comm...,,no_ads,0.0
4,[],False,ThemBones-,,[],,,,text,t2_mw2t9tfi,...,140.0,140.0,Making so much progress. Food is fuel. FEED TH...,0,[],1.0,https://www.reddit.com/gallery/z2n912,https://www.reddit.com/gallery/z2n912,no_ads,0.0


In [8]:
post_subset = df[['subreddit','selftext','title']]#,'link_flair_text']] 

In [9]:
post_subset.head()

Unnamed: 0,subreddit,selftext,title
0,eating_disorders,I'm not sure if this is the right place to pos...,Losing weight is triggering
1,eating_disorders,"I'm not anorexic, but I diligently count calor...",rant.
2,eating_disorders,I know I shouldn't be losing more but what els...,I hate how I look but theres nothing I can do ...
3,eating_disorders,i have a restrictive ed and i get a lot of anx...,slippery slope
4,eating_disorders,,Making so much progress. Food is fuel. FEED TH...


In [10]:
posts[0]['created_utc']

1669229787

In [11]:
post_subset['selftext']

0      I'm not sure if this is the right place to pos...
1      I'm not anorexic, but I diligently count calor...
2      I know I shouldn't be losing more but what els...
3      i have a restrictive ed and i get a lot of anx...
4                                                       
5      I'm 15F and I don't know anyone who understand...
6      I'm a 20 year old female. I'm a mom to one lit...
7      I’ve been eating terribly and I keep failing a...
8      Some context, I’m a Bodybuilder that tracks ca...
9      So i guess im just asking if its possible for ...
10     So I've been struggling w an eating disorder b...
11     Hello everyone \nMy girlfriend has bulimia. I ...
12                                             [removed]
13     (New here) \nI’ve had an Ed before and I found...
14     I try to restrict as much as possible, and hid...
15     I (23F) have an irrationally severe fear of ch...
16                                                      
17     I've recovered (physical

### B) Actual Data Extraction 

As both subreddits are extremely active and there are hundreds of posts daily, to extract data, we wish to get 200 rows of data from each subreddit from each day, from the last 10 months. The UTC timing for every 24h is 86,400, so we loop this over the last 300 days to get sufficient rows of data.

In [4]:
#Difference of each day is 86,400 in utc time
#We wish to get 200 rows of data from each day for the past 50 days

start_time = 1668729600 # UTC Timing: 0000 hours on 18 Nov 2022
time_list = []

for x in range(300):
    time_list.append(start_time - x*86400) 

In [5]:
url = 'https://api.pushshift.io/reddit/search/submission'

For Social Anxiety subreddit

In [29]:
data_social_anxiety = []

for time in time_list:
    
    params_sa = {
    'subreddit' : 'socialanxiety',
    'size': 200,
    'before': time
    }
    
    res_sa = requests.get(url, params_sa)
    
    data_sa = res_sa.json()
    posts_sa = data_sa['data']
    df_sa = pd.DataFrame(posts_sa)
    temp_post_subset = df_sa[['subreddit','selftext','title','id','score','upvote_ratio']]
    
    data_social_anxiety.append(temp_post_subset)

In [30]:
df_sa = pd.concat(data_social_anxiety)

In [53]:
df_sa.to_csv('sa_raw_data.csv', index=False)

In [40]:
df_sa.head()

Unnamed: 0,subreddit,selftext,title,id,score,upvote_ratio
0,socialanxiety,Pretty much the title. When I'm not doing amaz...,Does anyone else fixate on times of day for in...,yy5k9q,1,1.0
1,socialanxiety,I often feel like people in public from my com...,Is this schizophrenic?,yy5j8s,1,1.0
2,socialanxiety,"Yeah, how was your day?","Hello, how was your day?",yy5cgr,1,1.0
3,socialanxiety,I just wasn’t happy when I was with her lately...,Broke up with my first gf an hour ago and I’m ...,yy58ea,1,1.0
4,socialanxiety,This classmate had me disoriented all day beca...,Classmate gifted me a bag of cookies today.,yy53jh,1,1.0


For OCD subreddit

In [42]:
data_obessivecompulsivedisorder = []

for time in time_list:
    
    params_ocd = {
    'subreddit' : 'OCD',
    'size': 200,
    'before': time
    }
    
    res_ocd = requests.get(url, params_ocd)
    
    data_ocd = res_ocd.json()
    posts_ocd = data_ocd['data']
    df_ocd = pd.DataFrame(posts_ocd)
    temp_post_subset = df_ocd[['subreddit','selftext','title','id','score','upvote_ratio']]
    
    data_obessivecompulsivedisorder.append(temp_post_subset)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [44]:
df_ocd = pd.concat(data_obessivecompulsivedisorder)

In [45]:
df_ocd.to_csv('ocd_raw_data.csv', index=False)

In [None]:
df_combined = df_ocd.append(df_sa)

In [None]:
df_combined.shape

### C) Simple Data cleaning

Data cleaning steps:
1. Remove duplicates, there should be minimum 5000 unique posts from each subreddit
2. Combine the title and text into a single string
3. Set a minimum word count of 30 for valid posts, remove those that are blank/ removed post/ picture only
4. Check the remaining dataset whether there are sufficient posts from each subreddit

#### Combine Title and Body into one column (fulltext)

In [17]:
df_sa['fulltext'] = df_sa['title'] + ' ' + df_sa['selftext']
df_ocd['fulltext'] = df_ocd['title'] + ' ' + df_ocd['selftext']
df_combined['fulltext'] = df_combined['title'] + ' ' + df_combined['selftext']

df_combined.head()

Unnamed: 0,subreddit,selftext,title,fulltext
0,OCD,"I'll go first, I'm 28F. I have a lot of childh...",What are your most disturbing/ disgusting intr...,What are your most disturbing/ disgusting intr...
1,OCD,How do you all cope with OCD and live on?,"life, coping with OCD","life, coping with OCD How do you all cope with..."
2,OCD,"I’ll try to make this as concise as possible, ...","Existential OCD, and fear of Psychosis OCD?","Existential OCD, and fear of Psychosis OCD? I’..."
3,OCD,I suffer from many different kinds of OCD and ...,OCD is ruining my life and can potentially rui...,OCD is ruining my life and can potentially rui...
4,OCD,Last night I went to see Black Panther 2 with ...,"Do the exposure, babe!","Do the exposure, babe! Last night I went to se..."


#### Remove duplicates

In [46]:
print(df_ocd.shape)
df_ocd_unique = df_ocd.drop_duplicates(subset=None, keep='first', inplace = False)
print(df_ocd_unique.shape)

(14570, 6)
(9623, 6)


In [47]:
print(df_sa.shape)
df_sa_unique = df_sa.drop_duplicates(subset=None, keep='first', inplace = False)
print(df_sa_unique.shape)

(59923, 6)
(26585, 6)


In [49]:
df_combined_unique = df_sa_unique.append(df_ocd_unique)
df_combined_unique.shape

(36208, 6)

In [50]:
df_ocd_unique.to_csv('ocd_unique_data.csv', index=False)

In [51]:
df_sa_unique.to_csv('sa_unique_data.csv', index=False)

In [52]:
df_combined_unique.to_csv('combined_unique_data.csv', index=False)

#### Filter for word count 30 and above only

In [None]:
df_combined['words'] = len(df_combined['text'].str.split()) #also omit special characters and punctuations from this
df_combined_filtered = df_combined1[df_combined1['words'] >= 30]
df_combined2 = df_combined_filtered.drop(columns = ['words'])

In [38]:
#list_stopwords = stopwords.words['english']
#need to import a library first

### D) Export cleaned data to csv

In [64]:
df_cleaned.to_csv('clean_data.csv', index=False)