## Project 3 - NLP Binary Classification of Subreddits

### Subreddits Used: r/biology & r/biochemistry

#### Function to Pull Submission Data

In [1]:
#import library
import requests
import pandas as pd
import numpy as np

In [2]:
#the url for pulling submissions
url_submission = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
#maps to the params of pushshift api
params = {
    'subreddit': 'biology',
    'size': 2, 
}

In [167]:
#use the params in the get method
req = requests.get(url_submission,params)

In [168]:
#shows correct pull of data
req.status_code

200

In [169]:
#read in the data as a json file
data = req.json()

In [173]:
#create first part of bio df
bio_posts = data['data']

In [174]:
#by default getting 25 posts back from the pushshift api
#largest size you can pull is 100
len(bio_posts)

2

In [175]:
#this shows the code to grab the utc from each row
bio_posts[1]['created_utc']

1610905071

In [176]:
#if you have a list of dictionaries you have a pandas dataframe
bio_df = pd.DataFrame(bio_posts)
bio_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls
0,[],False,Hjecom,,[],,text,t2_9sv5rfl9,False,False,...,2186774,public,self,Question about basic genetics,0,[],1.0,https://www.reddit.com/r/biology/comments/kzb0...,all_ads,6
1,[],False,SureInstruction5625,,[],,text,t2_91upnx5c,False,False,...,2186764,public,self,high school science online journal club,0,[],1.0,https://www.reddit.com/r/biology/comments/kzat...,all_ads,6


In [4]:
def subreddit_pull(df,subreddit):
    
    '''
    Arguments: take in starting df, and subreddit name as a string
    Output: returns a dataframe of submission data from the subreddit
    URL: contains url for pulling submission data
    '''
    while len(df) < 10_000:
        url = 'https://api.pushshift.io/reddit/search/submission'
        
        utc = df['created_utc'].min()
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before': utc
        }
    
        req = requests.get(url,params)
        
        if req.status_code == 200 or req.status_code==300:
            data = req.json()
            new_posts = data['data']
            new_df = pd.DataFrame(new_posts)
            
            #cleaning: remove rows with [deleted] or [removed] selftext/titles
            #replace removed or deleted posts with np.nan
            new_df.replace('[removed]',np.nan,inplace=True)
            new_df.replace('[deleted]',np.nan,inplace=True)
            
            #drop rows from the selftext and title columns that are null
            if new_df['title'].isna().sum() > 0:
                new_df.dropna(axis=0,how='any',subset=['title'],inplace=True)
                    
            df = pd.concat([df,new_df],ignore_index=True,axis=0)
        else:
            continue
        
    return df

#### subreddit_pull Explained:

The function subreddit_pull is used to pull more data from the designated subreddit and concatenate it onto an existing dataframe. It will ignore request status codes that 'error', do not get a successive request of 200 or 300. The function will also filter rows that have been deleted or removed, pre-cleaning, to take in as much useful data as possible. It will return the dataframe, when the number of rows breaks 10,000. The r/biology subreddit did not throw any errors, or get stuck within an infinite loop. This function, howver, did not work when pulling from the r/biochemistry subreddit.

In [5]:
def subreddit_pull_biochem(df,subreddit):
    fail_req = 0
    '''
    Arguments: take in starting df, and subreddit name as a string
    Output: returns a dataframe of submission data from the subreddit
    URL: contains url for pulling submission data
    '''
    while len(df) < 10_000:
        url = 'https://api.pushshift.io/reddit/search/submission'
        
        utc = df['created_utc'].min()
        params = {
        'subreddit': subreddit,
        'size': 100,
        'before': utc
        }
    
        req = requests.get(url,params)
        
        if req.status_code == 200:
            data = req.json()
            new_posts = data['data']
            new_df = pd.DataFrame(new_posts)      
            df = pd.concat([df,new_df],ignore_index=True,axis=0)
        else:
            fail_req += 1
            continue
        #making function more robust to errors on requests
        
        if fail_req == 100:
            return df
    
    return df

#### subreddit_pull_biochem Explained

The above function I used on the biochemistry subreddit. The subreddit_pull_biochem function required a more robust capturing of the failed requests. This subreddit was prone to failed requests and would become stuck within an infinite loop, if the failed requests were not used to break the loop.

## Pulling r/biology Submission Data

In [15]:
#use function to pull 10_000 observations from the bio subreddit
final_bio_df = subreddit_pull(bio_df,'biology')

In [46]:
#important columns for the project
final_bio_df = final_bio_df[['subreddit','selftext','title']]

In [78]:
#cheking the null values of the dataframe
final_bio_df[['selftext','title']].isna().sum()

selftext    0
title       0
dtype: int64

In [49]:
#checking out initial value counts of the biology dataframe
final_bio_df['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [50]:
#drop rows from the selftext and title columns that are null
final_bio_df.dropna(axis=0,how='any',subset=['selftext','title'],inplace=True)

In [53]:
#checking out the title value counts from the biology dataframe
final_bio_df['title'].value_counts()

Repetitive negative thinking is associated with amyloid, tau, and cognitive decline                                                                                                                                                                                                               13
AIs from AI Dungeon 2 to sexy to funny and one based wholly on Reddit!                                                                                                                                                                                                                            11
Question                                                                                                                                                                                                                                                                                          10
Where Jaguars Are Killed, New Common Factor Emerges: Chinese Investment                                                  

In [54]:
#cataloging the final number of observations of the biology submissions dataframe
#this final dataframe will next go to cleaning
final_bio_df.shape

(10072, 3)

In [56]:
#save the bio submissions df to csv file
final_bio_df.to_csv('datasets/bio-submissions.csv',index=False)

## Pulling r/biochemistry Submission Data

In [6]:
#the url for pulling submissions
url_submission = 'https://api.pushshift.io/reddit/search/submission'

In [7]:
#create biochemistry params for requesting
params = {
    'subreddit': 'Biochemistry',
    'size':2,
}

In [8]:
#use req to pull api
#use the params in the get method
req = requests.get(url_submission,params)

In [9]:
#status of the request
req.status_code

200

In [10]:
#read in the data as a json
data_biochem = req.json()

In [11]:
#create first part of chem df
biochem_posts = data_biochem['data']
biochem_df = pd.DataFrame(biochem_posts)
biochem_df

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,thumbnail_height,thumbnail_width
0,[],False,yeager00,,[],,text,t2_6qe2u10t,False,False,...,self,What i must know to learn Biochemestry from 0?,0,[],1.0,https://www.reddit.com/r/Biochemistry/comments...,all_ads,6,,
1,[],False,SamSam3107,,[],,text,t2_7c0j5nm6,False,False,...,https://a.thumbs.redditmedia.com/_-Mu-BX2ynr-j...,Problems in binding kinetic analysis in SPR,0,[],1.0,https://www.reddit.com/r/Biochemistry/comments...,all_ads,6,61.0,140.0


In [12]:
#use subreddit_pull to create biochem_df
final_biochem_df = subreddit_pull_biochem(biochem_df,'Biochemistry')

In [13]:
#the initial shape of the dataframe
final_biochem_df.shape

(10098, 97)

In [14]:
#replace removed or deleted posts with np.nan
final_biochem_df.replace('[removed]',np.nan,inplace=True)
final_biochem_df.replace('[deleted]',np.nan,inplace=True)

#drop rows from the selftext and title columns that are null
final_biochem_df.dropna(axis=0,how='any',subset=['selftext','title'],inplace=True)

In [16]:
#the shape after dropping the null rows for removed and deleted
final_biochem_df.shape

(9069, 97)

In [25]:
#seeing the value counts of the selftext biochem dataframe
final_biochem_df['selftext'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         2840
Hey guys I wanted to know how many ATP is produced by fructose in the liver??                                                                             

In [28]:
#removing posts from the biochem df that are empty strings in the selftext column
final_biochem_df = final_biochem_df[final_biochem_df['selftext'] != '']

In [35]:
#finding the final shape of the biochem dataframe after initial cleaning
final_biochem_df.shape

(6229, 97)

In [38]:
#creating the dataframe with the desired columns
#the dataframe will next go to cleaning
final_biochem_df = final_biochem_df[['subreddit','selftext','title']]

In [42]:
#save the chem submissions df to csv file
final_biochem_df.to_csv('datasets/biochem-submissions.csv',index=False)