[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/hovikgas/nlp_reddit/blob/master/code/1_Data_Collection.ipynb)

# Classification of the /r/ADHD and /r/Anxiety SubReddits

### Data Collection

In [1]:
import requests, time, csv, json, re
import pandas as pd

First I am going to test out the Pushshift API. I start by creating URL and parameter variables for use with the requests package. Reddit calls posts submissions, and the Pushshift API allows at most 100 posts to be pulled at once.

In [2]:
# URL based on Pushshift API documentation 
url = 'https://api.pushshift.io/reddit/search/submission'

# Testing out with just 2 parameters specifying which subreddit and how many posts to pull
params = {'subreddit':'anxiety',
          'size':100,
         }

In [3]:
response = requests.get(url, params=params)

Testing the response to see if an error code was given

In [4]:
response.status_code

200

Reddit stores its data as a JSON, like many websites. Here, I look at the first item in that JSON's 'data' key where all the data is stored for each post.

In [5]:
response.json()['data'][0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'avocadojiang',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_2onvixzt',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1593139398,
 'domain': 'self.Anxiety',
 'full_link': 'https://www.reddit.com/r/Anxiety/comments/hfzjm6/webmd_is_trash_fire/',
 'gildings': {},
 'id': 'hfzjm6',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_css_class': 'venting',
 'link_flair_richtext': [],
 'link_flair_template_id': '19ddd408-76a1-11e5-acb1-0ea7f7edc7ed',
 'link_flair_text': 'Venting',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_o

---

#### Now that I've tested out the Pushshift API and successfully collected 2000 posts. I want to automate this functionality to make it easier in the future to collect data from other subreddits, or to collect even more data. 

Here I define a function here that will automatically save the data with a unique filename, in case I download again in the future and don't want to overwrite my existing data.

In [7]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Here I define a custom function to run my subreddit queries, show the status and progress as it happens, and automatically write the data into a corresponding JSON file using my previously defined custom function from above.

In [8]:
def reddit_query(subreddit, n_samples=100, before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/submission'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {
              'subreddit':subreddit,
              'sort':'desc',
              'size':n_samples,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{subreddit}.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} submissions.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Running the query for the ADHD subreddit

In [9]:
reddit_query(subreddit='adhd', n_samples=20000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Starting query 11
Starting query 12
Starting query 13
Starting query 14
Starting query 15
Starting query 16
Starting query 17
Starting query 18
Starting query 19
Starting query 20
Saved and completed query and returned 20000 submissions.
Reddit text is ready for processing.
Last timestamp was 1586893398.


Loading the saved JSON file into a list using the name of the last timestamp as the prefix for the filename.

In [10]:
with open(f'../data/1586893398_raw_adhd.json', 'r') as f:
    adhd_list = json.load(f)

Checking the length of the aforementioned list, to see if it matches the expected number of samples pulled from the query function.

In [11]:
len(adhd_list)

20000

Checking to see if the data came through. Looks good!

In [12]:
adhd_list[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'doublePepperoni314',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_5ezrii3y',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1593139327,
 'domain': 'self.ADHD',
 'full_link': 'https://www.reddit.com/r/ADHD/comments/hfzj0q/imminent_failure_entering_a_coding_immersive/',
 'gildings': {},
 'id': 'hfzj0q',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitel

Here I define a custom function to automatically parse my data so that it only keeps what I want and makes it ready to be turned into a Pandas DataFrame.

In [28]:
def reddit_parse(sample, subreddit):
    # list of columns to keep, ignoring everything else
    col_list = ['author',
            'subreddit',
            'title',
            'selftext',
            'created_utc',
            'retrieved_on',
            'url',
            'pinned',
            'media_only'
            ]
    # creating a dataframe from the inputted data
    posts_df = pd.DataFrame(sample)
    posts_df = posts_df[col_list]
    
    #
    posts_df.rename(columns={'subreddit':subreddit}, inplace=True)
    posts_df[subreddit] = posts_df[subreddit].map({subreddit:1})
    
    col_order = ['author',
            subreddit,
            'title',
            'selftext',
            'created_utc',
            'retrieved_on',
            'url',
            'pinned',
            'media_only'
            ]

    return posts_df[col_order]

In [29]:
adhd_df = reddit_parse(adhd_list, 'ADHD')

In [31]:
adhd_df.ADHD.value_counts()

1    20000
Name: ADHD, dtype: int64

In [32]:
adhd_df.head()

Unnamed: 0,author,ADHD,title,selftext,created_utc,retrieved_on,url,pinned,media_only
0,doublePepperoni314,1,Imminent Failure? Entering a Coding Immersive,I'm 37(m) and starting a coding immersive on ...,1593139327,1593139332,https://www.reddit.com/r/ADHD/comments/hfzj0q/...,False,False
1,CookiePriest,1,Cant stop being annoying/awkward,I don't mind the poor motivation or focus that...,1593139141,1593139145,https://www.reddit.com/r/ADHD/comments/hfzhfc/...,False,False
2,uberhyped,1,Speech problems: ADHD or just me?,I often have trouble structuring the ideas I w...,1593138922,1593138926,https://www.reddit.com/r/ADHD/comments/hfzfk8/...,False,False
3,turtlezeverywhere,1,My psychiatrists seemingly undermines my concerns,One thing I’ve noticed with me On vs Off meds ...,1593138736,1593138738,https://www.reddit.com/r/ADHD/comments/hfzdz6/...,False,False
4,ToonlinkFTW890,1,"Meds are great, until they wear off!","I need to do work, but my ER and booster had w...",1593138607,1593138608,https://www.reddit.com/r/ADHD/comments/hfzcuj/...,False,False


---

Repeating the above for the Anxiety subreddit.

In [21]:
reddit_query(subreddit='anxiety', n_samples=20000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Starting query 11
Starting query 12
Starting query 13
Starting query 14
Starting query 15
Starting query 16
Starting query 17
Starting query 18
Starting query 19
Starting query 20
Saved and completed query and returned 20000 submissions.
Reddit text is ready for processing.
Last timestamp was 1586828176.


In [23]:
with open(f'../data/1586828176_raw_anxiety.json', 'r') as f:
    anxiety_list = json.load(f)

In [24]:
len(anxiety_list)

20000

In [25]:
anxiety_list[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': '5uperstud83',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_5o7bt50w',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1593140513,
 'domain': 'self.Anxiety',
 'full_link': 'https://www.reddit.com/r/Anxiety/comments/hfzswf/im_curious_if_anyone_has_ever_had_severe_muscle/',
 'gildings': {},
 'id': 'hfzswf',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_css_class': 'daeflair',
 'link_flair_richtext': [],
 'link_flair_text': 'DAE Questions',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,


In [33]:
anxiety_df = reddit_parse(anxiety_list, 'Anxiety')

In [34]:
anxiety_df.Anxiety.value_counts()

1    20000
Name: Anxiety, dtype: int64

In [35]:
anxiety_df.head()

Unnamed: 0,author,Anxiety,title,selftext,created_utc,retrieved_on,url,pinned,media_only
0,5uperstud83,1,I’m curious if anyone has ever had severe musc...,\nI’ve been in horrible health the last severa...,1593140513,1593140518,https://www.reddit.com/r/Anxiety/comments/hfzs...,False,False
1,Adept_Owl,1,Anyone else with vivid anxiety? (Possible trig...,By vivid I mean where scenarios run through yo...,1593139976,1593139977,https://www.reddit.com/r/Anxiety/comments/hfzo...,False,False
2,Ccerberus667,1,I have anxiety about certain technology like A...,The root of my anxiety that I've had for a few...,1593139906,1593139909,https://www.reddit.com/r/Anxiety/comments/hfzn...,False,False
3,FinickyFlamingo1,1,My parents are signing me up for disability an...,"I want money, yes. But the fact that I'm recei...",1593139760,1593139766,https://www.reddit.com/r/Anxiety/comments/hfzm...,False,False
4,niajtima23,1,Compulsive exercise (running),when I first started exercising and began a re...,1593139752,1593139757,https://www.reddit.com/r/Anxiety/comments/hfzm...,False,False


---

Now that I have both dataframes ready to go, I will merge them into one for easier cleaning of their contents.

In [36]:
merged_df = pd.merge(anxiety_df, adhd_df, how='outer')

In [40]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40000 entries, 0 to 39999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   author        40000 non-null  object 
 1   Anxiety       20000 non-null  float64
 2   title         40000 non-null  object 
 3   selftext      39640 non-null  object 
 4   created_utc   40000 non-null  int64  
 5   retrieved_on  40000 non-null  int64  
 6   url           40000 non-null  object 
 7   pinned        40000 non-null  object 
 8   media_only    40000 non-null  object 
 9   ADHD          20000 non-null  float64
dtypes: float64(2), int64(2), object(6)
memory usage: 3.4+ MB


In [37]:
merged_df.isna().sum()

author              0
Anxiety         20000
title               0
selftext          360
created_utc         0
retrieved_on        0
url                 0
pinned              0
media_only          0
ADHD            20000
dtype: int64

As we can see, all the ADHD posts have missing values for the subreddit's name column for Anxiety posts, and vice versa, so I will keep one of them, and fill in zeroes for the rest. There are 360 posts that have no text in the body of the post, which means they only have titles, I will drop those. 

In [42]:
merged_df.fillna(0, inplace=True)

Checking for and dropping duplicate posts (i.e. posts that have the exact same title and author).

In [43]:
merged_df.duplicated(subset=['title', 'author']).value_counts()

False    39540
True       460
dtype: int64

Looks like we only have 460 duplicates, so going to drop those.

In [44]:
merged_df.drop_duplicates(subset=['title', 'author'], inplace=True)

Final check to make sure everything looks good.

In [45]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39540 entries, 0 to 39999
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   author        39540 non-null  object 
 1   Anxiety       39540 non-null  float64
 2   title         39540 non-null  object 
 3   selftext      39540 non-null  object 
 4   created_utc   39540 non-null  int64  
 5   retrieved_on  39540 non-null  int64  
 6   url           39540 non-null  object 
 7   pinned        39540 non-null  bool   
 8   media_only    39540 non-null  bool   
 9   ADHD          39540 non-null  float64
dtypes: bool(2), float64(2), int64(2), object(4)
memory usage: 2.8+ MB


Saving out to a do more rigorous cleaning in a separate notebook.

In [47]:
merged_df.to_csv('../data/merged.csv')