# Project 3: Web APIs & NLP (01_Data_Collection)

## Executive Summary

Based on the home page of Reddit, it describes itself as a place for community, conversation and connection with millions of users worldwide in addition to Reddit being a community of millions of users engaging in the creation of content and the sharing of conversation across tens of thousands of topics.

Reddit is where very often, there are always topics or ideas discussed and shared, and more importantly these are arranged in communities. It permits the joining of everyone’s favourite communities which will create a constant, personalized feed of content like news headlines, opinions, stories, both fun stories and startling discoveries, sports talk, games, viral pics, top memes, and videos.

There are several reasons why Reddit is such a sought after search medium. Primarily, people from all walks of life access Reddit is to seek, understand and participate, whether actively or passively, on an infinite number of themes, concepts and circumstances. These include those that appear to be controversial, undisputable, indefensible or just simply those with no answers at all.

With the above in mind, Reddit is considering to venture into the categorisations with the inclusion of issues.

One of the issues would be relating to diets which are further sub-categorised into features with vegan diets as opposed to ketogenic diets in co-relation to life-style aspects and the serious health aspects

## Problem Statement

The goal of this project is to address the popularity of diet advices through classification from two subreddits, “Vegan” and “Keto”.

The focus is to scrape two subreddits and create different types of classification models such as Naive Bayes Classifier and Logistic Regression Classifier then to compare between models. Through this modelling process, this will then determine the mostly used observed textual data to be classified under which group of subreddit.
From the results to then measure the classification metrics on accuracy, misclassification rate, sensitivity and specificity obtained from the model scores.

The success will be evaluated by the model scores based from the test scores and the least difference between both training scores and test scores of the accuracy

These will be addressed to the Plant Based Producers and Health Professionals as the main stakeholders as they are the ones who care about the environmental sustainability - do we eat less beef cos we care about the environment or eating healthy but need to think of the  health of the planet Earth.

As for the secondary stakeholders are environmentalists (those who care about protecting environment, climate and the planet)  animal rights activists (ie those who care about protection of animals)

## Data Collection

- Required installation to import timedelta:

In [1]:
pip install timedelta

Note: you may need to restart the kernel to use updated packages.


### Import Libraries

In [2]:
import requests
import json
import time
import pandas as pd
import random
import string

# import the datetime library  for data time format and date-integer conversion.
import datetime,timedelta
from dateutil.relativedelta import relativedelta
import urllib
%matplotlib inline

pd.set_option('display.max_columns', None)

### Checking status of responds using pushshift API

In [3]:
'''
To test the pushshift connection, to check the status of responds
#200 : ok
#4xx : error
#5xx : server error
'''
URL = "https://api.pushshift.io/reddit/search/submission?subreddit=vegan"

# make a request object to get and store the responds data from the above URL
response = requests.get(URL)
# check the status of the connection with the URL. 200 to indication the http responds without error.
response.status_code

200

In [4]:
# download the reddit's data in JSON format using the responds opject from request.
json_data = response.json()

# check how the data is organized in the JSON
json_data.keys()

dict_keys(['data'])

In [5]:
# check the contents in the key 'data' (raw data)
json_data['data']

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'Pharmbro6969',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_9jg3nc44',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1619062994,
  'domain': 'self.vegan',
  'full_link': 'https://www.reddit.com/r/vegan/comments/mvwkze/market_for_vegan_foods/',
  'gildings': {},
  'id': 'mvwkze',
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_css_class': 'Discussion',
  'link_flair_richtext': [],
  'link_flair_template_id': '0d26d52c-2ef1-11e5-8c17-0ec131dbf691',
  'link_flair_text': 'Discussion',
  'link_flair_text_color': 'dark',
  'link_flair_type'

In [6]:
df = pd.DataFrame(json_data['data'])

In [7]:
# Checking list of column variables for selective further analysis
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 73 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   all_awardings                25 non-null     object 
 1   allow_live_comments          25 non-null     bool   
 2   author                       25 non-null     object 
 3   author_flair_css_class       0 non-null      object 
 4   author_flair_richtext        25 non-null     object 
 5   author_flair_text            3 non-null      object 
 6   author_flair_type            25 non-null     object 
 7   author_fullname              25 non-null     object 
 8   author_patreon_flair         25 non-null     bool   
 9   author_premium               25 non-null     bool   
 10  awarders                     25 non-null     object 
 11  can_mod_post                 25 non-null     bool   
 12  contest_mode                 25 non-null     bool   
 13  created_utc           

In [8]:
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,removed_by_category,url_overridden_by_dest,author_flair_template_id,author_flair_text_color,post_hint,preview,thumbnail_height,thumbnail_width,is_gallery,media,media_embed,secure_media,secure_media_embed
0,[],False,Pharmbro6969,,[],,text,t2_9jg3nc44,False,False,[],False,False,1619062994,self.vegan,https://www.reddit.com/r/vegan/comments/mvwkze...,{},mvwkze,True,False,False,False,True,True,False,,Discussion,[],0d26d52c-2ef1-11e5-8c17-0ec131dbf691,Discussion,dark,text,False,False,True,0,0,False,all_ads,/r/vegan/comments/mvwkze/market_for_vegan_foods/,False,6,1619063006,1,I get this vibe that vegan foods are exploding...,True,False,False,vegan,t5_2qhpm,592689,public,self,Market for vegan foods,0,[],1.0,https://www.reddit.com/r/vegan/comments/mvwkze...,all_ads,6,,,,,,,,,,,,,
1,[],False,kayyy93,,[],,text,t2_acy0ngq2,False,False,[],False,False,1619062404,i.redd.it,https://www.reddit.com/r/vegan/comments/mvwfd1...,{},mvwfd1,False,False,False,True,False,False,False,,Activism,[],7371b396-1c1e-11e5-a3f0-0ef6ca535a4d,Activism,dark,text,False,False,True,0,0,False,all_ads,/r/vegan/comments/mvwfd1/cat_cow/,False,6,1619062415,1,,True,False,False,vegan,t5_2qhpm,592685,public,default,cat = cow,0,[],1.0,https://i.redd.it/l7wqipd4anu61.jpg,all_ads,6,automod_filtered,https://i.redd.it/l7wqipd4anu61.jpg,,,,,,,,,,,
2,[],False,ScullyIsTired,,[],vegan 5+ years,text,t2_5tpnniaf,False,False,[],False,False,1619062138,i.redd.it,https://www.reddit.com/r/vegan/comments/mvwctb...,{},mvwctb,False,False,False,True,False,False,False,,Funny,[],4983f5d2-e206-11e4-9f95-22000bb2c21d,Funny,dark,text,False,False,True,0,0,False,all_ads,/r/vegan/comments/mvwctb/im_tired_of_typing_ou...,False,6,1619062151,1,,True,False,False,vegan,t5_2qhpm,592681,public,https://b.thumbs.redditmedia.com/qX3u9Fuiqkgco...,I'm tired of typing out things wrong with this...,0,[],1.0,https://i.redd.it/vc0pr2sg9nu61.png,all_ads,6,automod_filtered,https://i.redd.it/vc0pr2sg9nu61.png,e7371f88-bba1-11e4-922d-22000b310327,dark,image,"{'enabled': True, 'images': [{'id': 'Es6o1nYWZ...",57.0,140.0,,,,,
3,[],False,bendancoh47,,[],,text,t2_8a62xzaq,False,False,[],False,False,1619061816,i.redd.it,https://www.reddit.com/r/vegan/comments/mvw9rt...,{},mvw9rt,False,False,False,True,False,False,False,,,[],,,dark,text,False,False,True,0,0,False,all_ads,/r/vegan/comments/mvw9rt/i_have_a_bit_of_miyok...,False,6,1619061827,1,,True,False,False,vegan,t5_2qhpm,592677,public,https://b.thumbs.redditmedia.com/bFpTnoXITS2A0...,I have a bit of Miyoko’s butter (palm oil-free...,0,[],1.0,https://i.redd.it/mmwbi4pc8nu61.jpg,all_ads,6,automod_filtered,https://i.redd.it/mmwbi4pc8nu61.jpg,,,image,"{'enabled': True, 'images': [{'id': 'M0xFY3dB7...",140.0,140.0,,,,,
4,[],False,SilverSquid1810,,[],vegan 1+ years,text,t2_1xtw5q6n,False,False,[],False,False,1619061092,wlwt.com,https://www.reddit.com/r/vegan/comments/mvw2n5...,{},mvw2n5,True,False,False,False,True,False,False,,Food,[],d9f9ddf0-f024-11e2-8ebb-12313b0c8c59,Food,dark,text,False,False,True,0,0,False,all_ads,/r/vegan/comments/mvw2n5/graeters_celebrating_...,False,6,1619061103,1,,True,False,False,vegan,t5_2qhpm,592674,public,https://b.thumbs.redditmedia.com/SBbwCQVdquqhq...,Graeter’s celebrating Earth Day with free scoo...,0,[],1.0,https://www.wlwt.com/article/graeter-s-celebra...,all_ads,6,,https://www.wlwt.com/article/graeter-s-celebra...,326e9422-bba2-11e4-a57f-22000b3e802b,dark,link,"{'enabled': False, 'images': [{'id': 'XwHau_71...",78.0,140.0,,,,,


- There seems to be some outliers in 'selftext' columns have some contents or emptied. To then explore the text that consists of textual data

### Acquiring information for futher analysis under key 'data' 

1. The subreddit that the thread corresponds to
2. The id of the subreddit
3. The author of the post
4. The title of the post
5. The selftext of the post
6. The no. of subscribers for the subreddit
7. The created utc time (integer)
8. The created time (dd/MM/yyyy hh:mm)
9. The posted duration in seconds compare to current time
10. The number of comments for the post
11. The score of the post


- Getting the data to focus on for an individual post:

In [9]:

# instantiate dictionary of the data for retrieving one sample post for validation
post_scrape = {}

# get one post to validate if the output is correct. the 1st post is used here.
demo_post = json_data['data'][0]

post_scrape['subreddit'] = demo_post['subreddit']
# get the rest of the data
post_scrape['id'] = demo_post['id']
post_scrape['author'] = demo_post['author']
post_scrape['title'] = demo_post['title']
post_scrape['selftext'] = demo_post['selftext']
post_scrape['subreddit_subscribers'] = demo_post['subreddit_subscribers']
post_scrape['created_utc']=demo_post['created_utc']
created_time=demo_post['created_utc']
post_scrape['created_time'] =datetime.datetime.fromtimestamp(created_time)
post_scrape['length_of_time(sec)'] = round(int(time.time()-created_time))
post_scrape['num_of_comments'] = demo_post['num_comments']
post_scrape['score'] = demo_post['score']



# output the dictionary to see if the post format is desired format.
post_scrape

{'subreddit': 'vegan',
 'id': 'mvwkze',
 'author': 'Pharmbro6969',
 'title': 'Market for vegan foods',
 'selftext': 'I get this vibe that vegan foods are exploding everywhere. Not from a concerned about animals point, but they can charge the same price for a taco that costs much less to make.\n\nHopefully in and out has a vegan option soon. Is there anywhere good in OC that’s really good? I cook myself so I’m looking for foods places that it’s hard to replicate (so no places that make the best tofu and rice please)',
 'subreddit_subscribers': 592689,
 'created_utc': 1619062994,
 'created_time': datetime.datetime(2021, 4, 22, 11, 43, 14),
 'length_of_time(sec)': 979,
 'num_of_comments': 0,
 'score': 1}

### Function to convert json tag to desired header

In [10]:
def scrape_post(post):
    post_scrape = {}
    post_scrape['subreddit'] = post['subreddit']
    post_scrape['id'] = post['id']
    post_scrape['author'] = post['author']
    post_scrape['title'] = post['title']
    try:
        post_scrape['selftext'] = post['selftext']
    except KeyError:
        post_scrape['selftext']  = "[]"
    post_scrape['subreddit_subscribers'] = post['subreddit_subscribers']
    post_scrape['created_utc']=post['created_utc']
    created_time=post['created_utc']
    post_scrape['created_time'] =datetime.datetime.fromtimestamp(int(created_time))
    post_scrape['length_of_time(sec)'] = round(int(time.time()-created_time))
    post_scrape['num_of_comments'] = post['num_comments']
    post_scrape['score'] = post['score']
    return post_scrape

### Function to retrieve the data based on subreddit and searching type,saving data into the corresponding CSV file.

In [11]:
'''
## Parameters ##
 - before:  the records retrieved before the give time compare to created_utc, default is current time
 - subreddit:  from which sub reddit the data is retrieved.
 - searchtype: can be submission or comment
 - note that the selftext is one of the key fields we are going to use for data analysis, so we treat the posts are invalid 
 - if the selftext field is empty string, marked as '[deleted]','[removed]'or none value, and those records will be removed from 
   the scrapped data.
 - Only the top of 1000 valid records will be scapped based on current time.

'''

def getPushshiftData(before=int(time.time()),subreddit='',searchtype="submission"):
    
    # New list for subreddit being scrapped
    posts = []
    
    # Counter keep track of current post scraped
    no_of_posts_scraped = 0
    i =0 
    while True:
        params = {'before': before,
                  "size":100,
                 "subreddit":subreddit}
        URL = "https://api.pushshift.io/reddit/search/"+searchtype
        res = requests.get(URL, params=params)
        if res.status_code == 200:
            json_data = res.json()
            # to get the creation time of the last record in the list of json data
            before =  json_data['data'][len(json_data['data'])-1]['created_utc']
            no_of_posts_scraped = no_of_posts_scraped+len(json_data['data'])
            
            for element in json_data['data']:
                try:
                    #To filter out data for selftext that has textual data
                    if not (element['selftext'].startswith("[removed]") or element['selftext'] == "[deleted]" or len(element['selftext']) == 0):
                        posts.append(element)
                    else:
                        no_of_posts_scraped=no_of_posts_scraped-1
                except KeyError:
                    no_of_posts_scraped=no_of_posts_scraped-1

            
            #Updates Currrent Total Data Scraped
            no_of_posts_scraped = len(posts)
            print(f"Current Total No. of Data Scraped: {no_of_posts_scraped}")

        else:
            print(res.status_code)
            if(res.status_code==522 or res.status_code==525 or res.status_code==502):
                #522 and 525: The API gets too much traffic and gets overloaded
                print("Retry to retrieve the data")
                continue
            break
        # Once reach 1000 data collected to then stop
        if(no_of_posts_scraped>=1000):
            break;
        # Reddit will limit the number of requests per second the requests are allowed to make. 
        # wait [3-5] seconds between requests to avoid putting excessive load on the servers
        sleep_time = time.sleep(random.randint(3,5))
    print("Total No. of Data Records Retrieved : "+str(len(posts)))
    # instantiate an emptylist and store the scraped post in the desired format.
    posts_infodicts_list = []

    # call function on each post, add results to [posts_infodicts_list]
    for post in posts:
        if(len(posts_infodicts_list)==1000):
            break;
        posts_infodicts_list.append(scrape_post(post))

    # checking if the data in the list is correct by retrieve 1st 10 records in the list
    # first I make the list of dictionaries into a DataFrame for easy export
    df = pd.DataFrame(posts_infodicts_list)

    #df = pd.DataFrame(posts)
    # export to csv in the local directory under data folder
   
    df.to_csv(f'../data/'+subreddit+'_scraped.csv', mode='w', header=True, index=False)
    return  posts_infodicts_list
    #return posts

In [12]:
# using function to scrape r/vegan data
vegan=getPushshiftData(before=int(time.time()),subreddit='vegan',searchtype='submission')
print("length of posts in vegan:"+str(len(vegan)))
vegan

Current Total No. of Data Scraped: 30
Current Total No. of Data Scraped: 54
Current Total No. of Data Scraped: 77
Current Total No. of Data Scraped: 95
Current Total No. of Data Scraped: 119
Current Total No. of Data Scraped: 145
Current Total No. of Data Scraped: 172
Current Total No. of Data Scraped: 190
Current Total No. of Data Scraped: 218
Current Total No. of Data Scraped: 233
Current Total No. of Data Scraped: 264
Current Total No. of Data Scraped: 286
Current Total No. of Data Scraped: 322
Current Total No. of Data Scraped: 348
Current Total No. of Data Scraped: 372
Current Total No. of Data Scraped: 409
Current Total No. of Data Scraped: 427
Current Total No. of Data Scraped: 457
Current Total No. of Data Scraped: 472
Current Total No. of Data Scraped: 495
Current Total No. of Data Scraped: 515
Current Total No. of Data Scraped: 534
Current Total No. of Data Scraped: 550
Current Total No. of Data Scraped: 576
Current Total No. of Data Scraped: 603
Current Total No. of Data Scr

[{'subreddit': 'vegan',
  'id': 'mvwkze',
  'author': 'Pharmbro6969',
  'title': 'Market for vegan foods',
  'selftext': 'I get this vibe that vegan foods are exploding everywhere. Not from a concerned about animals point, but they can charge the same price for a taco that costs much less to make.\n\nHopefully in and out has a vegan option soon. Is there anywhere good in OC that’s really good? I cook myself so I’m looking for foods places that it’s hard to replicate (so no places that make the best tofu and rice please)',
  'subreddit_subscribers': 592689,
  'created_utc': 1619062994,
  'created_time': datetime.datetime(2021, 4, 22, 11, 43, 14),
  'length_of_time(sec)': 1311,
  'num_of_comments': 0,
  'score': 1},
 {'subreddit': 'vegan',
  'id': 'mvvguj',
  'author': 'Tytheboss16',
  'title': '“If people still eat meat, they might as well get it from the right place” argument?',
  'selftext': 'So I’ve got a friend who runs a small-scale (about 30 head), local, organic cattle operation.

In [13]:
#Using function to scrap r/keto data

keto=getPushshiftData(before=int(time.time()),subreddit='keto',searchtype='submission')
keto

Current Total No. of Data Scraped: 54
Current Total No. of Data Scraped: 107
Current Total No. of Data Scraped: 150
Current Total No. of Data Scraped: 195
Current Total No. of Data Scraped: 250
Current Total No. of Data Scraped: 301
Current Total No. of Data Scraped: 356
Current Total No. of Data Scraped: 410
Current Total No. of Data Scraped: 473
Current Total No. of Data Scraped: 527
Current Total No. of Data Scraped: 587
Current Total No. of Data Scraped: 640
Current Total No. of Data Scraped: 689
Current Total No. of Data Scraped: 745
Current Total No. of Data Scraped: 801
Current Total No. of Data Scraped: 858
Current Total No. of Data Scraped: 918
Current Total No. of Data Scraped: 980
Current Total No. of Data Scraped: 1032
Total No. of Data Records Retrieved : 1032


[{'subreddit': 'keto',
  'id': 'mvvh65',
  'author': 'freddyt55555',
  'title': 'Allulose is amazing!',
  'selftext': "Guys, I'm not kidding. It's like a fucking natural miracle drug.\n\nAbout a month ago, I posted about how allulose is like [Ex-Lax](https://www.reddit.com/r/keto/comments/mbiksm/allulose_better_than_exlax/?utm_source=share&amp;utm_medium=web2x&amp;context=3). I stopped using it for a while, but I recently found out that not only does it not spike your blood glucose, it can actually make your blood glucose [go down](https://www.reddit.com/r/diabetes_t2/comments/mtic7m/eating_fruits_with_allulose/?utm_source=share&amp;utm_medium=web2x&amp;context=3). I'm not the only one who's experienced [this](https://www.reddit.com/r/diabetes_t2/comments/mv9djm/allulose_what_the_heck_is_going_on/?utm_source=share&amp;utm_medium=web2x&amp;context=3).\n\nI found out that the Ex-Lax effect is basically the allulose blocking the metabolism of certain carbohydrates. Here's an excerpt from 

In [16]:
#checking non-null values and data collected
vegan_scraped = pd.DataFrame(vegan)
vegan_scraped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   subreddit              1000 non-null   object        
 1   id                     1000 non-null   object        
 2   author                 1000 non-null   object        
 3   title                  1000 non-null   object        
 4   selftext               1000 non-null   object        
 5   subreddit_subscribers  1000 non-null   int64         
 6   created_utc            1000 non-null   int64         
 7   created_time           1000 non-null   datetime64[ns]
 8   length_of_time(sec)    1000 non-null   int64         
 9   num_of_comments        1000 non-null   int64         
 10  score                  1000 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(5)
memory usage: 86.1+ KB


In [17]:
#checking non-null values and data collected
keto_scraped = pd.DataFrame(keto)
keto_scraped.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   subreddit              1000 non-null   object        
 1   id                     1000 non-null   object        
 2   author                 1000 non-null   object        
 3   title                  1000 non-null   object        
 4   selftext               1000 non-null   object        
 5   subreddit_subscribers  1000 non-null   int64         
 6   created_utc            1000 non-null   int64         
 7   created_time           1000 non-null   datetime64[ns]
 8   length_of_time(sec)    1000 non-null   int64         
 9   num_of_comments        1000 non-null   int64         
 10  score                  1000 non-null   int64         
dtypes: datetime64[ns](1), int64(5), object(5)
memory usage: 86.1+ KB
