## Project 3: Web APIs & NLP

### Project Goal:
For project 3,the goal is two-fold:

1. Using [Pushshift's](https://github.com/pushshift/api) API, the aim is to collect posts from any two subreddits.
2. Use NLP(Natural Language Processing) to train a classifier on which subreddit a given post came from.

### Problem Statement: 
Can we leverage on machine learning to create a classifier that can accurately predict the origins of reddit posts?  

### Data Collection:
#### 1) Subreddit: r/Crypto_com
- a) https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com (Submission)
- b) https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com (Comments)

#### 2) Subreddit: r/bodyweightfitness
- a) https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness (Submission)
- b) https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness (Comments)

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import requests
import time
import datetime as dt
import json

## Function to scrap data from pushshift.io

In [2]:
# reference : https://github.com/pushshift/api

def pushshift(subreddit, post_type = 'submission', loops = 1, size=100, skip = 30):
    
# subreddit: name of subreddit to search for in our case it is r/Crypto_com and r/bodyweightfitness.
# post_type: (submission and comment) type of post to look for.
# loops : number of times to requests for posts.
# skip: number of days back to search in each loop

    #datafields to return for submissions
    sub_fields = ['author','author_fullname','created_utc','id','is_self','num_comments',
                  'permalink','score','selftext','subreddit','title','url']
    
    #datafields to return for comments
    com_fields = ['author','author_fullname','body','created_utc','id','parent_id','permalink',
                 'score','subreddit']
    
    # Instantiate list for posts data
    list_posts = []
    url_stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size={}".format(post_type,subreddit,size)
    
    #skip min of 1 day
    after = 1
    
    # check before requesting data
    if post_type not in ['submission', 'comment']:
        print("post_type is invalid, please enter either 'submission' or 'comment'")
        return None
    
    for i in range(loops):
        # add parameters to url to skip posts (after could be used to match up to post at end of previous loop if skip = 0)
        url = '{}&after={}d'.format(url_stem, skip * i + after) 
        # monitor status as loops run
        print(i, url)
        # get data
        res = requests.get(url)
        # add dictionaries for posts to list_posts
        list_posts.extend(res.json()['data']) 
        # be polite
        time.sleep(1) 

    # turn list_posts (a list of dictionaries where each dictionary contains data on one post) into a dataframe
    df_posts = pd.DataFrame.from_dict(list_posts) 

    # filter fields for submissions or comments
    if post_type == 'submission':
        df_posts = df_posts[sub_fields]
    elif post_type == 'comment':
        df_posts = df_posts[com_fields]  
#     else:
#         print("post_type is invalid, please enter either 'submission' or 'comment'")
#         return None

    # drop any duplicates
    df_posts.drop_duplicates(inplace=True)
    # add a field identifying submissions or comments
    df_posts['post_type'] = post_type
    
    return df_posts
    


## Collecting Reddit Submission for Crypto_com

In [8]:
crypto_subs = pushshift('Crypto_com',post_type = 'submission',loops=100,skip=10)
# Collect submission post from reddit over 100 days by skipping 10 days for each loop.
print('shape',crypto_subs.shape)
#crypto_subs.to_csv('crypto_subs-pushshift.csv')

0 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=1d
1 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=11d
2 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=21d
3 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=31d
4 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=41d
5 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=51d
6 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=61d
7 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=71d
8 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=81d
9 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=91d
10 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_

87 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=871d
88 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=881d
89 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=891d
90 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=901d
91 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=911d
92 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=921d
93 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=931d
94 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=941d
95 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=951d
96 https://api.pushshift.io/reddit/search/submission/?subreddit=Crypto_com&size=100&after=961d
97 https://api.pushshift.io/reddit/search/submissi

## Collecting Reddit Submission for r/bodyweightfitness

In [5]:
fitness_subs = pushshift('bodyweightfitness',post_type = 'submission',loops=100,skip=5)
# Collect submission post from reddit over 100 days by skipping 5 days for each loop.
print('shape',fitness_subs.shape)
#fitness_subs.to_csv('fitness_subs-pushshift.csv')

0 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=1d
1 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=6d
2 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=11d
3 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=16d
4 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=21d
5 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=26d
6 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=31d
7 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=36d
8 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=41d
9 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=46d
10

81 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=406d
82 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=411d
83 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=416d
84 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=421d
85 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=426d
86 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=431d
87 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=436d
88 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=441d
89 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&size=100&after=446d
90 https://api.pushshift.io/reddit/search/submission/?subreddit=bodyweightfitness&

In [11]:
crypto_subs.head()

Unnamed: 0,author,author_fullname,created_utc,id,is_self,num_comments,permalink,score,selftext,subreddit,title,url,post_type
0,NoBig894,t2_94dqtx3h,1646038109,t3b5f8,False,0,/r/Crypto_com/comments/t3b5f8/jamz_royal_egg_c...,1,,Crypto_com,Jamz - Royal Egg Club Drop,https://www.reddit.com/gallery/t3b5f8,submission
1,Jordan7matias,t2_3otgm3s6,1646038920,t3bcgk,True,0,/r/Crypto_com/comments/t3bcgk/crypto_exchange_...,1,Anyone here Canadian and buys USDC from main a...,Crypto_com,Crypto Exchange / .Com Canada,https://www.reddit.com/r/Crypto_com/comments/t...,submission
2,citytelegraph,t2_a8kya6is,1646041824,t3c15c,False,0,/r/Crypto_com/comments/t3c15c/cryptocom_price_...,1,,Crypto_com,Crypto.com Price Prediction — Backed by Matt D...,https://citytelegraph.com/crypto/26016/crypto-...,submission
3,BryanM_Crypto,t2_4ic3np86,1646042457,t3c6ua,True,0,/r/Crypto_com/comments/t3c6ua/cryptocom_is_sup...,1,&amp;#x200B;\n\nhttps://preview.redd.it/iuhzzt...,Crypto_com,Crypto.com is supporting Ontology’s network up...,https://www.reddit.com/r/Crypto_com/comments/t...,submission
4,getschnotzt,t2_94h2tyx0,1646043887,t3cit6,True,0,/r/Crypto_com/comments/t3cit6/best_way_to_buy_...,1,"Just wanted to make things clear, the best way...",Crypto_com,Best Way to buy via CDC?,https://www.reddit.com/r/Crypto_com/comments/t...,submission


## Collecting Reddit comments for Crypto_com

In [14]:
crypto_comm = pushshift('Crypto_com',post_type = 'comment',loops=100,skip=10)
# Collect comments from reddit over 100 days.
print('shape',crypto_comm.shape)
#crypto_comm.to_csv('crypto_comm-pushshift.csv')

0 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=1d
1 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=11d
2 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=21d
3 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=31d
4 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=41d
5 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=51d
6 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=61d
7 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=71d
8 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=81d
9 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=91d
10 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=101d
11 https:

90 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=901d
91 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=911d
92 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=921d
93 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=931d
94 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=941d
95 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=951d
96 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=961d
97 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=971d
98 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=981d
99 https://api.pushshift.io/reddit/search/comment/?subreddit=Crypto_com&size=100&after=991d
shape (10000, 10)


## Collecting Reddit comments for bodyweightfitness

In [17]:
fitness_comm = pushshift('bodyweightfitness',post_type = 'comment',loops=100,skip=5)
# Collect comments from reddit over 100 days by skipping 5 days for each loop.
print('shape',fitness_comm.shape)
#fitness_comm.to_csv('fitness_comm-pushshift.csv')

0 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=1d
1 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=6d
2 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=11d
3 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=16d
4 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=21d
5 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=26d
6 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=31d
7 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=36d
8 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=41d
9 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=46d
10 https://api.pushshift.io/redd

84 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=421d
85 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=426d
86 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=431d
87 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=436d
88 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=441d
89 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=446d
90 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=451d
91 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=456d
92 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=461d
93 https://api.pushshift.io/reddit/search/comment/?subreddit=bodyweightfitness&size=100&after=466d
94 https:/

In [18]:
fitness_comm.head()

Unnamed: 0,author,author_fullname,body,created_utc,id,parent_id,permalink,score,subreddit,post_type
0,Exodus111,t2_63cbq,"Your hips are not large, you just have an incr...",1646047340,hyrl38c,t3_t31ghm,/r/bodyweightfitness/comments/t31ghm/not_happy...,1,bodyweightfitness,comment
1,AutoModerator,t2_6l4z3,Removed due to low effort post. See the postin...,1646047371,hyrl4sj,t3_t3ddwk,/r/bodyweightfitness/comments/t3ddwk/how_to_ma...,1,bodyweightfitness,comment
2,SovArya,t2_1w2snh6,To progress\n\n1. Add reps\n2. Sets\n3. Reduce...,1646047442,hyrl85b,t3_t3cd0h,/r/bodyweightfitness/comments/t3cd0h/progressi...,1,bodyweightfitness,comment
3,BrainwashingCauldron,t2_9zymyxdm,"Honestly, your problem is not progressive over...",1646047669,hyrlj6f,t3_t3cd0h,/r/bodyweightfitness/comments/t3cd0h/progressi...,1,bodyweightfitness,comment
4,Weird-Original7430,t2_893cq29l,As a man with a prominent butt that loves skat...,1646047898,hyrlueg,t3_t31ghm,/r/bodyweightfitness/comments/t31ghm/not_happy...,1,bodyweightfitness,comment


In [19]:
crypto_comm.head()

Unnamed: 0,author,author_fullname,body,created_utc,id,parent_id,permalink,score,subreddit,post_type
0,Mantz22,t2_7ehdzwo,"Lol. \n\n""AI iS GoInG tO PoStPoNe AlL oUr CaRd...",1646040452,hyrchdk,t1_hyp0sne,/r/Crypto_com/comments/t2xguv/yea_interesting_...,1,Crypto_com,comment
1,changck007,t2_a0cecz19,Rose gold does not look Like how the website s...,1646040474,hyrcidt,t3_t3a87e,/r/Crypto_com/comments/t3a87e/icy_white_or_ros...,1,Crypto_com,comment
2,Jordan7matias,t2_3otgm3s6,Is cad -&gt; usdc not 1:1?,1646040501,hyrcjk0,t1_hyomgik,/r/Crypto_com/comments/t2mr79/why_cryptocom_sp...,1,Crypto_com,comment
3,uncl_ephil,t2_4dhgjm56,Well said,1646040588,hyrcngw,t1_hyr9f0g,/r/Crypto_com/comments/t3a87e/icy_white_or_ros...,1,Crypto_com,comment
4,CROmance,t2_hz6rsrfj,Yeah I had my card stolen same thing. Annoying...,1646040603,hyrco4a,t3_t3a37n,/r/Crypto_com/comments/t3a37n/fraudulent_trans...,1,Crypto_com,comment


- **We will concatenate comments for Crypto_com and bodyweightfitness and save it in csv format for further analysis.**

In [26]:
df = pd.concat([crypto_comm[['body','subreddit']], fitness_comm[['body','subreddit']]], ignore_index=True)
df.to_csv('../datasets/combined_comm.csv',index= False)

**1) Pushshift was used for data collection. Data collected from pushshift was 19,898 comments (10,000 for r/Crypto_com and 9898 for r/bodyweighfitness).**
<br>**2) Upon analysing the posts for respective subreddits, there were mostly images for r/Crypto_com submissions and text for r/bodyweighfitness submissions.Therefore, I decided to analyzed the comments for two subreddits as they will be more comparable.**