# Project 3 - Web APIs and NLP Classification

## Part 1: Web Scraping & Data Gathering

For project 3, your goal is two-fold:

* Using Pushshift's API, you'll collect posts from two subreddits of your choosing.
* You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.


Requirements
* Gather and prepare your data using the requests library.
* Create and compare two models. One of these must be a Random Forest classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
* A Jupyter Notebook with your analysis for a peer audience of data scientists.
* An executive summary of your results.
* A short presentation outlining your process and findings for a semi-technical audience.

For Project 3 the evaluation categories are as follows:
The Data Science Process

* Problem Statement
* Data Collection
* Data Cleaning & EDA
* Preprocessing & Modeling
* Evaluation and Conceptual Understanding
* Conclusion and Recommendations
* Organization and Professionalism

Organization
* Visualizations
* Python Syntax and Control Flow
* Presentation

### Model to classify data

In [None]:
import requests
import pandas as pd
from datetime import datetime as dt
from os import path

In [None]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [None]:
params = {
    'subreddit' : 'EtherMining',
    'size': 500
}

In [None]:
res = requests.get(url, params)

In [None]:
res.status_code

In [None]:
data = res.json()

In [None]:
posts = data['data']

In [None]:
len(posts)

In [None]:
df = pd.DataFrame(posts)

In [None]:
df.head()

In [None]:
df[['subreddit', 'selftext', 'title']].head(10)

In [None]:
## Lets work on collecting 10,000 post

In [None]:
#define parameters for collecting more submissions

def parameters(df, subreddit):
    
    params = {
        'subreddit': subreddit,
        'size': 100,
        'before': df.loc[(df.shape[0] - 1), 'created_utc']
    }

    return params

In [None]:
#def function to collect more submissions from subreddit
def get_posts(params):
    
    #url for searching subreddit with Pushshift.io
    url = "https://api.pushshift.io/reddit/search/submission"
    
    #scrape submissions data from reddit into json format
    res = requests.get(url, params=params)
    data = res.json()
    
    #return data in pandas dataframe format
    df = pd.DataFrame(data['data'])
    
    return df

In [None]:
#### scrape 100 more submissions for a total of 199x, to obtain 20,000 submissions in total

for i in trange(1):
    
    try:
        param = parameters(df_em, 'EtherMining')
        df_em = pd.concat([df_em, get_posts(param)], ignore_index=True)
    
    except:
        #notifies us if there is an error during scraping
        print(f"Error occurred while scraping")
        
    #1 seconds interval per requests to prevent server overload    
    time.sleep(10)

In [None]:
url = 'https://api.pushshift.io/reddit/search/submission'
file = '../data/EtherMining.csv'
subreddit = 'EtherMining'

In [None]:
for i in range(10):
    print(f'Loop {i}')
    # If file does not exists, start pulling posts from current datetime
    # else pull from file last post created_utc
    if not path.isfile(file):
        df=pd.DataFrame()
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': 1658758060
        }
    else:
        df = pd.read_csv(file)
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': df.loc[df.shape[0]-1,'created_utc']
    }
         
    success = False
    
    while not success:
        try:
            res = requests.get(url, params)
            status = res.status_code
            print(f'Get Status: {status}')
            if status == 200:
                success = True
            else:
                time.sleep(10)
        except Exception as error:
            print(error)
            continue
    
    data = res.json()
    posts = data['data']
    temp_df = pd.DataFrame(posts)
    pd.concat([df, temp_df]).to_csv(file, index=False)
        
    time.sleep(10)