<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Subreddit Classifier - Tea or Coffee?
By Amira (DSI-28)

---
# Problem Statement

We are a team of data scientists working for Coffea Vibes, a beverage company. The company is venturing into e-commerce and will be launching its own website/application selling coffee and tea products to consumers. We have been tasked to build a classification model that can accurately distinguish between coffee and tea in textual data.

Our classification model will contribute to the following use cases:

1. Our web development team can optimise the recommender systems so as to accurately suggest related products and advertisements to our potential consumers who might have varying preferences for coffee or tea.
2. Our business insights team can leverage on the classification model to correctly distinguish customer feedback on coffee and tea (e.g. through emails) and comments made on the company's social media pages, to aid better understanding of customer's feedback and take appropriate actions quickly, if necessary.

We evaluated the models based on the following criteria:

1. Accuracy scores (the higher, the better)
2. Delta between train and test scores (the smaller, the better)
3. Clear distinction of important features i.e. words to distinguish coffee and tea

---
# Structure

To organise my work better, I have organised this project into three notebooks: 

* Notebook 1 : Data Acquisition
* Notebook 2: Data Cleaning & Exploratory Data Analysis
* Notebook 3: Modelling & Model Evaluation

<span style='color:red'>**This is Notebook 1.**</span>

---
## Part 1: Data Acquisition

In [1]:
# import required libraries/packages

import requests
import pandas as pd
import time

import warnings
warnings.filterwarnings('ignore')

pd.options.display.max_rows = 500
pd.options.display.max_columns = 100

In [2]:
# defined a function that requires input of subreddit name. 
# this function will perform a for loop for x times to scrape reddit submission posts (100 posts each time).

def my_webscraper(subreddit,loops=20):

    data = pd.DataFrame() # create an empty dataframe which will store all the scraped data
    url = 'https://api.pushshift.io/reddit/search/submission'

    for i in range(loops): # range is 10 bc i want to repeat this 10 times (each time pull only 100 posts), can change to other numbers. 
        if i == 0: # for the first time, 'before' param  will take the most recent value as of when this code is run. 
            params = {'subreddit' :str(subreddit),'size' : 100}
        else: # for the subsequent runs,  change 'before' to the 'created_utc' value of the prev scraped set. 
            params = {'subreddit' :str(subreddit),'size' : 100,'before' : bef}

        # the set of codes below will repeat for each iteration of i. 
        res = requests.get(url,params)
        results = res.json()
        posts = pd.DataFrame(results['data'])
        bef = posts.iloc[-1]['created_utc']
        data = data.append(posts,ignore_index=True)
        print(f'This is iteration no. {i+1}, status code is {res.status_code}, accumulated no. of rows extracted is {len(data)}.')
        time.sleep(5) # set a timer between each iteration of i.

    return data

In [None]:
# increased the no. of loops for 'prolife' subreddit as it contains more removed/deleted posts

coffee = my_webscraper(subreddit='coffee')
print(coffee.shape)
print(coffee.info())

In [None]:
tea = my_webscraper(subreddit='tea')
print(tea.shape)
print(tea.info())

In [None]:
# extract dfs to csv to work with in notebook2
# files have been saved in the data folder; commenting this out in case we overwrite the data which is extracted live...

# coffee.to_csv('./data/coffee.csv',index=False,na_rep=' ')
# tea.to_csv('./data/tea.csv',index=False,na_rep=' ')