# Problem Statement

An investment company is looking to collate a list of posts on reddit for investment-related trends, which they can recommend the type of investment to a client based on their comments. Noting that the comments are mostly from retail investors discussing between stocks and cryptocurrency, they want to classify these comments accordingly. The company has commissioned you to build a classification model based off subreddits specifically from r/CryptoCurrency and r/stocks, to help train and build the model to classify these comments to the correct topic. This would allow the the company to suggest the correct type of investments a client should make based on their comments.

- Scrape sample comments off subreddits r/CryptoCurrency and r/stocks
- To build a classification model that can classify the subreddits with minimal misclassifications 

## Import Libraries

In [1]:
import requests
import numpy as np
import pandas as pd
import time

## Helper methods for Web Scraping off Reddit

In [5]:
def get_data(subreddit):
    continue_scrape = True
    base_url = 'https://api.pushshift.io/reddit/search/submission'
    params = {
        'subreddit': subreddit,
        'size' : 100,
    }
    comb_df = pd.DataFrame()

    while continue_scrape:
        res = requests.get(base_url,params)
        while res.status_code != 200:
            time.sleep(0.5)
            res = requests.get(base_url,params)
        data = res.json()
        posts = data['data']
        tmp_df = pd.DataFrame(posts)

        comb_df = pd.concat([comb_df,tmp_df.loc[(tmp_df['selftext'] != '') & (tmp_df['selftext'] != '[removed]') & (tmp_df['selftext'] != '[deleted]') & (pd.notna(tmp_df['selftext'])), :]], ignore_index=True)
        #params['before'] = comb_df.loc[comb_df.index[-1],'created_utc']
        params['before'] = tmp_df.loc[tmp_df.index[-1],'created_utc']
        
#         print(comb_df.shape)
        if len(comb_df) > 1000:
            continue_scrape = False

    return comb_df

## Creation of filepath to store scraped data

In [6]:
import os

outdir = '../data'
if not os.path.exists(outdir):
    os.mkdir(outdir)


## Scraping and storing of files into .csv format

In [None]:
%%time
crypto_df = get_data('cryptocurrency')
crypto_df.to_csv('../data/cryptocurrency.csv', index=False)

In [7]:
%%time
stocks_df = get_data('stocks')
stocks_df.to_csv('../data/stocks.csv', index=False)

(40, 70)
(84, 70)
(130, 71)
(166, 71)
(202, 71)
(247, 71)
(282, 71)
(313, 71)
(354, 71)
(401, 71)
(440, 71)
(478, 71)
(509, 71)
(540, 71)
(578, 71)
(611, 71)
(650, 71)
(692, 71)
(733, 71)
(768, 71)
(804, 71)
(847, 71)
(878, 71)
(914, 71)
(949, 71)
(979, 71)
(1017, 71)
