<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Project 3 Web APIs and NLP

by Tan Jun Pin

# Part 1 of 3

Due to large file size, please refer to the links below for the scrapped data in csv format. 

[Google Drive Link](https://drive.google.com/drive/folders/1lFYM9on5DrxRIsT3K_eY7DVi6A_SbN_w?usp=sharing)

The scrapped datasets are also stored in AWS S3 Bucket for extraction and utilization.

# Overview

With the recent recession on the global economy, inflation hits in almost every single country. Personal finance is getting crucial and many has noticed that investment will be one of the method to beat the inflation and still achieve personal financial success.

We are engaged by Tekla Investing Inc. for their newly launched product which is Regular Saving Plan on US Exchange Traded Fund (ETF). This new product is  suitable for value investors where they do not expect to have high return in a short period of time which is also a type of long-term investment. Value investors are usually looking at putting aside their money on an investment for a long period of time (approximately 6 months and above) which in return will have smaller risk in losing money.

Our client knows that long term investment is not for everybody and only specific groups of audience are suitable for this type of investment. The target audience must not expect the investment will yield amazing return in very short period of time as this new product is only suitable for those with spare amount of money that can be left untouched for at least few years to see the return of their investment.

With the launching of the new investment products by our client, one of the marketing strategies they have is targetted ads based on the posts in www.reddit.com. Targetting correct group of audience with targetted ads is important for better cost control. Our client has shorlisted 2 subreddits which are `r/stocks` and `r/wallstreetbets` for this study. Both subreddits are stock investment related but the nature of the investment are really different in both subreddits. Our client understand that r/stocks is a more suitable subreddit for their new products but there is only 4.6 millions subscribers and 40 posts per days.

They wish to have their ads being seen by more audience where they could have more chances of coverting them to customers. As such, they are exploring if r/wallstreetbets could be the good subreddits for their target ads. r/wallstreetbets has 12.6 millions subscribers and over 300 posts per day which is way more active than r/stocks. However, before spending their resources in this subreddit with almost 3 times more subscribers, they would like to know if r/wallstreetbets is a good fit for this product.

"IAS research found that content with positive and neutral sentiment tends to create greater engagement: 80% of consumers were receptive to these ads, and 93% more were favorable to the ads and associated brands." ([source](https://integralads.com/insider/contextual-advertising-sentiment-analysis/.))

Based on their Marketing Department, apart from targetting the right group of audience, our client wishes to know the period of time where subreddit sentiment is relatively good. They are looking to place their ads on the day or time where sentiments is good as study shows that engagements are generally better during good sentiments. Investments related topics are time-sensitive. The post and topic in the morning may not be relevant anymore in the afternoon. Hence, it is important to know that the nature and sentiments of posts varies from time to time.


# Content
This notebook is split to 3 parts:

- Part 1
    - Problem Statement
    - Dataset Sources
    - Web Scrapping

- Part 2
    - Data Cleaning
    - EDA
        - Sentiment Analysis
        - Sentiment Analysis Over Time
        - Most Common Words in Subreddit

- Part 3
    - Modeling & Analysis
    - Conclusions & Recommendations

# Problem Statement
With the requirements from our clients to promote their newly launched investing product, our objectives for this project are as follows:

- To identify the most common words from r/stocks that are associated with long-term investing and the probability of its appearing in r/wallstreetbets
- To predict the subreddit based on post title and selftext for ads targetting
- To identify the time period where subreddits are having positive sentiments for ads placement.

Using modelling technique such as Naive Bayes method and Random Forest method, the best model basing on F1-score, training and test score will be chosen.
F1-score is more appropriate in this project due to the imbalanced datasets that we have which will be further elaborated later. Both training and test score will be used to judge if a model is over-generalized or under-generalized.

# Datasets Sources

The posts from subreddit is obtained through web scrapping from the Push Shift API which particularly targeting the post by the author in the subreddit.

We understand that both subreddits are stocks / investment related where the content will be different from time to time, instead of fixing the number of post to be scrapped from each subreddit, the posts are scrapped and filter according to the 1 year timeframe. 

With the limitation in number of posts to be scrapped each time (250 posts per scrap), a for loop method is implemented to continuously scrapping the data. To ensure no datas is missed out, the timestamp of the last scrapped posts from each scrapping will be the starting time for the next scrapping.

Basing on the statistics from https://subredditstats.com/r/wallstreetbets , we are scapping 300,000 posts from r/wallstreetbets to ensure coverage of 1 year timeframe and 
250,000 posts from r/stocks.

After scapping, the posts are then stored as separate csv files for further usage in next parts of the notebook.

In [1]:
import requests
import pandas as pd
import re

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Define number of pull
freq = 1000
sizes = 250

In [8]:
def web_scrap(subreddit):
    # Define empty list to store the scrapped data
    posts = []

    # To scrape 6 rounds of 200 submission each time (total 1200)
    for i in range(freq):
        if len(posts) == 0:
            params = {
                'subreddit': subreddit,
                'size' : sizes
            }
        else:
            params = {
                'subreddit': subreddit,
                'size' : sizes,
                'before': posts[(len(posts)-1)]['created_utc']      
            }

        # Pulling for request based on the parameters   
        res = requests.get(url, params) 
        print(f'status code: {res.status_code}') #BEN: need to consider the case when your requests.get fails

        # Store the scrapped data as json format 
        data = res.json()
        # print(f'number of data in this scrapped: {len(data['data'])}')

        # To append the newly scrapped data into the original empty list
        for j in range(len(data['data'])):
            posts.append(data['data'][j])
        print(f'number of data after this pull: {len(posts)}')
        
        
    return posts

time.sleep(1) --> to put at end of loop to prevent the API from overloaded. 

In [7]:
%%time
df_wsb = pd.DataFrame(web_scrap('wallstreetbets'))
df_wsb.to_csv('./datasets/wallstreetbet_300k.csv', index=False)

status code: 200
number of data after this pull: 250
status code: 200
number of data after this pull: 500
status code: 200
number of data after this pull: 750
status code: 200
number of data after this pull: 1000
status code: 200
number of data after this pull: 1250
status code: 200
number of data after this pull: 1500
status code: 200
number of data after this pull: 1750
status code: 200
number of data after this pull: 2000
status code: 200
number of data after this pull: 2250
status code: 200
number of data after this pull: 2500
status code: 200
number of data after this pull: 2750
status code: 200
number of data after this pull: 3000
status code: 200
number of data after this pull: 3250
status code: 200
number of data after this pull: 3500
status code: 200
number of data after this pull: 3749
status code: 200
number of data after this pull: 3999
status code: 200
number of data after this pull: 4249
status code: 200
number of data after this pull: 4499
status code: 200
number of data

In [9]:
%%time
df_sto = pd.DataFrame(web_scrap('stocks'))
df_sto.to_csv('./datasets/stocks_300k.csv', index=False)

status code: 200
number of data after this pull: 250
status code: 200
number of data after this pull: 499
status code: 200
number of data after this pull: 749
status code: 200
number of data after this pull: 999
status code: 200
number of data after this pull: 1249
status code: 200
number of data after this pull: 1499
status code: 200
number of data after this pull: 1747
status code: 200
number of data after this pull: 1997
status code: 200
number of data after this pull: 2247
status code: 200
number of data after this pull: 2497
status code: 200
number of data after this pull: 2747
status code: 200
number of data after this pull: 2997
status code: 200
number of data after this pull: 3246
status code: 200
number of data after this pull: 3496
status code: 200
number of data after this pull: 3746
status code: 200
number of data after this pull: 3996
status code: 200
number of data after this pull: 4246
status code: 200
number of data after this pull: 4496
status code: 200
number of data 

Note: The notebook took approximately 3 hours to scrap the data

The scrapped data are then stored in AWS S3 Bucket manually