<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 Web APIs and NLP

_Authors: Joel Quek (SG)_

# Problem Statement

NLP Model to match posts from r/investing, r/stockmarket, r/wallstreetbets

# Exploratory Data Analysis

## Import Libraries

In [124]:
#All libraries used in this project are listed here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

import re
from bs4 import BeautifulSoup 

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, make_scorer, recall_score, precision_score,accuracy_score

# Open Scraped Datasets

The jupytyer notebooks for scraping are 'reddit-scrape.ipynb' and 'wallstreetbets-scrape.ipynb'

In [125]:
investing_df = pd.read_csv('datasets/investing.csv')
stockmarket_df = pd.read_csv('datasets/stockmarket.csv')
wallstreetbets_df = pd.read_csv('datasets/wallstreetbets.csv')

## r/investing

In [126]:
investing_df.shape

(7995, 75)

In [127]:
investing_df.iloc[investing_df.shape[0]-1]['created_utc']

# GMT: Friday, July 8, 2022 9:18:46 AM

1657271926

In [128]:
investing_df=investing_df[['subreddit', 'author', 'selftext', 'title']]
investing_df.head()

Unnamed: 0,subreddit,author,selftext,title
0,investing,HomeInvading,"Hey guys, I’m a 22 year old male, I grew up wi...",Help a young man out would ya?
1,investing,ocean-airseashell10,[removed],Treasury bonds is it a good idea to buy
2,investing,ocean-airseashell10,[removed],How to buy treasury bonds? Is treasury’s direc...
3,investing,iamjokingiamserious,[removed],Early Exercise of Stock Options
4,investing,jamesterryburke01,Hello Redditors 👋 \n\nI work as a Investment C...,Alternative Investments -


## r/stockmarket

In [129]:
stockmarket_df.shape

(7494, 81)

In [130]:
stockmarket_df.iloc[stockmarket_df.shape[0]-1]['created_utc']

# GMT: Wednesday, July 13, 2022 2:13:58 AM

1657678438

In [131]:
stockmarket_df=stockmarket_df[['subreddit', 'author', 'selftext', 'title']]
stockmarket_df.head()

Unnamed: 0,subreddit,author,selftext,title
0,StockMarket,zitrored,,Looking for the next exogenous event that take...
1,StockMarket,CompetitiveMission1,[Link to the full article (4 min read)](https:...,China stocks notch trillion-dollar gain on hop...
2,StockMarket,jaltrading21,,Get ready for some economic news and company e...
3,StockMarket,ShabbyShamble,,Market Recap! Bear Market Blues! Palantir (PLT...
4,StockMarket,PriceActionHelp,,Why it's not smart to rely on the RSI divergence


## r/wallstreetbets

In [132]:
wallstreetbets_df.shape

(5998, 84)

In [133]:
wallstreetbets_df.iloc[wallstreetbets_df.shape[0]-1]['created_utc']

# GMT: Thursday, October 27, 2022 12:02:36 AM

1666828956

In [134]:
wallstreetbets_df=wallstreetbets_df[['subreddit', 'author', 'selftext', 'title']]
wallstreetbets_df.head()

Unnamed: 0,subreddit,author,selftext,title
0,wallstreetbets,Pro-Gambler99,,i wish i was indian
1,wallstreetbets,Fit_One4445,,Be back in 3 months I guess…
2,wallstreetbets,Plastic-Ad-2191,"What do you think to become in a ""introducing ...",INTRODUCING BROKER TICKMILL
3,wallstreetbets,banmereddit5775r4,[removed],Coin is up on God awful earnings.
4,wallstreetbets,universityofnonsense,,Which one of you apes is in front of me right ...


# Final Cleaning 

## Handling Missing Values

In [135]:
investing_df['selftext']=investing_df['selftext'].fillna(' ')
stockmarket_df['selftext']=stockmarket_df['selftext'].fillna(' ')
wallstreetbets_df['selftext']=wallstreetbets_df['selftext'].fillna(' ')

In [136]:
investing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7995 entries, 0 to 7994
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  7995 non-null   object
 1   author     7995 non-null   object
 2   selftext   7995 non-null   object
 3   title      7995 non-null   object
dtypes: object(4)
memory usage: 250.0+ KB


In [137]:
stockmarket_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7494 entries, 0 to 7493
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  7494 non-null   object
 1   author     7494 non-null   object
 2   selftext   7494 non-null   object
 3   title      7494 non-null   object
dtypes: object(4)
memory usage: 234.3+ KB


In [138]:
wallstreetbets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5998 entries, 0 to 5997
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  5998 non-null   object
 1   author     5998 non-null   object
 2   selftext   5998 non-null   object
 3   title      5998 non-null   object
dtypes: object(4)
memory usage: 187.6+ KB


## Feature Engineering

I will combine the text from columns 'author','selftext' and 'title'

In [139]:
investing_df['Posts']='Author: '+investing_df['author']+' Title: ' + investing_df['title']+' Text: '+investing_df['selftext']
stockmarket_df['Posts']='Author: '+stockmarket_df['author']+' Title: ' + stockmarket_df['title']+' Text: '+stockmarket_df['selftext']
wallstreetbets_df['Posts']='Author: '+wallstreetbets_df['author']+' Title: ' + wallstreetbets_df['title']+' Text: '+wallstreetbets_df['selftext']

In [140]:
investing_df=investing_df[['subreddit','Posts']]
stockmarket_df=stockmarket_df[['subreddit','Posts']]
wallstreetbets_df=wallstreetbets_df[['subreddit','Posts']]

In [141]:
investing_df.head(3)

Unnamed: 0,subreddit,Posts
0,investing,Author: HomeInvading Title: Help a young man o...
1,investing,Author: ocean-airseashell10 Title: Treasury bo...
2,investing,Author: ocean-airseashell10 Title: How to buy ...


In [142]:
stockmarket_df.head(3)

Unnamed: 0,subreddit,Posts
0,StockMarket,Author: zitrored Title: Looking for the next e...
1,StockMarket,Author: CompetitiveMission1 Title: China stock...
2,StockMarket,Author: jaltrading21 Title: Get ready for some...


In [143]:
wallstreetbets_df.head(3)

Unnamed: 0,subreddit,Posts
0,wallstreetbets,Author: Pro-Gambler99 Title: i wish i was indi...
1,wallstreetbets,Author: Fit_One4445 Title: Be back in 3 months...
2,wallstreetbets,Author: Plastic-Ad-2191 Title: INTRODUCING BRO...


## Concatenate all 3 Dataframes

In [144]:
df = pd.concat([investing_df,stockmarket_df,wallstreetbets_df],ignore_index=True)

In [145]:
df.shape

(21487, 2)

In [146]:
df['subreddit'].value_counts()

investing         7995
StockMarket       7494
wallstreetbets    5998
Name: subreddit, dtype: int64

In [147]:
df.head()

Unnamed: 0,subreddit,Posts
0,investing,Author: HomeInvading Title: Help a young man o...
1,investing,Author: ocean-airseashell10 Title: Treasury bo...
2,investing,Author: ocean-airseashell10 Title: How to buy ...
3,investing,Author: iamjokingiamserious Title: Early Exerc...
4,investing,Author: jamesterryburke01 Title: Alternative I...


---

# NLP Classifier

## Hot-Encode Subreddit Labels

Convert 'investing', 'StockMarket' and 'wallstreetbets' into ternary labels
- 0 for investing
- 1 for stockmarket
- 2 for wallstreetbets

In [148]:
df['subreddit'].value_counts()

investing         7995
StockMarket       7494
wallstreetbets    5998
Name: subreddit, dtype: int64

In [149]:
df['subreddit']=df['subreddit'].map({'investing': 0, 'StockMarket': 1, 'wallstreetbets': 2})
df.head()

Unnamed: 0,subreddit,Posts
0,0,Author: HomeInvading Title: Help a young man o...
1,0,Author: ocean-airseashell10 Title: Treasury bo...
2,0,Author: ocean-airseashell10 Title: How to buy ...
3,0,Author: iamjokingiamserious Title: Early Exerc...
4,0,Author: jamesterryburke01 Title: Alternative I...


## Set Up Target Vector for Modelling

In [150]:
X = df['Posts']
y=df['subreddit']

In [152]:
y.value_counts(normalize=True)

0    0.372085
1    0.348769
2    0.279146
Name: subreddit, dtype: float64

In [153]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y, # stratify means the proportion of 0s and 1s are kept
                                                    random_state=42)

## Pre-Processing

---