<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 Web APIs and NLP

_Authors: Joel Quek (SG)_

# Problem Statement

NLP Model to match posts from r/investing, r/stockmarket, r/wallstreetbets

# Exploratory Data Analysis

## Import Libraries

In [2]:
#All libraries used in this project are listed here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

import re
from bs4 import BeautifulSoup 

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, make_scorer, recall_score, precision_score,accuracy_score

# Open Scraped Datasets

The jupytyer notebooks for scraping are 'reddit-scrape.ipynb' and 'wallstreetbets-scrape.ipynb'

In [6]:
investing_df = pd.read_csv('datasets/investing.csv')
stockmarket_df = pd.read_csv('datasets/stockmarket.csv')


## r/investing

In [7]:
investing_df.shape

(7995, 75)

In [8]:
investing_df.iloc[investing_df.shape[0]-1]['created_utc']

# GMT: Friday, July 8, 2022 9:18:46 AM

1657271926

In [9]:
investing_df=investing_df[['subreddit', 'author', 'selftext', 'title']]
investing_df.head()

Unnamed: 0,subreddit,author,selftext,title
0,investing,HomeInvading,"Hey guys, I’m a 22 year old male, I grew up wi...",Help a young man out would ya?
1,investing,ocean-airseashell10,[removed],Treasury bonds is it a good idea to buy
2,investing,ocean-airseashell10,[removed],How to buy treasury bonds? Is treasury’s direc...
3,investing,iamjokingiamserious,[removed],Early Exercise of Stock Options
4,investing,jamesterryburke01,Hello Redditors 👋 \n\nI work as a Investment C...,Alternative Investments -


## r/stockmarket

In [10]:
stockmarket_df.shape

(7494, 81)

In [11]:
stockmarket_df.iloc[stockmarket_df.shape[0]-1]['created_utc']

# GMT: Wednesday, July 13, 2022 2:13:58 AM

1657678438

In [12]:
stockmarket_df=stockmarket_df[['subreddit', 'author', 'selftext', 'title']]
stockmarket_df.head()

Unnamed: 0,subreddit,author,selftext,title
0,StockMarket,zitrored,,Looking for the next exogenous event that take...
1,StockMarket,CompetitiveMission1,[Link to the full article (4 min read)](https:...,China stocks notch trillion-dollar gain on hop...
2,StockMarket,jaltrading21,,Get ready for some economic news and company e...
3,StockMarket,ShabbyShamble,,Market Recap! Bear Market Blues! Palantir (PLT...
4,StockMarket,PriceActionHelp,,Why it's not smart to rely on the RSI divergence


# Final Cleaning 

## Handling Missing Values

In [13]:
investing_df['selftext']=investing_df['selftext'].fillna(' ')
stockmarket_df['selftext']=stockmarket_df['selftext'].fillna(' ')


In [14]:
investing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7995 entries, 0 to 7994
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  7995 non-null   object
 1   author     7995 non-null   object
 2   selftext   7995 non-null   object
 3   title      7995 non-null   object
dtypes: object(4)
memory usage: 250.0+ KB


In [15]:
stockmarket_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7494 entries, 0 to 7493
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  7494 non-null   object
 1   author     7494 non-null   object
 2   selftext   7494 non-null   object
 3   title      7494 non-null   object
dtypes: object(4)
memory usage: 234.3+ KB


## Feature Engineering

I will combine the text from columns 'author','selftext' and 'title'

In [16]:
investing_df['Posts']='Author: '+investing_df['author']+' Title: ' + investing_df['title']+' Text: '+investing_df['selftext']
stockmarket_df['Posts']='Author: '+stockmarket_df['author']+' Title: ' + stockmarket_df['title']+' Text: '+stockmarket_df['selftext']


In [17]:
investing_df=investing_df[['subreddit','Posts']]
stockmarket_df=stockmarket_df[['subreddit','Posts']]


In [18]:
investing_df.head(3)

Unnamed: 0,subreddit,Posts
0,investing,Author: HomeInvading Title: Help a young man o...
1,investing,Author: ocean-airseashell10 Title: Treasury bo...
2,investing,Author: ocean-airseashell10 Title: How to buy ...


In [19]:
stockmarket_df.head(3)

Unnamed: 0,subreddit,Posts
0,StockMarket,Author: zitrored Title: Looking for the next e...
1,StockMarket,Author: CompetitiveMission1 Title: China stock...
2,StockMarket,Author: jaltrading21 Title: Get ready for some...


## Concatenate both Dataframes

In [20]:
df = pd.concat([investing_df,stockmarket_df],ignore_index=True)

In [21]:
df.shape

(15489, 2)

In [22]:
df['subreddit'].value_counts()

investing      7995
StockMarket    7494
Name: subreddit, dtype: int64

In [23]:
df.head()

Unnamed: 0,subreddit,Posts
0,investing,Author: HomeInvading Title: Help a young man o...
1,investing,Author: ocean-airseashell10 Title: Treasury bo...
2,investing,Author: ocean-airseashell10 Title: How to buy ...
3,investing,Author: iamjokingiamserious Title: Early Exerc...
4,investing,Author: jamesterryburke01 Title: Alternative I...


---

# NLP Classifier

## Hot-Encode Subreddit Labels

Convert 'investing', 'StockMarket' and 'wallstreetbets' into ternary labels
- 0 for investing
- 1 for stockmarket


In [24]:
df['subreddit'].value_counts()

investing      7995
StockMarket    7494
Name: subreddit, dtype: int64

In [25]:
df['subreddit']=df['subreddit'].map({'investing': 0, 'StockMarket': 1})
df.head()

Unnamed: 0,subreddit,Posts
0,0,Author: HomeInvading Title: Help a young man o...
1,0,Author: ocean-airseashell10 Title: Treasury bo...
2,0,Author: ocean-airseashell10 Title: How to buy ...
3,0,Author: iamjokingiamserious Title: Early Exerc...
4,0,Author: jamesterryburke01 Title: Alternative I...


## Set Up Target Vector for Modelling

In [26]:
X = df['Posts']
y=df['subreddit']

In [27]:
X[110]

"Author: Queengenademedicieth Title: R-ruby and O-Opera Medici Etherton Code CodeX Leonardo da' Vinci. Words R-ruby and O-Opera: Ro-yal, C-ro-wn, Th-ro-ne, Or-b, C-ro-ss, C-or-onation, Ro-se, Ge-or-ge, Ge-or-gia, Ro-lls Ro-yce, Mic-ro-soft, Or-ca, Or-egon, New Y-or-k, Fl-or-ida, Calif-or-nia. Text: [removed]"

In [28]:
y.value_counts(normalize=True)

0    0.516173
1    0.483827
Name: subreddit, dtype: float64

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y, # stratify means the proportion of 0s and 1s are kept
                                                    random_state=42)

---

## Pre-Processing

Refer to Lecture 5.06

---

# Models

## `CountVectorizer and NaiveBayes GridSearchCV`


In [26]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [27]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_params = {
    'cvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}
# these are for the Grid Search to find the optimum combination of hyperparameters

In [28]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [29]:
gs.fit(X_train, y_train)

In [30]:
print(gs.best_score_)

0.7209264846502388


In [31]:
gs.score(X_train, y_train)

0.7375514486320717

In [32]:
gs.score(X_test, y_test)

0.7301484828921885

##   `TfidVectorizer and Naive Bayes GridSearchCV`



In [33]:
# Set up a pipeline with tf-idf vectorizer and multinomial naive bayes

pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

In [34]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# No stop words and english stop words
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_tvec_params = {
    'tvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range': [(1,1), (1,2)]
}

In [35]:
# Instantiate GridSearchCV.

gs_tvec = GridSearchCV(pipe_tvec, # what object are we optimizing?
                        param_grid = pipe_tvec_params, # what parameters values are we searching?
                        cv=5) # 5-fold cross-validation.

In [36]:
# Fit GridSearch to training data.
gs_tvec.fit(X_train, y_train)

In [37]:
print(gs_tvec.best_score_)

0.7549029279360673


In [38]:
# Score model on training set.
gs_tvec.score(X_train, y_train)

0.7939633605035913

In [39]:
# Score model on testing set.
gs_tvec.score(X_test, y_test)

0.762104583602324

---

## `CountVectorizer and Logistic Regression GridSearchCV`

In [3]:
pipe_count_logreg = Pipeline([('cvec', CountVectorizer()),
                 ('lr', LogisticRegression(solver='lbfgs'))
                ])

In [30]:
cross_val_score(pipe_count_logreg, X_train, y_train, cv=5)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.79265833, 0.79782082, 0.80589185, 0.7905569 , 0.80104923])

In [32]:
# ii. Fit into model
pipe_count_logreg.fit(X_train, y_train)

# Training score
print(pipe_count_logreg.score(X_train, y_train))

# Test score
print(pipe_count_logreg.score(X_test, y_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9326930836897749
0.8134280180761781


In [34]:
pipe_params = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [.8,.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}
scorers = {
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score)
}
#scorers dictionary allows us to prioritize which score we want for the model. Then we refit back the parameters to our model
gs_count_logreg = GridSearchCV(pipe_count_logreg,param_grid=pipe_params,scoring=scorers,refit='accuracy_score', cv=5)

In [39]:
 gs_count_logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [36]:
print(gs_count_logreg.best_score_)
gs_count_logreg.best_params_

0.7843598902288506


{'cvec__max_df': 0.8,
 'cvec__max_features': 3500,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 1)}

In [37]:
gs_count_logreg.score(X_train, y_train)


0.8747478008231782

In [38]:
gs_count_logreg.score(X_test, y_test)


0.7985797288573273

## `TfidVectorizer and Logistic Regression GridSearchCV`

In [40]:
pipe_tfid_logreg = Pipeline([('tfid', TfidfVectorizer()),
                 ('lr', LogisticRegression(solver='lbfgs'))
                ])

In [41]:
cross_val_score(pipe_tfid_logreg, X_train, y_train, cv=5)


array([0.81040742, 0.81315577, 0.80347054, 0.8071025 , 0.80064568])

In [42]:
# ii. Fit into model
pipe_tfid_logreg.fit(X_train, y_train)

# Training score
print(pipe_tfid_logreg.score(X_train, y_train))

# Test score
print(pipe_tfid_logreg.score(X_test, y_test))

0.8931482527641029
0.8143963847643642


In [43]:
pipe2_params = {
    'tfid__max_features': [2500, 3000, 3500],
    'tfid__min_df': [1, 2],
    'tfid__max_df': [.8,.9, .95],
    'tfid__ngram_range': [(1,1), (1,2)]
}


gs_tfid_logreg = GridSearchCV(pipe_tfid_logreg,param_grid=pipe2_params,scoring=scorers,refit='accuracy_score', cv=5)

In [44]:
gs_tfid_logreg.fit(X_train, y_train)
print(gs_tfid_logreg.best_score_)
gs_tfid_logreg.best_params_

0.7757239911300119


{'tfid__max_df': 0.8,
 'tfid__max_features': 3500,
 'tfid__min_df': 2,
 'tfid__ngram_range': (1, 2)}

In [45]:
gs_tfid_logreg.score(X_train, y_train)

0.8256799289807119

In [46]:
gs_tfid_logreg.score(X_test, y_test)


0.7876049063912202

## `Random Forest`

# Model(s) Evaluation

## Confusion Matrix

## ROCAUC

## Error Analysis [Type 1 and Type 2 Errors]