# Project 3: Reddit Predictor

## Table Of Contents
- [Cleaning](#Cleaning)
- [Exploratory Data Analysis](#EDA)
- [Model](#Model)
- [Resources](#Resources)

# Executive Summary

## Problem Statement

   Reddit aggregates some of the best content in the world. If you see it on Facebook, Twitter, or LinkedIn there is a high possibility it originated on Reddit first. The site isn't just a hodge podge of virtually every community on the planet talking about issues/musings that are on their mind, there is some pretty tight moderation to keep things under control and running smoothly. However, Imagine a scenario where the Reddit servers are down, or Reddit gets hacked, and people's message history gets disorganized. Does Reddit have a plan in place to solve it? How would they mitigage this potential disaster? Based off the post titles and its references, can we predict where a post comes from using Natural Language Processing (NLP) and Classification Models? Yes. 

   From here moving into the future, managing your information and cyberthreats are one of the top risk factors corporations are concerned about. It would mean a huge interruption to core operations. Even the very largest organizations and governments are not immune to it, and those that are susceptible to it can only hope to contain it. It gets worse when you think about how computing power, AI, machine learning, and mobile device usage are starting to outpace the protections companies have in place currently. The risk for disruption from within or externally increases exponentially. Even if the possibility of a scenario like this is small, a number of successive disruptions like this could put a dent in Reddit's popularity. Reddit needs a way to quickly reorient themselves if information got disorganized. This is how you survive in the future.  

## Solution 

   To address this problem we are going to create a classification model and utilize NLP. This is necessary because what we want to do is determine the class that a particular post resides in. In this case the difference between two subreddits. As an example we can use a test case of r/Politics and r/Stocks after pulling down the data using beautiful soup to place it in a dataframe. The two subreddits have an implicit relationship at the cross section of government policy and economics. NLP will help us to turn our everyday language into numbers, allowing the computer to interpret and codify the information to achieve our desired result. The objective of the classification model is to identify, based off data we feed it, which category the information will fall under. 

The types of classification models include: 
- Logistic Regression
- Naive Bayes (Guassian, Binomial, and Multinomial)
- Decision Trees
- Random Forests
- K-Nearest Neighbors
- Support Vector Machines

Each has it's own unique way for identifying the class that the data resides in. I will be using Logistic Regression, Multinomial Naive Bayes, and Random Forests(I will detail each below in the modeling section). Logistic Regression is the least complex in terms of what it needs to execute while Random Forest needs alot and is the most computationally expensive of the 3. After modeling it turned out that the least complex was 95% accurate in determining where a reddit post comes from.

## Methodology

The algorithm that I created stripped the title text of any unnecessary words, characters, and any unnecessary text. I also used grid search to find the optimal parameters for the model to use to achieve the best score. Once the best parameters are found I used those same parameters to print out the top score of the model, so every model presented here is the best based off the parameters and boundaries I provided it. The parameters I decided to stick with were pretty basic to ensure the models weren't too overwhelming for the CPU. The logic behind the choices made for the parameters (this applies to all) is I wanted to make sure there was boundary in the min amount of times a word needed to appear in the document, and the maximum amount of times it could be a document. Then I also wanted to set a limit the number of max features for the models. Although it should theoretically increase the performance this is not the case for Random Forest as it decreases the diversity of the trees. Also increasing the feature amount would cause the speed to execution to decrease for the model. 

## Conclusion 

In conclusion, the least complex model, the Logistic Regression Model with Countvectorization, beat out all of the rest for both training and testing with 95% accuracy. If I had more time I would have played with more parameters and allowed to models to take time to run through a combonation of different parameters to find the true optimal model for each.

In [1]:
# Load data collection libraries, modeling libraries, and plotting libraries. Each one of these libraries plays a role
# in extracting, analyzing and modeling the data. 

# Basic libraries for cleaning and Exploratory Data Analysis
import pandas as pd
import numpy  as np
import re
import string

# Natural Language Tool Kit Library for parsing words
from nltk.tokenize import RegexpTokenizer
from nltk.stem     import WordNetLemmatizer

# Matplotlib library for plotting
import matplotlib.pyplot as plt
%matplotlib inline

# sklearn library for modeling
from sklearn.model_selection         import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer 
from sklearn.linear_model            import LogisticRegression
from sklearn.pipeline                import Pipeline
from sklearn.naive_bayes             import MultinomialNB
from sklearn.ensemble                import RandomForestClassifier
from sklearn.metrics                 import accuracy_score

  from numpy.core.umath_tests import inner1d


### Load Data in

The data has been extrated from Reddit. We will use the beautiful soup library to extract 25 post until we reach 1000 min. In a previous notebook I extracted it and put it into a csv file that I am reading in into the notebook. Lastly I will house them in a dataframe so we can work with the data easily. 

In [2]:
subreddit_1 = pd.read_csv("./Additional Resources/Subreddit_1_Politics",usecols=['Text','subreddit'])
subreddit_1.head()

Unnamed: 0,Text,subreddit
0,"i’m alan s. inouye, a registered lobbyist, and...",politics
1,subpoena for mueller report and documents appr...,politics
2,sen. elizabeth warren will unveil a bill to ma...,politics
3,"mueller testimony before congress ‘inevitable,...",politics
4,american meritocracy is a myth - recent scanda...,politics


In [3]:
subreddit_1.tail()

Unnamed: 0,Text,subreddit
1050,trump’s long history of pushing wild misinform...,politics
1051,white house quietly expects ‘unfavorable thing...,politics
1052,pork industry soon will have more power over m...,politics
1053,trump tweets he's 'the best thing that ever to...,politics
1054,wilbur ross refuses second invitation to testi...,politics


In [4]:
subreddit_2 = pd.read_csv("./Additional Resources/Subreddit_2_Stocks", usecols=['Text','subreddit'])
subreddit_2.head()

Unnamed: 0,Text,subreddit
0,rate my portfolio - r/stocks quarterly thread ...,stocks
1,"r/stocks daily discussion wednesday - apr 03, ...",stocks
2,"$750,000 to invest",stocks
3,what i learned this winter...,stocks
4,amazon's giant 'dystopian' delivery-drone blim...,stocks


In [5]:
subreddit_2.tail()

Unnamed: 0,Text,subreddit
1064,buying stocks. should i buy in my local stock ...,stocks
1065,unless lululemon is able to establish itself a...,stocks
1066,global stocks open week lower this morning as ...,stocks
1067,will the end of the mueller investigation resu...,stocks
1068,is now a good time to finally short wayfair(w)?,stocks


## Cleaning

Just viewing the top and bottom five rows of the dataframes for the two subreddits I am noticing that there are some things I want to get rid of. Dollar signs, punctuation, website specific text, digits, etc. It's not needed to analyze the information. In fact, although we are analyzing the text we do not need to analyze every single bit of the text. Soon you will see I barely use the full string (sentence) because we will focus on the most pertinent words to understanding where a post title comes from.

### Create a document term matrix

Now I will turn this into a document term matrix to prepare it for EDA. This includes cleaning, tokenizing, and finally a document-term matrix. I will be utilizing regular expressions to do some data cleaning. They will look for patterns within the textual data that we can target and extract.

### Common data cleaning steps on all text:
- Make Text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text(e.g. /n, r/)
- Tokenize Text
- Remove stopwords

### Additional data cleaning steps after tokenization:
- Stemming/lemmatization
- Parts of speech tagging
- Create bi-grams or tri-grams (from n-grams)
- Deal with typos

I'll do a round of cleaning and in the first one I will look to remove things I noticed right off the bat that I mentioned above (Dollar signs, punctuation, website specific text, digits, etc.). It's also to handle the text if it is all lowercase so I will take care of that, too. Once the round of cleaning is taken care of I'll put it back in a data frame to take a look at it again to see if there is anything I missed.

Putting it in a dataframe is really beneficial because you can view all the text in one place. You may notice some things that you would have otherwise. For instance, removing text may result in a row with zero text. It's a possibility if the row merely contained digits. So for added measure I created a code to remove any rows that have zero letters in it. 

A second round of cleaning will be necessary if there is any unusual text present. After the second round I will move on to aspects of the project, like Exploratory Data Analysis!

#### Cleaning: Round 1 -- Reddit 1

In [6]:
# code inspired by Alice Zhao

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, and remove punctuation'''
    text = text.lower()
    text = re.sub('r/','', text)
    text = re.sub('\[.*?\]','',text) # the characters in the bracket will be replaced with nothing
    text = re.sub('[%s]' % re.escape(string.punctuation),' ', text) # Punctuation replaced with nothing
    text = re.sub('\w*\d\w*', '', text) # digits replaced with nothing
    return text

round1 = lambda x: clean_text_round1(x)

In [7]:
# Put into a data frame to apply to original dataframe
data_clean_1 = pd.DataFrame(subreddit_1.Text.apply(round1))
data_clean_1.head()

Unnamed: 0,Text
0,i’m alan s inouye a registered lobbyist and...
1,subpoena for mueller report and documents appr...
2,sen elizabeth warren will unveil a bill to ma...
3,mueller testimony before congress ‘inevitable ...
4,american meritocracy is a myth recent scanda...


In [8]:
# Reapply to original dataframe
subreddit_1["Text"] = data_clean_1["Text"]
subreddit_1.head()

Unnamed: 0,Text,subreddit
0,i’m alan s inouye a registered lobbyist and...,politics
1,subpoena for mueller report and documents appr...,politics
2,sen elizabeth warren will unveil a bill to ma...,politics
3,mueller testimony before congress ‘inevitable ...,politics
4,american meritocracy is a myth recent scanda...,politics


In [9]:
subreddit_1.shape

(1055, 2)

#### Remove strings that are 0

In [10]:
# Remove any rows with 0 length
subreddit_1 = subreddit_1[subreddit_1.Text.str.len() > 0]

In [11]:
subreddit_1.shape

(1055, 2)

#### Cleaning: Round 1 -- Reddit 2

In [12]:
# Put into a data frame to apply to original dataframe
data_clean_2 = pd.DataFrame(subreddit_2.Text.apply(round1))
data_clean_2.head()

Unnamed: 0,Text
0,rate my portfolio stocks quarterly thread ma...
1,stocks daily discussion wednesday apr
2,to invest
3,what i learned this winter
4,amazon s giant dystopian delivery drone blim...


In [13]:
subreddit_2["Text"] = data_clean_2["Text"]
subreddit_2.head()

Unnamed: 0,Text,subreddit
0,rate my portfolio stocks quarterly thread ma...,stocks
1,stocks daily discussion wednesday apr,stocks
2,to invest,stocks
3,what i learned this winter,stocks
4,amazon s giant dystopian delivery drone blim...,stocks


In [14]:
subreddit_2.shape

(1069, 2)

#### Remove strings that are 0

In [15]:
# Remove any rows with 0 length
subreddit_2 = subreddit_2[subreddit_2.Text.str.len() > 0]

In [16]:
subreddit_2.shape

(1068, 2)

#### Cleaning: Round 2 -- Reddit 1
There is some punctuation errors and nonsensical text that still need to be corrected. So I will do a second round of cleaning to capture them. We need this information to be clean so that we can properly tokenize and remove stop words. 

In [17]:
# Code inspired by Alice Zhao & Noah C.
def clean_text_round2(text):
    text = re.sub('[^a-z]',' ',text)
    text = re.sub('\n',' ',text)
    text = re.sub('\s[a-z]\s',' ', text) #remove 
    return text

round2 = lambda x: clean_text_round2(x)

In [18]:
# Put into a data frame to apply to original dataframe
data_clean_1 = pd.DataFrame(data_clean_1.Text.apply(round2))
data_clean_1.head()

Unnamed: 0,Text
0,i alan inouye registered lobbyist and so am...
1,subpoena for mueller report and documents appr...
2,sen elizabeth warren will unveil bill to make...
3,mueller testimony before congress inevitable ...
4,american meritocracy is myth recent scandals...


In [19]:
# Apply to original dataframe
subreddit_1['Text'] = data_clean_1['Text']
subreddit_1.head()

Unnamed: 0,Text,subreddit
0,i alan inouye registered lobbyist and so am...,politics
1,subpoena for mueller report and documents appr...,politics
2,sen elizabeth warren will unveil bill to make...,politics
3,mueller testimony before congress inevitable ...,politics
4,american meritocracy is myth recent scandals...,politics


#### Cleaning: Round 2 -- Reddit 2

In [20]:
# Put into a data frame to apply to original dataframe
data_clean_2 = pd.DataFrame(data_clean_2.Text.apply(round2))
data_clean_2.head()

Unnamed: 0,Text
0,rate my portfolio stocks quarterly thread ma...
1,stocks daily discussion wednesday apr
2,to invest
3,what learned this winter
4,amazon giant dystopian delivery drone blimp ...


In [21]:
subreddit_2['Text'] = data_clean_2['Text']
subreddit_2.head()

Unnamed: 0,Text,subreddit
0,rate my portfolio stocks quarterly thread ma...,stocks
1,stocks daily discussion wednesday apr,stocks
2,to invest,stocks
3,what learned this winter,stocks
4,amazon giant dystopian delivery drone blimp ...,stocks


In [22]:
# Import NLP tool kit
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 

In [23]:
# Inspired by Sklearn Documentation
# Lemmatize the data through a class
class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

cv = CountVectorizer(tokenizer=LemmaTokenizer(),
                    max_features = 1000,
                    stop_words = 'english')  

Not entirely sure this got tokenized properly. Please fix.

In [24]:
# Transform and fit the subreddits to countvectorizing
data_cv_subreddit_1 = cv.fit_transform(subreddit_1.Text)
# data_dtm_sub_1 = pd.DataFrame(data_cv_subreddit_1.toarray(), columns=cv.get_feature_names())
# data_dtm_sub_1.head()

In [25]:
data_cv_subreddit_2 = cv.fit_transform(subreddit_2.Text)
data_dtm_sub_2 = pd.DataFrame(data_cv_subreddit_2.toarray(), columns = cv.get_feature_names())
data_dtm_sub_2.head()

Unnamed: 0,aapl,able,absolutely,acb,accept,access,according,account,acquires,acquisition,...,xmas,xxii,yahoo,year,yes,yesterday,yield,youtube,zero,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## EDA

Exploratory is suppose to give a sense of what our data is like and what we should know about our data. Arguably one of the most important parts of the workflow. But this should unlock some interesting information about our data that we could see visually. The number one thing I notice is that there is no overlap in the top 30 words between the two subreddits, making it easy for the models to predict where a post comes from. 

#### Subreddit 1

In [26]:
# code originally written by boom
# Function to count words
def word_counter(title_df,stop_list = []):

    # Count Vectorize
    cvec = CountVectorizer(stop_words = stop_list, max_features=30)

    # Transform the corpus
    X_text = cvec.fit_transform(title_df['Text'])

    # Converts text to array form
    X_text = pd.DataFrame(X_text.toarray(), columns= cvec.get_feature_names())

    # See word counts
    word_counts = X_text.sum().sort_values(0, ascending=False)
    return word_counts

In [27]:
word_counter(subreddit_1, 'english')

trump         383
house         129
mueller       107
report         92
says           90
democrats      62
white          61
border         58
new            53
security       48
gop            47
health         43
biden          42
care           40
donald         39
subpoena       38
puerto         37
rico           34
campaign       33
sanders        32
subpoenas      31
congress       30
election       30
million        30
president      29
democratic     28
committee      27
clearance      26
vote           26
bernie         26
dtype: int64

#### Subreddit 2

In [28]:
# code originally written by boom
# Function to count words
def word_counter(title_df,stop_list = []):

    # Count Vectorize
    cvec = CountVectorizer(stop_words = stop_list, max_features=30)

    # Transform the corpus
    X_text = cvec.fit_transform(title_df['Text'])

    # Converts text to array form
    X_text = pd.DataFrame(X_text.toarray(), columns= cvec.get_feature_names())

    # See word counts
    word_counts = X_text.sum().sort_values(0, ascending=False)
    return word_counts

In [29]:
word_counter(subreddit_2, 'english')

stocks        156
stock         121
market         60
today          47
thoughts       47
buy            42
news           41
mar            38
discussion     36
lyft           35
trading        33
daily          32
good           30
picks          27
price          27
ipo            27
invest         27
company        25
pre            25
buying         25
apple          25
shares         24
earnings       24
does           24
boeing         24
week           22
think          21
ladybaybee     21
global         20
just           20
dtype: int64

### Combine DataFrame into a Corpus

I am combining the data frame because we need to create a corpus to analyze, train-test-split, and model the the corpus to predict where the documents come from. So I am going to combine the two corpora using concat. They are both already cleaned. Also the computer isn't able to read text like we do. If we are going to vectorize the words and they become numeric, we will need to do the same for the target (Subreddit Topic). The targest are Politics and Stocks and they will be 0 and 1, respectively. 

In [30]:
# Use concat to bring both the dataframes together.
df = pd.concat([subreddit_1,subreddit_2], ignore_index = True)

In [31]:
df.head()

Unnamed: 0,Text,subreddit
0,i alan inouye registered lobbyist and so am...,politics
1,subpoena for mueller report and documents appr...,politics
2,sen elizabeth warren will unveil bill to make...,politics
3,mueller testimony before congress inevitable ...,politics
4,american meritocracy is myth recent scandals...,politics


In [32]:
df.tail()

Unnamed: 0,Text,subreddit
2118,buying stocks should buy in my local stock ex...,stocks
2119,unless lululemon is able to establish itself a...,stocks
2120,global stocks open week lower this morning as ...,stocks
2121,will the end of the mueller investigation resu...,stocks
2122,is now good time to finally short wayfair,stocks


### Binarize subreddits

In [33]:
# Binarize the subreddits to prepare our y value
df['binarize'] = df['subreddit'].map({
    'politics':0,
    'stocks':1})

In [34]:
df.head()

Unnamed: 0,Text,subreddit,binarize
0,i alan inouye registered lobbyist and so am...,politics,0
1,subpoena for mueller report and documents appr...,politics,0
2,sen elizabeth warren will unveil bill to make...,politics,0
3,mueller testimony before congress inevitable ...,politics,0
4,american meritocracy is myth recent scandals...,politics,0


In [35]:
df.shape

(2123, 3)

#### Bag of Words Model
At this point I should have a bag of words model that is basically my corpus in no particular order, rhyme or reason. Its a fairly representation of my data, but its a good start. Now we need to put it into a matrix after it has been cleaned and tokenized. The matrix is so the computer store and read the data.

##### Use count vectorizer to put into a matrix

In [36]:
# Code inspired by Alice Zhao
# Convert data into a countvector
cv = CountVectorizer(stop_words='english',
                     tokenizer = LemmaTokenizer(),
                    max_features = 1000,
                    ngram_range = (1,2))
data_cv = cv.fit_transform(df.Text)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.shape

(2123, 1000)

##### Show the difference between a corpus and a Document Term Matrix
One has all of the information in one string and outlines what subreddit it comes from and its binarized version. The other breaks out all words into a matrix showing what words are in each document.

In [37]:
df.head()

Unnamed: 0,Text,subreddit,binarize
0,i alan inouye registered lobbyist and so am...,politics,0
1,subpoena for mueller report and documents appr...,politics,0
2,sen elizabeth warren will unveil bill to make...,politics,0
3,mueller testimony before congress inevitable ...,politics,0
4,american meritocracy is myth recent scandals...,politics,0


In [38]:
data_dtm.head()

Unnamed: 0,able,abortion,acb,access,account,accuse,accused,acquisition,act,action,...,world,worse,worth,wrong,wrongly,wrongly claim,yang,year,yield,york
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Model

I will start with a baseline model. This is to determine what my hurdle rate is for my other models going forward. I want to create 3 models with 2 variations each (6 in total). The first will be a simple Logistic Regression, second a Multinomial Niave Bayes Model, and third a Random Forest model. My hypothesis is that the Random Forest model will do the best since its suppose to account for high variance in the other models due to the bagging and diverse aspects of the model. 

Below are the definitions of the models:

- **Logistic Regression:** "Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function."
<br>
<br>
- **Multinomial Naives Bayes:** "Naive Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering."
<br>
<br>
- **Random Forests:** "Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement."

The reason I chose these models is because I wanted see if we went from a simple logistic regression to a more complex ensemble model like a random forest. My assumption is that the more complex model, Random Forest, will be able to produce a better score because the nature of the Random Forest model produes multiple versions of the data in different ways and will choose the classification based off majority vote, and it also controls for overfitting. 

The two variations I will use are Countvectorizing and TF-IDF. Countvectorizing will not account for words that are common and uncommon and will just count up the number of times and word appears in a document. TF-IDF does and gives each word a score, so the difference in the two should give me different scores and I assuming since TF-IDF takes into account rare and common words I assuming I would get a better score. 

[going to need to explain the parameters]

In [39]:
#check to see if we have unbalanced classes
df.binarize.value_counts(normalize = True)
# The baseline Accuracy score is 50%

1    0.503062
0    0.496938
Name: binarize, dtype: float64

#### Split your data

Using train test slpit to separate data. We need to do this in order to train and fit our data to create the predictive model. Then we will need to use unseen data to test the model. 

In [40]:
# Generate the X and y target
X = df['Text']
y = df['binarize']

In [41]:
# do a train test split now
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   stratify = y,
                                                   test_size =0.33, 
                                                   random_state = 42)

##### Logistic Regression and Countvectorizing w/ Pipeline

In [42]:
# Create a pipeline
pipe_1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

In [43]:
# Find the parameters, gridsearch them, use the best parameters to generate a score
pipe_params_1 = {
    'lr__penalty': ['l1','l2'],
    'cvec__max_features': [100,300, 500],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.5,.9],
    'cvec__ngram_range': [(1,1),(1,2),(3,3)]
}

# Gridsearch to find the best parameters and fit to training data
gs_1 = GridSearchCV(pipe_1, param_grid=pipe_params_1,
                  cv=3, 
                  verbose = 1,
                  n_jobs=2)
gs_1.fit(X_train, y_train)
best_1 = gs_1.best_estimator_
best_1.fit(X_train,y_train)
y_test_preds_1 = best_1.predict(X_test)

Fitting 3 folds for each of 72 candidates, totalling 216 fits


[Parallel(n_jobs=2)]: Done  65 tasks      | elapsed:    7.1s
[Parallel(n_jobs=2)]: Done 213 out of 216 | elapsed:   15.1s remaining:    0.1s
[Parallel(n_jobs=2)]: Done 216 out of 216 | elapsed:   15.2s finished


In [44]:
# Develop a score and print
y_train_preds_1 = best_1.predict(X_train)
print(accuracy_score(y_train, y_train_preds_1))
print(accuracy_score(y_test,y_test_preds_1))
dfparams = pd.DataFrame(gs_1.best_params_)
dfparams = dfparams.drop(index = 0).T
dfparams = dfparams.rename(index=str, columns={1: "Best Params"})
dfparams

0.9781997187060478
0.9529243937232525


Unnamed: 0,Best Params
cvec__max_df,0.5
cvec__max_features,500
cvec__min_df,2
cvec__ngram_range,1
lr__penalty,l2


##### Logistic Regression with TF-IDF w/ Pipeline

In [45]:
pipe_2 = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

In [46]:
# Find the parameters, gridsearch them, use the best features to generate a score
pipe_params_2 = {
    'lr__penalty': ['l1','l2'],
    'tfidf__max_features': [100,200,300, 400, 500],
    'tfidf__min_df': [2,3],
    'tfidf__max_df': [.9,.95],
    'tfidf__ngram_range': [(1,1),(1,2),(3,3)]
}

# Gridsearch to find the best parameters and fit to training data
gs_2 = GridSearchCV(pipe_2, 
                    param_grid=pipe_params_2, 
                    cv=3, 
                    verbose = 1, 
                    n_jobs=2)
gs_2.fit(X_train, y_train)
best_2 = gs_2.best_estimator_
best_2.fit(X_train,y_train)
y_test_preds_2 = best_2.predict(X_test)

Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=2)]: Done  59 tasks      | elapsed:    6.7s
[Parallel(n_jobs=2)]: Done 357 out of 360 | elapsed:   22.7s remaining:    0.1s
[Parallel(n_jobs=2)]: Done 360 out of 360 | elapsed:   22.9s finished


In [47]:
# Develop a score and print
y_train_preds_2 = best_2.predict(X_train)
print(accuracy_score(y_train, y_train_preds_2))
print(accuracy_score(y_test,y_test_preds_2))
dfparams_2 = pd.DataFrame(gs_2.best_params_)
dfparams_2 = dfparams_2.drop(index = 0).T
dfparams_2 = dfparams_2.rename(index=str, columns={1: "Best Params"})
dfparams_2

0.9690576652601969
0.9500713266761769


Unnamed: 0,Best Params
lr__penalty,l2
tfidf__max_df,0.9
tfidf__max_features,500
tfidf__min_df,3
tfidf__ngram_range,1


##### Multinomial Naive Bayes and Countvectorizing w/Pipeline

In [48]:
# Code Inspired by Siraj Raval
pipe_3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('mnb', MultinomialNB())
])

In [49]:
# Find the parameters, gridsearch them, use the best features to generate a score
pipe_params_3 = {
    'cvec__max_features': [100,500],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.9,.95],
    'cvec__ngram_range': [(1,1),(1,2),(3,3)]
}

# Gridsearch to find the best parameters and fit to training data
gs_3 = GridSearchCV(pipe_3, 
                   param_grid=pipe_params_3, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = 2)

gs_3.fit(X_train, y_train)
best_3 = gs_3.best_estimator_
best_3.fit(X_train,y_train)
y_test_preds_3 = best_3.predict(X_test)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


[Parallel(n_jobs=2)]: Done  59 tasks      | elapsed:    7.3s
[Parallel(n_jobs=2)]: Done  69 out of  72 | elapsed:    8.3s remaining:    0.3s
[Parallel(n_jobs=2)]: Done  72 out of  72 | elapsed:    8.5s finished


In [50]:
# Develop a score and print
y_train_preds_3 = best_3.predict(X_train)
print(accuracy_score(y_train, y_train_preds_3))
print(accuracy_score(y_test,y_test_preds_3))
dfparams_3 = pd.DataFrame(gs_3.best_params_)
dfparams_3 = dfparams_3.drop(index = 0).T
dfparams_3 = dfparams_3.rename(index=str, columns={1: "Best Params"})
dfparams_3

0.9648382559774965
0.9472182596291013


Unnamed: 0,Best Params
cvec__max_df,0.9
cvec__max_features,500.0
cvec__min_df,3.0
cvec__ngram_range,1.0


##### Multinomial Naive Bayes and TF-IDF w/Pipeline

In [51]:
# Code Inspired by Siraj Raval
pipe_4 = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('mnb', MultinomialNB())
])

In [52]:
# Find the parameters, gridsearch them, use the best features to generate a score
pipe_params_4 = {
    'tfidf__max_features': [100,200,300, 400, 500],
    'tfidf__min_df': [2,3],
    'tfidf__max_df': [.9,.95],
    'tfidf__ngram_range': [(1,1),(1,2),(3,3)]
}

In [53]:
# Gridsearch to find the best parameters and fit to training data
gs_4 = GridSearchCV(pipe_4, 
                   param_grid=pipe_params_4, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = 2)

gs_4.fit(X_train, y_train)
best_4 = gs_4.best_estimator_
best_4.fit(X_train,y_train)
y_test_preds_4 = best_4.predict(X_test)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=2)]: Done  50 tasks      | elapsed:    7.4s
[Parallel(n_jobs=2)]: Done 180 out of 180 | elapsed:   13.8s finished


In [54]:
# Develop a score and print
y_train_preds_4 = best_4.predict(X_train)
print(accuracy_score(y_train, y_train_preds_4))
print(accuracy_score(y_test,y_test_preds_4))
dfparams_4 = pd.DataFrame(gs_4.best_params_)
dfparams_4 = dfparams_4.drop(index = 0).T
dfparams_4 = dfparams_4.rename(index=str, columns={1: "Best Params"})
dfparams_4

0.9690576652601969
0.9443651925820257


Unnamed: 0,Best Params
tfidf__max_df,0.9
tfidf__max_features,500.0
tfidf__min_df,3.0
tfidf__ngram_range,1.0


##### Random Forests and Countvectorizor w/Pipeline

In [55]:
pipe_5 = Pipeline([
    ('cvec', CountVectorizer()),
    ('rfc', RandomForestClassifier())])

In [56]:
# Find the parameters, gridsearch them, use the best features to generate a score
pipe_params_5 = [{
    'cvec__max_features': [300, 400, 500],
    'cvec__min_df': [2,3],
    'cvec__max_df': [.9],
    'cvec__ngram_range': [(1,1),(1,2)],
    'rfc__bootstrap': [False, True],
    'rfc__n_estimators': [100, 110, 120],
    'rfc__max_features': [.5, .6, .7],
    'rfc__min_samples_leaf': [10,12, 14],
    'rfc__min_samples_split':[3,5,7]
}]

In [57]:
# Since random forest has more features, consider how many fits you will have to do before running
lst = []
count = 0
for i in pipe_params_5[0]:
    count = 0
    for j in pipe_params_5[0][i]:
        count += 1
    lst.append(count)

first = lst[0]
num = 1
for i in lst:
    num*=i
print(f'Fits: {num*3}')

Fits: 5832


In [58]:
# Gridsearch to find the best parameters and fit to training data
gs_5 = GridSearchCV(pipe_5, 
                   param_grid=pipe_params_5, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = -1)

gs_5.fit(X_train, y_train)
best_5 = gs_5.best_estimator_
best_5.fit(X_train,y_train)
y_test_preds_5 = best_5.predict(X_test)

Fitting 3 folds for each of 1944 candidates, totalling 5832 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   40.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  4.4min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  8.2min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 10.5min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed: 13.3min
[Parallel(n_jobs=-1)]: Done 5832 out of 5832 | elapsed: 15.8min finished


In [59]:
# Develop a score and print
y_train_preds_5 = best_5.predict(X_train)
print(accuracy_score(y_train, y_train_preds_5))
print(accuracy_score(y_test,y_test_preds_5))
dfparams_5 = pd.DataFrame(gs_5.best_params_)
dfparams_5 = dfparams_5.drop(index = 0).T
dfparams_5 = dfparams_5.rename(index=str, columns={1: "Best Params"})
dfparams_5

0.8523206751054853
0.8601997146932953


Unnamed: 0,Best Params
cvec__max_df,0.9
cvec__max_features,500
cvec__min_df,2
cvec__ngram_range,2
rfc__bootstrap,False
rfc__max_features,0.5
rfc__min_samples_leaf,10
rfc__min_samples_split,5
rfc__n_estimators,120


##### Random Forests and TF-IDF w/Pipeline

In [60]:
pipe_6 = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('rfc', RandomForestClassifier())])

In [61]:
# Find the parameters, gridsearch them, use the best features to generate a score
pipe_params_6 = [{
    'tfidf__max_features': [100, 300, 500],
    'tfidf__min_df': [2,3],
    'tfidf__max_df': [.5,.9],
    'tfidf__ngram_range': [(1,1),(1,2)],
    'rfc__bootstrap': [False, True],
    'rfc__n_estimators': [100, 110, 120],
    'rfc__max_features': [.5, .6, .7],
    'rfc__min_samples_leaf': [10,12, 14],
    'rfc__min_samples_split':[3,5,7]
}]

In [62]:
# Since random forest has more features, consider how many fits you will have to do before running
lst = []
count = 0
for i in pipe_params_6[0]:
    count = 0
    for j in pipe_params_6[0][i]:
        count += 1
    lst.append(count)

first = lst[0]
num = 1
for i in lst:
    num*=i
print(f'Fits: {num*3}')

Fits: 11664


In [63]:
# Gridsearch to find the best parameters and fit to training data
gs_6 = GridSearchCV(pipe_6, 
                   param_grid=pipe_params_6, 
                   cv = 3,
                   verbose = 1,
                   n_jobs = -1)

gs_6.fit(X_train, y_train)
best_6 = gs_6.best_estimator_
best_6.fit(X_train,y_train)
y_test_preds_6 = best_6.predict(X_test)

Fitting 3 folds for each of 3888 candidates, totalling 11664 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   18.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   47.0s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 26.7min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 27.4min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 28.3min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 30.2min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 32.3min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 35.0min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed: 38.3min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed: 41.3min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed: 43.9min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed: 46.7min
[Parallel(n_jobs=-1)]: Done 9784 tasks      | elapsed: 49.8min
[Parallel(n_jobs=-1)]: Done 11234 tasks      | elapsed: 53.5min
[Parallel(n_jobs=-1)]: Done 11664 out of 11664 | elapsed: 

In [64]:
# Develop a score and print
y_train_preds_6 = best_6.predict(X_train)
print(accuracy_score(y_train, y_train_preds_6))
print(accuracy_score(y_test,y_test_preds_6))
dfparams_6 = pd.DataFrame(gs_6.best_params_)
dfparams_6 = dfparams_6.drop(index = 0).T
dfparams_6 = dfparams_6.rename(index=str, columns={1: "Best Params"})
dfparams_6

0.8839662447257384
0.8559201141226819


Unnamed: 0,Best Params
rfc__bootstrap,False
rfc__max_features,0.5
rfc__min_samples_leaf,10
rfc__min_samples_split,7
rfc__n_estimators,110
tfidf__max_df,0.9
tfidf__max_features,100
tfidf__min_df,3
tfidf__ngram_range,1


#### Summary Of Stats

In [65]:
# Print a summary
print("Logistic Regression: Countvectorizing and TFIDF")
print()
print("Countvectorizor")
print(f'Grid Score Train: {gs_1.score(X_train,y_train)}')
print(f'Grid Score Test: {gs_1.score(X_test, y_test)}')
print()
print("TF-IDF")
print(f'Grid Score Train: {gs_2.score(X_train,y_train)}')
print(f'Grid Score Test: {gs_2.score(X_test, y_test)}')
print()
print("Multinomial Niave Bayes: Countvectorizing and TFIDF")
print()
print("Countvectorizor")
print(f'Grid Score Train: {gs_3.score(X_train,y_train)}')
print(f'Grid Score Test: {gs_3.score(X_test, y_test)}')
print()
print("TF-IDF")
print(f'Grid Score Train: {gs_4.score(X_train,y_train)}')
print(f'Grid Score Test: {gs_4.score(X_test, y_test)}')
print()
print("Random Forest: Countvectorizing and TFIDF")
print()
print("Countvectorizor")
print(f'Grid Score Train: {gs_5.score(X_train,y_train)}')
print(f'Grid Score Test: {gs_5.score(X_test, y_test)}')
print()
print("TF-IDF")
print(f'Grid Score Train: {gs_6.score(X_train,y_train)}')
print(f'Grid Score Test: {gs_6.score(X_test, y_test)}')
print()

# I would like to clean this up in the future. Maybe create a for loop that can print for me

Logistic Regression: Countvectorizing and TFIDF

Countvectorizor
Grid Score Train: 0.9781997187060478
Grid Score Test: 0.9529243937232525

TF-IDF
Grid Score Train: 0.9690576652601969
Grid Score Test: 0.9500713266761769

Multinomial Niave Bayes: Countvectorizing and TFIDF

Countvectorizor
Grid Score Train: 0.9648382559774965
Grid Score Test: 0.9472182596291013

TF-IDF
Grid Score Train: 0.9690576652601969
Grid Score Test: 0.9443651925820257

Random Forest: Countvectorizing and TFIDF

Countvectorizor
Grid Score Train: 0.8523206751054853
Grid Score Test: 0.8601997146932953

TF-IDF
Grid Score Train: 0.8839662447257384
Grid Score Test: 0.8559201141226819



## Resources

1. http://localhost:8888/notebooks/DSI%20-%20Nash/GALessons/5_Week/5.06-lesson-nlp_ii/introduction-to-nlp.ipynb
2. http://localhost:8888/notebooks/DSI%20-%20Nash/GALessons/5_Week/5.07-lesson-naive_bayes/starter-code.ipynb
3. https://www.youtube.com/watch?v=iQ1bfDMCv_c -- Majority of credit is due to Alice Zhao on youtube who does a walkthrough of Natural Language Processing from beginning to end. 
4. https://www.analyticsindiamag.com/7-types-classification-algorithms/