# EDA and Modeling


This notebook will step through the collected data and prepare it for modeling.

I attempted to create a function that took a dirty string in, and returned a tokenized, clean string. This worked for a single string, but I could not get the function to work in a dataframe.map() context.

Since I have to pull the string out of the dataframe to put it into a matrix for vectorizing, I'll just do the cleaning then.


### Contents
- [Data Load](#Data-Load)
- [r/motorcycle vs. r/soccer](#model-1)
- [r/soccer vs. r/MLS](#model-2)  
- [r/MLS vs. r/SoundersFC](#model-3)
- [Multi-class Regression](#model-4)
- [*k*-Nearest Neighbors](#model-5)
- [r/TalesFromRetail vs. r/TalesFromYourServer ](#model-6)
- [r/TalesFromYourServer vs. r/bartenders](#model-7)
- [GridSearchCV() on two different models](#model-8)

In [None]:
#imports
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction import stop_words, text
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.naive_bayes import MultinomialNB

## Data Load

Load each of the subreddits' data into a dataframe and make a first cleaning pass using the built-in string methods.

In [None]:
sub_list = ['mc','fb','mls','ssfc','tfr','tfys', 'bar']

for sub in sub_list:
    if sub == 'mc':
        csvfilename = '../datasets/motorcycles.csv'
    elif sub == 'fb':
        csvfilename = '../datasets/soccer.csv'
    elif sub == 'mls':
        csvfilename = '../datasets/mls.csv'
    elif sub == 'ssfc':
        csvfilename = '../datasets/sounders.csv'
    elif sub == 'tfr':
        csvfilename = '../datasets/tfr.csv'
    elif sub == 'bar':
        csvfilename = '../datasets/bar.csv'
    else:
        csvfilename = '../datasets/tfys.csv'
        
# Read the CSV in, drop the Unnamed column, drop the first index, which is blank, reset the index
    df = pd.read_csv(csvfilename)
    df.drop(columns=['Unnamed: 0'], inplace=True)
    df.drop(index=0, inplace=True)
    df.reset_index(drop=True, inplace=True)

# lowercase everything but the author
    df['title'] = df.title.str.lower()
    df['text'] = df.text.str.lower()
    
# if the text, the actual post, is Nan, replace with spaces. Spaces won't affect the process
    df['text'].fillna(' ', inplace=True)
        
        
# put the data into the correct dataframe, one per sub, to facilitate analysis
    if sub == 'mc':
        mc_df = df
    elif sub == 'fb':
        fb_df = df
    elif sub == 'mls':
        mls_df = df
    elif sub == 'tfr':
        tfr_df = df
    elif sub == 'tfys':
        tfys_df = df
    elif sub == 'bar':
        bar_df = df
    else:
        ssfc_df = df

### Functions

I created two functions to streamline the code, since I was doing many of the same things over and over. Sort of the whole point of a function.

#### `string_clean(series)`
This takes a dirty series of documents, cleans and tokenizes, and returns a clean corpus


#### `vector_scores(X,y,steps)`
This will take a clean X, a clean y, and a list of steps for a pipeline

The data is split then run through a series of pipelines. Each pipeline contains at least a vectorizer and a logistic regression object. The pipelines are fit, transformed, and scored with the results being stored in a dataframe, which is returned. A confusion matrix is also created using the test data; this is returned as well. The metrics for a binary classification are also computed and returned as a dataframe.

In [None]:
# This function will take in a dirty pd series and return a clean corpus with no numbers or punctuation

def string_clean(series):

# init the corpus
    corpus = []
    
    
# Step throught the series
    for i, val in series.iteritems():
        
# Create a list of all the clean words/tokens
        clean_list = re.findall(r'\b[^\d\W]+\b', val)

# Create the clean string with the tokens separated by spaces
        s_clean = ' '.join(clean_list)
    
# Append to the corpus
        corpus.append(s_clean)
    
    return corpus

In [None]:
# This function will take a clean X and y, run it through three different vectorizing methods and then 
# return the scores as a data frame. Also takes in a list of steps, and assumes the vectorizer is first step
# return the confusion matrix for the test data. The metrics are calculated for binary regression and returned
# as a datafram

def vector_scores(X, y, steps):
    
# split
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
    
#  Create a list of the vectors we are going to use in the steps of the pipeline as well as a list for
#  the dataframe
    vectors = [('cvec',CountVectorizer(stop_words='english')),
               ('tvec',TfidfVectorizer(stop_words='english')),
               ('hvec',HashingVectorizer(stop_words='english'))]
    vector_list = ['CountVector','TfidfVector','HashVector']

# initilize list iterater
    i=0

# init holding lists
    vec_list = []
    set_list = []
    score_list = []

    for vec in vectors:
    
# Set the pipeline step with the correct vector
        steps[0] = vec
        pipe = Pipeline(steps)
        
    
# Fit Train
        pipe.fit(X_train, y_train)
    
# Score Train
        vec_list.append(vector_list[i])
        set_list.append('train')
        score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
        vec_list.append(vector_list[i])
        set_list.append('test')
        score_list.append(pipe.score(X_test, y_test)) 
    
        i = i + 1
                      
                  
# Fill scores dataframe
    scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
    scores_df['vector'] = vec_list
    scores_df['set'] = set_list
    scores_df['score'] = score_list
    
# Reset to CountVectorizer(), as that's what I'm using for MultinomialNB
    steps = [('cvec',CountVectorizer(stop_words='english')),
             ('lr', LogisticRegression(solver='liblinear', multi_class='auto'))]
    pipe = Pipeline(steps)
    pipe.fit(X_train, y_train)
    
# Create the confusion matrix
# Predict Test
    y_pred = pipe.predict(X_test)
    
# Confusion Matrix
    cf = confusion_matrix(y_test, y_pred)
    
# Create a dataframe of metrics. If there are more than two classes, send an empty dataframe
    try:
        tn, fp, fn, tp = cf.ravel()
        n = tn+fp+fn+tp
        accuracy = (tp+tn)/n
        sensitivity = tp/(tp+fn)
        precision = tp/(tp+fp)
        specificity = tn/(tn+fp)
        missclass = (fp+fn)/n
        value_list = [accuracy, missclass, precision, sensitivity, specificity]
        metric_df = pd.DataFrame(columns=['metric', 'value'])
        metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
        metric_df.value = value_list
    except:
        metric_df = [[]]
        
        
    return scores_df, cf, metric_df

<a id='model-2'></a>

## r/motorcycle vs. r/soccer

For this model I will look at bikes vs. balls.

Can I reliably discern where the post came from, r/motorcyce or r/soccer?

I will also use the CountVectorizer(), TFIDFVectorizer(), and HashVectorizer().

In [None]:
# create combined df to get our X and y from r/motorcycle and r/soccer
data1_df = pd.concat([mc_df, fb_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mc']

First we do logistic regression

In [None]:
# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

In [None]:
scores_df, cf, metric_df = vector_scores(X,y,steps)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

Now we do multinomial naive bayes. I had tried to include this in the function, but some odd error that I couldn't solve cropped up. So I'm doing a pipeline of NB Multi, using the CountVectorizer.

In [None]:
# create combined df to get our X and y from r/motorcycle and r/soccer
data1_df = pd.concat([mc_df, fb_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mc']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = cf.ravel()
n = tn+fp+fn+tp
accuracy = (tp+tn)/n
sensitivity = tp/(tp+fn)
precision = tp/(tp+fp)
specificity = tn/(tn+fp)
missclass = (fp+fn)/n
value_list = [accuracy, missclass, precision, sensitivity, specificity]
metric_df = pd.DataFrame(columns=['metric', 'value'])
metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
metric_df.value = value_list

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

### Train had a 1.0 score?!?!?!?!

Let's go check that out!

In [None]:
# create combined df to get our X and y from r/motorcycle and r/soccer
data1_df = pd.concat([mc_df, fb_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mc']

# Get the scores
# scores_df = vector_scores(X,y, steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)

mc_stop_words = ['motorcycle','motorcycles']

my_stop_words = text.ENGLISH_STOP_WORDS.union(mc_stop_words)

cvec = CountVectorizer()
lr = LogisticRegression(solver='liblinear', multi_class='auto')

X_train_c = cvec.fit_transform(X_train)
X_test_c = cvec.transform(X_test)

lr.fit(X_train_c, y_train)

train_score = lr.score(X_train_c, y_train)
test_score = lr.score(X_test_c, y_test)
print(f'Train: {train_score}')
print(f'Test: {test_score}')
print(f'Cross: {cross_val_score(lr, X_train_c, y_train, cv=5).mean()}')



In [None]:
# get the features and their coefficents to look at
c_df = pd.DataFrame(lr.coef_, columns=cvec.get_feature_names()).T


In [None]:
top = c_df.sort_values(by=0, ascending=False).head(15)
bottom = c_df.sort_values(by=0, ascending=True).head(15)

In [None]:
top.rename(columns={0:'Coef'}, inplace=True)
top

In [None]:
bottom.rename(columns={0:'Coef'}, inplace=True)
bottom


#### Analysis
I find it very interesting that there are so many words so heavily weighted. Makes a bit more sense why train is overfitting so badly. Also, this is why I went and got several other subs of information.

In [None]:
# Total number of positive terms >= 1

pos = c_df[ c_df[0] >= 1].sort_values(by=0, ascending=False)
print(f'Mean of positive terms: {pos[0].mean()}')
print(f'Shape: {pos.shape}')

In [None]:
# Total number of negative terms <= -1

neg = c_df[ c_df[0] <= -1].sort_values(by=0, ascending=True)
print(f'Mean of negative eterms: {neg[0].mean()}')
print(f'Shape: {neg.shape}')

<a id='model-1'></a>

## r/soccer vs. r/mls

Since motorcycles and the language pertaining thereto aren't very similar to soccer-sports-ball, lets compare two more closely related subs to see what the results are. Namely, we'll look at the world-wide (although in practice fairly Euro-centric) r/soccer and the league specific r/mls.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([fb_df, mls_df])

# hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

# creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mls']

# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

# Get the scores
scores_df, cf, metric_df = vector_scores(X,y,steps)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

NaiveBayes Multi with CountVectorizer

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([fb_df, mls_df])

# hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

# creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mls']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = cf.ravel()
n = tn+fp+fn+tp
accuracy = (tp+tn)/n
sensitivity = tp/(tp+fn)
precision = tp/(tp+fp)
specificity = tn/(tn+fp)
missclass = (fp+fn)/n
value_list = [accuracy, missclass, precision, sensitivity, specificity]
metric_df = pd.DataFrame(columns=['metric', 'value'])
metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
metric_df.value = value_list

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

#### Analysis.
This was, as expected, not perfect. Much more cross-over in terms of specific vocabulary.

Let's do one more test: a team specific sub vs. the league in which it plays. r/SounderFC vs r/MLS.

<a id='model-3'></a>

## r/MLS vs. r/SoundersFC

Further down the rabbit hole.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mls_df, ssfc_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_ssfc']

# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

# Get the scores
scores_df, cf, metric_df = vector_scores(X,y,steps)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

NaiveBayes Multi with CountVectorizer

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mls_df, ssfc_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_ssfc']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = cf.ravel()
n = tn+fp+fn+tp
accuracy = (tp+tn)/n
sensitivity = tp/(tp+fn)
precision = tp/(tp+fp)
specificity = tn/(tn+fp)
missclass = (fp+fn)/n
value_list = [accuracy, missclass, precision, sensitivity, specificity]
metric_df = pd.DataFrame(columns=['metric', 'value'])
metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
metric_df.value = value_list

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

#### Analysis.
It's a bit over-fit, but not much. I am honestly surprised at this.

<a id='model-4'></a>

## Multi-class regression

EVERYTHING!!!!!!!

Seriously, I'm going to do a multi-class regression against the whole shebang. And then I'll cook an egg on my poor laptop.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mc_df, fb_df, mls_df, ssfc_df])

# instead of one-hot encoding the target apply a map to the 'sub'
sub_map = {'mc': 0,
           'fb': 1,
           'mls': 2,
           'ssfc':3}
            
data1_df['sub'] = data1_df['sub'].map(sub_map).astype(int)

#create our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub']

# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

# Get the scores
scores_df, cf, metric_df = vector_scores(X,y,steps)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['mc','fb','mls','ssfc'], index=['mc','fb','mls','ssfc'])

NaiveBayes Multi with CountVectorizer

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mc_df, fb_df, mls_df, ssfc_df])

# instead of one-hot encoding the target apply a map to the 'sub'
sub_map = {'mc': 0,
           'fb': 1,
           'mls': 2,
           'ssfc':3}
            
data1_df['sub'] = data1_df['sub'].map(sub_map).astype(int)

#create our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['mc','fb','mls','ssfc'], index=['mc','fb','mls','ssfc'])

#### Analysis

Multinomial Bayes did a surprisingly good job compared to the others.

The other models are obviously quite overfit.

<a id='model-5'></a>

## KNN

Let's try a KNN.

Because.

I'm going to put it through the standard scaler to make sure everything is nice and tight.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mc_df, fb_df, mls_df, ssfc_df])

# instead of one-hot encoding the target apply a map to the 'sub'
sub_map = {'mc': 0,
           'fb': 1,
           'mls': 2,
           'ssfc':3}
            
# Make sure the target is an integer, just to keep it clean
data1_df['sub'] = data1_df['sub'].map(sub_map).astype(int)

#create our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub']

# Init steps
steps = [
    ('cvec', CountVectorizer()),
    ('ss', StandardScaler(with_mean=False)),
    ('knn', KNeighborsClassifier())
]

scores_df, cf, metric_df = vector_scores(X,y,steps)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['mc','fb','mls','ssfc'], index=['mc','fb','mls','ssfc'])

#### Analysis

BWHAHAHAHAHA!!!!!

Not only is it overfit, it's still really bad! 

Baseline is .50. So .54 isn't much of an improvement. 

Worth a shot? I guess?

#### Thoughts so far.

The motorcycle vs. soccer is pretty clear that they are completely different. Makes sense.

The rest of them are refinements of the previous set. For example, r/MLS is going to be a subset of what is discussed in r/soccer, and r/SoundersFC is a subset of r/MLS. It is reasonable, therefore, to expect the vocabularies to be a subset as well. This would explain the ease with which the model was able to differentiate the different subreddits, even in a multi-class regression.

This leads us to a different experiment:

<a id='model-6'></a>

## r/TalesFromRetail vs. r/TalesFromYourServer

Since the soccer subs were subsets one of the other, I'm going to take two subs that have a similar form: stories. The two subs even have a similar focus: customer service. One is food based, r/TalesFromYourServer, and other is retail based, r/TalesFromRetail.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([tfr_df, tfys_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_tfys']

# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

scores_df, cf, metric_df = vector_scores(X,y,steps)

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

NaiveBayes Multi with CountVectorizer

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mls_df, ssfc_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_ssfc']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = cf.ravel()
n = tn+fp+fn+tp
accuracy = (tp+tn)/n
sensitivity = tp/(tp+fn)
precision = tp/(tp+fp)
specificity = tn/(tn+fp)
missclass = (fp+fn)/n
value_list = [accuracy, missclass, precision, sensitivity, specificity]
metric_df = pd.DataFrame(columns=['metric', 'value'])
metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
metric_df.value = value_list

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

#### Analysis
That was not what I expected. It's still quite overfit, which is, honestly, a bit surprising, but, in keeping with this whole project.

Part of this is due to the unbalanced classes. r/TalesFromRetail provided almost twice as many examples as r/TalesFromYourServer

<a id='model-7'></a>

## r/TalesFromYourServer vs. r/bartenders

Okay, let's try two foodservice subs: r/TalesFromYourServer, r/bartenders

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([bar_df, tfys_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_tfys']

# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

scores_df, cf, metric_df = vector_scores(X,y,steps)

scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

NaiveBayes Multi with CountVectorizer

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([mls_df, ssfc_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_ssfc']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = cf.ravel()
n = tn+fp+fn+tp
accuracy = (tp+tn)/n
sensitivity = tp/(tp+fn)
precision = tp/(tp+fp)
specificity = tn/(tn+fp)
missclass = (fp+fn)/n
value_list = [accuracy, missclass, precision, sensitivity, specificity]
metric_df = pd.DataFrame(columns=['metric', 'value'])
metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
metric_df.value = value_list

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

#### Analysis

FINALLY! I have some issues! I mean the code does. Not that I don't, but...well, I'm not paying ya'll for that kind of help.

**ANYWAY!**

TO THE BAT-CODE, ROBIN!!!!

I'm going to combine the text (post) with the title.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([bar_df, tfys_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

# Create a series that combines the title and the text
t_t = pd.Series(data1_df['title'] + ' ' + data1_df['text'])

#creat our X matrix and y target
X = string_clean(t_t)
y = data1_df['sub_tfys']

# Create the steps
steps = [
    ('hold', 'hold'),
    ('lr', LogisticRegression(solver='liblinear', multi_class='auto' ))]

scores_df, cf, metric_df = vector_scores(X,y,steps)

scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

Let's do the MNB on the whole shebang.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([bar_df, tfys_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

# Create a series that combines the title and the text
t_t = pd.Series(data1_df['title'] + ' ' + data1_df['text'])

#creat our X matrix and y target
X = string_clean(t_t)
y = data1_df['sub_tfys']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)
# create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english')),
    ('mnb', MultinomialNB())]

pipe = Pipeline(steps)

# init holding lists
vec_list = []
set_list = []
score_list = []
    
# Fit Train
pipe.fit(X_train, y_train)
    
# Score Train
vec_list.append('CountVector')
set_list.append('train')
score_list.append(pipe.score(X_train, y_train))
                      
# Score Test
vec_list.append('CountVector')
set_list.append('test')
score_list.append(pipe.score(X_test, y_test)) 
                 
# Fill scores dataframe
scores_df = pd.DataFrame(columns = ['vector','set','score'])                      
scores_df['vector'] = vec_list
scores_df['set'] = set_list
scores_df['score'] = score_list
    
# Create the confusion matrix
# Predict Test
y_pred = pipe.predict(X_test)
    
# Confusion Matrix
cf = confusion_matrix(y_test, y_pred)

tn, fp, fn, tp = cf.ravel()
n = tn+fp+fn+tp
accuracy = (tp+tn)/n
sensitivity = tp/(tp+fn)
precision = tp/(tp+fp)
specificity = tn/(tn+fp)
missclass = (fp+fn)/n
value_list = [accuracy, missclass, precision, sensitivity, specificity]
metric_df = pd.DataFrame(columns=['metric', 'value'])
metric_df.metric = ['Accuracy','Missclassification','Precision','Sensitivity','Specificity']
metric_df.value = value_list

In [None]:
scores_df

In [None]:
pd.DataFrame(cf, columns=['pred neg','pred pos'], index=['actual neg','actual pos'])

In [None]:
metric_df

That made a huge difference. Interesting, and not that surprising.

It's still overfit. So....

Let's play around with GridSearchCV to see if it can be improved.

<a id='model-8'></a>

## GridSearchCV

I'm going to use Gridseach to see if I can get the logistic regression less over-fit.

I'll then take that and apply it the first comparison: bikes v balls.

In [None]:
# create combined df to get our X and y from
data1_df = pd.concat([bar_df, tfys_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

# Create a series that combines the title and the text
t_t = pd.Series(data1_df['title'] + ' ' + data1_df['text'])

#creat our X matrix and y target
X = string_clean(t_t)
y = data1_df['sub_tfys']

#typical split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)

## Create the steps
steps = [
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(multi_class='auto' ))]

# Instantiate the Pipe
pipe=Pipeline(steps)

In [None]:
# Create Parameter dict
params={
    'cvec__stop_words':['english'],
    'cvec__max_features': [2000],
    'cvec__ngram_range':[(1,1)],
    'cvec__max_df': [.5],
    'lr__penalty':['l2'],
    'lr__solver':['lbfgs'],
    'lr__max_iter':[10]
}

gs = GridSearchCV(pipe, param_grid = params, cv=3, n_jobs=-1, verbose=1)

gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

Best run found:

0.8882030178326474

{'cvec__max_df': 0.5, 'cvec__max_features': 2000, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': 'english', 'lr__max_iter': 10, 'lr__penalty': 'l2', 'lr__solver': 'lbfgs'}

In [None]:
gs.score(X_test, y_test)

Train: .8882

Test: .9012

And that, all in all, isn't too bad.

Let's take the same parameters and run it on the original dataset, motorcyce v soccer.

In [None]:
# create combined df to get our X and y from r/motorcycle and r/soccer
data1_df = pd.concat([mc_df, fb_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mc']

#typical split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)

In [None]:
# Create the steps
steps = [
    ('cvec', CountVectorizer(stop_words='english', max_features=2000, ngram_range=(1,10), max_df=.5)),
    ('lr', LogisticRegression(multi_class='auto', solver='lbfgs', max_iter=10 ))]

# Instantiate the Pipe
pipe=Pipeline(steps)

pipe.fit(X_train, y_train)
print(f'Train: {pipe.score(X_train, y_train)}')
print(f'Test: {pipe.score(X_test, y_test)}')

Now that's interesting. The same models obviously don't apply.

Let's go play with Grid.

In [None]:
# create combined df to get our X and y from r/motorcycle and r/soccer
data1_df = pd.concat([mc_df, fb_df])

#hot one encode the sub for our target
data1_df = pd.get_dummies(data1_df, columns=['sub'],drop_first='True')

#creat our X matrix and y target
X = string_clean(data1_df['title'])
y = data1_df['sub_mc']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify=y)

In [None]:
# Create Parameter dict
mcstop=['motorcycle','motorcycles']
params={
    'cvec__stop_words':[None],
    'cvec__max_features': [3000],
    'cvec__ngram_range':[(1,1)],
    'cvec__max_df': np.linspace(.01, 1, 20),
    'lr__penalty':['l2'],
    'lr__solver':['sag'],
    'lr__max_iter':[100]
}

gs = GridSearchCV(pipe, param_grid = params, cv=5, n_jobs=-1, verbose=1)

gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

Fitting 5 folds for each of 1 candidates, totalling 3 fits

0.9318181818181818

{'cvec__max_df': 0.06210526315789474, 'cvec__max_features': 3000, 'cvec__ngram_range': (1, 1), 'cvec__stop_words': None, 'lr__max_iter': 100, 'lr__penalty': 'l2', 'lr__solver': 'sag'}


In [None]:
gs.score(X_test, y_test)

First Pass:

vector	set	score
- 0	CountVector	train	1.000000
- 1	CountVector	test	0.929545
- 2	TfidfVector	train	0.998485
- 3	TfidfVector	test	0.956818
- 4	HashVector	train	0.992424
- 5	HashVector	test	0.938636


GridSearch

CountVector train .9318

CountVector test .9318

Much better. Interesting.

#### Analysis

Hyperparameters obviously play a large roll in NLP. I can extrapolate from this that they play a large roll pretty much anywhere, and GridSeach() is going to be a valuable tool moving forward.

I don't know what it means, but I like the fact that the test dataset did better than the train. That gives me confidence on how well the model will perform on new data.

What I find very interesting is the scores being exactly the same from both Test and Train on the motorcycle vs soccer set of data. 