# Capstone Project: The Persuasive Power of Words

*by Nee Bimin*

## Notebook 3: Modeling and Conclusion

In this notebook, we will predict the number of ratings per view.

## Content

- [Pre-processing](#Preprocessing)
    * [Tokenizing and Lemmatizing](#Tokenizing-and-Lemmatizing)
- [Train/Test Split](#Train/Test-Split)
- [Grid Search CV](#Grid-Search-CV)
    * [Baseline Accuracy](#Baseline-Accuracy)
    * [Count Vectorizer](#Count-Vectorizer)
    * [Tfidf Vectorizer](#Tfidf-Vectorizer)
- [Optimising Tfidf Multinomial Naive Bayes](#Optimising-Tfidf-Multinomial-Naive_Bayes)
- [Optimising Tfidf Logistic Regression](#Optimising-Tfidf-Logistic-Regression)
- [Conclusion-and-Recommendations](#Conclusion-and-Recommendations)

In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import os
import sys
import operator
import graphviz

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
from sklearn.tree import export_graphviz

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

%matplotlib inline


calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.



In [37]:
# Read in data
ted_model = pd.read_csv('../data/ted_model.csv')
transcripts = pd.read_csv('../data/transcripts_cleaned.csv')

## Pre-processing

### Tokenizing and Lemmatizing

In [14]:
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+')
tokens = []

for i in range(len(transcripts['transcript'])):
    loop_tokens = tokenizer.tokenize(transcripts['transcript'].iloc[i].lower())
    
    for j, token in enumerate(loop_tokens):
            # Remove non-letters
            re.sub('[^a-zA-Z]', '', token)

            # Remove stopwords   
            if token in stopwords.words('english'):
                loop_tokens[j] = ''
                
    tokens.append(loop_tokens)

In [15]:
# Check token
tokens[2][:10]

['hello', 'voice', 'mail', '', 'old', 'friend', '', '', 'called', '']

In [19]:
# Instantiate Lemmatizer
lem = WordNetLemmatizer()

speech_token_lem = []
for speech in tokens:
    speech_lem = []
    
    for word in speech:
        #print(word)
        word_lem = lem.lemmatize(word) # get lemmatized word
        speech_lem.append(word_lem) # add to post list
    speech_token_lem.append(speech_lem)  # add post list to lemma matrix

# Check lemmatized token
speech_token_lem[2][:10]

['hello', 'voice', 'mail', '', 'old', 'friend', '', '', 'called', '']

In [20]:
# Format tokenized lemma for vectorizer i.e. change to a list of strings
lem_list = []

for speech in speech_token_lem:
    lem_list.append(' '.join(speech))

In [38]:
persuasive_model = pd.DataFrame(lem_list, columns=['transcript'])
persuasive_model['persuasive'] = ted_model['persuasive']

In [40]:
persuasive_model['persuasive'].describe()

count     2467.000000
mean       226.397649
std        473.497627
min          0.000000
25%         39.000000
50%        101.000000
75%        234.000000
max      10704.000000
Name: persuasive, dtype: float64

In [45]:
# Create another column to label a talk as persuasive if the number of persuasive votes
# is greater than or equal to the median
persuasive_median = persuasive_model['persuasive'].median()
persuasive_model['persuasive_label'] = np.where(persuasive_model['persuasive'] >= persuasive_median, 1, 0)

In [61]:
# Create another column to label a talk as inspiring 
# if the number of inspiring votes is greater than or equal to the median
inspiring_median = inspiring_model['inspiring'].median()
inspiring_model['inspiring_label'] = np.where(inspiring_model['inspiring'] >= inspiring_median, 1, 0)

NameError: name 'inspiring_model' is not defined

## Train/Test Split

In [47]:
X = persuasive_model['transcript']
y = persuasive_model['persuasive_label']

In [48]:
# Since it is classification problem, we will stratify y to ensure equal split of X and y 
# in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [49]:
# Confirm that train and test variables have the same length
print('Number of rows in X_train: {}'.format(len(X_train)))
print('Number of rows in y_train: {}'.format(len(y_train)))
print('Number of rows in X_test: {}'.format(len(X_test)))
print('Number of rows in y_test: {}'.format(len(y_test)))

Number of rows in X_train: 1850
Number of rows in y_train: 1850
Number of rows in X_test: 617
Number of rows in y_test: 617


### Baseline Accuracy

In [50]:
# baseline accuracy
baseline = y_train.value_counts(normalize=True)[1]
baseline

0.5064864864864865

Here we find the baseline accuracy, which is the likelihood of a transcript being persuasive, by calculating the percentage of the dataset that has the target value of 1. Normalising the value counts shows the percentage, and gives a baseline accuracy of 50.6%.

In [55]:
# Pipeline steps for each combination of model
# Include standard scaler for knn and logistic regression because distance are important when classifying
steps_list_cv = [ 
    [('cv', CountVectorizer()),('multi_nb', MultinomialNB())],
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

In [56]:
steps_titles_cv = ['multi_nb + cv','knn + cv','logreg + cv']

In [57]:
pipe_params_cv = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)]}
]   

In [58]:
# instantiate results DataFrame for CountVectorizer
gs_results_cv = pd.DataFrame(columns=['model','best_params','train_accuracy','test_accuracy',
                                      'baseline_accuracy','recall', 'precision', 'f1-score'])
gs_results_cv.head()

Unnamed: 0,model,best_params,train_accuracy,test_accuracy,baseline_accuracy,recall,precision,f1-score


In [59]:
# Loop through index of number of steps
for i in range(len(steps_list_cv)):
    # instantiate pipeline 
    pipe = Pipeline(steps=steps_list_cv[i])
    # fit GridSearchCV to model and model's params
    gs = GridSearchCV(pipe, pipe_params_cv[i], cv=3) 

    model_results = {}

    gs.fit(X_train, y_train)
    
    print('Model: ', steps_titles_cv[i])
    model_results['model'] = steps_titles_cv[i]

    print('Best Params: ', gs.best_params_)
    model_results['best_params'] = gs.best_params_

    print('Cross val accuracy: ', gs.score(X_train, y_train), '\n')
    model_results['train_accuracy'] = gs.score(X_train, y_train)
    
    print('Test accuracy: ', gs.score(X_test, y_test), '\n')
    model_results['test_accuracy'] = gs.score(X_test, y_test)
    
    model_results['baseline_accuracy'] = baseline
    
    # Display the confusion matrix results showing true/false positive/negative
    tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
    print("True Negatives: %s" % tn)
    print("False Positives: %s" % fp)
    print("False Negatives: %s" % fn)
    print("True Positives: %s" % tp, '\n')
    
    model_results['recall'] = tp/(tp+fn)
    model_results['precision'] = tp/(tp+fp)
    model_results['f1-score'] = 2*((tp/(tp+fp))*(tp/(tp+fn)))/((tp/(tp+fp))+(tp/(tp+fn)))

    gs_results_cv = gs_results_cv.append(model_results, ignore_index=True)
    pd.set_option('display.max_colwidth', 200)

Model:  multi_nb + cv
Best Params:  {'cv__ngram_range': (1, 2), 'cv__stop_words': 'english'}
0.9983783783783784 

0.49108589951377635 

True Negatives: 124
False Positives: 196
False Negatives: 118
True Positives: 179 

Model:  knn + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.5064864864864865 

0.4813614262560778 

True Negatives: 0
False Positives: 320
False Negatives: 0
True Positives: 297 



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Model:  logreg + cv
Best Params:  {'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}
0.9989189189189189 

0.5024311183144247 

True Negatives: 135
False Positives: 185
False Negatives: 122
True Positives: 175 



In [None]:
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
  model_name = model.__class__.__name__
  accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
  for fold_idx, accuracy in enumerate(accuracies):
    entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns