# Capstone Project: The Persuasive Power of Words

*by Nee Bimin*

## Notebook 3: Modeling and Conclusion

In this notebook, we will predict the number of ratings per view.

## Content

- [Pre-processing](#Preprocessing)
    * [Tokenizing and Lemmatizing](#Tokenizing-and-Lemmatizing)
- [Train/Test Split](#Train/Test-Split)
- [Grid Search CV](#Grid-Search-CV)
    * [Baseline Accuracy](#Baseline-Accuracy)
    * [Count Vectorizer](#Count-Vectorizer)
    * [Tfidf Vectorizer](#Tfidf-Vectorizer)
- [Optimising Tfidf Multinomial Naive Bayes](#Optimising-Tfidf-Multinomial-Naive_Bayes)
- [Optimising Tfidf Logistic Regression](#Optimising-Tfidf-Logistic-Regression)
- [Conclusion-and-Recommendations](#Conclusion-and-Recommendations)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.callbacks import EarlyStopping
from keras.layers import Dropout

from nltk.corpus import stopwords
from nltk import word_tokenize
STOPWORDS = set(stopwords.words('english'))
import plotly.graph_objs as go
import chart_studio.plotly as py
import cufflinks
from IPython.core.interactiveshell import InteractiveShell
import plotly.figure_factory as ff
InteractiveShell.ast_node_interactivity = 'all'
from plotly.offline import iplot
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

%matplotlib inline

In [15]:
# Read in data
ted_model = pd.read_csv('../data/ted_model.csv')
transcripts = pd.read_csv('../data/transcripts_cleaned.csv')

In [29]:
# Assign 1 to talks whose count of persuasive votes is equal to or higher than the median
# These will be the classes that we will train the model on
persuasive_median = ted_model['persuasive'].median()
ted_model['persuasive_label'] = np.where(ted_model['persuasive'] >= persuasive_median, 1, 0)

#Do the same for inspiring talks and unconvincing talks
inspiring_median = ted_model['inspiring'].median()
ted_model['inspiring_label'] = np.where(ted_model['inspiring'] >= inspiring_median, 1, 0)

unconvincing_median = ted_model['unconvincing'].median()
ted_model['unconvincing_label'] = np.where(ted_model['unconvincing'] >= unconvincing_median, 1, 0)

In [30]:
# Select columns of interest for modeling
ted_model = ted_model[['comments', 'views', 'transcript', 'persuasive_label', 'inspiring_label', 'unconvincing_label']]

In [18]:
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = re.compile('[/(){}\[\]\|@,;]').sub(' ', text) # find and replace symbols by space in text
    text = re.compile('[^0-9a-z #+_]').sub('', text) # find and remove symbols 
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
    
    return text

ted_model['transcript'] = ted_model['transcript'].apply(clean_text)

In [25]:
ted_model['transcript'][0][:1000]

'good morning great hasnt ive blown away whole thing fact im leaving three themes running conference relevant want talk one extraordinary evidence human creativity presentations weve people variety range second put us place idea whats going happen terms future idea may play outi interest education actually find everybody interest education dont find interesting youre dinner party say work education actually youre often dinner parties frankly work education youre asked youre never asked back curiously thats strange say somebody know say say work education see blood run face theyre like oh god know one night week ask education pin wall one things goes deep people right like religion money things big interest education think huge vested interest partly education thats meant take us future cant grasp think children starting school year retiring 2065 nobody clue despite expertise thats parade past four days world look like five years time yet meant educating unpredictability think extraordi

In [42]:
# The maximum number of words to be used. (most frequent)
max_words = 50000
# Max number of words in each complaint.
max_length = 10000
# This is fixed.
embed = 100

tokenizer = Tokenizer(num_words=max_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~') # num_words is the maximum number of words to be used
tokenizer.fit_on_texts(ted_model['transcript'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 98627 unique tokens.


In [43]:
# Set X variable
X = tokenizer.texts_to_sequences(ted_model['transcript'].values)
X = pad_sequences(X, maxlen=max_length) # maxlen is the maximum number of words in each complaint
print('Shape of data tensor:', X.shape)

Shape of data tensor: (2467, 10000)


In [44]:
# Set target variable
y = ted_model[['persuasive_label', 'inspiring_label', 'unconvincing_label']].values
print('Shape of label tensor:', y.shape)

Shape of label tensor: (2467, 3)


In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.10, random_state = 42)
print('Train set shapes:', X_train.shape, y_train.shape)
print('Test set shapes:', X_test.shape, y_test.shape)

Train set shapes: (2220, 10000) (2220, 3)
Test set shapes: (247, 10000) (247, 3)


In [48]:
# Instantiate model
model = Sequential()

model.add(Embedding(max_words, embed, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 10000, 100)        5000000   
_________________________________________________________________
spatial_dropout1d_3 (Spatial (None, 10000, 100)        0         
_________________________________________________________________
lstm_3 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_3 (Dense)              (None, 3)                 303       
Total params: 5,080,703
Trainable params: 5,080,703
Non-trainable params: 0
_________________________________________________________________
None


In [49]:
history = model.fit(X_train, y_train, 
                    epochs=5, 
                    batch_size=64,
                    validation_split=0.1,
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])

Instructions for updating:
Use tf.cast instead.
Train on 1998 samples, validate on 222 samples
Epoch 1/5

KeyboardInterrupt: 