# Capstone Project: The Persuasive Power of Words

*by Nee Bimin*

## Notebook 3: Modeling and Conclusion

In this notebook, we will predict the number of ratings per view.

## Content
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    * [Ratings EDA](#Ratings-EDA)
    * [Occupation EDA](#Ratings-EDA)
- [Pre-processing](#Preprocessing)
    * [Tokenizing](#Tokenizing)
    * [Lemmatizing](#Lemmatizing)
    * [Stemming](#Stemming)

In [2]:
import matplotlib.pyplot as plt
import matplotlib

import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
import settings
import pandas as pd
import numpy as np
import operator
import graphviz
from sklearn.tree import export_graphviz

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

%matplotlib inline

ModuleNotFoundError: No module named 'settings'

In [15]:
# Read in data
ted = pd.read_csv('../data/ted_cleaned.csv')

# keep only the columns of interest
ted_model = ted.loc[:, ['num_speaker','duration','comments','languages','views', 'film_date', 'published_date']]

ted_model.head(5)

Unnamed: 0,num_speaker,duration,comments,languages,views,film_date,published_date
0,1,19.4,4553,60,47227110,1970-01-01 00:00:01.140825600,1970-01-01 00:00:01.151367060
1,1,16.283333,265,43,3200520,1970-01-01 00:00:01.140825600,1970-01-01 00:00:01.151367060
2,1,21.433333,124,26,1636292,1970-01-01 00:00:01.140739200,1970-01-01 00:00:01.151367060
3,1,18.6,200,35,1697550,1970-01-01 00:00:01.140912000,1970-01-01 00:00:01.151367060
4,1,19.833333,593,48,12005869,1970-01-01 00:00:01.140566400,1970-01-01 00:00:01.151440680


The above dataframe only consists of numerical values. 
We can separately analyse the description of the talk and the title using sentiment analysis with the TextBlob library. The results will then be used as features.

In [29]:
ted_model['descr_sentiment'] = ted['description'].apply(lambda x:TextBlob(re.sub(r'[^\x00-\x7f]',r'',
                                                                                 x)).sentiment.polarity)

# Create dataframe to display sentiment
sentiment = pd.DataFrame(ted['description'])
sentiment['description_sentiment'] = ted_model['descr_sentiment']
sentiment.head()

Unnamed: 0,description,description_sentiment
0,Sir Ken Robinson makes an entertaining and pro...,0.291667
1,With the same humor and humanity he exuded in ...,-0.115909
2,New York Times columnist David Pogue takes aim...,-0.081981
3,"In an emotionally charged talk, MacArthur-winn...",0.0
4,You've never seen data presented like this. Wi...,0.0


In [30]:
ted_model['title_sentiment'] = ted['title'].apply(lambda x:TextBlob(re.sub(r'[^\x00-\x7f]',r'',
                                                                                 x)).sentiment.polarity)

# Include in the sentiment dataframe
sentiment['title'] = ted['title']
sentiment['title_sentiment'] = ted_model['title_sentiment']
sentiment.head()

Unnamed: 0,description,description_sentiment,title,title_sentiment
0,Sir Ken Robinson makes an entertaining and pro...,0.291667,Do schools kill creativity?,0.0
1,With the same humor and humanity he exuded in ...,-0.115909,Averting the climate crisis,0.0
2,New York Times columnist David Pogue takes aim...,-0.081981,Simplicity sells,0.0
3,"In an emotionally charged talk, MacArthur-winn...",0.0,Greening the ghetto,0.0
4,You've never seen data presented like this. Wi...,0.0,The best stats you've ever seen,1.0


### One-Hot Encode Tags and Event Locations

In [32]:
ted['tags'] = ted['tags'].apply(lambda x:eval(str(x)))
all_tags = {}
count = 0

for talk in ted['tags']:
    for tag in talk:
        if not tag in all_tags:
            all_tags[tag] = count
            count = count+1
onehot = np.zeros((0,count))

for talk in ted['tags']:
    temp = np.zeros((1,count))
    for tag in talk:
        temp[0,all_tags[tag]] = 1
    onehot = np.concatenate((onehot,temp),0)

In [35]:
ted_model_np = ted_model.values
all_y = np.reshape(ted['avgPerRating'].values(),(-1,1))
all_X = np.concatenate((ted_model_np,onehot),1)
combined = np.concatenate((all_x,all_y),1)

np.random.shuffle(combined)
data_size = np.shape(all_y)[0]
train_size = (int)(data_size*0.75)
feature_size = np.shape(all_X)[1]

X_train = combined[0:train_size,0:feature_size]
y_train = np.reshape(combined[0:train_size,feature_size],(-1,1))
X_val = combined[train_size:data_size,0:feature_size]
y_val = np.reshape(combined[train_size:data_size,feature_size],(-1,1))

KeyError: 'avgPerRating'