### 2.2 Variables for text analysis

For variables "title","plot outline", "plot", "mpaa_reason", "tagline_TMDB", we'd like to apply text analysis to extract the key words that might be informative for predicting genre.

To do this, we first perform routine bag-od-words analysis to filter out non-word items (puctuations and numbers) of the text, and turn the stirn into frequency counts of different worlds.

For each variable, we would then apply PCA to the matrix of bag-of-word, and keep the top PCs (the number of PCs to choose will be decided by the variance they can explain, but for now I'll choose 10 PCs as demonstration). This new matrix would become the new feature matrix for this specific variable. 

In [1]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("feature_multi_top100.txt")
df.shape

(100, 43)

In [3]:
col_words = []
colname = "plot outline"

df_col = df[colname]

In [54]:
import re
col_words = []

for i in range(len(df_col)):
    
    if type(df_col[i]) == str: 
        letters_only = re.sub("[^a-zA-Z]", " " , df_col[i])

        lower_case = letters_only.lower()  
        words = lower_case.split()

        meaningful_words = lower_case.split()
        words = ( " ".join(meaningful_words))
    
    else: words = "NA"

    col_words.append(words)

In [79]:
# run "pip install --user --install-option="--prefix=" -U scikit-learn" on terminal
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",   
                                         tokenizer = None,    
                                         preprocessor = None, 
                                         stop_words = None,   
                                         max_features = 5000)

col_data = vectorizer.fit_transform(col_words)
col_data = pd.DataFrame(col_data.toarray())

vocab = vectorizer.get_feature_names()

In [99]:
df_new = pd.concat([df["imdb_ids"], col_data], axis = 1)
col_names = ["imdb_ids"] + vocab
df_new.columns = col_names

In [101]:
df_new.head()

Unnamed: 0,imdb_ids,ability,able,about,absent,accident,accidentally,accomplices,accused,acting,...,works,world,wreak,wreaks,writer,year,years,york,young,zefram
0,113101,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,425473,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,76759,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,266543,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,411267,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0


In [102]:
df_new.shape

(100, 1085)

We can see that the text analysis on the first 100 rows generate bag-of-words of size 1085. The next step is to apply PCA to reduce the feature dimension.

In [112]:
from sklearn.decomposition import PCA
pca = PCA(n_components=10)
df_pca = pd.DataFrame(pca.fit_transform(col_data))

In [124]:
df_new_pca = pd.concat([df["imdb_ids"], df_pca], axis = 1)

col_names = ["imdb_ids"]
for i in range(10):
    i_name = "plot_outline_PC" + str(i)
    col_names.append(i_name)
    
df_new_pca.columns = col_names
df_new_pca.to_csv("plot_outline_text_analysis.txt")

In [125]:
df_new_pca = pd.read_csv("plot_outline_text_analysis.txt")
df_new_pca.head()

Unnamed: 0.1,Unnamed: 0,imdb_ids,plot_outline_PC0,plot_outline_PC1,plot_outline_PC2,plot_outline_PC3,plot_outline_PC4,plot_outline_PC5,plot_outline_PC6,plot_outline_PC7,plot_outline_PC8,plot_outline_PC9
0,0,113101,-1.315679,-0.934888,0.347586,0.373653,0.315814,-0.221987,-0.553449,0.063865,-0.543138,-0.41548
1,1,425473,-1.433989,-0.003703,-0.181678,0.771971,-0.805442,-0.060474,0.382522,-0.603814,0.145042,-0.0581
2,2,76759,1.662408,1.688841,0.339102,0.009727,-1.804681,-0.567208,-1.597238,-1.214306,1.035848,-0.069215
3,3,266543,-0.489223,1.600604,0.524656,0.294343,0.025439,-0.434347,0.237392,0.412804,-0.30785,-0.274183
4,4,411267,-0.073453,-0.043161,0.734877,-0.353448,-0.696221,1.357004,0.443975,1.99399,0.919077,0.473163
