# Film Genre Prediction
In this notebook I explore predicting the genre of almost 44,000 films from a text-based overview of the film. I will experiment with several variations of three supervised classification machine learning algorthms: Logistic Regression, Multinomial Naive Bayes, and a Support Vector Machine (SVM) Classifier. <br>

The data used in this notebook is from The Movies Dateset from Kaggle (https://www.kaggle.com/rounakbanik/the-movies-dataset)

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Seed numpy for consistancy
np.random.seed(1)

### Read Data
I have already cleaned the raw data from kaggle (parsed JSON, removed nonsense entries, merged data from seperate files, etc.) and pickled the cleaned DataFrame.

In [3]:
movies = pd.read_pickle('data/movies_reduced.pkl')

In [4]:
print(len(movies))
movies.head()

43998


Unnamed: 0_level_0,overview,num_genres,genres_list
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
862,"Led by Woody, Andy's toys live happily in his ...",3,"[animation, comedy, family]"
8844,When siblings Judy and Peter discover an encha...,3,"[adventure, fantasy, family]"
15602,A family wedding reignites the ancient feud be...,2,"[romance, comedy]"
31357,"Cheated on, mistreated and stepped on, the wom...",3,"[comedy, drama, romance]"
11862,Just when George Banks has recovered from his ...,1,[comedy]


In [5]:
movies.genres_list.apply(lambda row: len(row)).value_counts()

2    14154
1    13947
3     9429
4     3331
0     2128
5      829
6      154
7       23
8        3
Name: genres_list, dtype: int64

In [6]:
all_genres = []
for i, row in movies.iterrows():
    for genre in row.genres_list:
        if genre not in all_genres:
            all_genres.append(genre)

In [7]:
genre_counts = {genre:0 for genre in all_genres}
for i, row in movies.iterrows():
    for genre in row.genres_list:
        genre_counts[genre] += 1

In [8]:
genre_counts

{'animation': 1879,
 'comedy': 12689,
 'family': 2687,
 'adventure': 3442,
 'fantasy': 2269,
 'romance': 6607,
 'drama': 19833,
 'action': 6507,
 'crime': 4244,
 'thriller': 7532,
 'horror': 4616,
 'history': 1371,
 'science_fiction': 2995,
 'mystery': 2440,
 'war': 1304,
 'foreign': 1565,
 'music': 1556,
 'documentary': 3816,
 'western': 1029,
 'tv_movie': 739}

## Modeling
Many films naturally fall into several genres. However, for the purpose of modeling I will only predict the single most likely genre. In order to convert the text-based overview data into numerical data, I will use sklearn's TF-IDF (Term Frequncy - Inverse Document Frequency) vectorizer. This function converts the text of each film overview into a matrix containing the relative frequency of each word in the overview to the frequency of the word in all film descriptions.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

  from collections import Sequence


### Build Design Matricies

In [10]:
# 80% train, 20% test (41870 total with a genre label)
genre_train = movies[movies.num_genres!=0].iloc[:33496]
genre_test = movies[movies.num_genres!=0].iloc[-8374:]

In [11]:
X_train = []
y_train = []
for i,row in genre_train.iterrows():
    for g in row.genres_list:
        X_train.append(row.overview)
        y_train.append(g)

In [12]:
X_test = list(genre_test.overview)

In [13]:
predictions = pd.DataFrame(genre_test.genres_list)
predictions.columns = ['true_genres']

### Logistic Regression

#### One vs Rest

In [14]:
logistic_clf1 = make_pipeline(TfidfVectorizer(),LogisticRegression(multi_class='ovr'))
logistic_clf1.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [15]:
predictions['logistic_clf1'] = logistic_clf1.predict(X_test)

#### Cross Entropy Loss

In [16]:
logistic_clf2 = make_pipeline(TfidfVectorizer(), LogisticRegression(multi_class='multinomial', solver='lbfgs'))
logistic_clf2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [17]:
predictions['logistic_clf2'] = logistic_clf2.predict(X_test)

### Multinomial Naive Bayes


In [18]:
nb_clf = make_pipeline(TfidfVectorizer(), MultinomialNB())
nb_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...   vocabulary=None)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [19]:
predictions['nb_clf'] = nb_clf.predict(X_test)

### Support Vector Machine
The SVM classifier fails to fit/converge with too many training instances. Thus, I will only use a subset of the data to train the model

In [20]:
svm_clf = make_pipeline(TfidfVectorizer(), SVC())
svm_clf.fit(X_train[:5000], y_train[:5000])

Pipeline(memory=None,
     steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_i...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [21]:
predictions['svm_clf'] = svm_clf.predict(X_test)

### Comparison

In [22]:
predictions.head(6)

Unnamed: 0_level_0,true_genres,logistic_clf1,logistic_clf2,nb_clf,svm_clf
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
46494,"[drama, crime]",crime,crime,drama,drama
50669,"[drama, thriller]",drama,drama,drama,drama
242354,[drama],drama,drama,drama,drama
39436,"[crime, horror, mystery, thriller]",crime,thriller,drama,drama
277968,"[comedy, drama]",drama,drama,drama,drama
127894,"[comedy, romance]",comedy,comedy,drama,drama


In [23]:
def model_accuracy(row, model_name):
    if row[model_name] in row.true_genres:
        return True
    else:
        return False

In [24]:
model_names = list(predictions.columns)
model_names.remove('true_genres')

for model in model_names:
    predictions[model+"_correct"] = predictions.apply(lambda row: model_accuracy(row, model), axis=1)
    predictions[model+"_correct"].value_counts()
    correct_percent = predictions[model+"_correct"].value_counts().loc[True]/len(predictions)
    print(model, "--> Accuracy: {0:.3f}".format(correct_percent))

logistic_clf1 --> Accuracy: 0.656
logistic_clf2 --> Accuracy: 0.660
nb_clf --> Accuracy: 0.447
svm_clf --> Accuracy: 0.408


In [25]:
percent_correct_genre = pd.DataFrame()
for model in model_names:  
    correct_genre = predictions.groupby([model, model+"_correct"]).true_genres.count()
    percent = correct_genre.groupby(level=0).apply(lambda x: x/float(np.sum(x))).unstack()[True]
    percent_correct_genre[model] = percent

percent_correct_genre.index.name = 'genre'
percent_correct_genre

Unnamed: 0_level_0,logistic_clf1,logistic_clf2,nb_clf,svm_clf
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
action,0.707965,0.691877,0.823529,
adventure,0.5625,0.5875,,
animation,0.972973,0.974359,,
comedy,0.665318,0.664009,0.771167,
crime,0.6,0.583333,,
documentary,0.878519,0.868383,1.0,
drama,0.595266,0.60734,0.428102,0.408407
family,0.764706,0.746032,,
fantasy,0.638889,0.666667,,
history,,,,


### Conclusion

Both the Naive Bayes and SVM classifiers—but especially the SVM classifier—over predict the most common genre, 'drama'. Because of this, the 45% and 41% accuracy for these models are rather misleading. The SVM classifier predicted 'drama' for every film, and still had an overall accuracy of 41% which means this the absolute minimum basline all models should be compared against. <br>

Clearly the two logistic regression classifiers outpreform both the Naive Bayes and SVM classifiers. Both the one-vs-rest and cross entropy logistic models outpreformed the basline model by about 25%. This is a meaningful improvement and as shown above, the logistic models do a relatively good job of predicting across all genres.