## Do descriptions for Netflix Movies and TV Shows Differ?
Using a dataset gives information about Netflix movies and tv shows, I want to see if whether an item is a movie or tv show can be predicted based off of its descriptipn.

[Data Source](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-04-20/readme.md)

In [1]:
import pandas as pd
import numpy as np

### Data Exploration

In [2]:
data = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-20/netflix_titles.csv")
data.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [32]:
data.shape

(7787, 12)

In [6]:
data.groupby("type").size()

type
Movie      5377
TV Show    2410
dtype: int64

There's over twice as many movies as tv shows in this dataset. 

In [31]:
data.groupby(["country"]).size().sort_values(ascending=False)

country
United States                        2555
India                                 923
United Kingdom                        397
Japan                                 226
South Korea                           183
                                     ... 
Indonesia, South Korea, Singapore       1
Indonesia, United Kingdom               1
Indonesia, United States                1
Iran, France                            1
Zimbabwe                                1
Length: 681, dtype: int64

The most movies/tv shows come from the United States. 

I wonder if looking at just the United States produces a more balanced dataset for type.

In [23]:
data_us = data[data["country"].str.contains("United States")==True]

In [24]:
data_us.groupby("type").size()

type
Movie      2431
TV Show     866
dtype: int64

The movie to tv show ratio is even larger when looking at only those that lists the United States as the country or in the list of countries, so I will keep the entire dataset. 

First, let's split the data into a train and set set. Before doing that, I need to encode the target as 0s and 1s.

In [34]:
from sklearn.model_selection import train_test_split

y = pd.get_dummies(data.type)["TV Show"] # create dummy variable where 1 = tv show, 0 = movie
X = data.description

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=27, stratify=y) # make sure split has same proportions as 0s and 1s in y

### Processing Text Data and Model Training

Next, I will tokenize the movie/tv show descriptions and create a sparse matrix that will be able to be used for classification. I will not be removing any stop words.

In [47]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)
X_train_transformed = vect.transform(X_train)

feature_names = vect.get_feature_names_out()

print(repr(X_train_transformed))
print(f"Number of vocabulary words: {len(feature_names)}")
print(f"Some of the features: {feature_names[::500]}")

<5840x15863 sparse matrix of type '<class 'numpy.int64'>'
	with 125662 stored elements in Compressed Sparse Row format>
Number of vocabulary words: 15863
Some of the features: ['000' 'aid' 'assailants' 'benjie' 'bucks' 'check' 'confessing' 'cyborgs'
 'dirty' 'eggy' 'experiments' 'fogged' 'gnasher' 'height' 'impacts' 'jeff'
 'lasting' 'maintaining' 'minecraft' 'netherlands' 'oven' 'piques'
 'providence' 'regarded' 'role' 'senses' 'snail' 'strangest' 'teddy'
 'treasures' 'valuable' 'wilbur']


There are 15,862 total vocabulary words. Look at every 500th feature name, we see a wide variety of different words.

In [48]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LogisticRegression(random_state=0, max_iter=1000), X_train_transformed, y_train, cv=5)
print(f"Mean cross validation score: {np.mean(scores)}")

Mean cross validation score: 0.7482876712328768


### Does removing stop words make the model any better?


I wil try removing stop words using nltk's list of stop words.

In [51]:
import nltk
from nltk.corpus import stopwords

vect = CountVectorizer(stop_words=stopwords.words("english")).fit(X_train)
X_train_transformed = vect.transform(X_train)

scores = cross_val_score(LogisticRegression(random_state=0, max_iter=1000), X_train_transformed, y_train, cv=5)
print(f"Mean cross validation score: {np.mean(scores)}")

Mean cross validation score: 0.7441780821917808


No, stop words do not improve the model. 

### TF-IDF

I want to see if using TF_IDF (Term Frequency-Inverse Document Frequency) will help the model results as opposed to using stop words. TF-IDF gives a less weight to vocabulary words that are ore common (appearing very often in most of the descriptions). 

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect_tfidf = TfidfVectorizer().fit(X_train)
X_train_tfidf = vect_tfidf.transform(X_train)

scores = cross_val_score(LogisticRegression(random_state=0, max_iter=1000), X_train_tfidf, y_train, cv=5)
print(f"Mean cross validation score: {np.mean(scores)}")

Mean cross validation score: 0.7453767123287671


We can see that using tf-idf, stop words, or no using either does not affect the validation score much. 

### Using GridSearch for Hyperparameter Tuning

In [57]:
from sklearn.model_selection import GridSearchCV

vect = CountVectorizer().fit(X_train)
X_train_transformed = vect.transform(X_train)

grid = {"C": [0.001, 0.01, 0.1, 1, 10]}
grid_search = GridSearchCV(LogisticRegression(random_state=0, max_iter=1000), param_grid=grid, cv=5)
grid_search.fit(X_train_transformed, y_train)

print(f"Best score: {grid_search.best_score_}")
print(f"Best parameter: {grid_search.best_params_}")

Best score: 0.7573630136986301
Best parameter: {'C': 0.1}


Again, there's not much improvement even when adjusting the C parameter (inverse of regularization strength) in Logistic Regression.

### Final Model
How well does the model perform on the test set?

In [63]:
X_test_transformed = vect.transform(X_test)

lg_model = grid_search.best_estimator_
final_score = lg_model.score(X_test_transformed, y_test)

print(f"Mean accuracy: {final_score}")

Mean accuracy: 0.7616846430405753


The model performs slightly better on the test set, than the train set, with an accuracy of 0.76.