<a href="https://colab.research.google.com/github/magikarp01/SIFNetflix/blob/master/SIF_Netflix_Dataset_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A notebook with code for predicting the quality of a movie or series. The notebook uses multiple regression techniques and neural networks in order to handle different kinds of predictive input.



In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import math

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn import preprocessing

In [4]:
# read dataset into dataframe
df = pd.read_excel("Netflix Dataset Latest 2021.xlsx")

In [31]:
# stratify data by movie or series, since they are judged very differently
movie_dataframe = df[df["Series or Movie"] == "Movie"].reset_index()
series_dataframe = df[df["Series or Movie"] == "Series"].reset_index()

First, a multiple regression between the genres a movie/series is part of and the IMDb score

In [51]:
# functional because want to be able to work with either movie_dataframe or series_dataframe
# returns x_data, y_data
# x_data is 2d np array of whether each movie/series has a particular genre
# y_data is 1d np array of imdb reviews
def preprocess_genre_regression(dataframe):
  genre_column = dataframe["Genre"]
  imdb_column = dataframe["IMDb Score"]
  # a list of the (string) genres
  genre_list = []

  # a list of imdb reviews for each entry
  imdb_reviews = []

  # a list of the genres for each entry
  genre_data = []
  
  # process data to handle empty/bad rows
  for i in range(len(genre_column)):
    genre_cell = genre_column[i]
    if genre_cell is not None and imdb_column[i] is not None and not math.isnan(imdb_column[i]):
        try:
            cell_genres = genre_cell.split(", ")
            genre_data.append(cell_genres)
            imdb_reviews.append(imdb_column[i])
        except AttributeError:
            continue

        for genre in cell_genres:
            if genre not in genre_list:
                genre_list.append(genre)

  print(genre_list)
  genredict = {k: v for v, k in enumerate(genre_list)}

  # x_data is a 2d array, each row is an array of binary entries
  # each binary entry corresponds to if movie/series is part of a genre 
  # 0 if no and 1 if yes
  x_data = []
  for entry in genre_data:
      entry_genres = [0]*len(genre_list)
      for genre in entry:
          entry_genres[genredict[genre]] = 1
      x_data.append(entry_genres)
  x_data = np.array(x_data)
  y_data = np.array(imdb_reviews)/10

  return x_data, y_data, genre_list


In [52]:
# Perform the multiple regression on the processed data from previous function
def train_genre_regression(x_data, y_data, test_size=0.2, random_state=101, print_output = False):
  X_train, X_test, y_train, y_test = train_test_split(
    x_data, y_data, test_size=test_size, random_state=random_state)

  # creating a regression model
  model = LinearRegression()

  # fitting the model
  model.fit(X_train, y_train)

  # making predictions
  predictions = model.predict(X_test)

  # model evaluation
  print('mean_squared_error : ', mean_squared_error(y_test, predictions))
  # print('mean_absolute_error : ', mean_absolute_error(y_test, predictions))
  print(f'model R^2: {model.score(X_test, y_test)}')
  # print(f'model coefficients: {model.coef_}')

  if print_output:
      for i in range(len(y_test)):
        print(f"predicts {predictions[i]}, actual review is {y_test[i]}")

  return model

def model_output(model, genre_list):
  coef_dic = dict(zip(genre_list, model.coef_))
  for k, v in coef_dic.items():
      print(f"For genre {k}, the coefficient is {v}")  

In [53]:
# put it all together
def genre_regression(dataframe):
  x_data, y_data, genre_list = preprocess_genre_regression(dataframe)
  model = train_genre_regression(x_data, y_data)
  model_output(model, genre_list)
  return model

genre_regression(movie_dataframe)
# genre_regression(series_dataframe)

['Comedy', 'Romance', 'Drama', 'Crime', 'Fantasy', 'Mystery', 'Thriller', 'Short', 'Action', 'Adventure', 'Sci-Fi', 'Music', 'Family', 'Biography', 'Animation', 'War', 'History', 'Documentary', 'Horror', 'Film-Noir', 'Sport', 'Western', 'Musical', 'Reality-TV', 'Adult', 'News', 'Talk-Show']
mean_squared_error :  0.006606608489484805
model R^2: 0.16982544975353442
For genre Comedy, the coefficient is -0.01602164088416835
For genre Romance, the coefficient is -0.012631617961811675
For genre Drama, the coefficient is 0.02420954061188718
For genre Crime, the coefficient is 0.009489141835484157
For genre Fantasy, the coefficient is -0.0009438763731072829
For genre Mystery, the coefficient is 0.003640220577919371
For genre Thriller, the coefficient is -0.02632799540434819
For genre Short, the coefficient is 0.03233820605460993
For genre Action, the coefficient is -0.0168781347073299
For genre Adventure, the coefficient is -0.005170742785502887
For genre Sci-Fi, the coefficient is -0.00447804

LinearRegression()

Another model, determining whether runtime has an affect on the quality of the movie

In [66]:
def runtime_stratification(dataframe):
  runtime_column = dataframe["Runtime"]
  imdb_column = dataframe["IMDb Score"]
  
  # runtime_data is list of 4 arrays, each array has all imdb reviews for one runtime
  # 0 corresponding to <30 mins, 1 for 30-60 mins, 2 for 1-2 hour, 3 for > 2 hrs

  runtime_data = [[], [], [], []]
  # possible values in the runtime cell
  possible_runtimes = {"< 30 minutes": 0, "30-60 mins": 1, "1-2 hour": 2, "> 2 hrs": 3}
  # process data to handle empty/bad rows
  for i in range(len(runtime_column)):
    runtime_cell = runtime_column[i]
    if runtime_cell is not None and imdb_column[i] is not None and not math.isnan(imdb_column[i]):
        try:
            runtime_data[possible_runtimes[runtime_cell]].append(imdb_column[i])
        except AttributeError:
            continue
  runtime_data = np.array(runtime_data)

  runtime_names = list(possible_runtimes.keys())
  for i in range(4):
    print(f"For runtime {runtime_names[i]}, average IMDb score is {np.average(runtime_data[i])}, " +
          f"variance in IMDb score is {np.var(runtime_data[i])}")

  return runtime_data

runtime_stratification(movie_dataframe)
print()

For runtime < 30 minutes, average IMDb score is 7.076344086021504, variance in IMDb score is 0.46417158052954105
For runtime 30-60 mins, average IMDb score is 7.133088235294117, variance in IMDb score is 0.3467728157439447
For runtime 1-2 hour, average IMDb score is 6.632261768082665, variance in IMDb score is 0.7887142491200274
For runtime > 2 hrs, average IMDb score is 7.1072398190045245, variance in IMDb score is 0.5618221810130247



