# Prediction of Movie Preferences among Young Adults

The goal of this project is to analyze the characteristics of movies published since the year 2000 that lead to higher ratings among adults between the ages of 18 and 30. The movie dataset from IMDb that is used is publicly available at https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset.

In [1]:
import numpy as np
import pandas as pd

# Load dataset
movie_data = pd.read_csv('imdb-extensive-dataset/IMDb movies.csv')
rating_data = pd.read_csv('imdb-extensive-dataset/IMDb ratings.csv')

In [2]:
# Explore movie data
movie_data.head()

Unnamed: 0,imdb_title_id,title,original_title,year,date_published,genre,duration,country,language,director,...,actors,description,avg_vote,votes,budget,usa_gross_income,worlwide_gross_income,metascore,reviews_from_users,reviews_from_critics
0,tt0000574,The Story of the Kelly Gang,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,Charles Tait,...,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,537,$ 2250,,,,7.0,7.0
1,tt0001892,Den sorte drøm,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,Urban Gad,...,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",Two men of high rank are both wooing the beaut...,5.9,171,,,,,4.0,2.0
2,tt0002101,Cleopatra,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,Charles L. Gaskill,...,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,420,$ 45000,,,,24.0,3.0
3,tt0002130,L'Inferno,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",...,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2019,,,,,28.0,14.0
4,tt0002199,"From the Manger to the Cross; or, Jesus of Naz...","From the Manger to the Cross; or, Jesus of Naz...",1912,1913,"Biography, Drama",60,USA,English,Sidney Olcott,...,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,438,,,,,12.0,5.0


In [3]:
# Explore rating data
rating_data.head()

Unnamed: 0,imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,...,females_30age_avg_vote,females_30age_votes,females_45age_avg_vote,females_45age_votes,top1000_voters_rating,top1000_voters_votes,us_voters_rating,us_voters_votes,non_us_voters_rating,non_us_voters_votes
0,tt0000574,6.1,537,6.3,6.0,54,17,55,121,122,...,6.0,19.0,6.6,14.0,6.3,64.0,6.0,89.0,6.2,309.0
1,tt0001892,5.9,171,6.1,6.0,5,6,17,41,52,...,5.8,4.0,6.5,8.0,5.9,29.0,6.2,27.0,6.0,114.0
2,tt0002101,5.2,420,5.2,5.0,12,8,16,60,89,...,5.5,14.0,6.2,20.0,4.9,57.0,5.5,197.0,4.7,103.0
3,tt0002130,7.0,2019,6.9,7.0,194,208,386,571,308,...,7.3,74.0,7.4,75.0,7.0,126.0,7.1,452.0,7.0,1076.0
4,tt0002199,5.7,438,5.8,6.0,28,15,42,75,114,...,4.8,10.0,6.5,15.0,5.7,56.0,5.9,161.0,5.7,164.0


In [4]:
len(movie_data)

81273

## Data preprocessing

We would like to consider only the columns that contain information relevant to our research question, so we remove the unnecessary columns from the datasets. Of the remaining columns, we may need to encode the information in a different way before we apply our models. This process of cleaning the data may be subjective because it can be difficult to define the types of relevant information for our project.

### Movie data

In [5]:
movie_data.columns

Index(['imdb_title_id', 'title', 'original_title', 'year', 'date_published',
       'genre', 'duration', 'country', 'language', 'director', 'writer',
       'production_company', 'actors', 'description', 'avg_vote', 'votes',
       'budget', 'usa_gross_income', 'worlwide_gross_income', 'metascore',
       'reviews_from_users', 'reviews_from_critics'],
      dtype='object')

The original movie dataset has the columns shown above. We want features that may correlate with the movie rating, so we keep only the following columns: `year`, `genre`, `duration`, `country`, `language`, `director`, `writer`, `production_company`, and `actors`. The title and description columns were removed because there may not be common words for these features, and a model might overestimate the importance of individual words and lead to overfitting. The budget column was dropped because the amounts varied in currency. The votes, gross income, and review columns were also removed because we want information that we have before knowing whether a movie will be successful, and these removed columns would obviously correlate with higher ratings but would not be helpful for prediction.

In [6]:
movie_data = movie_data[['year', 'genre', 'duration', 'country', 'language', 'director', 'writer', 'production_company', 'actors']]
movie_data.head()

Unnamed: 0,year,genre,duration,country,language,director,writer,production_company,actors
0,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be..."
1,1911,Drama,53,"Germany, Denmark",,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse..."
2,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ..."
3,1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L..."
4,1912,"Biography, Drama",60,USA,English,Sidney Olcott,Gene Gauntier,Kalem Company,"R. Henderson Bland, Percy Dyer, Gene Gauntier,..."


### Rating data

In [7]:
rating_data.columns

Index(['imdb_title_id', 'weighted_average_vote', 'total_votes', 'mean_vote',
       'median_vote', 'votes_10', 'votes_9', 'votes_8', 'votes_7', 'votes_6',
       'votes_5', 'votes_4', 'votes_3', 'votes_2', 'votes_1',
       'allgenders_0age_avg_vote', 'allgenders_0age_votes',
       'allgenders_18age_avg_vote', 'allgenders_18age_votes',
       'allgenders_30age_avg_vote', 'allgenders_30age_votes',
       'allgenders_45age_avg_vote', 'allgenders_45age_votes',
       'males_allages_avg_vote', 'males_allages_votes', 'males_0age_avg_vote',
       'males_0age_votes', 'males_18age_avg_vote', 'males_18age_votes',
       'males_30age_avg_vote', 'males_30age_votes', 'males_45age_avg_vote',
       'males_45age_votes', 'females_allages_avg_vote',
       'females_allages_votes', 'females_0age_avg_vote', 'females_0age_votes',
       'females_18age_avg_vote', 'females_18age_votes',
       'females_30age_avg_vote', 'females_30age_votes',
       'females_45age_avg_vote', 'females_45age_votes',
       

The original rating dataset has the columns shown above. We want to predict the movie preferences of young adults, so the following columns are relevant: `allgenders_18age_avg_vote`, `allgenders_18age_votes`, `males_18age_avg_vote`, `males_18age_votes`, `females_18age_avg_vote`, and `females_18age_votes`. These columns refer to the average and number of ratings by all genders, males, and females between the ages of 18-30.

In [8]:
rating_data = rating_data[['allgenders_18age_avg_vote', 'allgenders_18age_votes', 'males_18age_avg_vote', 'males_18age_votes', 'females_18age_avg_vote', 'females_18age_votes']]
rating_data.head()

Unnamed: 0,allgenders_18age_avg_vote,allgenders_18age_votes,males_18age_avg_vote,males_18age_votes,females_18age_avg_vote,females_18age_votes
0,6.2,126.0,6.2,112.0,5.7,14.0
1,5.7,25.0,5.8,21.0,5.8,4.0
2,4.6,24.0,4.6,20.0,4.5,4.0
3,7.0,429.0,7.0,371.0,6.8,53.0
4,5.7,38.0,5.8,34.0,5.0,4.0


### Concatenate the data

We currently have two datasets, `movie_data` and `rating_data`, but this can be hard to work with once we start removing some movie entries to further clean our data. Thus, we would like to combine them so that we can work with just one dataset.

In [9]:
data = pd.concat([movie_data, rating_data], axis=1, sort=False)
data.head()

Unnamed: 0,year,genre,duration,country,language,director,writer,production_company,actors,allgenders_18age_avg_vote,allgenders_18age_votes,males_18age_avg_vote,males_18age_votes,females_18age_avg_vote,females_18age_votes
0,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",6.2,126.0,6.2,112.0,5.7,14.0
1,1911,Drama,53,"Germany, Denmark",,Urban Gad,"Urban Gad, Gebhard Schätzler-Perasini",Fotorama,"Asta Nielsen, Valdemar Psilander, Gunnar Helse...",5.7,25.0,5.8,21.0,5.8,4.0
2,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",4.6,24.0,4.6,20.0,4.5,4.0
3,1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",7.0,429.0,7.0,371.0,6.8,53.0
4,1912,"Biography, Drama",60,USA,English,Sidney Olcott,Gene Gauntier,Kalem Company,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...",5.7,38.0,5.8,34.0,5.0,4.0


### Remove some entries

Before we move on to encoding the values of each feature, we need to drop the movie entries that were published before the year 2000 or do not contain sufficient information. First, we are focusing on movies that were published in 2000 or later, so any movies published before then are not relevant to our research. Second, when we look at the column for number of ratings, we notice that some movies have few votes, which may have skewed the rating and not be a good representation of preferences for that movie. We can handle this situation by keeping only movies that have at least 100 ratings by young adults for both males and females. We have 81273 movie entries in total, so it is possible for us to drop some rows without worrying about not having enough data left.

In [10]:
data = data.loc[data['year'] >= 2000]
data = data.loc[data['males_18age_votes'] >= 100.0]
data = data.loc[data['females_18age_votes'] >= 100.0]

Once we remove the movie entries that lack enough ratings, we can drop the columns `allgenders_18age_votes`, `males_18age_votes`, and `females_18age_votes` because we want to predict only the movie ratings. The columns that we choose to keep are `allgenders_18age_avg_vote`, `males_18age_avg_vote`, and `females_18age_avg_vote`.

In [11]:
data = data.drop(columns=['allgenders_18age_votes', 'males_18age_votes', 'females_18age_votes'])

We also notice that there are missing values for features of some movie entries, and we will deal with them by removing those rows from our dataset. After these necessary steps to remove some of our data, we are left with 9859 movie entries.

In [12]:
for col in data.columns:
    data[col].replace('', np.nan, inplace=True)
    data.dropna(subset=[col], inplace=True)

In [13]:
print(len(data))

9859


Unfortunately, due to the limited capacity of my computer, we can use only 1000 of the 9859 available movie entries for this project. This issue is mainly due to the one-hot feature encoding that we will need to perform in the next step, and my computer does not have enough memory to safely handle more data. As a result, we will need to select a random sample of 1000 movie entries to use for our analysis.

In [14]:
import random

row_indices = []

for index, row in data.iterrows():
    row_indices.append(index)

rand_indices = random.sample(row_indices, 1000)
data = data.loc[rand_indices]
data.head()

Unnamed: 0,year,genre,duration,country,language,director,writer,production_company,actors,allgenders_18age_avg_vote,males_18age_avg_vote,females_18age_avg_vote
31679,2003,"Comedy, Crime, Romance",100,USA,"English, French","Joel Coen, Ethan Coen","Robert Ramsey, Matthew Stone",Universal Pictures,"George Clooney, Catherine Zeta-Jones, Geoffrey...",6.3,6.3,6.1
66610,2015,"Horror, Thriller",81,"USA, UK",English,Paul Solet,Mike Le,Campfire,"Keir Gilchrist, Stella Maeve, Maestro Harrell,...",4.2,4.3,4.1
71322,2015,"Horror, Thriller",88,USA,English,Ben Jehoshua,"Barry Jay, Ben Jehoshua",Terror Films,"Kian Lawley, Elizabeth Keener, Angelica Cassid...",3.4,3.1,3.8
42618,2005,"Crime, Drama, Horror",92,USA,English,Alexander Bulkley,"Alexander Bulkley, Kelly Bulkley",Blackwater Films,"Justin Chambers, Robin Tunney, Rory Culkin, Wi...",5.5,5.4,5.8
56203,2012,"Crime, Drama, Mystery",82,USA,English,George Gallo,"George Gallo, Kevin Pollak",Oxymoron Entertainment,"Selma Blair, Amy Smart, Kevin Pollak, Jason An...",6.2,6.1,6.4


### Feature encoding

Of the kept columns, we use the following encoding scheme for each variable:

* `year`: standardize
* `genre`: one-hot
* `duration`: standardize
* `country`: one-hot
* `language`: one-hot
* `director`: one-hot
* `writer`: one-hot
* `production_company`: one-hot
* `actors`: one-hot

We start by standardizing the values in the `year` and `duration` columns.

In [15]:
data = data.apply(lambda x : (x - x.mean()) / x.std() if (x.name == 'year' or x.name == 'duration') else x)

We will one-hot encode the values in the other columns. Many of these columns contain multiple values separated by commas, so we need to extract each of them when we apply one-hot encoding.

In [16]:
data = pd.concat([data.drop('genre', axis=1), data['genre'].str.get_dummies(sep=", ")], 1)
data = pd.concat([data.drop('country', axis=1), data['country'].str.get_dummies(sep=", ")], 1)
data = pd.concat([data.drop('language', axis=1), data['language'].str.get_dummies(sep=", ")], 1)
data = pd.concat([data.drop('director', axis=1), data['director'].str.get_dummies(sep=", ")], 1)
data = pd.concat([data.drop('writer', axis=1), data['writer'].str.get_dummies(sep=", ")], 1)
data = pd.concat([data.drop('production_company', axis=1), data['production_company'].str.get_dummies(sep=", ")], 1)
data = pd.concat([data.drop('actors', axis=1), data['actors'].str.get_dummies(sep=", ")], 1)

We will also scale the values of the output ratings from a range of 0.0-10.0 to a range of 0.0-1.0.

In [17]:
#data = data.apply(lambda x : x/10.0 if (x.name == 'allgenders_18age_avg_vote' or x.name == 'males_18age_avg_vote' or x.name == 'females_18age_avg_vote') else x)
data.head()

Unnamed: 0,year,duration,allgenders_18age_avg_vote,males_18age_avg_vote,females_18age_avg_vote,Action,Adventure,Animation,Biography,Comedy,...,Óscar Ladoire,Özay Fecht,Özgü Namal,Özgürcan Cevik,Özkan Mese,Øyvind Osmo Eriksen,Úrsula Corberó,Þröstur Leó Gunnarsson,Þór Jóhannesson,Þór Tulinius
31679,-1.35296,-0.349276,6.3,6.3,6.1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
66610,0.889206,-1.321336,4.2,4.3,4.1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
71322,0.889206,-0.963209,3.4,3.1,3.8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42618,-0.979266,-0.758565,5.5,5.4,5.8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56203,0.328664,-1.270175,6.2,6.1,6.4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Training models

We have a regression problem because the possible movie ratings are between 1.0-10.0. We will use a linear regression model and a neural network model to predict movie ratings among young adults for males, females, and both genders.

In [18]:
X = data.drop(columns=['allgenders_18age_avg_vote', 'males_18age_avg_vote', 'females_18age_avg_vote'])
y_all = data[['allgenders_18age_avg_vote']]
y_male = data[['males_18age_avg_vote']]
y_female = data[['females_18age_avg_vote']]

### Linear regression

We will implement our linear regression model using Scikit-learn, and we will use 70% of our data to train and 30% to test our model. We evaluate our model using mean squared error, and we find that the MSE for all three trials are between 0.006-0.009.

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

def linear_regression(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    model = LinearRegression()
    model.fit(X_train, y_train)
    prediction = model.predict(X_test)
    
    mse = mean_squared_error(prediction, y_test)
    return mse

In [20]:
linear_regression(X, y_all)

0.9306432084665556

In [21]:
linear_regression(X, y_male)

0.9568804466623241

In [22]:
linear_regression(X, y_female)

0.939281568298518

### Neural network

We will implement our neural network model using TensorFlow and keras. We will use 70% of our data to train, 15% to test our model, and 15% for a validation set. We find the optimal deep learning model by trying different combinations of layers and selecting the one that decreases mean squared error. We use a learning rate of 0.001 and 10 epochs for our neural network.

In [23]:
!pip install tensorflow



In [24]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def nn(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

    model = keras.Sequential([layers.Dense(16, activation="relu",
                              input_shape=[len(X_train.keys())]),
                              layers.Dense(16, activation="relu"),
                              layers.Dense(1)])

    model.compile(optimizer=keras.optimizers.SGD(0.001),
                  loss=keras.losses.MeanSquaredError(),
                  metrics=["mse"])

    model.fit(X_train, y_train, epochs=10, validation_split=0.15)
    prediction = model.predict(X_test)
    
    mse = mean_squared_error(prediction, y_test)
    return mse

In [25]:
nn(X, y_all)

Train on 722 samples, validate on 128 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


1.2556587852609176

In [26]:
nn(X, y_male)

Train on 722 samples, validate on 128 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


1.3947771071030393

In [27]:
nn(X, y_female)

Train on 722 samples, validate on 128 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


1.2063002077136293