In this notebook we will try to predict The IMDB rating of movies

In [None]:
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
## standard imports

conn = sqlite3.connect('moviedb')

In [58]:
df=pd.read_sql('select * from  basics b  join ratings r on b.tconst =r.tconst join principals p on p.tconst = b.tconst join name n on n.nconst=p.nconst', conn) 
## gets the info we will be using from our database

In [59]:
df.duplicated().sum()
df.isnull().sum()
## quick audit, primaryProfession will be dropped later anyways

tconst                  0
titleType               0
primaryTitle            0
originalTitle           0
isAdult                 0
startYear               0
endYear                 0
runtimeMinutes          0
genres                  0
tconst                  0
averageRating           0
numVotes                0
tconst                  0
ordering                0
nconst                  0
category                0
job                     0
characters              0
nconst                  0
primaryName             0
birthYear               0
deathYear               0
primaryProfession    1495
knownForTitles          0
dtype: int64

In [60]:
df = df.drop(columns=[ 'isAdult',  'endYear', 'originalTitle', 'characters','primaryProfession', 'knownForTitles','job','birthYear','deathYear','tconst','titleType','ordering','nconst'])
## we dont need these columns

In [66]:
df['runtimeMinutes'] = df['runtimeMinutes'].replace(r'\N', np.nan)
df['startYear'] = df['startYear'].replace(r'\N', np.nan)
df['runtimeMinutes'] = pd.to_numeric(df['runtimeMinutes'])
df['startYear'] = pd.to_numeric(df['startYear'])
df = df.drop_duplicates(subset=['primaryTitle'])  ## this makes sure its one row per movie
##  changes the missing data to nan and change the columns to numeric


In [67]:
df['main_genre'] = df['genres'].str.split(',').str[0] 
 ## the genres column was a list of many, but the main one was listed first
## this wil make it easier to model

In [68]:
one_hot_encoded = pd.get_dummies(df['main_genre'])
df = pd.concat([df, one_hot_encoded], axis=1) ## changing to one hot for the model



In [74]:

df['main_genre'].value_counts()

main_genre
Drama          22247
Comedy         21674
Action          8718
Crime           6015
Adventure       4464
Animation       3648
Documentary     2909
\N              2306
Western         1909
Horror          1891
Biography       1776
Short           1224
Thriller         764
Adult            722
Family           536
Mystery          495
Romance          468
Musical          463
Fantasy          338
Sci-Fi           314
Music            249
History          115
War              101
Game-Show         51
Talk-Show         50
Film-Noir         26
News              22
Sport             18
Reality-TV         7
Name: count, dtype: int64

In [44]:
selected_columns = ['startYear', 'runtimeMinutes', 'averageRating']
correlation_matrix = df[selected_columns].corr()
print(correlation_matrix)


                startYear  runtimeMinutes  averageRating
startYear        1.000000        0.264581       0.012866
runtimeMinutes   0.264581        1.000000       0.036380
averageRating    0.012866        0.036380       1.000000


In [98]:
x = df[['startYear', 'runtimeMinutes']]  ## first we try without genres
y = df['averageRating']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)

                            OLS Regression Results                            
Dep. Variable:          averageRating   R-squared:                       0.002
Model:                            OLS   Adj. R-squared:                  0.002
Method:                 Least Squares   F-statistic:                     69.53
Date:                Wed, 01 Nov 2023   Prob (F-statistic):           6.77e-31
Time:                        17:11:04   Log-Likelihood:            -1.1987e+05
No. Observations:               79136   AIC:                         2.398e+05
Df Residuals:                   79133   BIC:                         2.398e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              8.1745      0.350     23.

In [110]:
subset = ['Action', 'Adult', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir',
       'Game-Show', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'News',
       'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Talk-Show',
       'Thriller', 'War', 'Western']
for column in subset:
    df[column] = df[column].replace({True: 1, False: 0})

    ## booleans werent working with statsmodel so we changed to 1 and 0


In [111]:
x = df[['startYear', 'runtimeMinutes', 'Action', 'Adult', 'Adventure', 'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family', 'Fantasy', 'Film-Noir',
       'Game-Show', 'History', 'Horror', 'Music', 'Musical', 'Mystery', 'News',
       'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport', 'Talk-Show',
       'Thriller', 'War', 'Western']] 
y = df['averageRating']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)

## including genres gives us a higher R-squared than the previous model, some genres have very high P values so we can use backwards elimination on them

                            OLS Regression Results                            
Dep. Variable:          averageRating   R-squared:                       0.084
Model:                            OLS   Adj. R-squared:                  0.083
Method:                 Least Squares   F-statistic:                     240.5
Date:                Wed, 01 Nov 2023   Prob (F-statistic):               0.00
Time:                        17:17:47   Log-Likelihood:            -1.1649e+05
No. Observations:               79136   AIC:                         2.330e+05
Df Residuals:                   79105   BIC:                         2.333e+05
Df Model:                          30                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              7.5852      0.345     21.

In [122]:
x = df[['startYear', 'runtimeMinutes', 'Action',   'Animation', 'Biography', 'Comedy',
       'Crime', 'Documentary', 'Drama', 'Family',  'Film-Noir',
        'History', 'Horror', 'Music',  'Mystery', 'News',
        'Sci-Fi', 'Short', 
       'Thriller', 'Western']] 
y = df['averageRating']

x = sm.add_constant(x)

model = sm.OLS(y, x).fit()
predictions = model.predict(x) 

print_model = model.summary()
print(print_model)

## Removing the non signigicant genres doesnt change our R squared

                            OLS Regression Results                            
Dep. Variable:          averageRating   R-squared:                       0.084
Model:                            OLS   Adj. R-squared:                  0.083
Method:                 Least Squares   F-statistic:                     360.4
Date:                Wed, 01 Nov 2023   Prob (F-statistic):               0.00
Time:                        17:31:42   Log-Likelihood:            -1.1649e+05
No. Observations:               79136   AIC:                         2.330e+05
Df Residuals:                   79115   BIC:                         2.332e+05
Df Model:                          20                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const              7.6335      0.341     22.

### Conclusion
- Although the P values are small, the R squared is also low meaning the model can only explain 8% of the variance in ratings.  The most interesting finding is probably the genre effects, which are listed below.


Here is a list of the effect each genre will have on the average rating of your movie

- Documentary: 0.9316
- Animation: 0.6585
- Biography: 0.6429
- Film-Noir: 0.5333
- Music: 0.5296
- Short: 0.4453
- History: 0.4068
- Drama: 0.2873
- Family: 0.2842
- Comedy: 0.0507
- Western: 0.0531
- Action: -0.2472
- Crime: 0.2037
- Mystery: -0.1459
- Thriller: -0.5288
- Sci-Fi: -0.7178
- Horror: -0.9391
- News: -1.0092


Here we can see newer movies are very slightly lower rated than old ones, and longer movies are very slightly rated higher

- runtimeMinutes: 0.0022
- startYear: -0.0009

So, if you are making a movie these are some things to consider