<a href="https://colab.research.google.com/github/ashishmathew0297/movie_rating_prediction_cis550_final_project/blob/main/Nikita/Copy_of_movie_success_prediction_system_nikita.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Success Prediction System
The main aim of this project is to predict the potential success or a failure of a movie based on wide range of features.

Our main dataset comes from IMDb itself which gives us open source access to its non-commercial dataset at

https://developer.imdb.com/non-commercial-datasets/

In this project we will be going through every step of the Machine Learning Pipeline, from the loading and cleaning up of our dataset to training and tuning our model to make predictions on new data.

## Data Preparation

Our data here comes directly from IMDb itself which consists of movie information from as early as the 1800s to upcoming movies in the near future.

We have made a separate script to create a dataset from 6 of the 7 datasets presented by IMDb. We now focus on loading up our datasets and further cleaning it up for use with our Machine Learning models.

In [1]:
import warnings, requests, gzip
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import *


from joblib import Parallel, delayed
from tqdm import tqdm
from math import floor, ceil
import os, pickle

from sklearn.cluster import KMeans
from sklearn.model_selection import *
from sklearn.metrics import *

from sklearn.preprocessing import StandardScaler, LabelEncoder, OrdinalEncoder, OneHotEncoder

%matplotlib inline

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [23]:
movie_dataset = pd.read_csv("/content/drive/MyDrive/drive-download-20231202T144505Z-001/imdb_movie_dataset.csv")

In [24]:
movie_dataset  = movie_dataset.fillna(value=np.nan)
movie_dataset.isna().sum()

Unnamed: 0                  0
tconst                      0
primaryTitle                0
isAdult                     0
releaseYear                 0
runtimeMinutes              0
Action                      0
Adult                       0
Adventure                   0
Animation                   0
Biography                   0
Comedy                      0
Crime                       0
Documentary                 0
Drama                       0
Family                      0
Fantasy                     0
Film-Noir                   0
Game-Show                   0
History                     0
Horror                      0
Music                       0
Musical                     0
Mystery                     0
News                        0
Reality-TV                  0
Romance                     0
Sci-Fi                      0
Sport                       0
Talk-Show                   0
Thriller                    0
War                         0
Western                     0
\N        

From the data that was created, we now drop the 'Production_designer' column as more the 400k were null values. We also imputed the values in the 'Short' column based on runtime as the general standard of a short movie is upto 50 minutes. Next we remove the '\N' string from regions and split it into multiple columns

In [25]:
movie_dataset.drop('production_designer',axis=1, inplace=True)
movie_dataset['Short'] = np.where(movie_dataset['runtimeMinutes']>50, 0, 1)
movie_dataset['region'] = movie_dataset['region'].convert_dtypes(convert_string=True)
movie_dataset['region'] = movie_dataset['region'].str.replace(r"\N," ,'',regex=False)
movie_dataset['region'] = movie_dataset['region'].str.replace(r",\N" ,'',regex=False)
movie_dataset

Unnamed: 0.1,Unnamed: 0,tconst,primaryTitle,isAdult,releaseYear,runtimeMinutes,Action,Adult,Adventure,Animation,...,actress,cinematographer,composer,director,editor,producer,self,writer,averageRating,numVotes
0,0,tt0000502,Bohemios,False,1905,100,0,0,0,0,...,,,,nm0063413,,,,"nm0675388,nm0063413,nm0657268",4.1,15
1,1,tt0000574,The Story of the Kelly Gang,False,1906,70,1,0,1,0,...,nm0846887,nm0675239,nm2421834,nm0846879,,"nm0317210,nm0425854,nm0846911",,nm0846879,6.0,855
2,2,tt0000591,The Prodigal Son,False,1907,90,0,0,0,0,...,"nm1323543,nm1759558",,,nm0141150,,,,nm0141150,5.0,21
3,3,tt0000615,Robbery Under Arms,False,1907,0,0,0,0,0,...,nm0218953,nm0167619,,nm0533958,,,,"nm0533958,nm0092809",4.3,25
4,4,tt0000630,Hamlet,False,1908,0,0,0,0,0,...,nm0624446,,,nm0143333,,nm0209738,,nm0000636,2.9,27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
440879,570470,tt9916538,Kuambil Lagi Hatiku,False,2019,123,0,0,0,0,...,"nm8678236,nm1417182,nm1266058",,nm4700236,nm4457074,,nm1290982,,"nm4900525,nm4843252,nm2679404",8.6,7
440880,570471,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,False,2015,57,0,0,0,0,...,,"nm9272492,nm9272489,nm8349149,nm9275317",,"nm9272491,nm9272490",,,"nm10538557,nm10538558,nm10538556","nm9272491,nm9272490",0.0,0
440881,570472,tt9916680,De la ilusión al desconcierto: cine colombiano...,False,2007,100,0,0,0,0,...,,"nm10538579,nm10538578,nm10538577",,nm0652213,nm4762061,,"nm0033355,nm0127882,nm0133349,nm10503634","nm0652213,nm10538576",0.0,0
440882,570474,tt9916730,6 Gunn,False,2017,116,0,0,0,0,...,,nm1957275,,nm10538612,nm9785908,"nm10538614,nm10538613",,nm10538612,7.6,11


In [26]:
movie_dataset["region"].fillna("", inplace=True)
movie_dataset["region"].isna().sum()

0

In [27]:
def check_region(region, target):
    try:
        return 1 if target in region else 0
    except TypeError:
        return 0

In [28]:
movie_dataset['region_US'] = movie_dataset["region"].apply(lambda x: check_region(x, 'US'))
movie_dataset['region_UK'] = movie_dataset["region"].apply(lambda x: check_region(x, 'UK'))
movie_dataset['region_AU'] = movie_dataset["region"].apply(lambda x: check_region(x, 'AU'))
movie_dataset['region_IN'] = movie_dataset["region"].apply(lambda x: check_region(x, 'IN'))
movie_dataset['region_JP'] = movie_dataset["region"].apply(lambda x: check_region(x, 'JP'))
movie_dataset['region_other'] = movie_dataset['region'].apply(lambda x: any(e not in ['US', 'UK','AU','IN','JP'] for e in x)).astype(int)
movie_dataset

Unnamed: 0.1,Unnamed: 0,tconst,primaryTitle,isAdult,releaseYear,runtimeMinutes,Action,Adult,Adventure,Animation,...,self,writer,averageRating,numVotes,region_US,region_UK,region_AU,region_IN,region_JP,region_other
0,0,tt0000502,Bohemios,False,1905,100,0,0,0,0,...,,"nm0675388,nm0063413,nm0657268",4.1,15,0,0,0,0,0,1
1,1,tt0000574,The Story of the Kelly Gang,False,1906,70,1,0,1,0,...,,nm0846879,6.0,855,1,0,1,0,0,1
2,2,tt0000591,The Prodigal Son,False,1907,90,0,0,0,0,...,,nm0141150,5.0,21,1,0,0,0,0,1
3,3,tt0000615,Robbery Under Arms,False,1907,0,0,0,0,0,...,,"nm0533958,nm0092809",4.3,25,0,0,1,0,0,1
4,4,tt0000630,Hamlet,False,1908,0,0,0,0,0,...,,nm0000636,2.9,27,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
440879,570470,tt9916538,Kuambil Lagi Hatiku,False,2019,123,0,0,0,0,...,,"nm4900525,nm4843252,nm2679404",8.6,7,0,0,0,0,0,1
440880,570471,tt9916622,Rodolpho Teóphilo - O Legado de um Pioneiro,False,2015,57,0,0,0,0,...,"nm10538557,nm10538558,nm10538556","nm9272491,nm9272490",0.0,0,0,0,0,0,0,1
440881,570472,tt9916680,De la ilusión al desconcierto: cine colombiano...,False,2007,100,0,0,0,0,...,"nm0033355,nm0127882,nm0133349,nm10503634","nm0652213,nm10538576",0.0,0,0,0,0,0,0,1
440882,570474,tt9916730,6 Gunn,False,2017,116,0,0,0,0,...,,nm10538612,7.6,11,0,0,0,1,0,1


In [30]:
movie_dataset.isna().sum()
# movie_dataset.info()

Unnamed: 0              0
tconst                  0
primaryTitle            0
isAdult                 0
releaseYear             0
runtimeMinutes          0
Action                  0
Adult                   0
Adventure               0
Animation               0
Biography               0
Comedy                  0
Crime                   0
Documentary             0
Drama                   0
Family                  0
Fantasy                 0
Film-Noir               0
Game-Show               0
History                 0
Horror                  0
Music                   0
Musical                 0
Mystery                 0
News                    0
Reality-TV              0
Romance                 0
Sci-Fi                  0
Sport                   0
Talk-Show               0
Thriller                0
War                     0
Western                 0
\N                      0
Short                   0
region                  0
actor               96798
actress            137683
cinematograp

Next we need to split the actors into multiple columns. We also need to do this for actresses, writer, self, producer, composer, etc.

In [33]:
actor_split= movie_dataset['actor'].str.split(',', expand=True)
actor_split =  actor_split.iloc[:, :2]
actor_split.rename(columns={0: 'actor_1', 1: 'actor_2'}, inplace=True)
actor_split
actor_split.isna().sum()

actor_1     96798
actor_2    146007
dtype: int64

In [37]:
actress_split= movie_dataset['actress'].str.split(',', expand=True)
actress_split = actress_split.iloc[:, :2]
actress_split.rename(columns={0: 'actress_1', 1: 'actress_2'}, inplace=True)
actress_split
actress_split.isna().sum()

actress_1    137683
actress_2    260123
dtype: int64

In [43]:
writer_split= movie_dataset['writer'].str.split(',', expand=True)
writer_split = writer_split.iloc[:, :2]
writer_split.rename(columns={0: 'writer_1', 1: 'writer_2'}, inplace=True)
writer_split.drop(['writer_2'],axis=1, inplace=True)
writer_split
writer_split.isna().sum()

writer_1    0
dtype: int64

In [46]:
cinema_split= movie_dataset['cinematographer'].str.split(',', expand=True)
cinema_split = cinema_split .iloc[:, :1]
cinema_split .rename(columns={0: 'cinematographer_1'}, inplace=True)
cinema_split
cinema_split.isna().sum()

cinematographer_1    192493
dtype: int64

In [47]:
self_split= movie_dataset['self'].str.split(',', expand=True)
self_split = self_split .iloc[:, :1]
self_split .rename(columns={0: 'self_1'}, inplace=True)
self_split
# self_split.isna().sum()

Unnamed: 0,self_1
0,
1,
2,
3,
4,
...,...
440879,
440880,nm10538557
440881,nm0033355
440882,


In [48]:
producer_split= movie_dataset['producer'].str.split(',', expand=True)
producer_split =producer_split.iloc[:, :1]
producer_split.rename(columns={0: 'producer_1'}, inplace=True)
producer_split
#producer_split.isna().sum()

Unnamed: 0,producer_1
0,
1,nm0317210
2,
3,
4,nm0209738
...,...
440879,nm1290982
440880,
440881,
440882,nm10538614


In [49]:
composer_split= movie_dataset['composer'].str.split(',', expand=True)
composer_split =composer_split.iloc[:, :1]
composer_split.rename(columns={0: 'composer_1',}, inplace=True)
composer_split
# composer_split.isna().sum()

Unnamed: 0,composer_1
0,
1,nm2421834
2,
3,
4,
...,...
440879,nm4700236
440880,
440881,
440882,


In [50]:
editor_split= movie_dataset['editor'].str.split(',', expand=True)
editor_split =editor_split.iloc[:, :1]
editor_split.rename(columns={0: 'editor_1',}, inplace=True)
editor_split
# editor_split.isna().sum()

Unnamed: 0,editor_1
0,
1,
2,
3,
4,
...,...
440879,
440880,
440881,nm4762061
440882,nm9785908


In [53]:
# merge the 8 dataframes on index using inner join
# movie_dataset = pd.merge(movie_dataset, actor_split, left_index=True, right_index=True).merge(actress_split, left_index=True, right_index=True).merge(writer_split, left_index=True, right_index=True).merge(cinema_split, left_index=True, right_index=True).merge(producer_split, left_index=True, right_index=True).merge(composer_split, left_index=True, right_index=True).merge(editor_split, left_index=True, right_index=True).merge(self_split, left_index=True, right_index=True)
# movie_dataset  = movie_dataset.fillna(value=np.nan)
movie_dataset.drop(['actor', 'actress','writer','producer','self','composer','cinematographer','editor'], axis = 1, inplace =True)
movie_dataset.isna().sum()

Unnamed: 0                0
tconst                    0
primaryTitle              0
isAdult                   0
releaseYear               0
runtimeMinutes            0
Action                    0
Adult                     0
Adventure                 0
Animation                 0
Biography                 0
Comedy                    0
Crime                     0
Documentary               0
Drama                     0
Family                    0
Fantasy                   0
Film-Noir                 0
Game-Show                 0
History                   0
Horror                    0
Music                     0
Musical                   0
Mystery                   0
News                      0
Reality-TV                0
Romance                   0
Sci-Fi                    0
Sport                     0
Talk-Show                 0
Thriller                  0
War                       0
Western                   0
\N                        0
Short                     0
region              

In [22]:
# movie_dataset  = movie_dataset.fillna(value=np.nan)
# movie_dataset.isna()

Unnamed: 0.1,Unnamed: 0,tconst,primaryTitle,isAdult,releaseYear,runtimeMinutes,Action,Adult,Adventure,Animation,...,actress_1,actress_2,writer_1,writer_2,cinematographer_1,producer_1,producer_2,composer_1,editor_1,self_1
0,False,False,False,False,False,False,False,False,False,False,...,True,True,False,False,True,True,True,True,True,True
1,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,False,False,False,False,True,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,True,True,True,True,True
3,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,True,True,True,True,True
4,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,True,False,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
440879,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,True,False,True,True
440880,False,False,False,False,False,False,False,False,False,False,...,True,True,False,False,False,True,True,True,True,False
440881,False,False,False,False,False,False,False,False,False,False,...,True,True,False,False,False,True,True,True,False,False
440882,False,False,False,False,False,False,False,False,False,False,...,True,True,False,True,False,False,False,True,False,True


In [54]:
#saving dataframe as csv
movie_dataset.to_csv('/content/drive/MyDrive/drive-download-20231202T144505Z-001/movie_dataset1.csv')