# Movie Success Prediction System
The main aim of this project is to predict the potential success or a failure of a movie based on wide range of features.

Our main dataset comes from IMDb itself which gives us open source access to its non-commercial dataset at

https://developer.imdb.com/non-commercial-datasets/

In this project we will be going through every step of the Machine Learning Pipeline, from the loading and cleaning up of our dataset to training and tuning our model to make predictions on new data.

In [1]:
import warnings, requests, gzip
warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Data Preparation

Our data here comes directly from IMDb itself which consists of movie information from as early as the 1800s to upcoming movies in the near future.

We have made a separate script to create a dataset from 6 of the 7 datasets presented by IMDb. We now focus on loading up our datasets and further cleaning it up for use with our Machine Learning models.

We will be loading the datasets we created previously from Google Drive. Make sure you change the path according to the folder format you are following in your Google Drive.

In [3]:
movie_dataset = pd.read_csv("/content/drive/MyDrive/CIS550_final_project_datasets/imdb_movie_dataset.csv")

Now, we check to see what the size of the dataset is that we are dealing with

In [4]:
movie_dataset.dtypes

Unnamed: 0               int64
tconst                  object
primaryTitle            object
isAdult                   bool
releaseYear              int64
runtimeMinutes           int64
Action                   int64
Adult                    int64
Adventure                int64
Animation                int64
Biography                int64
Comedy                   int64
Crime                    int64
Documentary              int64
Drama                    int64
Family                   int64
Fantasy                  int64
Film-Noir                int64
Game-Show                int64
History                  int64
Horror                   int64
Music                    int64
Musical                  int64
Mystery                  int64
News                     int64
Reality-TV               int64
Romance                  int64
Sci-Fi                   int64
Sport                    int64
Talk-Show                int64
Thriller                 int64
War                      int64
Western 

A lot of these datatypes are mismatched and will take up a large amount of space when working with our data. CSV is bad at maintaining datatypes as opposed to formats like Parquet, however, on the flipside. CSV files are extremely easy to work with when it comes to big data problems.

We will now be working on changing the datatypes of all our columns into forms that are suitable for our use case.

In [7]:
movie_dataset["isAdult"] = movie_dataset["isAdult"].apply(pd.to_numeric, errors="coerce").astype("bool")

movie_dataset["runtimeMinutes"] = movie_dataset["runtimeMinutes"].apply(pd.to_numeric, errors="coerce")
movie_dataset["runtimeMinutes"] = movie_dataset["runtimeMinutes"].fillna(0)
movie_dataset["runtimeMinutes"] = movie_dataset["runtimeMinutes"].astype(np.int32)

movie_dataset["releaseYear"] = movie_dataset["releaseYear"].apply(pd.to_numeric, errors="coerce")
movie_dataset["releaseYear"] = movie_dataset["releaseYear"].fillna(0)
movie_dataset["releaseYear"] = movie_dataset["releaseYear"].astype(np.int16)

movie_genres = [
  "Action",
  "Adult",
  "Adventure",
  "Animation",
  "Biography",
  "Comedy",
  "Crime",
  "Documentary",
  "Drama",
  "Family",
  "Fantasy",
  "Film-Noir",
  "Game-Show",
  "History",
  "Horror",
  "Music",
  "Musical",
  "Mystery",
  "News",
  "Reality-TV",
  "Romance",
  "Sci-Fi",
  "Sport",
  "Talk-Show",
  "Thriller",
  "War",
  "Western",
  "\\N",
]

movie_dataset[movie_genres] = movie_dataset[movie_genres].apply(pd.to_numeric, errors="coerce")
movie_dataset[movie_genres] = movie_dataset[movie_genres].fillna(0)
movie_dataset[movie_genres] = movie_dataset[movie_genres].astype(np.uint8)

people_and_locations = [
  "region",
  "actor",
  "actress",
  "cinematographer",
  "composer",
  "director",
  "editor",
  "producer",
  "production_designer",
  "self",
  "writer",
]

movie_dataset[people_and_locations] = movie_dataset[people_and_locations].fillna(" ")
movie_dataset[people_and_locations] = movie_dataset[people_and_locations].astype("string")

movie_dataset["tconst"] = movie_dataset["tconst"].astype("string")
movie_dataset["primaryTitle"] = movie_dataset["primaryTitle"].astype("string")
movie_dataset["numVotes"] = movie_dataset["numVotes"].astype(np.int64)
movie_dataset["averageRating"] = movie_dataset["averageRating"].astype(np.float32)

In [8]:
movie_dataset.dtypes

Unnamed: 0               int64
tconst                  string
primaryTitle            string
isAdult                   bool
releaseYear              int16
runtimeMinutes           int32
Action                   uint8
Adult                    uint8
Adventure                uint8
Animation                uint8
Biography                uint8
Comedy                   uint8
Crime                    uint8
Documentary              uint8
Drama                    uint8
Family                   uint8
Fantasy                  uint8
Film-Noir                uint8
Game-Show                uint8
History                  uint8
Horror                   uint8
Music                    uint8
Musical                  uint8
Mystery                  uint8
News                     uint8
Reality-TV               uint8
Romance                  uint8
Sci-Fi                   uint8
Sport                    uint8
Talk-Show                uint8
Thriller                 uint8
War                      uint8
Western 

In [None]:
movie_dataset.head()

Unnamed: 0.1,Unnamed: 0,tconst,primaryTitle,isAdult,releaseYear,runtimeMinutes,Action,Adult,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,Film-Noir,Game-Show,History,Horror,Music,Musical,Mystery,News,Reality-TV,Romance,Sci-Fi,Sport,Talk-Show,Thriller,War,Western,\N,Short,region,actor,actress,cinematographer,composer,director,editor,producer,production_designer,self,writer,averageRating,numVotes
0,0,tt0000502,Bohemios,False,1905,100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,,"\N,ES","nm0215752,nm0252720",,,,nm0063413,,,,,"nm0675388,nm0063413,nm0657268",4.1,15
1,1,tt0000574,The Story of the Kelly Gang,False,1906,70,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,"HU,AU,RS,GB,\N,SG,US,DE,AU","nm0846894,nm1431224,nm3002376",nm0846887,nm0675239,nm2421834,nm0846879,,"nm0317210,nm0425854,nm0846911",,,nm0846879,6.0,855
2,2,tt0000591,The Prodigal Son,False,1907,90,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,"FR,\N,US","nm0906197,nm0332182","nm1323543,nm1759558",,,nm0141150,,,,,nm0141150,5.0,21
3,3,tt0000615,Robbery Under Arms,False,1907,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,"\N,AU","nm3071427,nm0581353,nm0888988,nm0240418,nm0346387",nm0218953,nm0167619,,nm0533958,,,,,"nm0533958,nm0092809",4.3,25
4,4,tt0000630,Hamlet,False,1908,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,"IT,\N,FI,US,FI",,nm0624446,,,nm0143333,,nm0209738,,,nm0000636,2.9,27


Only run the code below if you need the names of actors. The ML model does not need this information to train it. But it is useful for us to put a name to the actors and crew that took part in the creating of the movie.

Running the code below could cause some machines' Python kernels to crash depending on the RAM availability on them. Only load this dataset when needed.

In [None]:
# movie_dataset = pd.read_parquet("detailed_crew_information.pqt")