# TMDb Movie prediction

<img src="https://img5.goodfon.com/wallpaper/nbig/c/af/sssssss-aaaaaaaaaaa-ddddddddd-fffffffff-rrrrrrr.jpg"> 

***
# Introduction
This data set contains information about 10,000 movies collected from The
Movie Database (TMDb), including user ratings and revenue.

- Certain columns, like ‘cast’ and ‘genres’, contain multiple values
separated by pipe (|) characters.  
-  The final two columns ending with “_adj” show the budget and revenue of
the associated movie in terms of 2010 dollars, accounting for inflation over
time.

***
# Objectives
1- Filter and clean the columns and rows (Remove unnecessary
columns & rows, Deal with NaN values with proper imputation
techniques , remove duplicate records , apply feature scaling
(normalization) for variables if necessary , Convert the used
categorical columns to numerical columns using One hot encoding
and label encoding techniques , check also that all columns have
proper datatypes) In order to make them tidy and be able to be fed
the columns into a linear regression model.

2- Fed the data after filtering them into a linear or polynomial regression
model where we will use all our selected columns as our X variables
and we will use our Y variable the net profit which is the difference
between (revenue_adj – budget_adj).

*** 

# Data wrangling

### Importing necessary libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as mp 

### Reading data from the main csv file

In [2]:
df = pd.read_csv('tmdb-movies.csv')

### Displaying the first five rows of the dataset

In [3]:
df.head()

Unnamed: 0,id,imdb_id,popularity,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999939.3,1392446000.0
1,76341,tt1392190,28.419936,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/2015,6185,7.1,2015,137999939.3,348161300.0
2,262500,tt2908446,13.112507,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/2015,2480,6.3,2015,101199955.5,271619000.0
3,140607,tt2488496,11.173104,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/2015,5292,7.5,2015,183999919.0,1902723000.0
4,168259,tt2820852,9.335014,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/2015,2947,7.3,2015,174799923.1,1385749000.0


In [4]:
df.shape

(10866, 19)

### Formatting

Rounding up float numbers in order to have a better preview on the data, especially in order to normalize both budget_adj and revenue_adj columns' values.

In [5]:
pd.set_option('display.float_format', lambda x: '%.1f' % x)
df.head()

Unnamed: 0,id,imdb_id,popularity,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,33.0,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999939.3,1392445893.0
1,76341,tt1392190,28.4,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/2015,6185,7.1,2015,137999939.3,348161292.5
2,262500,tt2908446,13.1,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/2015,2480,6.3,2015,101199955.5,271619025.4
3,140607,tt2488496,11.2,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/2015,5292,7.5,2015,183999919.0,1902723130.0
4,168259,tt2820852,9.3,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/2015,2947,7.3,2015,174799923.1,1385748801.0


Adding a new column "profit_adj"

In [6]:
df["profit_adj"]=df["revenue_adj"]-df["budget_adj"]
df.head()

Unnamed: 0,id,imdb_id,popularity,original_title,cast,homepage,director,tagline,keywords,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,profit_adj
0,135397,tt0369610,33.0,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/2015,5562,6.5,2015,137999939.3,1392445893.0,1254445953.7
1,76341,tt1392190,28.4,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/2015,6185,7.1,2015,137999939.3,348161292.5,210161353.2
2,262500,tt2908446,13.1,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/2015,2480,6.3,2015,101199955.5,271619025.4,170419069.9
3,140607,tt2488496,11.2,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/2015,5292,7.5,2015,183999919.0,1902723130.0,1718723211.0
4,168259,tt2820852,9.3,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/2015,2947,7.3,2015,174799923.1,1385748801.0,1210948877.9


In [7]:
geners_df = pd.concat([df.drop('genres', 1), df['genres'].str.get_dummies(sep="|")], 1)
geners_df.head()

Unnamed: 0,id,imdb_id,popularity,original_title,cast,homepage,director,tagline,keywords,overview,...,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
0,135397,tt0369610,33.0,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,monster|dna|tyrannosaurus rex|velociraptor|island,Twenty-two years after the events of Jurassic ...,...,0,0,0,0,0,1,0,1,0,0
1,76341,tt1392190,28.4,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,future|chase|post-apocalyptic|dystopia|australia,An apocalyptic story set in the furthest reach...,...,0,0,0,0,0,1,0,1,0,0
2,262500,tt2908446,13.1,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,based on novel|revolution|dystopia|sequel|dyst...,Beatrice Prior must confront her inner demons ...,...,0,0,0,0,0,1,0,1,0,0
3,140607,tt2488496,11.2,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,android|spaceship|jedi|space opera|3d,Thirty years after defeating the Galactic Empi...,...,0,0,0,0,0,1,0,0,0,0
4,168259,tt2820852,9.3,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,car race|speed|revenge|suspense|car,Deckard Shaw seeks revenge against Dominic Tor...,...,0,0,0,0,0,0,0,1,0,0


In [8]:
correlations_geners = geners_df.corr()

In [9]:
print(correlations_geners['profit_adj'])

id                -0.1
popularity         0.6
runtime            0.1
vote_count         0.7
vote_average       0.2
release_year      -0.1
budget_adj         0.5
revenue_adj        1.0
profit_adj         1.0
Action             0.1
Adventure          0.2
Animation          0.1
Comedy            -0.0
Crime              0.0
Documentary       -0.1
Drama             -0.1
Family             0.1
Fantasy            0.1
Foreign           -0.0
History           -0.0
Horror            -0.1
Music             -0.0
Mystery           -0.0
Romance           -0.0
Science Fiction    0.1
TV Movie          -0.0
Thriller           0.0
War                0.0
Western           -0.0
Name: profit_adj, dtype: float64


We found that genre column doesn't correlate strongly with how popular a movie is, so we will drop it.

### Checking for NULL values.

In [10]:
df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   original_title        10866 non-null  object 
 4   cast                  10790 non-null  object 
 5   homepage              2936 non-null   object 
 6   director              10822 non-null  object 
 7   tagline               8042 non-null   object 
 8   keywords              9373 non-null   object 
 9   overview              10862 non-null  object 
 10  runtime               10866 non-null  int64  
 11  genres                10843 non-null  object 
 12  production_companies  9836 non-null   object 
 13  release_date          10866 non-null  object 
 14  vote_count            10866 non-null  int64  
 15  vote_average       

id                         0
imdb_id                   10
popularity                 0
original_title             0
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
runtime                    0
genres                    23
production_companies    1030
release_date               0
vote_count                 0
vote_average               0
release_year               0
budget_adj                 0
revenue_adj                0
profit_adj                 0
dtype: int64

### Dropping rows and columns.

Columns to be dropped: 
- **homepage, id, imdb_id, original_title**: they are unique to each movie.
- **tagline, cast, director**: serves little to no importance, in addition to having a HUGE number of null values.
- **release_date**: we will use the "release_year" as a more general approach instead.
- **budget_adj, revenue_adj**: we need to calculate the profit from them, after that they serve no purpose.

In [11]:
colsToBeDropped=["imdb_id","homepage","id","keywords","original_title","director","production_companies","genres","budget_adj","revenue_adj","cast","tagline","overview","release_date"]
df.drop(colsToBeDropped,inplace=True,axis=1)
print("First 5 rows after dropping the columns")
df.head()

First 5 rows after dropping the columns


Unnamed: 0,popularity,runtime,vote_count,vote_average,release_year,profit_adj
0,33.0,124,5562,6.5,2015,1254445953.7
1,28.4,120,6185,7.1,2015,210161353.2
2,13.1,119,2480,6.3,2015,170419069.9
3,11.2,136,5292,7.5,2015,1718723211.0
4,9.3,137,2947,7.3,2015,1210948877.9


In [12]:
#Be careful, this reduces the number of rows significantly (10866 to 8701)
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10866 entries, 0 to 10865
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   popularity    10866 non-null  float64
 1   runtime       10866 non-null  int64  
 2   vote_count    10866 non-null  int64  
 3   vote_average  10866 non-null  float64
 4   release_year  10866 non-null  int64  
 5   profit_adj    10866 non-null  float64
dtypes: float64(3), int64(3)
memory usage: 594.2 KB


Rows to be dropped:
- Remove duplicates.
- Remove nulls //it's better to imputate them instead (add 0s and 1s as an example)

In [13]:
df.shape

(10866, 6)