# TMDb Movie prediction

<img src="https://img5.goodfon.com/wallpaper/nbig/c/af/sssssss-aaaaaaaaaaa-ddddddddd-fffffffff-rrrrrrr.jpg"> 

***
# Introduction
This data set contains information about 10,000 movies collected from The
Movie Database (TMDb), including user ratings and revenue.

- Certain columns, like ‘cast’ and ‘genres’, contain multiple values
separated by pipe (|) characters.  
-  The final two columns ending with “_adj” show the budget and revenue of
the associated movie in terms of 2010 dollars, accounting for inflation over
time.

***
# Objectives
1- Filter and clean the columns and rows (Remove unnecessary
columns & rows, Deal with NaN values with proper imputation
techniques , remove duplicate records , apply feature scaling
(normalization) for variables if necessary , Convert the used
categorical columns to numerical columns using One hot encoding
and label encoding techniques , check also that all columns have
proper datatypes) In order to make them tidy and be able to be fed
the columns into a linear regression model.

2- Fed the data after filtering them into a linear or polynomial regression
model where we will use all our selected columns as our X variables
and we will use our Y variable the net profit which is the difference
between (revenue_adj – budget_adj).

*** 

# Data wrangling

### Importing necessary libraries

In [None]:
import pandas as pd
import numpy as np

### Reading data from the main csv file

In [None]:
df = pd.read_csv('tmdb-movies.csv')

### Displaying the first five rows of the dataset

In [None]:
df.head()

### Formatting

Rounding up float numbers in order to have a better preview on the data, especially in order to normalize both budget_adj and revenue_adj columns' values.

In [None]:
pd.set_option('display.float_format', lambda x: '%.1f' % x)
df.head()

Adding a new column "profit_adj"

In [None]:
df["profit_adj"]=df["revenue_adj"]-df["budget_adj"]
df.head()

### Checking for NULL values.

In [None]:
df.info()
df.isnull().sum()

### Dropping rows and columns.

Columns to be dropped: 
- **homepage, id, imdb_id, original_title**: they are unique to each movie.
- **tagline, cast, director**: serves little to no importance, in addition to having a HUGE number of null values.
- **release_date**: we will use the "release_year" as a more general approach instead.
- **budget_adj, revenue_adj**: we need to calculate the profit from them, after that they serve no purpose.

In [None]:
colsToBeDropped=["homepage", "id","imdb_id", "original_title","tagline","cast","director","budget_adj","revenue_adj"]
df.drop(colsToBeDropped,inplace=True,axis=1)
print("First 5 rows after dropping the columns")
df.head()

Rows to be dropped:
- Remove duplicates.
- Remove nulls //it's better to imputate them instead (add 0s and 1s as an example)

In [None]:
#Be careful, this reduces the number of rows significantly (10866 to 8701)
df.dropna(inplace=True)
df.info()