# **Read and merge datasets**

In [23]:
import pandas as pd

movies = pd.read_csv('/content/drive/MyDrive/tmdb_5000_movies.csv')

credits = pd.read_csv('/content/drive/MyDrive/tmdb_5000_credits.csv')


In [24]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [25]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


In [None]:
credits.head(5)

In [26]:
credits.columns = ['id','title','cast','crew']
movies= movies.merge(credits,on='id')

Note1. In this part to have a comprehensive dataset two CSV files were merged

# **Data cleaning**

In [27]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

# **Handle Null and doplicate rows**

In [28]:
movies.isnull().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_x                    0
vote_average               0
vote_count                 0
title_y                    0
cast                       0
crew                       0
dtype: int64

In [29]:
movies['homepage'].fillna('No Homepage', inplace=True)
movies['tagline'].fillna('No Tagline', inplace=True)

In [30]:
movies.dropna(inplace=True)

In [31]:
movies.drop_duplicates(inplace=True)

In [32]:
movies.isnull().sum()

budget                  0
genres                  0
homepage                0
id                      0
keywords                0
original_language       0
original_title          0
overview                0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
spoken_languages        0
status                  0
tagline                 0
title_x                 0
vote_average            0
vote_count              0
title_y                 0
cast                    0
crew                    0
dtype: int64

Note2. Two features, the homepage (more than 3000 lines) and tagline (more than 800) had a large number of nulls compared to the total number of lines (4802).
In this situation, it is not possible to easily delete rows with null values.
There are different methods such as Text Analysis, Imputation with a Placeholder, Feature Engineering, and Modeling with Missing Values to handle the missing values which can be chosen according to the goal of the projects.
Since the goal of this project is to find out whether a film was profitable or not, and considering having two columns (budget and revenue) we can delete columns with missing values in the next steps.
As a result, in the stage of data cleaning, I applied the " Imputation with a Placeholder" method and filled the null values in these two columns with "No Homepage" and "No Tagline".

Moreover, there are a few numbers of null values in other columns which I drop them