# IMBD Top 1000 Movies 

## Load the dataset and import librarys:

In [1]:
import pandas as pd

# Load the dataset
file_path = 'imdb_movies_Scrape1000.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

Unnamed: 0,Title,Release year,Plot summary,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross
0,The Shawshank Redemption,1994,"Over the course of several years, two convicts...",Drama,9.3,142 min,9.3,82.0,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",2869913,$28.34M
1,The Dark Knight,2008,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama",9.0,152 min,9.0,84.0,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",2851842,$534.86M
2,Inception,2010,A thief who steals corporate secrets through t...,"Action, Adventure, Sci-Fi",8.8,148 min,8.8,74.0,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",2532959,$292.58M
3,Fight Club,1999,An insomniac office worker and a devil-may-car...,Drama,8.8,139 min,8.8,67.0,David Fincher,"Brad Pitt, Edward Norton, Meat Loaf, Zach Grenier",2305364,$37.03M
4,Pulp Fiction,1994,"The lives of two mob hitmen, a boxer, a gangst...","Crime, Drama",8.9,154 min,8.9,95.0,Quentin Tarantino,"John Travolta, Uma Thurman, Samuel L. Jackson,...",2204248,$107.93M


# Data Processing and Cleaning:
## Data Cleaning:
### Handling missing values:

In [2]:
# Step 1: Check for Missing Values
data.isnull().sum()



Title            0
Release year     0
Plot summary     0
Genre            0
Rating           0
Runtime          0
IMDb rating      0
Metascore        5
Director         0
Stars            0
Votes            0
Gross           13
dtype: int64

What was done: Checked for the presence of missing values in each column of the dataset.
Why it was necessary: Identifying missing values is crucial to assess the quality of the data and decide on appropriate handling techniques.

###  Data Type Conversion

In [3]:
# Step 2: Data Type Conversion
# Convert 'Release year' to integer
data['Release year'] = pd.to_numeric(data['Release year'], errors='coerce')

# Convert 'IMDb rating' to float
data['IMDb rating'] = pd.to_numeric(data['IMDb rating'], errors='coerce')

# Convert 'votes' to integer
data['Votes'] = data['Votes'].str.replace(',', '').astype(int)



What was done: Converted the 'Release year' column to integer and 'Votes' column to integer and the 'IMDb rating' column to float.
Why it was necessary: Ensuring that data types are consistent with the data they represent is important for analysis and computation. For example, the release year should be an integer, and ratings should be floating-point numbers.

### Remove Unnecessary Characters:

In [4]:
# Step 3: Remove Unnecessary Characters
# Remove 'min' from 'Runtime' and convert to integer
data['Runtime'] = data['Runtime'].str.replace(' min', '').astype(float)

# Remove '$' and 'M' from 'Gross', convert to float, and multiply by 1 million
data['Gross'] = data['Gross'].str.replace('[\$\,M]', '', regex=True).astype(float) * 1e6


What was done: Removed non-numeric characters from the 'Runtime' and 'Gross' columns and converted them to appropriate numeric types.
Why it was necessary: Numeric columns containing non-numeric characters cannot be used for mathematical operations. Cleaning these columns allows for accurate calculations and analysis.

### Splitting Columns 

In [5]:
# Step 4: Splitting Columns 
data['Stars'] = data['Stars'].str.split(', ')


What was done: Split the 'Stars' column into a list of individual names.
Why it was necessary: The 'Stars' column contained multiple names in a single string, which is not ideal for analysis. Splitting the names into a list allows for easier manipulation and analysis of individual names.

### Handling missing values

In [6]:
# Handling missing values

# For 'Metascore', we'll fill missing values with the median
metascore_median = data['Metascore'].median()
data['Metascore'].fillna(metascore_median, inplace=True)

# For 'Gross', we'll fill missing values with the median
gross_median = data['Gross'].median()
data['Gross'].fillna(gross_median, inplace=True)

# Fill missing values in 'Release year' with the median
release_year_median = data['Release year'].median()
data['Release year'].fillna(release_year_median, inplace=True)

# Check if there are any missing values left
data.isnull().sum()


Title           0
Release year    0
Plot summary    0
Genre           0
Rating          0
Runtime         0
IMDb rating     0
Metascore       0
Director        0
Stars           0
Votes           0
Gross           0
dtype: int64

What was done: Filled missing values in the 'Metascore', 'Gross', and 'Release year' columns with their respective median values. Why it was necessary: Missing values can affect the results of analyses and computations. Filling them with a central tendency measure is a common practice to maintain the overall distribution of the data.

### cleaned data:

In [7]:
# Display the cleaned dataset and the missing values
data.head()

Unnamed: 0,Title,Release year,Plot summary,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross
0,The Shawshank Redemption,1994.0,"Over the course of several years, two convicts...",Drama,9.3,142.0,9.3,82.0,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",2869913,28340000.0
1,The Dark Knight,2008.0,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama",9.0,152.0,9.0,84.0,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",2851842,534860000.0
2,Inception,2010.0,A thief who steals corporate secrets through t...,"Action, Adventure, Sci-Fi",8.8,148.0,8.8,74.0,Christopher Nolan,"[Leonardo DiCaprio, Joseph Gordon-Levitt, Elli...",2532959,292580000.0
3,Fight Club,1999.0,An insomniac office worker and a devil-may-car...,Drama,8.8,139.0,8.8,67.0,David Fincher,"[Brad Pitt, Edward Norton, Meat Loaf, Zach Gre...",2305364,37030000.0
4,Pulp Fiction,1994.0,"The lives of two mob hitmen, a boxer, a gangst...","Crime, Drama",8.9,154.0,8.9,95.0,Quentin Tarantino,"[John Travolta, Uma Thurman, Samuel L. Jackson...",2204248,107930000.0
