# IMBD Top  Movies 

## Load the dataset and import librarys:

In [23]:
import pandas as pd

# Load the dataset
file_path = 'movies.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataset
data.head()

Unnamed: 0,Title,Release year,Plot summary,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross
0,The Shawshank Redemption,1994,"Over the course of several years, two convicts...",Drama,9.3,142 min,9.3,82.0,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",2869913,$28.34M
1,The Dark Knight,2008,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama",9.0,152 min,9.0,84.0,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",2851842,$534.86M
2,Inception,2010,A thief who steals corporate secrets through t...,"Action, Adventure, Sci-Fi",8.8,148 min,8.8,74.0,Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellio...",2532959,$292.58M
3,Fight Club,1999,An insomniac office worker and a devil-may-car...,Drama,8.8,139 min,8.8,67.0,David Fincher,"Brad Pitt, Edward Norton, Meat Loaf, Zach Grenier",2305364,$37.03M
4,Pulp Fiction,1994,"The lives of two mob hitmen, a boxer, a gangst...","Crime, Drama",8.9,154 min,8.9,95.0,Quentin Tarantino,"John Travolta, Uma Thurman, Samuel L. Jackson,...",2204248,$107.93M


# Data Processing and Cleaning:
## Data Cleaning:
### Handling missing values:

In [24]:
# Step 1: Check for Missing Values
data.isnull().sum()



Title            0
Release year     0
Plot summary     0
Genre            0
Rating           0
Runtime          0
IMDb rating      0
Metascore       27
Director         0
Stars            0
Votes            0
Gross           39
dtype: int64

What was done: Checked for the presence of missing values in each column of the dataset.
Why it was necessary: Identifying missing values is crucial to assess the quality of the data and decide on appropriate handling techniques.

###  Data Type Conversion

In [25]:
# Step 2: Data Type Conversion
# Convert 'Release year' to integer
data['Release year'] = pd.to_numeric(data['Release year'], errors='coerce')

# Convert 'IMDb rating' to float
data['IMDb rating'] = pd.to_numeric(data['IMDb rating'], errors='coerce')

# Convert 'votes' to integer
data['Votes'] = data['Votes'].str.replace(',', '').astype(int)



What was done: Converted the 'Release year' column to integer and 'Votes' column to integer and the 'IMDb rating' column to float.
Why it was necessary: Ensuring that data types are consistent with the data they represent is important for analysis and computation. For example, the release year should be an integer, and ratings should be floating-point numbers.

### Remove Unnecessary Characters:

In [26]:
# Step 3: Remove Unnecessary Characters
# Remove 'min' from 'Runtime' and convert to integer
data['Runtime'] = data['Runtime'].str.replace(' min', '').astype(float)

# Remove '$' and 'M' from 'Gross', convert to float, and multiply by 1 million
data['Gross'] = data['Gross'].str.replace('[\$\,M]', '', regex=True).astype(float) * 1e6


What was done: Removed non-numeric characters from the 'Runtime' and 'Gross' columns and converted them to appropriate numeric types.
Why it was necessary: Numeric columns containing non-numeric characters cannot be used for mathematical operations. Cleaning these columns allows for accurate calculations and analysis.

### Splitting Columns 

In [27]:
# Step 4: Splitting Columns 
data['Stars'] = data['Stars'].str.split(', ')


What was done: Split the 'Stars' column into a list of individual names.
Why it was necessary: The 'Stars' column contained multiple names in a single string, which is not ideal for analysis. Splitting the names into a list allows for easier manipulation and analysis of individual names.

### Handling missing values

In [28]:
# Handling missing values

# Drop rows with missing values in 'Metascore', 'Gross', or 'Release year'
data.dropna(subset=['Metascore', 'Gross', 'Release year'], inplace=True)

# Check if there are any missing values left
data.isnull().sum()



Title           0
Release year    0
Plot summary    0
Genre           0
Rating          0
Runtime         0
IMDb rating     0
Metascore       0
Director        0
Stars           0
Votes           0
Gross           0
dtype: int64

What was done:
Rows with missing values in the 'Metascore', 'Gross', or 'Release year' columns were dropped from the dataset.

Why it was necessary:
Removing rows with missing values in these key columns ensures the integrity and accuracy of the dataset, which is crucial for any subsequent analysis or modeling. It prevents potential biases or errors that could arise from imputing or ignoring these missing values.

### cleaned data:

In [29]:
# Display the cleaned dataset 
data.head()

Unnamed: 0,Title,Release year,Plot summary,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross
0,The Shawshank Redemption,1994.0,"Over the course of several years, two convicts...",Drama,9.3,142.0,9.3,82.0,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",2869913,28340000.0
1,The Dark Knight,2008.0,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama",9.0,152.0,9.0,84.0,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",2851842,534860000.0
2,Inception,2010.0,A thief who steals corporate secrets through t...,"Action, Adventure, Sci-Fi",8.8,148.0,8.8,74.0,Christopher Nolan,"[Leonardo DiCaprio, Joseph Gordon-Levitt, Elli...",2532959,292580000.0
3,Fight Club,1999.0,An insomniac office worker and a devil-may-car...,Drama,8.8,139.0,8.8,67.0,David Fincher,"[Brad Pitt, Edward Norton, Meat Loaf, Zach Gre...",2305364,37030000.0
4,Pulp Fiction,1994.0,"The lives of two mob hitmen, a boxer, a gangst...","Crime, Drama",8.9,154.0,8.9,95.0,Quentin Tarantino,"[John Travolta, Uma Thurman, Samuel L. Jackson...",2204248,107930000.0


## Preprocessing: 
### Normilazation:

In [30]:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object with the custom range (1 to 10)
custom_scaler = MinMaxScaler(feature_range=(1, 10))

# Normalize the specified columns in the original 'data' DataFrame
columns_to_normalize = ['Metascore', 'Votes', 'Gross']
data[columns_to_normalize] = custom_scaler.fit_transform(data[columns_to_normalize])
data.head()

Unnamed: 0,Title,Release year,Plot summary,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross
0,The Shawshank Redemption,1994.0,"Over the course of several years, two convicts...",Drama,9.3,142.0,9.3,8.21978,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",10.0,1.272308
1,The Dark Knight,2008.0,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama",9.0,152.0,9.0,8.417582,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",9.941789,6.139261
2,Inception,2010.0,A thief who steals corporate secrets through t...,"Action, Adventure, Sci-Fi",8.8,148.0,8.8,7.428571,Christopher Nolan,"[Leonardo DiCaprio, Joseph Gordon-Levitt, Elli...",8.914598,3.811287
3,Fight Club,1999.0,An insomniac office worker and a devil-may-car...,Drama,8.8,139.0,8.8,6.736264,David Fincher,"[Brad Pitt, Edward Norton, Meat Loaf, Zach Gre...",8.181464,1.355807
4,Pulp Fiction,1994.0,"The lives of two mob hitmen, a boxer, a gangst...","Crime, Drama",8.9,154.0,8.9,9.505495,Quentin Tarantino,"[John Travolta, Uma Thurman, Samuel L. Jackson...",7.855748,2.037057


What was done: The 'MetaScore', 'Votes', and 'Gross' columns were normalized to a scale of 1-10.

Why it was necessary: the normalization was done to ensure that the data is presented consistently for comparison and visualization. Since IMDB ratings typically fall within the 1, 10 range aligning the 'MetaScore' 'Votes' and 'Gross columns, on the scale makes it easier to analyze and understand. By bringing all values to a scale it simplifies the process of gauging the significance or size of these variables relative, to each other.

###  Discretization:

In [31]:


# Drop rows with NaN values in 'IMDb rating' column
data = data.dropna(subset=['IMDb rating'])

# Define the bins and labels for discretization
bins = [0, 6.9, 8.4, 10]
labels = ['Low', 'Medium', 'High']

# Discretize the 'IMDb rating' column
data['Rating Category'] = pd.cut(data['IMDb rating'], bins=bins, labels=labels, include_lowest=True)

# Display the updated dataset with the new 'Rating Category' column
data.head()

Unnamed: 0,Title,Release year,Plot summary,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross,Rating Category
0,The Shawshank Redemption,1994.0,"Over the course of several years, two convicts...",Drama,9.3,142.0,9.3,8.21978,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",10.0,1.272308,High
1,The Dark Knight,2008.0,When the menace known as the Joker wreaks havo...,"Action, Crime, Drama",9.0,152.0,9.0,8.417582,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",9.941789,6.139261,High
2,Inception,2010.0,A thief who steals corporate secrets through t...,"Action, Adventure, Sci-Fi",8.8,148.0,8.8,7.428571,Christopher Nolan,"[Leonardo DiCaprio, Joseph Gordon-Levitt, Elli...",8.914598,3.811287,High
3,Fight Club,1999.0,An insomniac office worker and a devil-may-car...,Drama,8.8,139.0,8.8,6.736264,David Fincher,"[Brad Pitt, Edward Norton, Meat Loaf, Zach Gre...",8.181464,1.355807,High
4,Pulp Fiction,1994.0,"The lives of two mob hitmen, a boxer, a gangst...","Crime, Drama",8.9,154.0,8.9,9.505495,Quentin Tarantino,"[John Travolta, Uma Thurman, Samuel L. Jackson...",7.855748,2.037057,High


Was done:  we discretized 'IMDb rating' into the categories (low, medium, high ) 'Low' for ratings below 7.0, 'Medium' for ratings between 7.0 and 8.4, and 'High' for ratings 8.5 and above. Why it was necessary: provides a simplified representation of the ratings, allowing for easier analysis and comparison. we can identify patterns within each category.

###  Feature Selection:

In [32]:
data.drop(columns=['Plot summary'], inplace=True)
data.head()

Unnamed: 0,Title,Release year,Genre,Rating,Runtime,IMDb rating,Metascore,Director,Stars,Votes,Gross,Rating Category
0,The Shawshank Redemption,1994.0,Drama,9.3,142.0,9.3,8.21978,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",10.0,1.272308,High
1,The Dark Knight,2008.0,"Action, Crime, Drama",9.0,152.0,9.0,8.417582,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",9.941789,6.139261,High
2,Inception,2010.0,"Action, Adventure, Sci-Fi",8.8,148.0,8.8,7.428571,Christopher Nolan,"[Leonardo DiCaprio, Joseph Gordon-Levitt, Elli...",8.914598,3.811287,High
3,Fight Club,1999.0,Drama,8.8,139.0,8.8,6.736264,David Fincher,"[Brad Pitt, Edward Norton, Meat Loaf, Zach Gre...",8.181464,1.355807,High
4,Pulp Fiction,1994.0,"Crime, Drama",8.9,154.0,8.9,9.505495,Quentin Tarantino,"[John Travolta, Uma Thurman, Samuel L. Jackson...",7.855748,2.037057,High


What was done:
The 'Plot Summary' column was dropped from the dataset.

Why it was necessary:
The 'Plot Summary' column was deemed not helpful for the analysis or modeling objectives. Removing it simplifies the dataset, reducing complexity and focusing on more relevant features.