# Data Discovery: ML

To use features like Director, Top 5 casts, pg_val, Writer, Revenue and others to predict the imdb_score of the movie.These features are available in different files and need to find the join path and join the tables using similarity techniques.

Process to be followed
- Finding the join paths for the table and joining those tables
- Joining the table to create the final training data
- Feature engineering to create new features
- Data Cleaning on feature columns
- Splitting of data into training and test set
- Traning a regression Model(Any model of your choice)
- Evaluating predictions through mean square error



In [1]:
'''

These are the datasets I found that are useful for predicting the movie's rating.
These are some of the insights gained from observing the data.

I will be focusing on getting these values cleaned into my final table.
1) Movie name -> values.csv, values1.csv
2) Director -> information.csv
3) Top_5 casts -> information.csv
4) pg_val -> productions.csv
5) Revenue -> income.csv
6) Popularity -> values1.csv
7) runtime -> income.csv
8) IMDb score -> values.csv, values1.csv

income.csv -> revenue, runtime, popularity, description (Need to clean revenue) 
information.csv -> info, top 5 casts
productions.csv -> film, tagline
values.csv -> names, imdb_score
values1.csv -> title, IMDb score

'''

import pandas as pd

income_df = pd.read_csv('data/income.csv')
info_df = pd.read_csv('data/information.csv')
prod_df = pd.read_csv('data/productions.csv')
val_df = pd.read_csv('data/values.csv')
val1_df = pd.read_csv('data/values1.csv')

In [2]:
'''
I will create an empty dataframe -> IMDB_df
'''

IMDB_df = pd.DataFrame()

In [3]:
'''
Adding all the names and imdb_score columns from val_df to my IMDB_df data frame
'''
IMDB_df['Movie titles'] = val_df['names']
IMDB_df['score1'] = val_df['imdb_score']

IMDB_df

Unnamed: 0,Movie titles,score1
0,Top Gun: Maverick,8.6
1,Jurassic World Dominion,6
2,Top Gun,6.9
3,Lightyear,5.2
4,Spiderhead,5.4
...,...,...
24397,Delicatessen,7.6
24398,Bitch Ass,5.5
24399,Bullwhip,5.1
24400,The Freshman,6.4


In [4]:

'''
Comparing 'Movie titles' in IMDB df to 'title' val1_df and calculate the similarity score, If the similarity score is 1, then 
'imdb_score' from val1_df will be joined to IMDB_df as 'score2' similarly we will.

'''

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Normalize and prepare the data
IMDB_df['Movie titles'] = IMDB_df['Movie titles'].fillna('unknown').str.lower().str.strip()
val1_df['title'] = val1_df['title'].fillna('unknown').str.lower().str.strip()

# Vectorize the titles
vectorizer = TfidfVectorizer()
IMDB_titles_vectorized = vectorizer.fit_transform(IMDB_df['Movie titles'])
val1_titles_vectorized = vectorizer.transform(val1_df['title'])

# Using the cosine similarity
cosine_similarities = cosine_similarity(IMDB_titles_vectorized, val1_titles_vectorized)
mean_similarity = cosine_similarities.mean()

''' Here I am comparing the similarity score between the 'Movie titles' column and the 'title' column and got a good similarity score'''
print(f"Average Cosine Similarity between 'Movie titles' and 'title': {mean_similarity:.4f}")

# Find the best match from val1_df for each title in IMDB_df based on similarity score
max_sim_indices = np.argmax(cosine_similarities, axis=1)
max_sim_scores = np.max(cosine_similarities, axis=1)

matches_df = pd.DataFrame({
    'IMDB_index': range(len(IMDB_df)),
    'val1_index': max_sim_indices,
    'similarity_score': max_sim_scores
})

# Filter matches with a similarity score above a threshold
threshold = 0.999
filtered_matches_df = matches_df[matches_df['similarity_score'] >= threshold]

# Merge to bring the 'title' from val1_df to matches_df
filtered_matches_df = filtered_matches_df.merge(val1_df[['title', 'imdb_score', 'popularity']], 
                                                left_on='val1_index', right_index=True, how='left')

# Merge the matches back to IMDB_df to add the 'imdb_score' and 'popularity'
IMDB_df = IMDB_df.merge(filtered_matches_df[['IMDB_index', 'imdb_score', 'popularity']], 
                        left_index=True, right_on='IMDB_index', how='left')

# Dropping Index column
IMDB_df.drop(['IMDB_index'], axis=1, inplace=True)

#to verify the results
print(IMDB_df.head())

Average Cosine Similarity between 'Movie titles' and 'title': 0.0039
                Movie titles score1  imdb_score  popularity
0.0        top gun: maverick    8.6        6.21     126.291
1.0  jurassic world dominion      6        2.95     113.897
2.0                  top gun    6.9        4.66      50.029
3.0                lightyear    5.2        6.82      53.066
4.0               spiderhead    5.4        4.26      26.153


In [5]:
''' 
I will take an average of both the imdb scores and keep it in a single column as IMDB_score and 
drop the columns as those are irrevelant
'''

IMDB_df['score1'] = pd.to_numeric(IMDB_df['score1'], errors='coerce')
IMDB_df['imdb_score'] = pd.to_numeric(IMDB_df['imdb_score'], errors='coerce')

# Calculate the average of 'score1' and 'score2'
IMDB_df['IMDB_score'] = IMDB_df[['score1', 'imdb_score']].mean(axis=1)

# Drop the 'score1' and 'score2' columns
IMDB_df = IMDB_df.drop(columns=['score1', 'imdb_score'])

# Display the updated DataFrame
print(IMDB_df.head())

                Movie titles  popularity  IMDB_score
0.0        top gun: maverick     126.291       7.405
1.0  jurassic world dominion     113.897       4.475
2.0                  top gun      50.029       5.780
3.0                lightyear      53.066       6.010
4.0               spiderhead      26.153       4.830


In [6]:
'''
Now, I will be joining Director and top_5 casts from info_df to IMDB_df based on comparing the similarity score between
'Movie titles' from IMDB_df and 'info' from info_df.

'''
# Normalize and prepare the data
IMDB_df['Movie titles'] = IMDB_df['Movie titles'].str.lower().str.strip()
info_df['info'] = info_df['info'].str.lower().str.strip()

# Vectorize the titles
vectorizer = TfidfVectorizer()
IMDB_titles_vectorized = vectorizer.fit_transform(IMDB_df['Movie titles'])
info_titles_vectorized = vectorizer.transform(info_df['info'])

# Calculate cosine similarities between all pairs of titles
cosine_similarities = cosine_similarity(IMDB_titles_vectorized, info_titles_vectorized)
mean_similarity = cosine_similarities.mean()
print(f"Average Cosine Similarity between 'Movie titles' and 'title': {mean_similarity:.4f}")

# Determine the best match from info_df for each title in IMDB_df based on the highest similarity score
max_sim_indices = np.argmax(cosine_similarities, axis=1)
max_sim_scores = np.max(cosine_similarities, axis=1)

# Create an intermediary DataFrame to hold the best matches with similarity scores
matches_df = pd.DataFrame({
    'IMDB_index': range(len(IMDB_df)),
    'info_index': max_sim_indices,
    'similarity_score': max_sim_scores
})

# Filter matches with a similarity score above a threshold
threshold = 0.999  # Adjust as necessary
filtered_matches_df = matches_df[matches_df['similarity_score'] >= threshold]

# Merge to bring the 'Director' and 'Top 5 Casts' from info_df to matches_df
filtered_matches_df = filtered_matches_df.merge(
    info_df[['Director', 'Top 5 Casts']], 
    left_on='info_index', 
    right_index=True, 
    how='left'
)

# Merge the matches back to IMDB_df to add the 'Director' and 'Top 5 Casts'
IMDB_df = IMDB_df.merge(
    filtered_matches_df[['IMDB_index', 'Director', 'Top 5 Casts']], 
    left_index=True, 
    right_on='IMDB_index', 
    how='left'
)

# Drop auxiliary index column if added by the merging process
IMDB_df.drop(['IMDB_index'], axis=1, inplace=True)

print(IMDB_df)

Average Cosine Similarity between 'Movie titles' and 'title': 0.0039
                    Movie titles  popularity  IMDB_score  \
0.0            top gun: maverick     126.291       7.405   
1.0      jurassic world dominion     113.897       4.475   
2.0                      top gun      50.029       5.780   
3.0                    lightyear      53.066       6.010   
4.0                   spiderhead      26.153       4.830   
...                          ...         ...         ...   
24397.0             delicatessen      14.407       6.835   
24398.0                bitch ass       3.706       7.625   
24399.0                 bullwhip       3.457       3.835   
24400.0             the freshman      15.218       6.075   
24401.0           guys and dolls       8.390       6.985   

                     Director  \
0.0           Joseph Kosinski   
1.0           Colin Trevorrow   
2.0                Tony Scott   
3.0             Angus MacLane   
4.0           Joseph Kosinski   
...         

In [7]:
''' 
Now I will split Top 5 Column list into different columns. For this I will convert it into string and I will use 
strip functions to get each character name and divided into different columns.
['Jack Epps Jr.', 'Peter Craig', 'Tom Cruise', 'Jennifer Connelly', 'Miles Teller']

'''
IMDB_df['Top 5 Casts'] = IMDB_df['Top 5 Casts'].astype(str)

# Remove square brackets and quotes, then split by comma followed by a space
IMDB_df['Top 5 Casts'] = IMDB_df['Top 5 Casts'].str.strip("[]").str.replace("'", "").str.split(', ')

# Checking each list has exactly 5 elements by padding shorter lists with empty strings
IMDB_df['Top 5 Casts'] = IMDB_df['Top 5 Casts'].apply(lambda x: (x + [''] * 5)[:5])

# Expanding these lists into separate columns
columns = ['Top Cast 1', 'Top Cast 2', 'Top Cast 3', 'Top Cast 4', 'Top Cast 5']
IMDB_df[columns] = pd.DataFrame(IMDB_df['Top 5 Casts'].tolist(), index=IMDB_df.index)

IMDB_df.drop(['Top 5 Casts'], axis=1, inplace=True)


In [8]:
IMDB_df

Unnamed: 0,Movie titles,popularity,IMDB_score,Director,Top Cast 1,Top Cast 2,Top Cast 3,Top Cast 4,Top Cast 5
0.0,top gun: maverick,126.291,7.405,Joseph Kosinski,Jack Epps Jr.,Peter Craig,Tom Cruise,Jennifer Connelly,Miles Teller
1.0,jurassic world dominion,113.897,4.475,Colin Trevorrow,Colin Trevorrow,Derek Connolly,Chris Pratt,Bryce Dallas Howard,Laura Dern
2.0,top gun,50.029,5.780,Tony Scott,Jack Epps Jr.,Ehud Yonay,Tom Cruise,Tim Robbins,Kelly McGillis
3.0,lightyear,53.066,6.010,Angus MacLane,Jason Headley,Matthew Aldrich,Chris Evans,Keke Palmer,Peter Sohn
4.0,spiderhead,26.153,4.830,Joseph Kosinski,Rhett Reese,Paul Wernick,Chris Hemsworth,Miles Teller,Jurnee Smollett
...,...,...,...,...,...,...,...,...,...
24397.0,delicatessen,14.407,6.835,Marc Caro,Jean-Pierre Jeunet,Marc Caro,Gilles Adrien,Marie-Laure Dougnac,Dominique Pinon
24398.0,bitch ass,3.706,7.625,Bill Posley,Bill Posley,Teon Kelley,Tunde Laleye,"""Melisa Sellers""",Bill Posley
24399.0,bullwhip,3.457,3.835,Harmon Jones,Guy Madison,Rhonda Fleming,James Griffith,Harmon Jones,Adele Buffington
24400.0,the freshman,15.218,6.075,Andrew Bergman,Marlon Brando,Matthew Broderick,Bruno Kirby,Andrew Bergman,Andrew Bergman


In [9]:
'''
Now I will be joining pg_val and tagline from prod_df to IMDB_df by comparing columns which are 'films' from prod_df to 'Movie titles'
from IMDB_df using cosine similarity.
'''

"\nNow I will be joining pg_val and tagline from prod_df to IMDB_df by comparing columns which are 'films' from prod_df to 'Movie titles'\nfrom IMDB_df using cosine similarity.\n"

In [10]:
# Normalize and prepare the data
IMDB_df['Movie titles'] = IMDB_df['Movie titles'].str.lower().str.strip()
prod_df['films'] = prod_df['films'].fillna('unknown').str.lower().str.strip()

# Vectorize the titles
vectorizer = TfidfVectorizer()
IMDB_titles_vectorized = vectorizer.fit_transform(IMDB_df['Movie titles'])
prod_films_vectorized = vectorizer.transform(prod_df['films'])

# Calculate cosine similarities between all pairs of titles
cosine_similarities = cosine_similarity(IMDB_titles_vectorized, prod_films_vectorized)

mean_similarity = cosine_similarities.mean()
print(f"Average Cosine Similarity between 'Movie titles' and 'title': {mean_similarity:.4f}")

# Determine the best match from prod_df for each title in IMDB_df based on the highest similarity score
max_sim_indices = np.argmax(cosine_similarities, axis=1)
max_sim_scores = np.max(cosine_similarities, axis=1)

# Create an intermediary DataFrame to hold the best matches with similarity scores
matches_df = pd.DataFrame({
    'IMDB_index': range(len(IMDB_df)),
    'prod_index': max_sim_indices,
    'similarity_score': max_sim_scores
})

# Filter matches with a similarity score above a threshold
threshold = 0.999  # Adjust as necessary
filtered_matches_df = matches_df[matches_df['similarity_score'] >= threshold]

# Merge to bring the 'pg_val' from prod_df to matches_df
filtered_matches_df = filtered_matches_df.merge(
    prod_df[['pg_val','tagline']], 
    left_on='prod_index', 
    right_index=True, 
    how='left'
)

# Merge the matches back to IMDB_df to add the 'pg_val'
IMDB_df = IMDB_df.merge(
    filtered_matches_df[['IMDB_index', 'pg_val','tagline']], 
    left_index=True, 
    right_on='IMDB_index', 
    how='left'
)

# Drop auxiliary index column if added by the merging process
IMDB_df.drop(['IMDB_index'], axis=1, inplace=True)

# Print the updated DataFrame to verify the results
print(IMDB_df.head())

Average Cosine Similarity between 'Movie titles' and 'title': 0.0039
                Movie titles  popularity  IMDB_score         Director  \
0.0        top gun: maverick     126.291       7.405  Joseph Kosinski   
1.0  jurassic world dominion     113.897       4.475  Colin Trevorrow   
2.0                  top gun      50.029       5.780       Tony Scott   
3.0                lightyear      53.066       6.010    Angus MacLane   
4.0               spiderhead      26.153       4.830  Joseph Kosinski   

          Top Cast 1       Top Cast 2       Top Cast 3           Top Cast 4  \
0.0    Jack Epps Jr.      Peter Craig       Tom Cruise    Jennifer Connelly   
1.0  Colin Trevorrow   Derek Connolly      Chris Pratt  Bryce Dallas Howard   
2.0    Jack Epps Jr.       Ehud Yonay       Tom Cruise          Tim Robbins   
3.0    Jason Headley  Matthew Aldrich      Chris Evans          Keke Palmer   
4.0      Rhett Reese     Paul Wernick  Chris Hemsworth         Miles Teller   

          Top Cas

In [11]:
#Standardize True/False values in pg_val column
true_values = ['T', 'True']
false_values = ['F', 'False']

# Use a mapping function to convert
IMDB_df['pg_val'] = IMDB_df['pg_val'].apply(lambda x: True if x in true_values else (False if x in false_values else x))

#Calculate the mode (most frequent value) in the column
mode_value = IMDB_df['pg_val'].mode()[0]

#Fill missing values with the mode
IMDB_df['pg_val'].fillna(mode_value, inplace=True)

In [12]:
IMDB_df

Unnamed: 0,Movie titles,popularity,IMDB_score,Director,Top Cast 1,Top Cast 2,Top Cast 3,Top Cast 4,Top Cast 5,pg_val,tagline
0.0,top gun: maverick,126.291,7.405,Joseph Kosinski,Jack Epps Jr.,Peter Craig,Tom Cruise,Jennifer Connelly,Miles Teller,False,Feel the need... The need for speed.
1.0,jurassic world dominion,113.897,4.475,Colin Trevorrow,Colin Trevorrow,Derek Connolly,Chris Pratt,Bryce Dallas Howard,Laura Dern,False,The epic conclusion of the Jurassic era.
2.0,top gun,50.029,5.780,Tony Scott,Jack Epps Jr.,Ehud Yonay,Tom Cruise,Tim Robbins,Kelly McGillis,False,Up there with the best of the best.
3.0,lightyear,53.066,6.010,Angus MacLane,Jason Headley,Matthew Aldrich,Chris Evans,Keke Palmer,Peter Sohn,False,Infinity awaits.
4.0,spiderhead,26.153,4.830,Joseph Kosinski,Rhett Reese,Paul Wernick,Chris Hemsworth,Miles Teller,Jurnee Smollett,False,How far would you go to fix human nature?
...,...,...,...,...,...,...,...,...,...,...,...
24397.0,delicatessen,14.407,6.835,Marc Caro,Jean-Pierre Jeunet,Marc Caro,Gilles Adrien,Marie-Laure Dougnac,Dominique Pinon,False,A futuristic comic feast.
,bitch ass,3.706,7.625,Bill Posley,Bill Posley,Teon Kelley,Tunde Laleye,"""Melisa Sellers""",Bill Posley,False,
24399.0,bullwhip,3.457,3.835,Harmon Jones,Guy Madison,Rhonda Fleming,James Griffith,Harmon Jones,Adele Buffington,False,SADDLE TRAMP AND RED-HEADED HELLCAT...they rip...
24400.0,the freshman,15.218,6.075,Andrew Bergman,Marlon Brando,Matthew Broderick,Bruno Kirby,Andrew Bergman,Andrew Bergman,False,"He was on his way to the Dean's List, but he w..."


In [15]:
'''
Now, I will try to extract budget, revenue, and runtime.

This can be done by comparing the 'tagline' column from IMDB_df and the 'overview' column from income_df. If both
strings match, then join revenue, runtime, and budget to the IMDB_df.

'''

IMDB_df['tagline'] = IMDB_df['tagline'].fillna('unknown').str.lower().str.strip()
income_df['overview'] = income_df['overview'].fillna('unknown').str.lower().str.strip()

# Create a mapping from overview to revenue and budget
overview_to_financials = income_df.set_index('overview')[['revenue', 'budget', 'runtime']]


# Map revenue and budget to IMDB_df using the tagline column
IMDB_df = IMDB_df.join(overview_to_financials, on='tagline')

In [16]:
IMDB_df

Unnamed: 0,Movie titles,popularity,IMDB_score,Director,Top Cast 1,Top Cast 2,Top Cast 3,Top Cast 4,Top Cast 5,pg_val,tagline,revenue,budget,runtime
0.0,top gun: maverick,126.291,7.405,Joseph Kosinski,Jack Epps Jr.,Peter Craig,Tom Cruise,Jennifer Connelly,Miles Teller,False,feel the need... the need for speed.,1488732821,170000000,131
1.0,jurassic world dominion,113.897,4.475,Colin Trevorrow,Colin Trevorrow,Derek Connolly,Chris Pratt,Bryce Dallas Howard,Laura Dern,False,the epic conclusion of the jurassic era.,1001978080,165000K,147
2.0,top gun,50.029,5.780,Tony Scott,Jack Epps Jr.,Ehud Yonay,Tom Cruise,Tim Robbins,Kelly McGillis,False,up there with the best of the best.,356M,15M,110
3.0,lightyear,53.066,6.010,Angus MacLane,Jason Headley,Matthew Aldrich,Chris Evans,Keke Palmer,Peter Sohn,False,infinity awaits.,226M,200000000,105
4.0,spiderhead,26.153,4.830,Joseph Kosinski,Rhett Reese,Paul Wernick,Chris Hemsworth,Miles Teller,Jurnee Smollett,False,how far would you go to fix human nature?,0,0,106
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24397.0,delicatessen,14.407,6.835,Marc Caro,Jean-Pierre Jeunet,Marc Caro,Gilles Adrien,Marie-Laure Dougnac,Dominique Pinon,False,a futuristic comic feast.,1M,4000000,99
,bitch ass,3.706,7.625,Bill Posley,Bill Posley,Teon Kelley,Tunde Laleye,"""Melisa Sellers""",Bill Posley,False,unknown,0,0,0
24399.0,bullwhip,3.457,3.835,Harmon Jones,Guy Madison,Rhonda Fleming,James Griffith,Harmon Jones,Adele Buffington,False,saddle tramp and red-headed hellcat...they rip...,0,0,80
24400.0,the freshman,15.218,6.075,Andrew Bergman,Marlon Brando,Matthew Broderick,Bruno Kirby,Andrew Bergman,Andrew Bergman,False,"he was on his way to the dean's list, but he w...",0,0,102


In [17]:
'''
Cleaning of Dataset:

1) I will drop the NaN values and fillna values
2) I will clean the revenue and budget rows. It has K and M in the values and will convert it to normal Integer values.
3) Cleaning feature columns
4) Dropping columns which are not required

Feature Engineering:
1) Will add another feature called 'profit' by subtracting revenue and budget
2) Will only consider movies with a profit greater than 0.
3) Adding profit margin - Profit as a percentage of revenue
4) Adding budget to revenue ratio
'''

"\nCleaning of Dataset:\n\n1) I will drop the NaN values and fillna values\n2) I will clean the revenue and budget rows. It has K and M in the values and will convert it to normal Integer values.\n3) Cleaning feature columns\n4) Dropping columns which are not required\n\nFeature Engineering:\n1) Will add another feature called 'profit' by subtracting revenue and budget\n2) Will only consider movies with a profit greater than 0.\n3) Adding profit margin - Profit as a percentage of revenue\n4) Adding budget to revenue ratio\n"

In [18]:
#Cleaning of Data

def convert_financials(value):
    if pd.isna(value):
        return 0
    if 'M' in value:
        value = value.replace('M', '').strip()
        return float(value) * 1000000
    elif 'K' in value:
        value = value.replace('K', '').strip()
        return float(value) * 1000
    else:
        return float(value)

IMDB_df = IMDB_df.dropna()
IMDB_df.fillna("Unknown", inplace=True)
IMDB_df.drop(columns=['tagline'], inplace=True)

# Clean the revenue and budget columns
IMDB_df['revenue'] = IMDB_df['revenue'].apply(convert_financials).astype(int)
IMDB_df['budget'] = IMDB_df['budget'].apply(convert_financials).astype(int)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IMDB_df.fillna("Unknown", inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IMDB_df.drop(columns=['tagline'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IMDB_df['revenue'] = IMDB_df['revenue'].apply(convert_financials).astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the docume

In [19]:
#Feature Engineering

# Calculate profit
IMDB_df['profit'] = IMDB_df['revenue'] - IMDB_df['budget']

# Drop rows where profit is less than or equal to zero
IMDB_df = IMDB_df[IMDB_df['profit'] > 0]

#Calculate profit margin percentage
IMDB_df['profit_margin'] = IMDB_df.apply(lambda row: (row['profit'] / row['revenue']) * 100 if row['revenue'] > 0 else 0, axis=1)

#Calculate budget to revenue ratio percentage
IMDB_df['budget_revenue_ratio'] = IMDB_df.apply(lambda row: (row['budget'] / row['revenue']) * 100 if row['revenue'] > 0 else 0, axis=1)

# Reseting index for consecutive indexing
IMDB_df.reset_index(drop=True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IMDB_df['profit'] = IMDB_df['revenue'] - IMDB_df['budget']


In [20]:
IMDB_df

Unnamed: 0,Movie titles,popularity,IMDB_score,Director,Top Cast 1,Top Cast 2,Top Cast 3,Top Cast 4,Top Cast 5,pg_val,revenue,budget,runtime,profit,profit_margin,budget_revenue_ratio
0,top gun: maverick,126.291,7.405,Joseph Kosinski,Jack Epps Jr.,Peter Craig,Tom Cruise,Jennifer Connelly,Miles Teller,False,1488732821,170000000,131,1318732821,88.580893,11.419107
1,jurassic world dominion,113.897,4.475,Colin Trevorrow,Colin Trevorrow,Derek Connolly,Chris Pratt,Bryce Dallas Howard,Laura Dern,False,1001978080,165000000,147,836978080,83.532574,16.467426
2,top gun,50.029,5.780,Tony Scott,Jack Epps Jr.,Ehud Yonay,Tom Cruise,Tim Robbins,Kelly McGillis,False,356000000,15000000,110,341000000,95.786517,4.213483
3,lightyear,53.066,6.010,Angus MacLane,Jason Headley,Matthew Aldrich,Chris Evans,Keke Palmer,Peter Sohn,False,226000000,200000000,105,26000000,11.504425,88.495575
4,everything everywhere all at once,64.278,8.050,Dan Kwan,Dan Kwan,Daniel Scheinert,Michelle Yeoh,Stephanie Hsu,Ke Huy Quan,False,139200000,25000000,140,114200000,82.040230,17.959770
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6485,body,3.615,4.835,Dan Berk,Dan Berk,Robert Olsen,Helen Rogers,Alexandra Turshen,Lauren Molina,False,2000,0,75,2000,100.000000,0.000000
6486,balls of fury,9.436,5.960,Robert Ben Garant,Robert Ben Garant,Dan Fogler,Christopher Walken,George Lopez,Robert Ben Garant,False,41098065,0,90,41098065,100.000000,0.000000
6487,the long riders,14.184,5.415,Walter Hill,Steven Smith,Stacy Keach,David Carradine,Stacy Keach,Dennis Quaid,False,15000000,10000000,99,5000000,33.333333,66.666667
6488,police academy 4: citizens on patrol,15.942,3.605,Jim Drake,Pat Proft,Gene Quintano,Steve Guttenberg,Bubba Smith,Michael Winslow,False,76800000,17000000,88,59800000,77.864583,22.135417


In [21]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


data = IMDB_df

#X contains the features used for prediction, and y contains the target variable (IMDB_score).

X = data[['popularity', 'pg_val', 'revenue', 'budget', 'runtime', 'profit', 'profit_margin', 'budget_revenue_ratio',
          'Director', 'Top Cast 1', 'Top Cast 2', 'Top Cast 3', 'Top Cast 4', 'Top Cast 5']]
y = data['IMDB_score']

# Encoding categorical columns using One-Hot Encoding
categorical_features = ['Director', 'Top Cast 1', 'Top Cast 2', 'Top Cast 3', 'Top Cast 4', 'Top Cast 5']
one_hot = OneHotEncoder(handle_unknown='ignore')
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder='passthrough')
X_transformed = transformer.fit_transform(X)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.20, random_state=42)

# Training the Random Forest Regressor
regressor = RandomForestRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Predicting on the test set
y_pred = regressor.predict(X_test)

# Calculating the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Calculating the Root Mean Squared Error (RMSE)
rmse=np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

Root Mean Squared Error: 1.3978168610573691
