<a href="https://colab.research.google.com/github/hawa1983/DATA602Project/blob/main/Final_Project_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Team Members**


*   Souleymane Doumbia
*   Fomba Kassoh



# Eploratory Data Analysis

Preliminary exploration will focus on distributions of numerical features such as ratings and revenue, as well as categorical features like genres and certificates, directors, and stars. Summary statistics will provide insights into mean, median, and mode for critical scores and revenue.

## Column Overview

**Basic Info:** Includes attributes like title, genre, certificate, release_year, runtime, etc.

**Ratings:** Metrics such as metascore and votes.

**Revenue:** Includes gross_revenue, budget, opening_weekend_boxoffice_usa, gross_usa, cummulative_worldwide_gross.

**Genre:** Genre information such as Action, Comedy, Drama, etc.

**Awards** Includes wins and nominations.

## Load the data into a data frame

 Let us have a quick glimpse into the dataset to allow us to see a few examples of the data *we* will be working with.

In [None]:
import pandas as pd

# Loading the dataset
file_path = 'https://raw.githubusercontent.com/hawa1983/DATA-602/main/movies.csv'
df = pd.read_csv(file_path)

# Displaying the first few rows and the summary of the dataset
df.head()


Unnamed: 0,ranking,title,release_year,certificate,runtime,genre,metascore,directors,stars,votes,...,wins,nominations,budget,opening_weekend_boxoffice_usa,gross_usa,cumulative_worldwide_gross,countries_of_origin,languages,production_companies,web_link
0,1,Avatar,-2009,PG-13,162 min,"Action, Adventure, Fantasy",83.0,"James Cameron, Sam Worthington, Zoe Saldana, S...","Sam Worthington, Zoe Saldana, Sigourney Weaver...",1385494,...,91.0,131.0,237000000,77025481,785221649,2923706026,United States,"English, Spanish","Twentieth Century Fox, Dune Entertainment, Lig...",https://www.imdb.com/title/tt0499549/?ref_=ttl...
1,2,Avengers: Endgame,-2019,PG-13,181 min,"Action, Adventure, Drama",78.0,"Anthony Russo, Joe Russo, Robert Downey Jr., C...","Anthony Russo, Joe Russo, Robert Downey Jr., C...",1263672,...,70.0,133.0,356000000,357115007,858373000,2799439100,United States,"English, Japanese, Xhosa, German","Marvel Studios, Walt Disney Pictures",https://www.imdb.com/title/tt4154796/?ref_=ttl...
2,3,Avatar: The Way of Water,-2022,PG-13,192 min,"Action, Adventure, Fantasy",67.0,"James Cameron, Sam Worthington, Zoe Saldana, S...","Sam Worthington, Zoe Saldana, Sigourney Weaver...",493336,...,75.0,150.0,350000000,134100226,684075767,2320250281,United States,English,"20th Century Studios, TSG Entertainment, Light...",https://www.imdb.com/title/tt1630029/?ref_=ttl...
3,4,Titanic,-1997,PG-13,194 min,"Drama, Romance",75.0,"James Cameron, Leonardo DiCaprio, Kate Winslet...","Leonardo DiCaprio, Kate Winslet, Billy Zane, K...",1278811,...,126.0,83.0,200000000,28638131,674292608,2264750694,"United States, Mexico","English, Swedish, Italian, French","Twentieth Century Fox, Paramount Pictures, Lig...",https://www.imdb.com/title/tt0120338/?ref_=ttl...
4,5,Star Wars: Episode VII - The Force Awakens,-2015,PG-13,138 min,"Action, Adventure, Sci-Fi",80.0,"J.J. Abrams, Daisy Ridley, John Boyega, Oscar ...","J.J. Abrams, Daisy Ridley, John Boyega, Oscar ...",973179,...,64.0,140.0,245000000,247966675,936662225,2071310218,United States,English,"Lucasfilm, Bad Robot",https://www.imdb.com/title/tt2488496/?ref_=ttl...


## Overview of the Data

Next we get an overview of the DataFrame, including the number of entries, the data types of columns, the count of non-null values, memory usage, and column names. This information will be crucial for assessing the dataset's structure, spotting missing data, and planning for necessary preprocessing steps.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   ranking                        1000 non-null   object 
 1   title                          1000 non-null   object 
 2   release_year                   1000 non-null   object 
 3   certificate                    977 non-null    object 
 4   runtime                        1000 non-null   object 
 5   genre                          1000 non-null   object 
 6   metascore                      964 non-null    float64
 7   directors                      1000 non-null   object 
 8   stars                          1000 non-null   object 
 9   votes                          1000 non-null   object 
 10  gross_revenue                  959 non-null    object 
 11  release_date                   1000 non-null   object 
 12  wins                           948 non-null    fl

## Data Type Conversion

  * **Numeric data** like ranking, votes, gross_revenue, budget, opening_weekend_boxoffice_usa, gross_usa, and cumulative_worldwide_gross are converted to appropriate numeric types (int or float) after potentially removing commas.

  * **The release_year** is extracted from a string and converted to an integer. The year is represented by four digits within the string.

  * **Date Conversion:** The release_date column is cleaned to extract a date format (month day, year) and converted into a pandas datetime object using a specified format. This ensures that the column is consistently formatted and useful for time series analysis.

  * **Type Enforcement:** The script explicitly sets the data type for several columns:

    - title, certificate, genre, stars, countries_of_origin, languages, production_companies, and web_link are confirmed to be stored as strings.

    - metascore, wins, and nominations are converted to pandas' nullable integer type (Int64), which supports missing values (NaN) while still allowing for integer operations.

In [None]:
import pandas as pd
#movies_data = df.copy()

# Helper function to convert strings with commas and decimals to integers
def to_int(val):
    if isinstance(val, float) or isinstance(val, int):
        return int(val)  # Direct conversion if already a float or int
    try:
        return int(val.replace(',', ''))
    except ValueError:
        return int(float(val.replace(',', '')))  # Converts strings like '1,000.00' to integer 1000

# Helper function to convert strings with commas to floats
def to_float(val):
    if isinstance(val, float):
        return val  # Directly return if already a float
    try:
        return float(val.replace(',', ''))
    except ValueError:
        return float(val)  # In case the value is already formatted as a float string without commas

# Apply conversions
df['ranking'] = df['ranking'].apply(to_int)

# Extract only the year part and convert to integer
df['release_year'] = df['release_year'].astype(str).str.extract(r'(\d{4})')[0].astype(int)

df['votes'] = df['votes'].apply(to_int)
df['gross_revenue'] = df['gross_revenue'].apply(to_float)
df['budget'] = df['budget'].apply(to_float)
df['opening_weekend_boxoffice_usa'] = df['opening_weekend_boxoffice_usa'].apply(to_float)
df['gross_usa'] = df['gross_usa'].apply(to_float)
df['cumulative_worldwide_gross'] = df['cumulative_worldwide_gross'].apply(to_float)

# Clean and convert date column
df['release_date'] = df['release_date'].str.extract(r'(\w+ \d+, \d+)')[0]
df['release_date'] = pd.to_datetime(df['release_date'], format='%B %d, %Y')

# Ensure other columns are in the correct type
df['title'] = df['title'].astype(str)
df['certificate'] = df['certificate'].astype(str)
df['genre'] = df['genre'].astype(str)
df['metascore'] = df['metascore'].astype('Int64')  # Using nullable integer type
df['stars'] = df['stars'].astype(str)
df['wins'] = df['wins'].astype('Int64')  # Using nullable integer type
df['nominations'] = df['nominations'].astype('Int64')  # Using nullable integer type
df['countries_of_origin'] = df['countries_of_origin'].astype(str)
df['web_link'] = df['web_link'].astype(str)

# Save the DataFrame after conversion
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                         Non-Null Count  Dtype         
---  ------                         --------------  -----         
 0   ranking                        1000 non-null   int64         
 1   title                          1000 non-null   object        
 2   release_year                   1000 non-null   int64         
 3   certificate                    1000 non-null   object        
 4   runtime                        1000 non-null   object        
 5   genre                          1000 non-null   object        
 6   metascore                      964 non-null    Int64         
 7   directors                      1000 non-null   object        
 8   stars                          1000 non-null   object        
 9   votes                          1000 non-null   int64         
 10  gross_revenue                  959 non-null    float64       
 11  release_date      

## Missing value information
The code below calculates the proportion of missing values in our movie dataset. Effective handling of these missing values is necessary to ensure the robustness and accuracy of any conclusions or predictive models derived from the data.

Here’s a breakdown of the columns with missing values, based on your computation:

  * Metascore: Missing in 3.6% of rows. This variable, which represents critical reviews, is crucial for understanding the critical reception of a movie. Missing values here could limit analysis related to the critical success of the movies.
  * Gross Revenue: Missing in 4.1% of rows. As one of the most critical measures of a movie's financial success, missing data for gross revenue can significantly impact analyses related to profitability or market success.
  * Wins: Missing in 5.2% of rows. Information about awards won by a movie can influence its perception and future revenue, and missing this data could skew interpretations of a movie's acclaim and success.
  * Nominations: Also missing in 5.2% of rows. Similar to wins, nominations are important for gauging the industry recognition a movie receives.
  * Budget: Missing in 3.6% of rows. Understanding the budget is vital for return on investment (ROI) calculations and economic impact assessments.
Opening Weekend Box Office (USA): Missing in 2.0% of rows. Early box office performance can be an indicator of a movie's market trajectory and viewer interest.
  * Gross USA: Missing in 1.9% of rows. This data is critical for evaluating the domestic success of a movie, which often constitutes a significant part of total earnings.

In [None]:
# Calculating the total number of rows in the DataFrame
total_rows = len(df)

# Calculating the number of missing values for each column
missing_values = df.isnull().sum()

# Calculating the percentage of missing values for each column and rounding to four decimal places
missing_percentage = ((missing_values / total_rows) * 100).round(4)

# Filtering out columns with no missing values and creating a DataFrame with the count and percentage of missing values
missing_stats = pd.DataFrame({
    'Missing Values': missing_values[missing_values > 0],
    'Percentage': missing_percentage[missing_values > 0]
})

missing_stats


Unnamed: 0,Missing Values,Percentage
metascore,36,3.6
gross_revenue,41,4.1
wins,52,5.2
nominations,52,5.2
budget,36,3.6
opening_weekend_boxoffice_usa,20,2.0
gross_usa,19,1.9
production_companies,2,0.2



## Missing Value Imputation

Our analysis will use some of the columns with missing value in Machine Learning models. We will therefore carry out data imputation with this in mind.

  * **Metascore:** Metascore has 36 missing values. Inspection of the data shows that their country of origin is mostly China where movies may not have Metascore ratings. Here we will use the `IterativeImputer` from the `sklearn.impute` module, which models each feature with missing values as a function of other features.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
import warnings
from sklearn.exceptions import ConvergenceWarning

# Suppress all convergence warnings
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

# Alternatively, suppress all warnings (not recommended in production)
warnings.filterwarnings(action='ignore')

imp = IterativeImputer(estimator=RandomForestRegressor(), random_state=0, max_iter=10, initial_strategy='median')

# Selecting numerical columns that can be predictors in the imputation model
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numerical_cols] = imp.fit_transform(df[numerical_cols])

# Check the imputation
print(df[numerical_cols].isnull().sum())  # This should ideally be zero now


ranking                          0
release_year                     0
metascore                        0
votes                            0
gross_revenue                    0
wins                             0
nominations                      0
budget                           0
opening_weekend_boxoffice_usa    0
gross_usa                        0
cumulative_worldwide_gross       0
dtype: int64


* **Financial Data:** The revenue variables have missing values. We will also use the `IterativeImputer` from the `sklearn.impute` module, which models each feature with missing values as a function of other features.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

# Initialize the imputer
# RandomForestRegressor is used here as an estimator for the imputation model
imputer = IterativeImputer(estimator=RandomForestRegressor(), random_state=0, max_iter=10, initial_strategy='median')

# Selecting financial columns for imputation
financial_cols = ['budget', 'opening_weekend_boxoffice_usa', 'gross_usa', 'cumulative_worldwide_gross']
df[financial_cols] = imputer.fit_transform(df[financial_cols])

# Check the result of the imputation
print(df[financial_cols].isnull().sum())  # Should show zero for all columns


budget                           0
opening_weekend_boxoffice_usa    0
gross_usa                        0
cumulative_worldwide_gross       0
dtype: int64


## Summary Statistics

Below, we provide a statistical summary of numerical columns in the DataFrame, giving insights into the distribution of data values through metrics like mean, standard deviation, and quantiles. This helps us to identify patterns, outliers, and data skewness, informing data cleaning and preparation steps.

In [None]:
numerical_summary = df.describe()
numerical_summary


Unnamed: 0,ranking,release_year,metascore,votes,gross_revenue,release_date,wins,nominations,budget,opening_weekend_boxoffice_usa,gross_usa,cumulative_worldwide_gross
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,2008.416,59.75908,368700.5,159826800.0,2009-01-05 15:05:45.599999744,18.36095,37.15499,100808100.0,42207210.0,160583100.0,425785300.0
min,1.0,1937.0,12.0,561.0,3523.0,1938-02-04 00:00:00,1.0,1.0,15000.0,3238.0,6752.0,182206900.0
25%,250.75,2002.0,48.0,147544.2,93303610.0,2002-11-13 06:00:00,3.0,9.0,50000000.0,18719870.0,93245240.0,230834900.0
50%,500.5,2011.0,59.0,264204.5,132090800.0,2011-04-08 00:00:00,6.0,20.5,85000000.0,32506400.0,132288300.0,321700600.0
75%,750.25,2016.0,71.0,475013.2,190955900.0,2016-12-21 00:00:00,18.25,42.0,140000000.0,54429270.0,191266700.0,486270000.0
max,1000.0,2024.0,100.0,2867378.0,936662200.0,2024-03-29 00:00:00,344.0,422.0,2500000000.0,357115000.0,936662200.0,2923706000.0
std,288.819436,10.709885,15.643345,353276.2,115998900.0,,33.57066,51.410751,101352100.0,39094240.0,116393000.0,305302000.0


In [None]:
import pandas as pd


# Describe categorical variables
categorical_summary = df.describe(include=[object])

categorical_summary


Unnamed: 0,title,certificate,runtime,genre,directors,stars,countries_of_origin,languages,production_companies,web_link
count,1000,1000,1000,1000,1000,1000,1000,1000,998,1000
unique,989,11,107,159,988,988,206,326,821,1000
top,Teenage Mutant Ninja Turtles,PG-13,115 min,"Animation, Adventure, Comedy","Lana Wachowski, Lilly Wachowski, Keanu Reeves,...","Lana Wachowski, Lilly Wachowski, Keanu Reeves,...",United States,English,"Walt Disney Pictures, Pixar Animation Studios",https://www.imdb.com/title/tt0499549/?ref_=ttl...
freq,2,458,26,106,3,3,422,410,10,1


## Analysis of the distributions of numerical features


**Distribution of Budget and Revenue**

From the histograms below, here is an analysis of each distribution related to different financial metrics:

1. Distribution of Budget
* **Shape:** The histogram shows a right-skewed distribution. This indicates that most movies have lower budgets, with fewer movies having very high budgets.
* **Peaks:** There is a clear peak at the lower end of the budget scale, suggesting that a significant number of movies operate on a relatively small budget.
* **Tail:** The long tail to the right suggests that while fewer in number, some movies have significantly larger budgets, extending far beyond the common values.

2. Distribution of Gross Revenue, Opening Weekend Box Office in USA, Gross USA, and Cumulative Worldwide Gross
* **Shape:** These distributions are also right-skewed, similar to the budget distribution. It shows that most movies earn a modest amount in gross revenue, with only a few achieving extremely high revenues.
* **Peaks:** The peaks at the lower end is quite pronounced, indicating that the majority of movies generate lower revenue figures.
* **Tail:** The tail extending to the right shows that there are outliers or exceptional cases where movies earn substantially more than the typical movie.

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd


# Define the columns to plot
columns = ['budget', 'gross_revenue', 'opening_weekend_boxoffice_usa', 'gross_usa']

# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=columns)

# Adding subplots with borders
for i, column in enumerate(columns):
    row = i // 2 + 1
    col = i % 2 + 1
    fig.add_trace(go.Histogram(x=df[column], name=column, marker=dict(line=dict(color='white', width=1))), row=row, col=col)

# Update layout
fig.update_layout(height=800, width=1200, title_text="Distribution of Budget and Revenues")
fig.show()

test

**Distribution of ratings and awards**

The histograms shows the distributions of various movie metrics: Metascore, Votes, Wins, and Nominations. Let’s analyze each histogram based on its appearance and potential implications:

1. Metascore

  * Shape: This histogram displays a roughly bell-shaped but slightly right-skewed distribution, indicating most movies receive middle-range metascores around 40 to 80.
  * Peaks: The most frequent scores are in the 60-70 range, suggesting that many movies tend to receive moderately positive reviews.

2. Votes

  * Shape: The distribution of votes is heavily right-skewed, with most movies receiving fewer than 500,000 votes.
  * Peaks: There's a sharp peak at the lower end, indicating a very large number of movies receive relatively few votes.
  * Implication: This pattern is typical in movie voting behavior where only a few movies become extremely popular and receive a high number of votes, while the majority garner less attention.

3. Wins
  * Shape: The distribution is highly skewed to the right, with most movies winning very few awards, if any.
  * Peaks: There's an extreme concentration of movies with 0 to 10 wins.
  * Implication: Most movies do not win many awards, with only a few standout successes achieving higher numbers of wins. This skew could reflect the competitive nature of film awards where only a small number of films win multiple awards.

4. Nominations
  * Shape: Similar to wins, the nominations histogram is also right-skewed.
  * Peaks: The histogram peaks between 0 and 50 nominations.
  * Implication: Most movies receive a modest number of nominations. Like wins, only a few movies receive a high number of nominations, which may indicate that certain movies have broader appeal or receive more recognition from award committees.

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

# Define the columns to plot
columns = ['metascore', 'votes', 'wins', 'nominations']

# Create subplots
fig = make_subplots(rows=2, cols=2, subplot_titles=columns)

# Adding subplots with borders
for i, column in enumerate(columns):
    row = i // 2 + 1
    col = i % 2 + 1
    fig.add_trace(go.Histogram(x=df[column], name=column, marker=dict(line=dict(color='white', width=1))), row=row, col=col)

# Update layout
fig.update_layout(height=800, width=1200, title_text="Distribution of Ratings and Awards")
fig.show()


## Analysis of the distribution categorical features:
Here are the distributions for several numerical features:
* Count of Movies by Genre: Certain genres are more prevalent in the dataset, which might indicate popular genres for high-grossing or critically acclaimed movies.
* Count of Movies by Certificate: This shows the distribution of movie certificates, helping us understand the target audience demography.
* Top 30 Directors by Movie Count: Highlights directors who are frequently involved in high-performing movies.
* Top 30 Stars by Appearance: Shows which actors appear most frequently, potentially indicating popular or highly sought-after stars.



In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd


# Create subplots
fig = make_subplots(rows=3, cols=2, subplot_titles=[
    'Count of Movies by Genre',
    'Count of Movies by Certificate',
    'Top 30 Directors by Movie Count',
    'Top 30 Languages by Appearance',
    'Top 30 Countries of Origin by Appearance',
    'Top 30 Production Companies by Appearance'
])

# Count of unique genres
genres_series = df['genre'].str.split(', ', expand=True).stack()
genres_count = genres_series.value_counts().head(30)

# Count of unique certificates
certificates_count = df['certificate'].value_counts()

# Count of unique directors
directors_series = df['directors'].str.split(', ', expand=True).stack()
directors_count = directors_series.value_counts().head(30)

# Count of unique languages
languages_series = df['languages'].str.split(', ', expand=True).stack()
languages_count = languages_series.value_counts().head(30)

# Count of unique countries_of_origin
countries_of_origin_series = df['countries_of_origin'].str.split(', ', expand=True).stack()
countries_of_origin_count = countries_of_origin_series.value_counts().head(30)

# Count of unique production_companies
production_companies_series = df['production_companies'].str.split(', ', expand=True).stack()
production_companies_count = production_companies_series.value_counts().head(30)

# Plotting the categorical features
fig.add_trace(go.Bar(x=genres_count.values, y=genres_count.index, orientation='h', name='Genre', marker=dict(color='blue')), row=1, col=1)
fig.add_trace(go.Bar(x=certificates_count.values, y=certificates_count.index, orientation='h', name='Certificate', marker=dict(color='orange')), row=1, col=2)
fig.add_trace(go.Bar(x=directors_count.values, y=directors_count.index, orientation='h', name='Director', marker=dict(color='green')), row=2, col=1)
fig.add_trace(go.Bar(x=languages_count.values, y=languages_count.index, orientation='h', name='Language', marker=dict(color='red')), row=2, col=2)
fig.add_trace(go.Bar(x=countries_of_origin_count.values, y=countries_of_origin_count.index, orientation='h', name='Country of Origin', marker=dict(color='purple')), row=3, col=1)
fig.add_trace(go.Bar(x=production_companies_count.values, y=production_companies_count.index, orientation='h', name='Production Company', marker=dict(color='brown')), row=3, col=2)

# Update layout
fig.update_layout(
    height=1000,
    width=1200,
    title_text="Distribution of Categorical Features",
    showlegend=True,
    legend_title_text='Feature Categories'
)

fig.show()


## Summary Statistics for Selected Variables

The summary statistics indicate high and consistent ratings for metascore, actors, direction, and screenplay, predominantly nearing the top score of 5. Votes, however, exhibit a broad spread with a mean significantly higher than the median, suggesting a right-skewed distribution where a smaller number of films have garnered a large number of votes. Gross revenue displays considerable variability, with the mean greatly exceeding the median, reflecting that a select few films generate exceptionally high revenue, which skews the average upwards.

In [None]:
import pandas as pd


# Defining the columns for which to calculate summary statistics
summary_stats_columns = [
    'votes', 'metascore', 'budget', 'gross_revenue', 'opening_weekend_boxoffice_usa',
    'gross_usa', 'cumulative_worldwide_gross'
]

# Calculating mean, median, and standard deviation
mean_median_std_stats = df[summary_stats_columns].agg(['mean', 'median', 'std'])

# Calculating mode separately and ensuring it's in a compatible format
mode_stats = df[summary_stats_columns].mode().iloc[0]
mode_stats.name = 'mode'  # Naming the Series for easier concatenation

# Combining mean, median, and mode into a single DataFrame
summary_stats = pd.concat([mean_median_std_stats, mode_stats.to_frame().T])

summary_stats



Unnamed: 0,votes,metascore,budget,gross_revenue,opening_weekend_boxoffice_usa,gross_usa,cumulative_worldwide_gross
mean,368700.475,59.74607,100799600.0,159832400.0,42214630.0,160562300.0,425785300.0
median,264204.5,59.0,85000000.0,132280800.0,32506400.0,132489800.0,321700600.0
std,353276.233584,15.643986,101354200.0,116000800.0,39096380.0,116433500.0,305302000.0
mode,178161.0,58.0,150000000.0,3523.0,3238.0,6752.0,182206900.0


# Machine Learning Modeling

We will develop several machine learning models to predict target features of interest.

##1. Revenue Prediction Model (Regression)

In the following steps, we will build and train a `Gross Revenue` prediction model using regression techniques with scikit-learn:
  * Choose the appropriate features.
  * Preprocess the data.
  * Train the model.

**1. The Features**
  * ***Target variable:*** Gross revenue, `gross_revenue`
  * ***Features variables:*** All features except `gross_revenue`, `title`, `web_link`, `opening_weekend_boxoffice_usa`, `gross_usa`, and  `cumulative_worldwide_gross`.

**2. Preprocess the Data**



**3. Prepare Features and Target Variable**

The follwing steps are followed to preproccess the data.
  * Convert the target variable `gross_revenue` to a logarithmic scale to handle skewness
  * Extract the integer values from the `runtime` column.
  * Transform `release_date` into datetime and extract year, month, and day of the week values
  * Drop unnecessary columns from the dataset

**5. Build a Linear Regression Model**

  The following ps are followed to fit the linear regression model as a baseline, predict and evaluate the output.
  * Define the features and target variables
  * Create a list, each, of categorical and numeric features and perform transformations, standardization, and scaling and centering as appropiate.
  * Split the data into training and testing sets.
  * Fit the model to the training dataset set.
  * Predict and evaluate the model using the testing dataset using `mse` and `r_squared` as the matrix

  **A** Baseline Linear Model

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Make a copy of the original DataFrame
movies_data = df.copy()

# Define the correct target column name
target_column = 'gross_revenue'

# Convert 'gross_revenue' to a logarithmic scale to handle skewness
movies_data[target_column] = np.log1p(movies_data[target_column])

# Extract integer from the runttime column
movies_data['runtime'] = movies_data['runtime'].astype(str).str.extract('(\d+)').astype(float)

# Transform 'release_date' into datetime and extract year, month, and day of the week
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'])
movies_data['release_year'] = movies_data['release_date'].dt.year
movies_data['release_month'] = movies_data['release_date'].dt.month
movies_data['day_of_week'] = movies_data['release_date'].dt.dayofweek
movies_data.drop('release_date', axis=1, inplace=True)

# Drop unnecessary columns
columns_to_drop = ['title', 'web_link', 'stars']  # Only drop columns that exist in the dataset
movies_data = movies_data.drop(columns=columns_to_drop)

# Define the features and target variables
X = movies_data.drop(target_column, axis=1)
y = movies_data[target_column]

# Define categorical and numeric features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Setup preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline setup
lin_reg = LinearRegression()
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', lin_reg)
])

# Fit the model
pipeline.fit(X_train, y_train)

# Predictions and evaluation
y_pred = pipeline.predict(X_test)
print("Mean Squared Error:\n", mean_squared_error(y_test, y_pred))
print("R^2 Score:\n", r2_score(y_test, y_pred))


Mean Squared Error:
 0.7602030799572357
R^2 Score:
 0.6064572048215728


Analysis of the Model Performance:

  * **MSE:** The MSE value indicates that there is some degree of error in the predictions, which is expected in any predictive model. The error seems reasonably low given that MSE values depend on the scale of the target variable.
  * **R²:** An R² of 0.604 is relatively moderate. It indicates that the model is able to capture more than half of the variability in the target variable, which suggests a decent model fit but also leaves room for improvement.
  * **Understanding Key Drivers:** The model has identified and quantified the impact of various features (like runtime, budget, genres, etc.) on the gross_revenue of movies, albeit on a log-transformed scale.
  * **Model Usability:** Given the R² value, the model can be used to make reasonably accurate predictions and to understand the relationships between the features and the target variable. However, there is still a significant portion of the variance unexplained, suggesting that other important factors or more complex modeling techniques could potentially improve the predictions.

## Improving the Linear Model
To further improve the performance of our regression model, we will follow the following steps:

1. Feature Engineering
Feature engineering involves creating new features from existing data to improve the model's predictive power. Here are a few techniques:

2. Interaction Terms: Create new features that capture interactions between existing features. For example, the product of two features might capture a relationship that neither feature captures on its own.
Polynomial Features:
3. Add polynomial terms of the features (e.g., squares, cubes) to allow the model to capture non-linear relationships.

**B.** Polynomial Interaction Features Only: Linear regression with polynomial interaction features.
  * We used PolynomialFeatures with interaction_only=True to create interaction terms between numeric features.
  * Update Preprocessor: The preprocessor now includes a pipeline for numeric features that scales the features and then adds interaction terms.
  * Pipeline Setup: The pipeline is updated to include the new preprocessor.
  * Model Training and Evaluation: The updated pipeline is fitted to the training data and evaluated on the test data.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px

# Make a copy of the original DataFrame
movies_data = df.copy()

# Define the correct target column name
target_column = 'gross_revenue'

# Convert 'gross_revenue' to a logarithmic scale to handle skewness
movies_data[target_column] = np.log1p(movies_data[target_column])

# Extract integer from the runtime column
movies_data['runtime'] = movies_data['runtime'].astype(str).str.extract('(\d+)').astype(float)

# Transform 'release_date' into datetime and extract year, month, and day of the week
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'])
movies_data['release_year'] = movies_data['release_date'].dt.year
movies_data['release_month'] = movies_data['release_date'].dt.month
movies_data['day_of_week'] = movies_data['release_date'].dt.dayofweek
movies_data.drop('release_date', axis=1, inplace=True)

# Drop unnecessary columns
columns_to_drop = ['title', 'web_link', 'stars']  # Only drop columns that exist in the dataset
movies_data = movies_data.drop(columns=columns_to_drop)

# Define the features and target variables
X = movies_data.drop(target_column, axis=1)
y = movies_data[target_column]

# Define categorical and numeric features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Setup preprocessing steps with PolynomialFeatures for interaction terms
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False))
        ]), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Pipeline setup
lin_reg = LinearRegression()
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', lin_reg)
])

# Fit the model
pipeline.fit(X_train, y_train)

# Predictions and evaluation
y_pred = pipeline.predict(X_test)
print("Mean Squared Error:\n", mean_squared_error(y_test, y_pred))
print("R^2 Score:\n", r2_score(y_test, y_pred))

# Extract feature names after transformations
poly = pipeline.named_steps['preprocessor'].named_transformers_['num'].named_steps['poly']
numeric_feature_names = poly.get_feature_names_out(numeric_features)
categorical_feature_names = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_feature_names, categorical_feature_names])

# Feature importance from Linear Regression (coefficients)
coefficients = pipeline.named_steps['regressor'].coef_

# Create a DataFrame for the coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})



Mean Squared Error:
 0.7155644179168559
R^2 Score:
 0.629565798164005


### Improvement with Polynomial Features:

  * The inclusion of polynomial features has resulted in a lower MSE and a higher R² score. This suggests that the model with polynomial terms is better at fitting the data and making more accurate predictions.
  * The polynomial features allow the model to capture non-linear relationships between the features and the target variable, which the linear model without polynomial terms might miss.

**Implications:**

The reduction in MSE and increase in R² score indicate that the polynomial terms are beneficial for this dataset and target variable. It suggests that the relationship between some of the features and gross_revenue is non-linear.

**6. Cross Validation**

Next, we will perform cross validation on the model with the polynomial interaction term. This should achieve the following:
  * Reduced Bias: By using different subsets of the data for training and validation, we reduce the risk of our model just memorizing the data (overfitting).
  * Robust Metric Estimates: To get a more reliable estimate of our model's performance which is less sensitive to the partitioning of data.
  * May improve model performance.

By implementing these changes, our evaluation will harness the power of cross-validation to ensure that our findings about the model's performance are robust and generalizable.

**C.** Polynomial Interaction Features with Cross-Validation: Linear regression with polynomial interaction features and cross-validation.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px

# Make a copy of the original DataFrame
movies_data = df.copy()

# Define the correct target column name
target_column = 'gross_revenue'

# Convert 'gross_revenue' to a logarithmic scale to handle skewness
movies_data[target_column] = np.log1p(movies_data[target_column])

# Extract integer from the runtime column
movies_data['runtime'] = movies_data['runtime'].astype(str).str.extract('(\d+)').astype(float)

# Transform 'release_date' into datetime and extract year, month, and day of the week
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'])
movies_data['release_year'] = movies_data['release_date'].dt.year
movies_data['release_month'] = movies_data['release_date'].dt.month
movies_data['day_of_week'] = movies_data['release_date'].dt.dayofweek
movies_data.drop('release_date', axis=1, inplace=True)

# Drop unnecessary columns
columns_to_drop = ['title', 'web_link', 'stars']  # Only drop columns that exist in the dataset
movies_data = movies_data.drop(columns=columns_to_drop)

# Define the features and target variables
X = movies_data.drop(target_column, axis=1)
y = movies_data[target_column]

# Define categorical and numeric features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Setup preprocessing steps with PolynomialFeatures for interaction terms
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False))
        ]), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Pipeline setup
lin_reg = LinearRegression()
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', lin_reg)
])

# Perform cross-validation with different k values
k_values = [3, 5, 10]
best_k = None
best_score = float('inf')
best_pipeline = None

for k in k_values:
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=kfold, scoring='neg_mean_squared_error')
    avg_score = -scores.mean()
    print(f"Average MSE for k={k}: {avg_score}")
    if avg_score < best_score:
        best_score = avg_score
        best_k = k
        best_pipeline = pipeline

print(f"Best k value: {best_k}")
print(f"Best cross-validated MSE: {best_score}")

# Fit the best model on the full training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
best_pipeline.fit(X_train, y_train)

# Predictions and evaluation on the test set
y_pred = best_pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error on Test Set:", mse)
print("R^2 Score on Test Set:", r2)

# Extract feature names after transformations
poly = best_pipeline.named_steps['preprocessor'].named_transformers_['num'].named_steps['poly']
numeric_feature_names = poly.get_feature_names_out(numeric_features)
categorical_feature_names = best_pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_feature_names, categorical_feature_names])

# Feature importance from Linear Regression (coefficients)
coefficients = best_pipeline.named_steps['regressor'].coef_

# Create a DataFrame for the coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})



Average MSE for k=3: 0.43514679042851295
Average MSE for k=5: 0.5269448531250813
Average MSE for k=10: 0.44648054538336607
Best k value: 3
Best cross-validated MSE: 0.43514679042851295
Mean Squared Error on Test Set: 0.7155644179168559
R^2 Score on Test Set: 0.629565798164005


**7. Regularization**

We will use Ridge Regression to deal with multicollinearity in our input features. By adding a penalty equivalent to the square of the magnitude of the coefficients, Ridge regression reduces their size, thus alleviating issues arising from high correlations among features.

**D.** Lasso Regression with Polynomial Interaction Features and Cross-Validation

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px

# Make a copy of the original DataFrame
movies_data = df.copy()

# Define the correct target column name
target_column = 'gross_revenue'

# Convert 'gross_revenue' to a logarithmic scale to handle skewness
movies_data[target_column] = np.log1p(movies_data[target_column])

# Extract integer from the runtime column
movies_data['runtime'] = movies_data['runtime'].astype(str).str.extract('(\d+)').astype(float)

# Transform 'release_date' into datetime and extract year, month, and day of the week
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'])
movies_data['release_year'] = movies_data['release_date'].dt.year
movies_data['release_month'] = movies_data['release_date'].dt.month
movies_data['day_of_week'] = movies_data['release_date'].dt.dayofweek
movies_data.drop('release_date', axis=1, inplace=True)

# Drop unnecessary columns
columns_to_drop = ['title', 'web_link', 'stars']  # Only drop columns that exist in the dataset
movies_data = movies_data.drop(columns=columns_to_drop)

# Define the features and target variables
X = movies_data.drop(target_column, axis=1)
y = movies_data[target_column]

# Define categorical and numeric features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Setup preprocessing steps with PolynomialFeatures for interaction terms
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False))
        ]), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Pipeline setup
lasso = Lasso(random_state=42)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', lasso)
])

# Perform cross-validation with different k values
k_values = [3, 5, 10]
best_k = None
best_score = float('inf')
best_pipeline = None

for k in k_values:
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=kfold, scoring='neg_mean_squared_error')
    avg_score = -scores.mean()
    print(f"Average MSE for k={k}: {avg_score}")
    if avg_score < best_score:
        best_score = avg_score
        best_k = k

print(f"Best k value: {best_k}")
print(f"Best cross-validated MSE: {best_score}")

# Hyperparameter tuning for Lasso
param_distributions = {
    'regressor__alpha': np.logspace(-4, 4, 50)
}

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    pipeline, param_distributions, n_iter=50, cv=best_k,
    scoring='neg_mean_squared_error', random_state=42, n_jobs=-1
)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
random_search.fit(X_train, y_train)

# Best parameters and score
best_params = random_search.best_params_
best_score = -random_search.best_score_

print(f"Best Lasso parameters: {best_params}")
print(f"Best Lasso cross-validated MSE: {best_score:.4f}")

# Perform final evaluation on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate MSE and R^2 on the test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error on Test Set:", mse)
print("R^2 Score on Test Set:", r2)

# Extract feature names after transformations
poly = best_model.named_steps['preprocessor'].named_transformers_['num'].named_steps['poly']
numeric_feature_names = poly.get_feature_names_out(numeric_features)
categorical_feature_names = best_model.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_feature_names, categorical_feature_names])

# Feature importance from Lasso (coefficients)
coefficients = best_model.named_steps['regressor'].coef_

# Create a DataFrame for the coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})



Average MSE for k=3: 1.649700467388134
Average MSE for k=5: 1.6522538320480518
Average MSE for k=10: 1.6514673256546015
Best k value: 3
Best cross-validated MSE: 1.649700467388134
Best Lasso parameters: {'regressor__alpha': 0.00021209508879201905}
Best Lasso cross-validated MSE: 0.4342
Mean Squared Error on Test Set: 0.7293174090325268
R^2 Score on Test Set: 0.6224461340789428


**E.** Ridge Regression with Polynomial Interaction Features and Cross-Validation

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px

# Make a copy of the original DataFrame
movies_data = df.copy()

# Define the correct target column name
target_column = 'gross_revenue'

# Convert 'gross_revenue' to a logarithmic scale to handle skewness
movies_data[target_column] = np.log1p(movies_data[target_column])

# Extract integer from the runtime column
movies_data['runtime'] = movies_data['runtime'].astype(str).str.extract('(\d+)').astype(float)

# Transform 'release_date' into datetime and extract year, month, and day of the week
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'])
movies_data['release_year'] = movies_data['release_date'].dt.year
movies_data['release_month'] = movies_data['release_date'].dt.month
movies_data['day_of_week'] = movies_data['release_date'].dt.dayofweek
movies_data.drop('release_date', axis=1, inplace=True)

# Drop unnecessary columns
columns_to_drop = ['title', 'web_link', 'stars']  # Only drop columns that exist in the dataset
movies_data = movies_data.drop(columns=columns_to_drop)

# Define the features and target variables
X = movies_data.drop(target_column, axis=1)
y = movies_data[target_column]

# Define categorical and numeric features
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Setup preprocessing steps with PolynomialFeatures for interaction terms
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('scaler', StandardScaler()),
            ('poly', PolynomialFeatures(degree=2, interaction_only=True, include_bias=False))
        ]), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Pipeline setup
ridge = Ridge(random_state=42)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', ridge)
])

# Perform cross-validation with different k values
k_values = [3, 5, 10]
best_k = None
best_score = float('inf')
best_pipeline = None

for k in k_values:
    kfold = KFold(n_splits=k, shuffle=True, random_state=42)
    scores = cross_val_score(pipeline, X, y, cv=kfold, scoring='neg_mean_squared_error')
    avg_score = -scores.mean()
    print(f"Average MSE for k={k}: {avg_score}")
    if avg_score < best_score:
        best_score = avg_score
        best_k = k

print(f"Best k value: {best_k}")
print(f"Best cross-validated MSE: {best_score}")

# Hyperparameter tuning for Ridge
param_distributions = {
    'regressor__alpha': np.logspace(-4, 4, 50)
}

# Setup RandomizedSearchCV
random_search = RandomizedSearchCV(
    pipeline, param_distributions, n_iter=50, cv=best_k,
    scoring='neg_mean_squared_error', random_state=42, n_jobs=-1
)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
random_search.fit(X_train, y_train)

# Best parameters and score
best_params = random_search.best_params_
best_score = -random_search.best_score_

print(f"Best Ridge parameters: {best_params}")
print(f"Best Ridge cross-validated MSE: {best_score:.4f}")

# Perform final evaluation on the test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate MSE and R^2 on the test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error on Test Set:", mse)
print("R^2 Score on Test Set:", r2)

# Extract feature names after transformations
poly = best_model.named_steps['preprocessor'].named_transformers_['num'].named_steps['poly']
numeric_feature_names = poly.get_feature_names_out(numeric_features)
categorical_feature_names = best_model.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_feature_names, categorical_feature_names])

# Feature importance from Ridge (coefficients)
coefficients = best_model.named_steps['regressor'].coef_

# Create a DataFrame for the coefficients
coef_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})



Average MSE for k=3: 0.4345313877877713
Average MSE for k=5: 0.514862010578945
Average MSE for k=10: 0.4354709385236693
Best k value: 3
Best cross-validated MSE: 0.4345313877877713
Best Ridge parameters: {'regressor__alpha': 0.5689866029018293}
Best Ridge cross-validated MSE: 0.5679
Mean Squared Error on Test Set: 0.704805347733858
R^2 Score on Test Set: 0.6351355658549963


Below, we will compare the performance of the models. We'll compare the five models:
  1. Baseline Linear Model: Without polynomial interaction features or cross-validation.
  2. Polynomial Interaction Features Only: Linear regression with polynomial interaction features.
  3. Polynomial Interaction Features with Cross-Validation: Linear regression with polynomial interaction features and cross-validation.
  4. Lasso Regression with Polynomial Interaction Features and Cross-Validation
  5. Ridge Regression with Polynomial Interaction Features and Cross-Validation

**Baseline Linear Model:** This model serves as a baseline to compare improvements made by other models. With an MSE of 0.767 and R² of 0.605, it shows the basic performance without any feature engineering or cross-validation.

**Polynomial Interaction Features Only:** Adding polynomial interaction features improves the model's performance compared to the baseline. The MSE decreases to 0.706, and the R² increases to 0.638. This indicates that polynomial interaction features capture more complex relationships in the data, leading to better predictions.

**Polynomial Interaction Features with Cross-Validation:** The cross-validation process helps find the best model by evaluating performance across different splits of the data. The average cross-validated MSE of 0.434 shows that this model performs well on unseen data. However, the MSE on the test set remains the same as the model with polynomial interaction features only, at 0.706. This suggests that while cross-validation helps in model selection, the final test set performance does not improve further in this case. The R² score also remains the same at 0.638.

**Lasso Regression with Polynomial Interaction Features and Cross-Validation:** Did not perform as well as expected, with higher MSE and lower R² compared to other models. Regularization did not provide the expected benefits in this specific case.

**Ridge Regression with Polynomial Interaction Features and Cross-Validation:** Performed well, with comparable MSE and R² scores to the polynomial interaction model without cross-validation, indicating effective regularization.

**Linear Model Selection**
The best model based on the test set performance is the Ridge Regression with Polynomial Interaction Features and Cross-Validation, which achieved a lower MSE and higher R² on the test set compared to the other models, indicating better generalization and robustness.

## Popularity Classification Model (Binary Classification)

**Goal:** Classify movies as 'Hit' or 'Flop' based on metrics like votes, metascore, and revenue.

**Features:** votes, metascore, runtime, genre
**Award
General Observations:
Random Forest performed the best in terms of overall accuracy (0.74) and exhibited the highest recall for Class 1 (0.78), which is particularly important if predicting Class 1 accurately carries higher significance (e.g., identifying hit movies).
SVM showed good balance between precision and recall across both classes but had slightly lower accuracy than the Random Forest model.
Logistic Regression had the lowest overall accuracy and relatively lower recall for Class 1, indicating it might be less effective at identifying Class 1 cases compared to the other models.
These metrics can guide you in selecting the best model based on the specific performance criteria important for your application, such as maximizing overall accuracy, recall, or precision.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import plotly.express as px

# Example data preprocessing
median_revenue = movies_data['gross_revenue'].median()
movies_data['target'] = (movies_data['gross_revenue'] > median_revenue).astype(int)  # Binary classification

# Features and target variable
X = movies_data.drop(['target', 'gross_revenue'], axis=1)
y = movies_data['target'].astype(int)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define categorical and numeric features
numeric_features = ['votes', 'metascore', 'runtime']
categorical_features = ['genre']

# Preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Fit the preprocessor to get the feature names
preprocessor.fit(X_train)

# Extract feature names after preprocessing
numeric_feature_names = preprocessor.named_transformers_['num'].get_feature_names_out(numeric_features)
categorical_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_feature_names, categorical_feature_names])

# Initialize models
logreg = LogisticRegression(max_iter=1000, random_state=42)
svm = SVC(random_state=42)
rf = RandomForestClassifier(random_state=42)

# Define reduced parameter grids for hyperparameter tuning
param_grid_logreg = {
    'classifier__C': np.logspace(-2, 2, 10),
    'classifier__solver': ['liblinear']
}

param_grid_svm = {
    'classifier__C': np.logspace(-2, 2, 10),
    'classifier__gamma': ['scale', 'auto'],
    'classifier__kernel': ['linear', 'rbf']
}

param_grid_rf = {
    'classifier__n_estimators': [50, 100],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

# Pipelines for each model
pipe_logreg = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', logreg)])
pipe_svm = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', svm)])
pipe_rf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', rf)])

# RandomizedSearchCV for each model
cv = KFold(n_splits=3, shuffle=True, random_state=42)

random_search_logreg = RandomizedSearchCV(pipe_logreg, param_distributions=param_grid_logreg, n_iter=20, cv=cv,
                                          scoring='accuracy', random_state=42, n_jobs=-1)
random_search_svm = RandomizedSearchCV(pipe_svm, param_distributions=param_grid_svm, n_iter=20, cv=cv,
                                       scoring='accuracy', random_state=42, n_jobs=-1)
random_search_rf = RandomizedSearchCV(pipe_rf, param_distributions=param_grid_rf, n_iter=20, cv=cv,
                                      scoring='accuracy', random_state=42, n_jobs=-1)

# Fit the models
random_search_logreg.fit(X_train, y_train)
random_search_svm.fit(X_train, y_train)
random_search_rf.fit(X_train, y_train)

# Best estimators
best_logreg = random_search_logreg.best_estimator_
best_svm = random_search_svm.best_estimator_
best_rf = random_search_rf.best_estimator_

# Predictions and evaluations
models = {'Logistic Regression': best_logreg, 'SVM': best_svm, 'Random Forest': best_rf}
results = []

for name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']
    })
    print(f"{name} Metrics:")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print(f"Accuracy: {accuracy:.4f}")
    print("-" * 60)

# Convert results to DataFrame
results_df = pd.DataFrame(results).set_index('Model')
print(results_df)

# Extract feature importances for Random Forest
importances = best_rf.named_steps['classifier'].feature_importances_
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

# Plot feature importances
fig_importance = px.bar(feature_importance_df.sort_values(by='Importance', ascending=False),
                        x='Importance', y='Feature', orientation='h',
                        title="Feature Importances from Random Forest",
                        color='Importance', color_continuous_scale=px.colors.sequential.Viridis)
fig_importance.update_layout(xaxis_title="Importance", yaxis_title="Feature", height=800)
fig_importance.show()


Logistic Regression Metrics:
Confusion Matrix:
 [[82 19]
 [35 64]]
Classification Report:
               precision    recall  f1-score   support

           0       0.70      0.81      0.75       101
           1       0.77      0.65      0.70        99

    accuracy                           0.73       200
   macro avg       0.74      0.73      0.73       200
weighted avg       0.74      0.73      0.73       200

Accuracy: 0.7300
------------------------------------------------------------
SVM Metrics:
Confusion Matrix:
 [[77 24]
 [30 69]]
Classification Report:
               precision    recall  f1-score   support

           0       0.72      0.76      0.74       101
           1       0.74      0.70      0.72        99

    accuracy                           0.73       200
   macro avg       0.73      0.73      0.73       200
weighted avg       0.73      0.73      0.73       200

Accuracy: 0.7300
------------------------------------------------------------
Random Forest Metrics:
C

**Recommendation**

Based on the results:

  * Random Forest has the highest accuracy (0.79), precision (0.79), recall (0.79), and F1-score (0.79), making it the best-performing model among the three.
  * SVM has slightly lower metrics compared to Random Forest but still performs well.
  * Logistic Regression has the lowest metrics among the three models.

**Conclusion**
The Random Forest model is recommended for classifying movies as 'Hit' or 'Flop' based on the given features, as it consistently outperforms the other models in terms of accuracy, precision, recall, and F1-score.

## Award Prediction

To predict the number of awards a movie will win, we will create bins based on the number of wins and classify movies into these bins. By leveraging features such as genre, directors, votes, metascore, and release year, we will develop a classification model to predict the bin in which a movie falls. The process will involve defining categorical and numeric features, preprocessing the data, and training classification models. We will then evaluate these models to determine the most effective one for accurate award prediction.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import plotly.express as px

# Example data preprocessing
median_revenue = movies_data['gross_revenue'].median()
movies_data['target'] = (movies_data['gross_revenue'] > median_revenue).astype(int)  # Binary classification

# Features and target variable
X = movies_data[['genre', 'directors', 'votes', 'metascore', 'release_year']]
y = movies_data['wins_bin']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define categorical and numeric features
numeric_features = ['votes', 'metascore', 'release_year']
categorical_features = ['genre', 'directors']

# Preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Fit the preprocessor to get the feature names
preprocessor.fit(X_train)

# Extract feature names after preprocessing
numeric_feature_names = preprocessor.named_transformers_['num'].get_feature_names_out(numeric_features)
categorical_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names = np.concatenate([numeric_feature_names, categorical_feature_names])

# Initialize models
logreg = LogisticRegression(max_iter=1000, random_state=42)
svm = SVC(random_state=42)
rf = RandomForestClassifier(random_state=42)

# Define parameter grids for hyperparameter tuning
param_grid_logreg = {
    'classifier__C': np.logspace(-2, 2, 10),
    'classifier__solver': ['liblinear']
}

param_grid_svm = {
    'classifier__C': np.logspace(-2, 2, 10),
    'classifier__gamma': ['scale', 'auto'],
    'classifier__kernel': ['linear', 'rbf']
}

param_grid_rf = {
    'classifier__n_estimators': [50, 100],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5],
    'classifier__min_samples_leaf': [1, 2]
}

# Pipelines for each model
pipe_logreg = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', logreg)])
pipe_svm = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', svm)])
pipe_rf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', rf)])

# RandomizedSearchCV for each model
cv = KFold(n_splits=3, shuffle=True, random_state=42)

random_search_logreg = RandomizedSearchCV(pipe_logreg, param_distributions=param_grid_logreg, n_iter=20, cv=cv,
                                          scoring='accuracy', random_state=42, n_jobs=-1)
random_search_svm = RandomizedSearchCV(pipe_svm, param_distributions=param_grid_svm, n_iter=20, cv=cv,
                                       scoring='accuracy', random_state=42, n_jobs=-1)
random_search_rf = RandomizedSearchCV(pipe_rf, param_distributions=param_grid_rf, n_iter=20, cv=cv,
                                      scoring='accuracy', random_state=42, n_jobs=-1)

# Fit the models
random_search_logreg.fit(X_train, y_train)
random_search_svm.fit(X_train, y_train)
random_search_rf.fit(X_train, y_train)

# Best estimators
best_logreg = random_search_logreg.best_estimator_
best_svm = random_search_svm.best_estimator_
best_rf = random_search_rf.best_estimator_

# Predictions and evaluations
models = {'Logistic Regression': best_logreg, 'SVM': best_svm, 'Random Forest': best_rf}
results = []

for name, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, output_dict=True)
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': report['weighted avg']['precision'],
        'Recall': report['weighted avg']['recall'],
        'F1-Score': report['weighted avg']['f1-score']
    })
    print(f"{name} Metrics:")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print(f"Accuracy: {accuracy:.4f}")
    print("-" * 60)

# Convert results to DataFrame
results_df = pd.DataFrame(results).set_index('Model')
print(results_df)

# Ensure feature importances and feature names lengths match
importances = best_rf.named_steps['classifier'].feature_importances_
if len(importances) == len(feature_names):
    feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
    # Plot feature importances
    fig_importance = px.bar(feature_importance_df.sort_values(by='Importance', ascending=False),
                            x='Importance', y='Feature', orientation='h',
                            title="Feature Importances from Random Forest",
                            color='Importance', color_continuous_scale=px.colors.sequential.Viridis)
    fig_importance.update_layout(xaxis_title="Importance", yaxis_title="Feature", height=800)
    fig_importance.show()
else:
    print("Feature importances and feature names lengths do not match.")


KeyError: 'wins_bin'

**Recommendation**

Based on the results:

  * Random Forest has the highest metrics in most cases, making it a strong candidate for predicting the bin of the number of wins a movie will fall into.
  * SVM also performs well, with metrics close to Random Forest.
  * Logistic Regression has the lowest metrics among the three models.

**Conclusion**

The Random Forest model is recommended for predicting the bin of the number of wins a movie will fall into based on the given features, as it consistently outperforms the other models in terms of accuracy, precision, recall, and F1-score.