# Assignment 4: Pipelines and Hyperparameter Tuning (32 total marks)
### Due: November 22 at 11:59pm

### Name: Matthew De Filippo

### In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models and evaluate the results. More details for each step can be found below.

### You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd

## Step 1: Data Input (4 marks)

Import the dataset you will be using. You can download the dataset onto your computer and read it in using pandas, or download it directly from the website. Answer the questions below about the dataset you selected. 

To find a dataset, you can use the resources listed in the notes. The dataset can be numerical, categorical, text-based or mixed. If you want help finding a particular dataset related to your interests, please email the instructor.

**You cannot use a dataset that was used for a previous assignment or in class**

In [2]:
# Import dataset (1 mark)
df = pd.read_csv('imdb_top_1000.csv')

### Questions (3 marks)

1. (1 mark) What is the source of your dataset?
1. (1 mark) Why did you pick this particular dataset?
1. (1 mark) Was there anything challenging about finding a dataset that you wanted to use?

#### Question 1 Response
The source of my dataset is Kaggle. It can be found at the following link:
https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

#### Question 2 Response
I selected this dataset because I am a fan of movies and thought it would be interesting to develop a regression model that
could predict the IMDB rating (i.e. aggregated user score) based on several attributes such as meta score (i.e. aggregated critic rating, run time, release year, number of votes, gross, and certificate (i.e. PG, G, PG-13, etc).

#### Question 3 Response
Yes; I wanted to find a dataset that was interesting so I spent a decent amount of time browsing Kaggle. There were some datasets that seemed to be poorly put together that I had to filter through. I found that using the usability filter on Kaggle allowed me to browse through higher quality datasets.

## Step 2: Data Processing (5 marks)

The next step is to process your data. Implement the following steps as needed.

In [3]:
# Clean data (if needed)
df.head(3)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444


In [4]:
# First, I want to get rid of some attributes that I know I will not need for my analysis.
df = df.drop(columns = ['Poster_Link', 'Series_Title', 'Genre', 'Overview', 'Director', 'Star1', 'Star2', 'Star3', 'Star4'])
df.head()

Unnamed: 0,Released_Year,Certificate,Runtime,IMDB_Rating,Meta_score,No_of_Votes,Gross
0,1994,A,142 min,9.3,80.0,2343110,28341469
1,1972,A,175 min,9.2,100.0,1620367,134966411
2,2008,UA,152 min,9.0,84.0,2303232,534858444
3,1974,A,202 min,9.0,90.0,1129952,57300000
4,1957,U,96 min,9.0,96.0,689845,4360000


In [5]:
# Now, let's take a look to see if there are any null values.
df.isnull().sum()

Released_Year      0
Certificate      101
Runtime            0
IMDB_Rating        0
Meta_score       157
No_of_Votes        0
Gross            169
dtype: int64

In [6]:
# Here, we see that Certificate, Meta_score, and Gross contain over 100 null values. We will need to deal with these.
# Since it is unclear what these values should be and guessing a value will impact the integrity of the data, I am simply 
# going to drop all null values from the dataset.
df = df.dropna()

In [7]:
# Let's take a look at the datatypes that we have.
df.dtypes

Released_Year      int64
Certificate       object
Runtime           object
IMDB_Rating      float64
Meta_score       float64
No_of_Votes        int64
Gross             object
dtype: object

In [8]:
# Notice that Runtime is an object datatype. This is because it is stored as a string with the word "min" concatenated with it.
# Let's convert this to an integer so we can perform proper regression. We also need to update the 'Released_Year' and 'Gross' to
# integer variables as well.
df['Runtime'] = df['Runtime'].str.extract('(\d+)').astype(int)
df['Released_Year'] = df['Released_Year'].astype(int)
df['Gross'] = df['Gross'].str.replace(',', '').astype(int)
df.dtypes

Released_Year      int32
Certificate       object
Runtime            int32
IMDB_Rating      float64
Meta_score       float64
No_of_Votes        int64
Gross              int32
dtype: object

In [9]:
# Implement preprocessing steps. Remember to use ColumnTransformer if more than one preprocessing method is needed.
# The only preprocessing we are going to need here is to encode the movie certificates.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)
enc.fit(df[['Certificate']])
encoded_df = pd.DataFrame(enc.transform(df[['Certificate']]), columns=enc.get_feature_names_out())
df = pd.concat([df, encoded_df], axis=1).drop(columns=['Certificate'])

In [10]:
df.dropna()
df=df.dropna()

### Questions (2 marks)

1. (1 mark) Were there any missing/null values in your dataset? If yes, how did you replace them and why? If no, describe how you would've replaced them and why.
2. (1 mark) What type of data do you have? What preprocessing methods would you have to apply based on your data types?

#### Question 1
Yes. There were several missing/null values in my dataset. As per the above comments, I decided to remove all rows with null values in order to maintain the integrity of the dataset. I would not know what values to replace these with.

#### Question 2
I have a mixture of numerical data (Released_Year, Runtime, IMDB_Rating, Meta_score, No_of_Votes, Gross) and categorical data (Certificate). Since I have categorical data, I had to use one-hot encoding on the Certificate attribute.

## Step 3: Implement Machine Learning Model (11 marks)

In this section, you will implement three different supervised learning models (one linear and two non-linear) of your choice. You will use a pipeline to help you decide which model and hyperparameters work best. It is up to you to select what models to use and what hyperparameters to test. You can use the class examples for guidance. You must print out the best model parameters and results after the grid search.

In [11]:
# Implement pipeline and grid search here. Can add more code blocks if necessary

In [12]:
# Let's split the data into our training and validation sets.
from sklearn.model_selection import train_test_split
X = df.drop(columns = 'IMDB_Rating')
y = df['IMDB_Rating']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import make_scorer, mean_squared_error, r2_score

# Linear Regression Pipeline
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LinearRegression())
])

# Random Forest Pipeline
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestRegressor())
])

# GBR Pipeline
gbr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', GradientBoostingRegressor())
])

In [26]:
# Define the parameter grids.
param_grid_lr = {} 


param_grid_rf = {
    'classifier__n_estimators': [100, 200],
    'classifier__max_depth': [10, 20, None]
}

param_grid_gbr = {
    'classifier__n_estimators': [100, 200],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [3, 5, 7]
}

# Define the scoring metrics.
scoring = {
    'mean_squared_error': make_scorer(mean_squared_error),
    'r2_score': make_scorer(r2_score)
}

# Create GridSearchCV instances for each algorithm with multiple scoring metrics
grid_search_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=5, scoring=scoring, refit='r2_score')
grid_search_rf = GridSearchCV(rf_pipeline, param_grid_rf, cv=5, scoring=scoring, refit='r2_score')
grid_search_gbr = GridSearchCV(gbr_pipeline, param_grid_gbr, cv=5, scoring=scoring, refit='r2_score')


In [27]:
# Fit the models
grid_search_lr.fit(X_train, y_train)
grid_search_rf.fit(X_train, y_train)
grid_search_gbr.fit(X_train, y_train)

# Get the best parameters based on F1
best_params_rf = grid_search_rf.best_params_
best_params_lr = grid_search_lr.best_params_
best_params_gbr = grid_search_gbr.best_params_

# Access the results for both scoring metrics
results_rf = grid_search_rf.cv_results_
results_lr = grid_search_lr.cv_results_
results_gbr = grid_search_gbr.cv_results_

In [38]:
# Print the results for r2 score and mean squared error.
print("\nLinear Regression Results:")
print("R2 scores:", results_lr['mean_test_r2_score'])
print("MSE scores:", results_lr['mean_test_mean_squared_error'])
print("Best Parameters for Logistic Regression based on R2 Score:", best_params_lr, "\n")

print("Random Forest Results:")
print("R2 scores:", results_rf['mean_test_r2_score'])
print("MSE scores:", results_rf['mean_test_mean_squared_error'])
print("Best Parameters for Random Forest based on R2 Score:", best_params_rf, "\n")

print("\nGradient Boosted Regressor Results:")
print("R2 scores:", results_gbr['mean_test_r2_score'])
print("MSE scores:", results_gbr['mean_test_mean_squared_error'])
print("\nBest Parameters for GBR based on R2 Score:", best_params_gbr)


Linear Regression Results:
R2 scores: [0.54047612]
MSE scores: [0.03066882]
Best Parameters for Logistic Regression based on R2 Score: {} 

Random Forest Results:
R2 scores: [0.60985009 0.60889271 0.61278754 0.61113792 0.60430742 0.61697171]
MSE scores: [0.02627654 0.02631146 0.02602027 0.02609873 0.02659183 0.02572411]
Best Parameters for Random Forest based on R2 Score: {'classifier__max_depth': None, 'classifier__n_estimators': 200} 


Gradient Boosted Regressor Results:
R2 scores: [0.40629032 0.51440837 0.46805492 0.55617581 0.49260742 0.55194381
 0.5768896  0.55748626 0.57907432 0.55965261 0.55613517 0.53853788
 0.5511261  0.53996379 0.54491047 0.54852363 0.53010136 0.56047557]
MSE scores: [0.03965134 0.03239641 0.03540282 0.02956137 0.03393822 0.03004674
 0.02819713 0.02946402 0.02793037 0.029177   0.02947279 0.03086858
 0.02978851 0.03059196 0.03046288 0.02995356 0.03141106 0.02931356]

Best Parameters for GBR based on R2 Score: {'classifier__learning_rate': 0.1, 'classifier__m

### Questions (5 marks)

1. (1 mark) Do you need regression or classification models for your dataset?
1. (2 marks) Which models did you select for testing and why?
1. (2 marks) Which model worked the best? Does this make sense based on the theory discussed in the course and the context of your dataset?

#### Question 1 Response
I need regression model for my dataset because my target variable, 'IMDB_Rating' is a continuous numberic variable. I am trying to predict the value of this target variable using the remaining features.

#### Question 2 Response
I selected the following three (3) models: LinearRegression, RandomForestRegressor, and GradientBoostingRegressor. I selected these models because they represent a range of different model complexities. I thought it would be interesting to see how these models perform from the most straightforward to implement (LinearRegression) to the most complicated (GradientBoostingRegressor).

#### Question 3 Response
Based on the R2 scores shown above, the best results were achieved with the RandomForestRegressor model. The maximum R2 score of 0.62 was achieved using this model. This does make sense based on the theory discussed in class. It is a more sophisticated model than LinearRegression. While the GradientBoostedRegressor can theoretically offer better results, this requires careful parameter tuning; perhaps more parameter tuning would be required to outperform the RandomForestRegressor model in this scenario.

## Step 4: Validate Model (6 marks)

Use the testing set to calculate the testing accuracy for the best model determined in Step 3.

In [45]:
# Calculate testing accuracy (1 mark)
from sklearn.model_selection import cross_validate

# Instantiate the Random Forest Classifier model with the parameters found in step 3.
model = RandomForestRegressor(max_depth=5, n_estimators=200)
model.fit(X_train, y_train)
scores = cross_validate(model, X_train, y_train, cv=5, scoring='r2', return_train_score=True)
print(f"Training Accuracy (R2 Score) = {scores['train_score'].mean()}" )
print(f"Validation Accuracy (R2 Score) = {scores['test_score'].mean()}" )




Training Accuracy (R2 Score) = 0.7852414913970109
Validation Accuracy (R2 Score) = 0.5635404375177003



### Questions (5 marks)

1. (1 mark) Which accuracy metric did you choose? 
1. (1 mark) How do these results compare to those in part 3? Did this model generalize well?
1. (3 marks) Based on your results and the context of your dataset, did the best model perform "well enough" to be used out in the real-world? Why or why not? Do you have any suggestions for how you could improve this analysis?

#### Question 1 Response
I used R2 score as the accuracy metric so I could adequately compare the results to that achieved in the previous section.

#### Question 2 Response
An R2 training score of 0.79 was achieved; this surpassed the score of 0.62 achieved in the previous section. As such, we can say that this model did generalize well.

#### Question 3 Response
No; this model did not perform well enough to be used in the real-world. The model is exhibiting high variance; we are overfitting the data. Perhaps the data is just not that well correlated or the model requires further tuning. Some potential changes we could make to the model include:
- Further tuning of the RandomForestRegressor parameters (max_depth, n_estimators, etc.)
- Changing to a GradientBoostedRegressor model and adequately tuning the parameters.
- Sourcing more data (this dataset only has 1000 records and approximately 300 have been removed due to NaN values).

## Process Description (4 marks)
Please describe the process you used to create your code. Cite any websites or generative AI tools used. You can use the following questions as guidance:
1. Where did you source your code?
1. In what order did you complete the steps?
1. If you used generative AI, what prompts did you use? Did you need to modify the code at all? Why or why not?
1. Did you have any challenges? If yes, what were they? If not, what helped you to be successful?

#### Question 1 Response
I sourced my code by referencing previously completed assignments as well as Lab 6 (Scaler, Pipelines, Gridsearch). I adapted and modified the code where applicable to suit my specific dataset and problem.

#### Question 2 Response
I completed the steps in the suggested order 1->2->3->4. I felt that this was a logical way to complete the assignment and was well laid out.

#### Question 3 Response
I used ChatGPT to help me find the name of the function astype() that I used in the data processing phase of assignment to 
convert the datatype of certain feature columns to integers so that they could be used in my regression model.

#### Question 4 Response
Yes; I ran into issues with my dataset because certain columns were being read in as objects/strings instead of integer datatypes. I eventually figured out that I needed to convert these to integers in the data processing step to make my models work properly.


## Reflection (2 marks)
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challenging, motivating
while working on this assignment.

I liked this assignment because it gave me the opportunity to put together what I have learned so far in the course and apply it to a dataset that I was interested in. I found the concept of Gridsearch a bit confusing at first but referencing the notes and related Lab exercise helped me proceed to complete the assignment.