![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [8]:
# Start your coding from below
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error

# For lasso
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import cross_val_score # Corrected import
from sklearn.preprocessing import StandardScaler

# Run OLS
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Read in data
df_rental = pd.read_csv("rental_info.csv")

# Add information on rental duration
df_rental["rental_length"] = pd.to_datetime(df_rental["return_date"]) - pd.to_datetime(df_rental["rental_date"])
df_rental["rental_length_days"] = df_rental["rental_length"].dt.days

### Add dummy variables
# Add dummy for deleted scenes
df_rental["deleted_scenes"] =  np.where(df_rental["special_features"].str.contains("Deleted Scenes"), 1, 0)
# Add dummy for behind the scenes
df_rental["behind_the_scenes"] =  np.where(df_rental["special_features"].str.contains("Behind the Scenes"), 1, 0)

# Choose columns to drop
cols_to_drop = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]

# Split into feature and target sets
X = df_rental.drop(cols_to_drop, axis=1)
y = df_rental["rental_length_days"]

# Further split into training and test data
X_train,X_test,y_train,y_test = train_test_split(X, 
                                                 y, 
                                                 test_size=0.2, 
                                                 random_state=9)

preprocessor = StandardScaler()
pipelines = {'Linear Regression' : Pipeline([('preprocessor',preprocessor),('Regressor',LinearRegression())]),
            'Lasso' : Pipeline([('preprocessor',preprocessor),('Regressor', Lasso())]),
            'Random Forest' : Pipeline([('preprocessor',preprocessor),('Regressor',RandomForestRegressor())]),
            'Ridge' : Pipeline([('preprocessor',preprocessor),('Regressor', Ridge())])}
for name,pipeline in pipelines.items():
    score = cross_val_score(pipeline,X_train,y_train,cv=5,scoring='neg_mean_squared_error')
    print(f'{name} - Mean Accuracy : {np.mean(score):.4f}')
    #print(f'{name} - Mean Accuracy : {pipeline.get_params()}')

param_grids ={'Linear Regression':{'Regressor__n_jobs':[0.1,1,10]},
             'Lasso' : {'Regressor__alpha':[1,5,10]},
             'Random Forest' : {'Regressor__n_estimators':[100,150,200],'Regressor__n_jobs':[1,5,10]},
             'Ridge':{'Regressor__alpha':[1,5,10]}}
for name,pipeline in pipelines.items():
    param_grid=param_grids[name]
    grid_search=GridSearchCV(pipeline,param_grid=param_grid, 
                             cv=5,scoring='neg_mean_squared_error')
    grid_search.fit(X_train,y_train)
    best_classifier=grid_search.best_estimator_
    y_pred = best_classifier.predict(X_test)
    mse =mean_squared_error(y_test,y_pred)
    r_score=r2_score(y_test,y_pred)
    print(f'{name} : MSE {mse} R2 score {r_score}')
    

Linear Regression - Mean Accuracy : -2.8491
Lasso - Mean Accuracy : -5.8096
Random Forest - Mean Accuracy : -2.0680
Ridge - Mean Accuracy : -2.8491
Linear Regression : MSE 2.9417238646975976 R2 score 0.5856476313096709
Lasso : MSE 5.9495850986576295 R2 score 0.1619795766905221
Random Forest : MSE 2.0323902033962487 R2 score 0.7137305424937211
Ridge : MSE 2.9418707029214026 R2 score 0.5856269486186209


**Summary**

Among the four models, Random Forest demonstrates the best performance with the highest R² score (0.7137), indicating it explains the most variance in the data, and the lowest MSE (2.0324), indicating the least prediction error. The mean accuracy, although negative, is the highest among the models tested.

Linear Regression and Ridge Regression show moderate performance, with identical R² scores (0.5856) and similar MSE values around 2.9417-2.9419.

Lasso Regression shows the poorest performance with the lowest R² score (0.1620) and the highest MSE (5.9496), indicating it explains the least variance and has the highest prediction error.