![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below

In [22]:
df = pd.read_csv('rental_info.csv')

df['rental_length_days'] = (pd.to_datetime(df['return_date']) - pd.to_datetime(df['rental_date'])).dt.days
df['deleted_scenes'] = df['special_features'].apply(lambda x: 'Deleted Scenes' in x).astype('int')
df['behind_the_scenes'] = df['special_features'].apply(lambda x: 'Behind the Scenes' in x).astype('int')

df.sample(10)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
8866,2005-06-15 08:18:37+00:00,2005-06-22 07:36:37+00:00,6.99,2009.0,2.99,80.0,10.99,"{Commentaries,""Behind the Scenes""}",0,0,0,0,48.8601,6400.0,8.9401,6,0,1
14151,2005-08-18 12:07:25+00:00,2005-08-27 07:41:25+00:00,8.99,2009.0,4.99,120.0,28.99,"{""Behind the Scenes""}",0,0,1,0,80.8201,14400.0,24.9001,8,0,1
13273,2005-07-29 22:44:57+00:00,2005-08-03 22:25:57+00:00,3.99,2009.0,2.99,147.0,24.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,1,0,15.9201,21609.0,8.9401,4,1,1
15455,2005-08-22 07:57:08+00:00,2005-08-23 11:11:08+00:00,4.99,2006.0,4.99,148.0,17.99,"{Trailers,Commentaries}",0,0,1,0,24.9001,21904.0,24.9001,1,0,0
8181,2005-08-22 08:04:31+00:00,2005-08-31 06:30:31+00:00,3.99,2007.0,0.99,126.0,16.99,"{Commentaries,""Behind the Scenes""}",0,0,0,0,15.9201,15876.0,0.9801,8,0,1
3280,2005-08-01 16:00:02+00:00,2005-08-08 20:18:02+00:00,4.99,2009.0,2.99,57.0,12.99,{Commentaries},1,0,0,0,24.9001,3249.0,8.9401,7,0,0
5154,2005-07-28 11:56:00+00:00,2005-07-30 14:33:00+00:00,2.99,2009.0,2.99,179.0,16.99,"{Commentaries,""Deleted Scenes"",""Behind the Sce...",0,0,0,1,8.9401,32041.0,8.9401,2,1,1
14182,2005-06-18 16:59:23+00:00,2005-06-21 14:12:23+00:00,0.99,2008.0,0.99,118.0,15.99,"{Trailers,""Behind the Scenes""}",0,1,0,0,0.9801,13924.0,0.9801,2,0,1
14072,2005-07-28 19:22:27+00:00,2005-07-30 17:06:27+00:00,0.99,2010.0,0.99,69.0,29.99,"{Trailers,""Behind the Scenes""}",0,1,0,0,0.9801,4761.0,0.9801,1,0,1
8485,2005-07-11 06:37:51+00:00,2005-07-17 02:34:51+00:00,0.99,2007.0,0.99,113.0,20.99,"{Commentaries,""Deleted Scenes"",""Behind the Sce...",0,1,0,0,0.9801,12769.0,0.9801,5,1,1


In [25]:
X = df[['amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'NC-17', 'PG',
        'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2',
        'deleted_scenes', 'behind_the_scenes']]

y = df['rental_length_days']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [32]:
# Let' try random forest
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=9)

In [33]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators': [100, 500],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [2, 30]
}

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

In [34]:
# Training
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


12 fits failed out of a total of 24.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "/home/israel/miniconda3/envs/ds2/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/israel/miniconda3/envs/ds2/lib/python3.12/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/home/israel/miniconda3/envs/ds2/lib/python3.12/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/home/israel/miniconda3/envs/ds2/lib/python3.12/site-packages/sklearn/utils/_param_validatio

In [35]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
best_mse = MSE(y_pred, y_test) ** (1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(best_mse)) 
print('Best Parameters:', grid_rf.best_params_)

Test RMSE of best model: 1.423
Best Parameters: {'max_features': 'sqrt', 'min_samples_leaf': 2, 'n_estimators': 500}
