![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [334]:
import pandas as pd
import numpy as np
from scipy.stats import randint, uniform

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import RandomizedSearchCV

In [335]:
df = pd.read_csv('rental_info.csv')
print(df.head())

                 rental_date  ... rental_rate_2
0  2005-05-25 02:54:33+00:00  ...        8.9401
1  2005-06-15 23:19:16+00:00  ...        8.9401
2  2005-07-10 04:27:45+00:00  ...        8.9401
3  2005-07-31 12:06:41+00:00  ...        8.9401
4  2005-08-19 12:30:04+00:00  ...        8.9401

[5 rows x 15 columns]


In [336]:
# Convert 'return_date' and 'rental_date' to datetime format
df['return_date'] = pd.to_datetime(df['return_date'])
df['rental_date'] = pd.to_datetime(df['rental_date'])

# Get number of rental days
df['rental_length'] = df['return_date'] - df['rental_date']
df['rental_length_days'] = df['rental_length'].dt.days

print(df.head())

                rental_date  ... rental_length_days
0 2005-05-25 02:54:33+00:00  ...                  3
1 2005-06-15 23:19:16+00:00  ...                  2
2 2005-07-10 04:27:45+00:00  ...                  7
3 2005-07-31 12:06:41+00:00  ...                  2
4 2005-08-19 12:30:04+00:00  ...                  4

[5 rows x 17 columns]


In [337]:
print(df['special_features'].value_counts())

{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}                         916
{Trailers,"Deleted Scenes","Behind

In [338]:
# Add dummy variables
df['deleted_scenes'] = np.where(df['special_features'].str.contains('Deleted Scenes'), 1, 0)
df['trailers'] = np.where(df['special_features'].str.contains('Trailers'), 1, 0)
df['behind_the_scenes'] = np.where(df['special_features'].str.contains('Behind the Scenes'), 1, 0)
df['commentaries'] = np.where(df['special_features'].str.contains('Commentaries'), 1, 0)

df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length,rental_length_days,deleted_scenes,trailers,behind_the_scenes,commentaries
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3 days 20:46:00,3,0,1,1,0
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 20:05:00,2,0,1,1,0
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7 days 05:44:00,7,0,1,1,0
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2 days 02:24:00,2,0,1,1,0
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4 days 01:05:00,4,0,1,1,0


In [339]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   rental_date         15861 non-null  datetime64[ns, UTC]
 1   return_date         15861 non-null  datetime64[ns, UTC]
 2   amount              15861 non-null  float64            
 3   release_year        15861 non-null  float64            
 4   rental_rate         15861 non-null  float64            
 5   length              15861 non-null  float64            
 6   replacement_cost    15861 non-null  float64            
 7   special_features    15861 non-null  object             
 8   NC-17               15861 non-null  int64              
 9   PG                  15861 non-null  int64              
 10  PG-13               15861 non-null  int64              
 11  R                   15861 non-null  int64              
 12  amount_2            15861 non-nu

## Spliting train - test set

In [340]:
# Create dataframe of features for running regression models
X = df[['amount', 'rental_rate', 'length', 'replacement_cost', 
        'amount_2', 'length_2', 'rental_rate_2', 
        'deleted_scenes', 'trailers', 'behind_the_scenes', 
        'NC-17', 'PG', 'PG-13', 'R']]
y = df['rental_length_days']

print('Shape of X: ', X.shape)
print('Shape of y: ', y.shape)


Shape of X:  (15861, 14)
Shape of y:  (15861,)


In [341]:
# Set seed for reproducibility
SEED = 9

# Execute a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=SEED )
print('Shape of X_train: ', X_train.shape)
print('Shape of X_test: ', X_test.shape)

Shape of X_train:  (12688, 14)
Shape of X_test:  (3173, 14)


## Feature Selection

In [342]:
from sklearn.linear_model import Lasso

# Instantiate the model
lasso = Lasso(alpha = 0.01, random_state = SEED)

# Fit the model
lasso.fit(X_train, y_train)

# Access coefficients
lasso_coef = lasso.coef_
print('Lasso coefficients: ', lasso_coef)

# Identify features with non-zero coefficients
selected_features = X_train.columns[lasso_coef > 0]
print('Selected features: ', selected_features)

# Subset the data
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]
print('Shape of X_train_selected: ', X_train_selected.shape)
print('Shape of X_test_selected: ', X_test_selected.shape)


Lasso coefficients:  [ 1.50966588e+00 -1.06342718e+00  2.95551524e-03 -3.49894264e-03
 -3.34266132e-02 -7.38984056e-06 -2.44358072e-02 -9.34567483e-02
 -5.35295570e-02  4.57871440e-02  2.85318450e-02  0.00000000e+00
  2.92265824e-03 -9.43248284e-02]
Selected features:  Index(['amount', 'length', 'behind_the_scenes', 'NC-17', 'PG-13'], dtype='object')
Shape of X_train_selected:  (12688, 5)
Shape of X_test_selected:  (3173, 5)


## Choosing models and performing hyperparameter tuning

### Linear Regressor

In [343]:
from sklearn.linear_model import Ridge

# Instantiate model
ridge = Ridge()

# Define hyperparameters
params = {'alpha': np.logspace(-4, 4, 20) }

# Instantiate Randomized Search
rand_ridge = RandomizedSearchCV(estimator=ridge,
                                param_distributions=params,
                                n_iter=100,
                                scoring='neg_mean_squared_error',
                                cv=5, verbose=1, n_jobs=-1,
                                random_state=SEED)

rand_ridge.fit(X_train_selected, y_train)

# Extract the best hyperparameters
print('Best hyperparameters:\n', rand_ridge.best_params_)

# Extract the best model
best_model = rand_ridge.best_estimator_

# Predict the test set labels
y_pred = best_model.predict(X_test_selected)

# Calculate the test set MSE
best_mse = MSE(y_test, y_pred)
print(best_mse)


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best hyperparameters:
 {'alpha': 206.913808111479}
4.846495506603203


### Decision Tree

In [344]:
from sklearn.tree import DecisionTreeRegressor

# Instantiate model
dt = DecisionTreeRegressor(random_state = SEED)

# Define hyperparameter:
dt_params = {'max_depth': [None] + list(range(4, 21, 2)),  # More depth values in smaller steps
    'min_samples_leaf': randint(1, 20),  # Allow up to 20 for finer tuning of leaves
    'min_samples_split': randint(2, 20),  # Increase range to test splits better
    'max_features': ['auto', 'log2', 'sqrt', None],  # Keep options broad
    'splitter': ['best', 'random'] }

In [345]:
# Instantiate Random Search
tree = RandomizedSearchCV(estimator = dt, param_distributions = dt_params, cv = 3, scoring = 'neg_mean_squared_error', verbose = 1, n_jobs =-1)

# Fit to the training set
tree.fit(X_train_selected, y_train)

# Extract the best hyperparameters
print('Best hyperparameters:\n', tree.best_params_)

# Extract the best model
best_model = tree.best_estimator_

# Predict the test set labels
y_pred = best_model.predict(X_test_selected)

# Calculate the test set MSE
best_mse = MSE(y_test, y_pred)
print(best_mse)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best hyperparameters:
 {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'best'}
2.940904085913742


### Random Forest

In [346]:
from sklearn.ensemble import RandomForestRegressor

# Set seed for reproducibility
SEED = 9

# Instantiate Random Forest
rf = RandomForestRegressor(random_state = SEED)

# Define hyperparameter
params = {'n_estimators': randint(100, 1000),  # Test a wider range with finer granularity
    'max_depth': [None] + list(range(4, 21, 2)),  # More depth values in smaller steps
    'min_samples_leaf': randint(1, 20),  # Allow up to 20 for finer tuning of leaves
    'min_samples_split': randint(2, 20),  # Increase range to test splits better
    'max_features': ['auto', 'log2', 'sqrt', None],  # Keep options broad
    'bootstrap': [True, False],  # Bootstrap options
    'max_samples': uniform(0.5, 0.5) # Fractional sampling
         }

In [347]:

# Instantiate Random Search
rand = RandomizedSearchCV(estimator = rf, param_distributions = params, cv = 3, scoring = 'neg_mean_squared_error', verbose = 1, n_jobs =-1)

# Fit to the training set
rand.fit(X_train_selected, y_train)

# Extract the best hyperparameters
print('Best hyperparameters:\n', rand.best_params_)

# Extract the best model
best_model = rand.best_estimator_

# Predict the test set labels
y_pred = best_model.predict(X_test_selected)

# Calculate the test set MSE
best_mse = MSE(y_test, y_pred)
print(best_mse)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best hyperparameters:
 {'bootstrap': True, 'max_depth': 14, 'max_features': 'log2', 'max_samples': 0.9234501589221612, 'min_samples_leaf': 5, 'min_samples_split': 18, 'n_estimators': 220}
3.3587243096773727


### XGBoosting

In [348]:
import xgboost as xgb

# Create the hyperparameters
xgb_param = {
    'n_estimators': [25],
    'max_depth': range(2, 12) }

xg_reg = xgb.XGBRegressor(n_estimators=10)

rand_xgb = RandomizedSearchCV(estimator = xg_reg, param_distributions = xgb_param, scoring = 'neg_mean_squared_error', n_iter = 5, cv = 4, verbose = 1)
                                    
rand_xgb.fit(X_train_selected, y_train)

# Extract the best hyperparameters
print('Best hyperparameters:\n', rand_xgb.best_params_)

# Extract the best model
best_model = rand_xgb.best_estimator_

# Predict the test set labels
y_pred = best_model.predict(X_test_selected)

# Calculate the test set MSE
best_mse = MSE(y_test, y_pred)
print(best_mse)

Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best hyperparameters:
 {'n_estimators': 25, 'max_depth': 11}
2.9541590335515613
