![dvd_image](../images/movie_rental_prediction/dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

# Importing Libraries

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso, Ridge, LinearRegression
from sklearn.preprocessing import StandardScaler
# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

## Exploring Data

In [3]:
df = pd.read_csv('../datasets/movie_rental_prediction/rental_info.csv')
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [4]:
df.dtypes

rental_date          object
return_date          object
amount              float64
release_year        float64
rental_rate         float64
length              float64
replacement_cost    float64
special_features     object
NC-17                 int64
PG                    int64
PG-13                 int64
R                     int64
amount_2            float64
length_2            float64
rental_rate_2       float64
dtype: object

In [5]:
df.isna().sum()

rental_date         0
return_date         0
amount              0
release_year        0
rental_rate         0
length              0
replacement_cost    0
special_features    0
NC-17               0
PG                  0
PG-13               0
R                   0
amount_2            0
length_2            0
rental_rate_2       0
dtype: int64

## Manipulating Columns

In [6]:
# converting 'return_date' and 'rental_date' to datetime so we can preform operations
df['return_date'] = pd.to_datetime(df['return_date'])
df['rental_date'] = pd.to_datetime(df['rental_date'])

# creating a column of the difference between the rental date and the return date
df['rental_length_days'] = (df['return_date'] - df['rental_date']).dt.days

In [7]:
# the main types are 'Behind the Scenes' and 'Deleted Scenes'
df['special_features'].unique()

array(['{Trailers,"Behind the Scenes"}', '{Trailers}',
       '{Commentaries,"Behind the Scenes"}', '{Trailers,Commentaries}',
       '{"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes","Behind the Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes"}',
       '{"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes"}', '{Commentaries}',
       '{Trailers,Commentaries,"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes"}', '{"Deleted Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}'],
      dtype=object)

In [8]:
# creating dummy column for the different main groups of special feature
df['deleted_scenes']=np.where(df['special_features'].str.contains('Behind the Scenes'), 0,1)

## Splitting Data

In [9]:
cols_to_drop = ["special_features", "rental_length_days", "rental_date", "return_date"]


X = df.drop(columns = cols_to_drop,axis=1)
y = df['rental_length_days']

In [10]:
# splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9)

## Building Models

In [11]:
# first, we will evaluate plain models without scaling features and check their preformance

# fitting a plain linear regression model
lr = LinearRegression()
lr.fit(X_train,y_train)
y_preds_lr = lr.predict(X_test)
# MSE for Linear Regression
mean_squared_error(y_test, y_preds_lr)

2.943688440000023

In [12]:
# fitting a plain ridge regression model running through alpha values in a for-loop
alpha_value_list = np.linspace(0.5,25,5)
for alpha in alpha_value_list:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train,y_train)
    y_preds_ridge = ridge.predict(X_test)
    # MSE for Ridge
    score = mean_squared_error(y_test, y_preds_ridge)
    print(f"The score for Ridge regression with an alpha value of {alpha} is {score}")

The score for Ridge regression with an alpha value of 0.5 is 2.943706059079038
The score for Ridge regression with an alpha value of 6.625 is 2.9439292330114917
The score for Ridge regression with an alpha value of 12.75 is 2.944165617420569
The score for Ridge regression with an alpha value of 18.875 is 2.9444147071009708
The score for Ridge regression with an alpha value of 25.0 is 2.9446760174063282


In [13]:
# fitting a plain Lasso regression model
alpha_value_list = np.linspace(0,1,5)

for alpha in alpha_value_list:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train,y_train)
    y_preds_lasso = lasso.predict(X_test)
    # MSE for Ridge
    score = mean_squared_error(y_test, y_preds_lasso)
    print(f"The score for Lasso regression with an alpha value of {alpha} is {score}")

  lasso.fit(X_train,y_train)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


The score for Lasso regression with an alpha value of 0.0 is 2.943688440000025
The score for Lasso regression with an alpha value of 0.25 is 3.2483186945081934
The score for Lasso regression with an alpha value of 0.5 is 3.7137682414670485
The score for Lasso regression with an alpha value of 0.75 is 3.796281721558295
The score for Lasso regression with an alpha value of 1.0 is 3.8056884092652106


In [14]:
# now trying out the best regression model with scaled data

# scaling data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [15]:
# final model will be Linear Regression
model = LinearRegression()

model.fit(X_train_scaled,y_train)

y_preds = model.predict(X_test_scaled)

score = mean_squared_error(y_test,y_preds)

score

2.9458048013468945

This score still isn't below 3, so we are going to try out some different models

### Random Forest Model and Tuning

In [16]:
# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# creating model
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train_scaled, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

In [17]:
# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)
rf.fit(X_train_scaled,y_train)
rf_pred = rf.predict(X_test_scaled)
mse_random_forest= mean_squared_error(y_test, rf_pred)
mse_random_forest

2.249519864530372

In [18]:
# The lowest score out of all the models has been with Random Forest
best_model = rf
best_mse = mse_random_forest