# Prediting Movie Rental Durations

A DVD rental company want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file rental_info.csv. It has the following features:

"rental_date": The date (and time) the customer rents the DVD.

"return_date": The date (and time) the customer returns the DVD.

"amount": The amount paid by the customer for renting the DVD.

"amount_2": The square of "amount".

"rental_rate": The rate at which the DVD is rented for.

"rental_rate_2": The square of "rental_rate".

"release_year": The year the movie being rented was released.

"length": Lenght of the movie being rented, in minuites.

"length_2": The square of "length".

"replacement_cost": The amount it will cost the company to replace the DVD.

"special_features": Any special features, for example trailers/deleted scenes that the DVD also has.

"NC-17", "PG", "PG-13", "R": These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [5]:
# Importing modules
import pandas as pd
import numpy as np

 #Lasso
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Run OLS
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Random forest
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV


# Data importing and initial information about the data
rental = pd.read_csv('datasets/rental_info.csv')


display(rental.head())
print('\n')
print(rental.info())

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB
None


The columns ("rental_date" and "return_date") will be converted to a datetime format. From there, it is possible to calculate the time the customers rented the movie.

In [6]:
# Calculate the rental duration in days
rental['return_date'] = pd.to_datetime(rental['return_date'])
rental['rental_date'] = pd.to_datetime(rental['rental_date'])
rental['rental_days'] = (rental['return_date']-rental['rental_date']).dt.days

display(rental.head())
print('\n')
print(rental.info())

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_days
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   rental_date       15861 non-null  datetime64[ns, UTC]
 1   return_date       15861 non-null  datetime64[ns, UTC]
 2   amount            15861 non-null  float64            
 3   release_year      15861 non-null  float64            
 4   rental_rate       15861 non-null  float64            
 5   length            15861 non-null  float64            
 6   replacement_cost  15861 non-null  float64            
 7   special_features  15861 non-null  object             
 8   NC-17             15861 non-null  int64              
 9   PG                15861 non-null  int64              
 10  PG-13             15861 non-null  int64              
 11  R                 15861 non-null  int64              
 12  amount_2          15861 non-null  float64            
 13 

For this exercise, it is required to have 2 dummy variables from "special_features", which will have the value of 1 when "Deleted Scenes" or "Behind the Scenes" are present. After, all columns that are not required must be removed before dividing the data using splitting the data.

In [8]:
# Add dummy variables
rental['deleted_scenes'] = np.where(rental['special_features'].str.contains('Deleted Scenes'),1,0)
rental['behind_the_scenes'] = np.where(rental['special_features'].str.contains('Behind the Scenes'),1,0)

# Choose columns to drop
cols_to_drop = ["special_features", "rental_days", "rental_date", "return_date"]

X = rental.drop(cols_to_drop, axis=1)
y = rental['rental_days']

# Splitting the data into trainning and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=9)

A feature selection will be performed using the Lasso Regression, from where the features with negative coefficients will not be used for the Linear Regression and Random Forest Regression. The best model will be selected using the mean squared error (MSE).

In [12]:
# Lasso:
    # Create the model
lasso = Lasso(random_state=9, alpha=0.3)

    # Train the model
lasso.fit(X_train,y_train)
lasso_coef = lasso.coef_
non_zero_lasso = lasso_coef > 0

    # Perform feature selectino by choosing columns with positive coefficients
X_train_non_zero = X_train.iloc[:, non_zero_lasso]
X_test_non_zero = X_test.iloc[:, non_zero_lasso]

In [13]:
# Linear Regression:

linear = LinearRegression()
linear.fit(X_train_non_zero,y_train)
y_pred_LR = linear.predict(X_test_non_zero)
mse_LR = mean_squared_error(y_test, y_pred_LR)

# Random Forest Regression
    # hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}
    
    # Create a random forest regressor
rf = RandomForestRegressor()

    # Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

    # Fit the random search object to the data
rand_search.fit(X_train, y_train)

    # Create a variable for the best hyper param
hyper_params = rand_search.best_params_

    # Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_RF= mean_squared_error(y_test, rf_pred)  

In [20]:
mse_dict = {'Linear Regression': mse_LR,
            'Random Forest Regression': mse_RF}

# Get the model with the minimum MSE
best_model = min(mse_dict, key=mse_dict.get)
best_mse = mse_dict[best_model]

print(f"The best model is: {best_model} with MSE: {best_mse}")

The best model is: Random Forest Regression with MSE: 2.225667528098759
