# Solution for Lasso Regression, DecisionTreeRegressor, RandomForestRegressor
## Problem Statement:
A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

## Results
The best model is the tuned random forest model which shows the least MSE (mean squared error) of 2.23.

# CODE

In [5]:
# Import required packages
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error

In [6]:
pwd

'C:\\Users\\OK'

In [7]:
# Import and explore data
# Read in the csv file 
rentals_df = pd.read_csv(r"C:\Users\OK\Documents\Self\I know Python\rental_info.csv")
rentals_df.info()
rentals_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [8]:
# Clean the data
## # Calculate rent duration in days
rentals_df["rental_length"] = pd.to_datetime(rentals_df["return_date"]) - pd.to_datetime(rentals_df["rental_date"])
rentals_df["rental_length_days"] = rentals_df["rental_length"].dt.days

In [9]:
# Preprocess the data
# Create dummy variables for "special_features" - "Deleted Scenes","Behind the Scenes" levels
rentals_df["special_features"].unique()
rentals_df["deleted_scenes"] = np.where(rentals_df["special_features"].str.contains("Deleted Scenes"), 1, 0)
rentals_df["behind_the_scenes"] = np.where(rentals_df["special_features"].str.contains("Behind the Scenes"), 1, 0)
rentals_df.head()
rentals_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype          
---  ------              --------------  -----          
 0   rental_date         15861 non-null  object         
 1   return_date         15861 non-null  object         
 2   amount              15861 non-null  float64        
 3   release_year        15861 non-null  float64        
 4   rental_rate         15861 non-null  float64        
 5   length              15861 non-null  float64        
 6   replacement_cost    15861 non-null  float64        
 7   special_features    15861 non-null  object         
 8   NC-17               15861 non-null  int64          
 9   PG                  15861 non-null  int64          
 10  PG-13               15861 non-null  int64          
 11  R                   15861 non-null  int64          
 12  amount_2            15861 non-null  float64        
 13  length_2            15861 non-n

In [10]:
# Perform Train-test split, avoiding columns that leak data about the target.
leak_cols = ["special_features", "rental_length", "rental_length_days", "rental_date", "return_date"]
X = rentals_df.drop(leak_cols, axis=1)
y = rentals_df["rental_length_days"]
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=9)

In [11]:
# Perform Lasso Regression
## Instantiate a lasso regression model
model_lasso = Lasso(alpha=0.3, random_state=9)
model_lasso.fit(X_train, y_train)
## Extract coefficients
lasso_coef = model_lasso.coef_
print(lasso_coef)
## Get coefficients above zero
X_lasso_train, X_lasso_test = X_train.iloc[:, lasso_coef > 0], X_test.iloc[:, lasso_coef > 0]

[ 5.84104424e-01  0.00000000e+00 -0.00000000e+00  0.00000000e+00
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00  4.36220109e-02  3.01167812e-06 -1.52983561e-01
 -0.00000000e+00  0.00000000e+00]


In [12]:
# Comparison: Linear Regression model
lr_model = LinearRegression()
lr_model_fit = lr_model.fit(X_train, y_train)
y_lr_pred = lr_model_fit.predict(X_test)
mse_lr_lasso = mean_squared_error(y_test, y_lr_pred)
print('Test set MSE of lr_model: {:.2f}'.format(mse_lr_lasso)) # 2.94 for lr

Test set MSE of lr_model: 2.94


In [13]:
# Comparison: DecisionTreeRegressor model
dt_model = DecisionTreeRegressor(max_depth=8,
            min_samples_leaf=0.13,
            random_state=3)
dt_model.fit(X_train, y_train)
y_dt_pred = dt_model.predict(X_test)
mse_dt_lasso = mean_squared_error(y_test, y_dt_pred)
print("Test set MSE of dt_model: {:.2f}".format(mse_dt_lasso)) # 3.32

Test set MSE of dt_model: 3.32


In [14]:
## Comparison: RandomForestRegressor with hyperparameter tuning
### Discover the number of estimators
rf_param_distr = {"n_estimators": np.arange(1, 101, 1),
                 "max_depth": np.arange(1,11,1)}
rf_model = RandomForestRegressor()
### Gather hyperparameters by 
rand_search = RandomizedSearchCV(rf_model, param_distributions = rf_param_distr, cv = 5, random_state = 9)
rand_search.fit(X_train, y_train)
hyper_params = rand_search.best_params_
rf_tuned_model = RandomForestRegressor(n_estimators = hyper_params["n_estimators"], max_depth = hyper_params["max_depth"], random_state=9)
rf_tuned_model.fit(X_train, y_train)
y_rf_tuned_pred = rf_tuned_model.predict(X_test)
mse_rf_tuned_lasso = mean_squared_error(y_test, y_rf_tuned_pred)
print("Test set MSE of rf_tuned_model: {:.2f}".format(mse_rf_tuned_lasso)) # 2.23

Test set MSE of rf_tuned_model: 2.23


In [22]:
# Best performing model recommendation 
best_model = "tuned_RandomForest"
best_mse = 2.23
print(best_model)
print(best_mse)

tuned_RandomForest
2.23
