![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [272]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below

df = pd.read_csv('rental_info.csv')

**Let's see how the first 5 rows of the dataset look**

In [273]:
df.head(5)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [274]:
df.shape

(15861, 15)

In [275]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB


In [276]:
df.isnull().sum()

rental_date         0
return_date         0
amount              0
release_year        0
rental_rate         0
length              0
replacement_cost    0
special_features    0
NC-17               0
PG                  0
PG-13               0
R                   0
amount_2            0
length_2            0
rental_rate_2       0
dtype: int64

In [277]:
df.columns

Index(['rental_date', 'return_date', 'amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'special_features', 'NC-17', 'PG',
       'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2'],
      dtype='object')

**changing the datatype to the correct**

By changing the data type to the correct one, we are ensuring that the information is processed accurately and coherently, which will help avoid potential errors and ensure the integrity of subsequent analyses and models

In [278]:

df['rental_date'] = pd.to_datetime(df['rental_date'], format='%Y-%m-%d %H:%M:%S%z')
df['return_date'] = pd.to_datetime(df['return_date'], format='%Y-%m-%d %H:%M:%S%z')


In [279]:
df.dtypes

rental_date         datetime64[ns, UTC]
return_date         datetime64[ns, UTC]
amount                          float64
release_year                    float64
rental_rate                     float64
length                          float64
replacement_cost                float64
special_features                 object
NC-17                             int64
PG                                int64
PG-13                             int64
R                                 int64
amount_2                        float64
length_2                        float64
rental_rate_2                   float64
dtype: object

**Determining the number of days it takes customers to return a movie**

In [280]:
df['rental_length_days'] = (df['return_date']-df['rental_date']).dt.days

In [281]:
df.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4


In [282]:
df['special_features'].value_counts()

{Trailers,Commentaries,"Behind the Scenes"}                     1308
{Trailers}                                                      1139
{Trailers,Commentaries}                                         1129
{Trailers,"Behind the Scenes"}                                  1122
{"Behind the Scenes"}                                           1108
{Commentaries,"Deleted Scenes","Behind the Scenes"}             1101
{Commentaries}                                                  1089
{Commentaries,"Behind the Scenes"}                              1078
{Trailers,"Deleted Scenes"}                                     1047
{"Deleted Scenes","Behind the Scenes"}                          1035
{"Deleted Scenes"}                                              1023
{Commentaries,"Deleted Scenes"}                                 1011
{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}     983
{Trailers,Commentaries,"Deleted Scenes"}                         916
{Trailers,"Deleted Scenes","Behind

In [283]:
df['deleted_scenes'] = np.where(df['special_features'].str.contains("Deleted Scenes"), 1, 0)
df['behind_the_scenes'] = np.where(df['special_features'].str.contains("Behind the Scenes"), 1, 0)
df.sample(10)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
4778,2005-07-29 01:42:08+00:00,2005-08-04 02:59:08+00:00,4.99,2006.0,2.99,80.0,24.99,"{""Deleted Scenes""}",0,1,0,0,24.9001,6400.0,8.9401,6,1,0
2487,2005-07-30 16:09:56+00:00,2005-08-06 15:50:56+00:00,7.99,2007.0,4.99,145.0,10.99,"{Commentaries,""Deleted Scenes"",""Behind the Sce...",1,0,0,0,63.8401,21025.0,24.9001,6,1,1
14795,2005-06-21 22:07:07+00:00,2005-06-28 02:59:07+00:00,2.99,2009.0,2.99,178.0,10.99,"{Trailers,""Behind the Scenes""}",0,0,1,0,8.9401,31684.0,8.9401,6,0,1
684,2005-07-27 23:05:40+00:00,2005-08-05 18:47:40+00:00,2.99,2006.0,0.99,63.0,22.99,"{Commentaries,""Deleted Scenes"",""Behind the Sce...",0,0,0,0,8.9401,3969.0,0.9801,8,1,1
10855,2005-08-20 19:30:51+00:00,2005-08-26 17:34:51+00:00,5.99,2010.0,2.99,71.0,29.99,"{Trailers,""Deleted Scenes""}",1,0,0,0,35.8801,5041.0,8.9401,5,1,0
1881,2005-07-07 10:39:43+00:00,2005-07-10 10:30:43+00:00,4.99,2006.0,4.99,135.0,15.99,"{Trailers,""Deleted Scenes""}",1,0,0,0,24.9001,18225.0,24.9001,2,1,0
7057,2005-08-23 09:04:33+00:00,2005-08-27 13:58:33+00:00,2.99,2009.0,2.99,172.0,29.99,"{Trailers,Commentaries,""Behind the Scenes""}",0,0,1,0,8.9401,29584.0,8.9401,4,0,1
1350,2005-08-23 10:31:24+00:00,2005-08-28 10:53:24+00:00,1.99,2004.0,0.99,54.0,14.99,"{Trailers,Commentaries}",0,0,1,0,3.9601,2916.0,0.9801,5,0,0
2795,2005-07-12 02:14:57+00:00,2005-07-15 20:41:57+00:00,2.99,2009.0,2.99,71.0,29.99,"{""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,8.9401,5041.0,8.9401,3,1,1
9514,2005-08-20 10:02:02+00:00,2005-08-27 08:51:02+00:00,5.99,2010.0,2.99,92.0,19.99,{Trailers},1,0,0,0,35.8801,8464.0,8.9401,6,0,0


In [284]:
df.columns

Index(['rental_date', 'return_date', 'amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'special_features', 'NC-17', 'PG',
       'PG-13', 'R', 'amount_2', 'length_2', 'rental_rate_2',
       'rental_length_days', 'deleted_scenes', 'behind_the_scenes'],
      dtype='object')

In [285]:
df.shape

(15861, 18)

In [286]:
df.describe()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
count,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0,15861.0
mean,4.217161,2006.885379,2.944101,114.994578,20.224727,0.204842,0.200303,0.223378,0.198726,23.355504,14832.841876,11.389287,4.525944,0.49732,0.536347
std,2.360383,2.025027,1.649766,40.114715,6.083784,0.403599,0.400239,0.416523,0.399054,23.503164,9393.431996,10.005293,2.635108,0.500009,0.498693
min,0.99,2004.0,0.99,46.0,9.99,0.0,0.0,0.0,0.0,0.9801,2116.0,0.9801,0.0,0.0,0.0
25%,2.99,2005.0,0.99,81.0,14.99,0.0,0.0,0.0,0.0,8.9401,6561.0,0.9801,2.0,0.0,0.0
50%,3.99,2007.0,2.99,114.0,20.99,0.0,0.0,0.0,0.0,15.9201,12996.0,8.9401,5.0,0.0,1.0
75%,4.99,2009.0,4.99,148.0,25.99,0.0,0.0,0.0,0.0,24.9001,21904.0,24.9001,7.0,1.0,1.0
max,11.99,2010.0,4.99,185.0,29.99,1.0,1.0,1.0,1.0,143.7601,34225.0,24.9001,9.0,1.0,1.0


**Spliting the data**

In [287]:
X = df[['amount', 'release_year', 'rental_rate',
       'length', 'replacement_cost', 'NC-17', 'PG',
       'PG-13', 'R', 'deleted_scenes', 'behind_the_scenes']]
y = df['rental_length_days']

In [288]:
X_train,X_test,y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=9)

In [289]:
# Import Lasso
from sklearn.linear_model import Lasso
import matplotlib.pyplot as plt

# Instantiate a lasso regression model
lasso = Lasso(alpha = 0.3, random_state=9)

# Fit the model to the data
lasso.fit(X_train, y_train)

# Compute and print the coefficients
lasso_coef = lasso.coef_ 
print(lasso_coef)
print(X_train.columns)

[ 9.62167821e-01  0.00000000e+00 -8.41179857e-01  4.94571646e-04
 -0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -0.00000000e+00 -0.00000000e+00  0.00000000e+00]
Index(['amount', 'release_year', 'rental_rate', 'length', 'replacement_cost',
       'NC-17', 'PG', 'PG-13', 'R', 'deleted_scenes', 'behind_the_scenes'],
      dtype='object')


In [290]:
relevant_features = X_train[X_train.columns[lasso.coef_ > 0]]
relevant_features.columns

Index(['amount', 'length'], dtype='object')

**RandomForestRegressor**

In [291]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as MSE

# Instantiate a RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 200, min_samples_leaf=0.12, random_state=9)

# Fit rf to the training set
rf.fit(X_train, y_train)

# Evaluate the test set predictions
y_pred_rf = rf.predict(X_test)

# Calculate rmse
rf_rmse = MSE(y_test,y_pred_rf)**(1/2)
print(f'RandomForestRegressor RMSE : {rf_rmse}')

RandomForestRegressor RMSE : 2.0792213932429116


**AdaBoostRegressor**

In [292]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
# Instantiate a AdaBoostRegressor
dr = DecisionTreeRegressor(max_depth=1,min_samples_leaf=0.12, criterion = 'friedman_mse')
ada_lr = AdaBoostRegressor(dr,n_estimators=200, random_state=9)

# Fit ada_lr to the training set
ada_lr.fit(X_train, y_train)

# Evaluate the test set predictions
y_pred_ada = ada_lr.predict(X_test)

# Calculate rmse
ada_rmse =  MSE(y_test,y_pred)**(1/2)
print(f'AdaBoostClassifier RMSE : {ada_rmse}')

AdaBoostClassifier RMSE : 2.0792213932429116


**GradientBoostingRegressor**

In [293]:
from sklearn.ensemble import GradientBoostingRegressor

# Instantiate a GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators = 200, max_depth=1, random_state=9, max_features=0.5,subsample=0.8)

# Fit gbr to the training set
gbr.fit(X_train, y_train)

# Evaluate the test set predictions
y_pred_gb = gbr.predict(X_test) 

# Calculate rmse
gb_rmse = MSE(y_test,y_pred_gb)**(1/2)
print(f'GradientBoostingRegressor RMSE : {gb_rmse}')

GradientBoostingRegressor RMSE : 1.7056417342753065


**GradientBoostingRegressor + GridSearchCV**

In [294]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor

params = {
    'n_estimators': np.arange(140,200,20),
    'max_depth' : [1,2],
    'max_features': [0.5,0.6,0.7,0.8],
    'subsample': [0.7,0.8,0.9]
}

grid_gb = GridSearchCV(GradientBoostingRegressor(),params,cv=4,scoring='neg_mean_squared_error')
grid_gb.fit(X_train,y_train)

print(grid_gb.best_params_)
print((-grid_gb.best_score_)**1/2)

{'max_depth': 2, 'max_features': 0.8, 'n_estimators': 180, 'subsample': 0.7}
1.2056569987611818


**best model**

In [295]:
best_model = grid_gb.best_estimator_
y_pred_bm = best_model.predict(X_test)
bm_rmse = MSE(y_test,y_pred_bm)**(1/2)
bm_rmse

1.5786003166485583