![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [56]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

#import csv
df=pd.read_csv("rental_info.csv")
print(df)
#Exploratory Data Analysis
print(df.info())

                     rental_date                return_date  amount  \
0      2005-05-25 02:54:33+00:00  2005-05-28 23:40:33+00:00    2.99   
1      2005-06-15 23:19:16+00:00  2005-06-18 19:24:16+00:00    2.99   
2      2005-07-10 04:27:45+00:00  2005-07-17 10:11:45+00:00    2.99   
3      2005-07-31 12:06:41+00:00  2005-08-02 14:30:41+00:00    2.99   
4      2005-08-19 12:30:04+00:00  2005-08-23 13:35:04+00:00    2.99   
...                          ...                        ...     ...   
15856  2005-08-22 10:49:15+00:00  2005-08-29 09:52:15+00:00    6.99   
15857  2005-07-31 09:48:49+00:00  2005-08-04 10:53:49+00:00    4.99   
15858  2005-08-20 10:35:30+00:00  2005-08-29 13:03:30+00:00    8.99   
15859  2005-07-31 13:10:20+00:00  2005-08-08 14:07:20+00:00    7.99   
15860  2005-08-18 06:33:55+00:00  2005-08-24 07:14:55+00:00    5.99   

       release_year  rental_rate  length  replacement_cost  \
0            2005.0         2.99   126.0             16.99   
1            2005.0    

In [57]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

#import csv
df=pd.read_csv("rental_info.csv")
print(df)
#Exploratory Data Analysis
print(df.info())

                     rental_date                return_date  amount  \
0      2005-05-25 02:54:33+00:00  2005-05-28 23:40:33+00:00    2.99   
1      2005-06-15 23:19:16+00:00  2005-06-18 19:24:16+00:00    2.99   
2      2005-07-10 04:27:45+00:00  2005-07-17 10:11:45+00:00    2.99   
3      2005-07-31 12:06:41+00:00  2005-08-02 14:30:41+00:00    2.99   
4      2005-08-19 12:30:04+00:00  2005-08-23 13:35:04+00:00    2.99   
...                          ...                        ...     ...   
15856  2005-08-22 10:49:15+00:00  2005-08-29 09:52:15+00:00    6.99   
15857  2005-07-31 09:48:49+00:00  2005-08-04 10:53:49+00:00    4.99   
15858  2005-08-20 10:35:30+00:00  2005-08-29 13:03:30+00:00    8.99   
15859  2005-07-31 13:10:20+00:00  2005-08-08 14:07:20+00:00    7.99   
15860  2005-08-18 06:33:55+00:00  2005-08-24 07:14:55+00:00    5.99   

       release_year  rental_rate  length  replacement_cost  \
0            2005.0         2.99   126.0             16.99   
1            2005.0    

In [58]:
#creating days column
df['rental_date']=pd.to_datetime(df['rental_date'])
df['return_date']=pd.to_datetime(df['return_date'])
df['rental_length_days']=(df['return_date']-df['rental_date']).dt.days
#inspecting the special_features column
print(df['special_features'].dtype)
print(df['special_features'].unique())
df['deleted_scenes']=np.where(df['special_features'].str.contains("Deleted Scenes"),1,0)
df['behind_the_scenes']=np.where(df['special_features'].str.contains("Behind the Scenes"),1,0)
df

object
['{Trailers,"Behind the Scenes"}' '{Trailers}'
 '{Commentaries,"Behind the Scenes"}' '{Trailers,Commentaries}'
 '{"Deleted Scenes","Behind the Scenes"}'
 '{Commentaries,"Deleted Scenes","Behind the Scenes"}'
 '{Trailers,Commentaries,"Deleted Scenes"}' '{"Behind the Scenes"}'
 '{Trailers,"Deleted Scenes","Behind the Scenes"}'
 '{Commentaries,"Deleted Scenes"}' '{Commentaries}'
 '{Trailers,Commentaries,"Behind the Scenes"}'
 '{Trailers,"Deleted Scenes"}' '{"Deleted Scenes"}'
 '{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}']


Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7,0,1
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,4,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15856,2005-08-22 10:49:15+00:00,2005-08-29 09:52:15+00:00,6.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,48.8601,7744.0,24.9001,6,1,1
15857,2005-07-31 09:48:49+00:00,2005-08-04 10:53:49+00:00,4.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,24.9001,7744.0,24.9001,4,1,1
15858,2005-08-20 10:35:30+00:00,2005-08-29 13:03:30+00:00,8.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,80.8201,7744.0,24.9001,9,1,1
15859,2005-07-31 13:10:20+00:00,2005-08-08 14:07:20+00:00,7.99,2009.0,4.99,88.0,11.99,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",0,0,0,1,63.8401,7744.0,24.9001,8,1,1


In [59]:
#Creating Train Testabs
cols_to_drop=["special_features","rental_length_days", "rental_date", "return_date"]
X = df.drop(cols_to_drop, axis=1)
y = df["rental_length_days"]
X_train, X_test, y_train, y_test=train_test_split(X, y, random_state=9, stratify=y,test_size=0.3)


In [60]:
#suggesting multiple models for Regression (LogisticRegression, BaggingRegressor, TreeRegressor, RandomForestRegressor, AdaBoostRegressor, GradientBoostRegressor)
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import  BaggingRegressor,RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor
#training and getting the MSE out of all the models
models={'logreg':LinearRegression(),'br':BaggingRegressor(random_state=9), 'tr':DecisionTreeRegressor(random_state=9), 'rfr':RandomForestRegressor(random_state=9), 'abr':AdaBoostRegressor(random_state=9), 'gbr':GradientBoostingRegressor(random_state=9)}
from sklearn.model_selection import cross_val_score, KFold
score={}
for model, MODEL in models.items():
    MODEL.fit(X_train,y_train)
    y_pred=MODEL.predict(X_test)
    score[model]=MSE(y_test, y_pred)

In [61]:
#selecting the top 3 best performing models
print(score)
score_df=pd.DataFrame(score.items(), columns=['model', 'MSE_score'])
#DecisionTree, BaggingRegressor and RandomForrest are the best models
score_df.sort_values('MSE_score', inplace=True)
#initializing best models
best_models=score_df.iloc[:3, :]
print(best_models)
score_df

{'logreg': 2.8764651270231685, 'br': 2.171557249512647, 'tr': 2.323657446667534, 'rfr': 2.138729585667425, 'abr': 3.094641758225065, 'gbr': 2.3311736000047283}
  model  MSE_score
3   rfr   2.138730
1    br   2.171557
2    tr   2.323657


Unnamed: 0,model,MSE_score
3,rfr,2.13873
1,br,2.171557
2,tr,2.323657
5,gbr,2.331174
0,logreg,2.876465
4,abr,3.094642


In [62]:
#tuning DecisionTree with RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
#lets see what params does DecisionTreeRegressor has to be tuned
print("Params to be tuned in TreeRegressor /n", models['tr'].get_params())
tr_param_grid={'max_depth':range(1,5,1), 'min_samples_leaf':np.arange(0.1,0.7, 0.1)}
rcv_search_tr=RandomizedSearchCV(estimator=models['tr'],param_distributions=tr_param_grid,n_iter=20,cv=5,scoring='neg_mean_squared_error',random_state=9,n_jobs=-1)
rcv_search_tr.fit(X_train,y_train)
print(rcv_search_tr.best_score_)
#the grid search returns a worse result as the initial model

Params to be tuned in TreeRegressor /n {'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'random_state': 9, 'splitter': 'best'}
-3.2040873821061546


In [63]:
#tuning BaggingRegressor with RandomizedSearchCV
#lets see what params does DecisionTreeRegressor has to be tuned
print("Params to be tuned in BaggingRegressor /n", models['br'].get_params())
br_param_grid={'max_features': np.linspace(0.1,1), 'max_samples': np.linspace(0.1,1)}
rcv_search_br=RandomizedSearchCV(estimator=models['br'],param_distributions=br_param_grid,n_iter=20,cv=5,scoring='neg_mean_squared_error',random_state=9,n_jobs=-1)
rcv_search_br.fit(X_train,y_train)
print(rcv_search_br.best_score_)
#the grid search returns a worse result as the initial model

Params to be tuned in BaggingRegressor /n {'bootstrap': True, 'bootstrap_features': False, 'estimator': None, 'max_features': 1.0, 'max_samples': 1.0, 'n_estimators': 10, 'n_jobs': None, 'oob_score': False, 'random_state': 9, 'verbose': 0, 'warm_start': False}
-2.2000903128652762


In [64]:
#tuning RandomForest with RandomizedSearchCV
#lets see what params does DecisionTreeRegressor has to be tuned
print("Params to be tuned in BaggingRegressor /n", models['rfr'].get_params())
rfr_param_grid={'max_depth':range(1,6),'min_samples_leaf': np.linspace(0.1,0.9, 6), 'min_samples_split': np.linspace(0.1,1, 6) }
rcv_search_rfr=RandomizedSearchCV(estimator=models['rfr'],param_distributions=rfr_param_grid,n_iter=20,cv=5,scoring='neg_mean_squared_error',random_state=9,n_jobs=-1)
rcv_search_rfr.fit(X_train,y_train)
print(rcv_search_rfr.best_score_)
#the grid search returns a worse result as the initial model

Params to be tuned in BaggingRegressor /n {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'monotonic_cst': None, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 9, 'verbose': 0, 'warm_start': False}
-4.237553660670148


In [65]:
#saving the best model rfr
best_model= models['rfr']
print(best_model)
best_mse=best_models.loc[best_models['model'] == 'rfr', 'MSE_score']
print(best_mse)


RandomForestRegressor(random_state=9)
3    2.13873
Name: MSE_score, dtype: float64
