In this project, you will use regression models to predict the number of days a customer rents DVDs for.

As with most data science projects, you will need to pre-process the data provided, in this case, a csv file called `rental_info.csv`. Specifically, you need to:

- Read in the csv file `rental_info.csv` using `pandas`.
- Create a column named `"rental_length_days"` using the columns `"return_date"` and `"rental_date"`, and add it to the `pandas` DataFrame. This column should contain information on how many days a DVD has been rented by a customer.
- Create two columns of dummy variables from `"special_features"`, which takes the value of `1` when:
    - The value is `"Deleted Scenes"`, storing as a column called `"deleted_scenes"`.
    - The value is `"Behind the Scenes"`, storing as a column called `"behind_the_scenes"`.
- Make a `pandas` DataFrame called `X` containing all the appropriate features you can use to run the regression models, avoiding columns that leak data about the target.
- Choose the `"rental_length_days"` as the target column and save it as a pandas Series called `y`.

Following the preprocessing you will need to:

- Split the data into `X_train`, `y_train`, `X_test`, and `y_test` train and test sets, avoiding any features that leak data about the target variable, and include 20% of the total data in the test set.
- **Set `random_state` to `9`** whenever you use a function/method involving randomness, for example, when doing a test-train split.

**Recommend a model yielding a mean squared error *(MSE) less than 3* on the *test set***

- Save the model you would recommend as a variable named `best_model`, and save its MSE on the test set as `best_mse`.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Reading csv file
rental_df = pd.read_csv('../data/rental_info.csv')
rental_df.head()

# Dataset Info
rental_df.info()

# Create "rental_length_days" column
# But first convert rental_date and return_date columns to datetime object
rental_df['rental_date'] = pd.to_datetime(rental_df['rental_date'])
rental_df['return_date'] = pd.to_datetime(rental_df['return_date'])

rental_df['rental_length_days'] = (rental_df['return_date'] - rental_df['rental_date']).dt.days
print(rental_df['rental_length_days'])

# Creating two columns of dummy variables from "special_features"
rental_df['deleted_scenes'] = np.where(rental_df['special_features'].str.contains('Deleted Scenes'),1,0)
rental_df['behind_the_scenes'] = np.where(rental_df['special_features'].str.contains('Behind the Scenes'),1,0)
rental_df.head()

# Preparing data for modelling
columns_to_drop = ['rental_date', 'return_date', 'rental_length_days','special_features']

# Split into feature and target sets
X = rental_df.drop(columns_to_drop, axis = 1)
y = rental_df['rental_length_days']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 9)

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Random forest hyperparameter space
param_dist = {'n_estimators': np.arange(1,101,1),
          'max_depth':np.arange(1,11,1)}

# Create a random forest regressor
rf = RandomForestRegressor()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions=param_dist, 
                                 cv=5, 
                                 random_state=9)

# Fit the random search object to the data
rand_search.fit(X_train, y_train)

# Create a variable for the best hyper param
hyper_params = rand_search.best_params_

# Run the random forest on the chosen hyper parameters
rf = RandomForestRegressor(n_estimators=hyper_params["n_estimators"], 
                           max_depth=hyper_params["max_depth"], 
                           random_state=9)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)
mse_random_forest= mean_squared_error(y_test, rf_pred)

# Random forest gives lowest MSE so:
best_model = rf
best_mse = mse_random_forest

print(best_model)
print(best_mse)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB
0        3
1        2
2       