![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [244]:
# Import required packages
import numpy as np
import pandas as pd
import datetime
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

# Read the "rental_info.csv" CSV file as a pandas DataFrame and assign the result to the rental_info_df variable
rental_info_df = pd.read_csv("rental_info.csv")

# Print the head of the rental_info_df DataFrame
print(rental_info_df.head())

                 rental_date  ... rental_rate_2
0  2005-05-25 02:54:33+00:00  ...        8.9401
1  2005-06-15 23:19:16+00:00  ...        8.9401
2  2005-07-10 04:27:45+00:00  ...        8.9401
3  2005-07-31 12:06:41+00:00  ...        8.9401
4  2005-08-19 12:30:04+00:00  ...        8.9401

[5 rows x 15 columns]


In [245]:
# Summarize the information of the rental_info_df DataFrame
print(rental_info_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15861 entries, 0 to 15860
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   rental_date       15861 non-null  object 
 1   return_date       15861 non-null  object 
 2   amount            15861 non-null  float64
 3   release_year      15861 non-null  float64
 4   rental_rate       15861 non-null  float64
 5   length            15861 non-null  float64
 6   replacement_cost  15861 non-null  float64
 7   special_features  15861 non-null  object 
 8   NC-17             15861 non-null  int64  
 9   PG                15861 non-null  int64  
 10  PG-13             15861 non-null  int64  
 11  R                 15861 non-null  int64  
 12  amount_2          15861 non-null  float64
 13  length_2          15861 non-null  float64
 14  rental_rate_2     15861 non-null  float64
dtypes: float64(8), int64(4), object(3)
memory usage: 1.8+ MB
None


In [246]:
# Print the data types of the rental_info_df DataFrame
print(rental_info_df.dtypes)

rental_date          object
return_date          object
amount              float64
release_year        float64
rental_rate         float64
length              float64
replacement_cost    float64
special_features     object
NC-17                 int64
PG                    int64
PG-13                 int64
R                     int64
amount_2            float64
length_2            float64
rental_rate_2       float64
dtype: object


In [247]:
# Create a "rental_length_days" column in the rental_info_df DataFrame representing the number of days elapsed between the rental date and the return date
rental_info_df["rental_length_days"] = (pd.to_datetime(rental_info_df["return_date"]) - pd.to_datetime(rental_info_df["rental_date"])) // datetime.timedelta(days = 1)

# Convert the "rental_date" and the "return_date" columns of the rental_info_df DataFrame to UTC-adapted floats
rental_info_df["rental_date"] = rental_info_df["rental_date"].apply(lambda x: pd.to_datetime(x).tz_convert('UTC').timestamp())
rental_info_df["return_date"] = rental_info_df["return_date"].apply(lambda x: pd.to_datetime(x).tz_convert('UTC').timestamp())

# Print the head of the "rental_date", "return_date" and "rental_length_days" columns of the rental_info_df DataFrame
print(rental_info_df[["rental_date", "return_date", "rental_length_days"]].head())

    rental_date   return_date  rental_length_days
0  1.116990e+09  1.117324e+09                   3
1  1.118878e+09  1.119123e+09                   2
2  1.120970e+09  1.121595e+09                   7
3  1.122812e+09  1.122993e+09                   2
4  1.124455e+09  1.124804e+09                   4


In [248]:
# Manually create two columns of dummy variables named "deleted_scenes" and "behind_the_scenes" from the "special_features" column of the rental_info_df DataFrame, each taking values 1 or 0 according to whether or not they contain "Deleted Scenes" or "Behind the Scenes", respectively
rental_info_df["deleted_scenes"] = np.where(rental_info_df["special_features"].str.contains("Deleted Scenes"), 1, 0)
rental_info_df["behind_the_scenes"] = np.where(rental_info_df["special_features"].str.contains("Behind the Scenes"), 1, 0)

# Print the head of the "deleted_scenes" and "behind_the_scenes" columns of the rental_info_df DataFrame
print(rental_info_df[["deleted_scenes", "behind_the_scenes"]].head())

   deleted_scenes  behind_the_scenes
0               0                  1
1               0                  1
2               0                  1
3               0                  1
4               0                  1


In [249]:
# Create an array containing all appropriate features on which regression models should be run, and assign the result to the features variable
features = rental_info_df.columns.drop(["amount_2", "length_2", "rental_length_days", "special_features"])

# Assign the name of the target feature for which regression models should make predictions to the target variable
target = "rental_length_days"

# Create a pandas DataFrame containing all columns of the rental_length_df DataFrame corresponding to the appropriate features and assign the result to the X variable
X = rental_info_df[features]

# Create a pandas Series containing the column of the rental_length_df DataFrame corresponding to the target feature and assign the result to the y variable
y = rental_info_df[target]

# Execute a train-test split, setting test_size = 0.2 and random_state = 9 and assign the result to the (X_train, X_test, y_train, y_test) tuple of variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 9)

In [250]:
# Instantiate five regression models of interest: LinearRegression, Ridge, Lasso, DecisionTreeRegressor and RandomForestRegressor, assigning each to the lr, rr, la, dt and rf variables, respectively, setting random_state = 9 whenever necessary
lr = LinearRegression()
rr = Ridge()
la = Lasso()
dt = DecisionTreeRegressor(random_state = 9)
rf = RandomForestRegressor(random_state = 9)

# Define hyperparameter grids for the four models rr, la, dt and rf, and assign each as dictionaries to the params_rr, params_la, params_dt and params_rf, respectively
params_rr = {
    "alpha" : np.arange(0, 1.1, 0.25),
}
params_la = {
    "alpha" : np.arange(0, 1.1, 0.25),
}
params_dt = {
    "splitter":["best"],
    "max_depth" : [1,3,5],
    "min_samples_leaf" : [2,4,6],
    "min_weight_fraction_leaf" : [0.1,0.5,0.9],
    "max_features" : ["auto","log2","sqrt"],
    "max_leaf_nodes" : [10,20,30,40]
}
params_rf = {
    
    "n_estimators" : [100, 350, 500],
    "max_features" : ["log2", "auto", "sqrt"],
    "min_samples_leaf" : [2, 10, 30]
}

# Create an array containing the five instances of regression models and store the result in the models variable
models = [lr, rr, la, dt, rf]

# Create an array of hyperparameter grids in the same order as the models array, using 0 as a placeholder for lr, and assign the result to the hyperparameter_grids variable
hyperparameter_grids = [0, params_rr, params_la, params_dt, params_rf]

# Initialize two empty arrays stored in the best_models and scores variable, where the mean squared errors of each model should be stored
best_models = []
scores = []

# Assign 5 to the n variable
n = 5

In [251]:
# Iterate over the five models in order to detect the best model and its best score
for i in range(n):
    m = models[i]
    if i == 0:
        m.fit(X_train, y_train)
        y_pred = m.predict(X_test)
        best_models.append(m)
        scores.append(mean_squared_error(y_test, y_pred))
    else:
        rcv = RandomizedSearchCV(estimator=m, param_distributions=hyperparameter_grids[i], random_state = 9, cv = 5, n_iter = 1, scoring = "neg_mean_squared_error")
        rcv.fit(X_train, y_train)
        best_models.append(rcv.best_estimator_)
        scores.append(-rcv.best_score_)

# Print the best_models and scores arrays, in order to detect the best model with the lowest mean squared error
print(f"Best models: {best_models} \n", f"Best scores: {scores}")

Best models: [LinearRegression(), Ridge(alpha=0.75), Lasso(alpha=0.75), DecisionTreeRegressor(max_depth=3, max_features='auto', max_leaf_nodes=30,
                      min_samples_leaf=2, min_weight_fraction_leaf=0.1,
                      random_state=9), RandomForestRegressor(max_features='log2', min_samples_leaf=10,
                      n_estimators=500, random_state=9)] 
 Best scores: [0.14533043915252705, 0.14539797860278186, 0.1452175948349666, 3.489104239301919, 1.7182256309416961]


In [252]:
# Determine the model with the lowest mean squared error and assign the model to the best_model variable, as well as its corresponding mean_square_error to the best_mse variable
best_mse = np.min(scores)
best_model = best_models[scores.index(best_mse)]
print(f"Best model: {best_model}", f"Best MSE score: {best_mse}")

Best model: Lasso(alpha=0.75) Best MSE score: 0.1452175948349666
