![dvd_image](dvd_image.jpg)

A DVD rental company needs your help! They want to figure out how many days a customer will rent a DVD for based on some features and has approached you for help. They want you to try out some regression models which will help predict the number of days a customer will rent a DVD for. The company wants a model which yeilds a MSE of 3 or less on a test set. The model you make will help the company become more efficient inventory planning.

The data they provided is in the csv file `rental_info.csv`. It has the following features:
- `"rental_date"`: The date (and time) the customer rents the DVD.
- `"return_date"`: The date (and time) the customer returns the DVD.
- `"amount"`: The amount paid by the customer for renting the DVD.
- `"amount_2"`: The square of `"amount"`.
- `"rental_rate"`: The rate at which the DVD is rented for.
- `"rental_rate_2"`: The square of `"rental_rate"`.
- `"release_year"`: The year the movie being rented was released.
- `"length"`: Lenght of the movie being rented, in minuites.
- `"length_2"`: The square of `"length"`.
- `"replacement_cost"`: The amount it will cost the company to replace the DVD.
- `"special_features"`: Any special features, for example trailers/deleted scenes that the DVD also has.
- `"NC-17"`, `"PG"`, `"PG-13"`, `"R"`: These columns are dummy variables of the rating of the movie. It takes the value 1 if the move is rated as the column name and 0 otherwise. For your convinience, the reference dummy has already been dropped.

In [197]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import any additional modules and start coding below

In [198]:
movies = pd.read_csv("rental_info.csv")
movies.head()

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
3,2005-07-31 12:06:41+00:00,2005-08-02 14:30:41+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401
4,2005-08-19 12:30:04+00:00,2005-08-23 13:35:04+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401


In [199]:
# Shape of our data
print(f"{movies.shape = }")

movies.shape = (15861, 15)


In [200]:
movies.dtypes

rental_date          object
return_date          object
amount              float64
release_year        float64
rental_rate         float64
length              float64
replacement_cost    float64
special_features     object
NC-17                 int64
PG                    int64
PG-13                 int64
R                     int64
amount_2            float64
length_2            float64
rental_rate_2       float64
dtype: object

## Create a columns named `"rental_length_days"` using the columns `"return_date"` and `"rental_date"`

In [201]:
# Convert rental_date and return_date into datetime objects
movies["rental_date"] = pd.to_datetime(movies["rental_date"])
movies["return_date"] = pd.to_datetime(movies["return_date"])

In [202]:
row1: pd.core.series.Series = movies.iloc[0]
delta = row1["return_date"] - row1["rental_date"]
print(f"{delta = }")
print(f"{delta.days = }")

delta = Timedelta('3 days 20:46:00')
delta.days = 3


In [203]:
movies["rental_length_days"] = (movies["return_date"] - movies["rental_date"]).dt.days
movies[["rental_date", "return_date", "rental_length_days"]].head(3)

Unnamed: 0,rental_date,return_date,rental_length_days
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,3
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,7


## Create two columns of dummy variables from `special_features` column.  
Will take the value of 1 when:
- The value is "Deleted Scenes", storing as a columns called "deleted_scenes"
- The value is "Behind the Scenes", storing as a column called "behing_the_scenes"

In [204]:
movies["special_features"].unique()

array(['{Trailers,"Behind the Scenes"}', '{Trailers}',
       '{Commentaries,"Behind the Scenes"}', '{Trailers,Commentaries}',
       '{"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes","Behind the Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes"}',
       '{"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes","Behind the Scenes"}',
       '{Commentaries,"Deleted Scenes"}', '{Commentaries}',
       '{Trailers,Commentaries,"Behind the Scenes"}',
       '{Trailers,"Deleted Scenes"}', '{"Deleted Scenes"}',
       '{Trailers,Commentaries,"Deleted Scenes","Behind the Scenes"}'],
      dtype=object)

In [205]:
print(f"{movies.iloc[0]['special_features'] = }")
print(f"{type(movies.iloc[0]['special_features']) = }")

movies.iloc[0]['special_features'] = '{Trailers,"Behind the Scenes"}'
type(movies.iloc[0]['special_features']) = <class 'str'>


In [206]:
"Behind the Scenes" in movies.iloc[0]["special_features"]

True

In [207]:
movies["special_features"]

0                         {Trailers,"Behind the Scenes"}
1                         {Trailers,"Behind the Scenes"}
2                         {Trailers,"Behind the Scenes"}
3                         {Trailers,"Behind the Scenes"}
4                         {Trailers,"Behind the Scenes"}
                              ...                       
15856    {Trailers,"Deleted Scenes","Behind the Scenes"}
15857    {Trailers,"Deleted Scenes","Behind the Scenes"}
15858    {Trailers,"Deleted Scenes","Behind the Scenes"}
15859    {Trailers,"Deleted Scenes","Behind the Scenes"}
15860    {Trailers,"Deleted Scenes","Behind the Scenes"}
Name: special_features, Length: 15861, dtype: object

In [208]:
movies["deleted_scenes"] = movies["special_features"].apply(lambda x: 1 if "Deleted Scenes" in x else 0)
movies["behind_the_scenes"] = movies["special_features"].apply(lambda x: 1 if "Behind the Scenes" in x else 0)

In [209]:
print(f"{movies['deleted_scenes'].unique() = }")
print(f"{movies['behind_the_scenes'].unique() = }")

movies['deleted_scenes'].unique() = array([0, 1])
movies['behind_the_scenes'].unique() = array([1, 0])


In [210]:
movies[["special_features", "deleted_scenes", "behind_the_scenes"]].sample(3)

Unnamed: 0,special_features,deleted_scenes,behind_the_scenes
9967,"{Commentaries,""Behind the Scenes""}",0,1
6707,"{Trailers,""Deleted Scenes"",""Behind the Scenes""}",1,1
8612,"{Trailers,Commentaries,""Deleted Scenes""}",1,0


## Make a `pandas` DataFrame called `X` containing all the appropiate features

In [211]:
movies.head(3)

Unnamed: 0,rental_date,return_date,amount,release_year,rental_rate,length,replacement_cost,special_features,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,rental_length_days,deleted_scenes,behind_the_scenes
0,2005-05-25 02:54:33+00:00,2005-05-28 23:40:33+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,3,0,1
1,2005-06-15 23:19:16+00:00,2005-06-18 19:24:16+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,2,0,1
2,2005-07-10 04:27:45+00:00,2005-07-17 10:11:45+00:00,2.99,2005.0,2.99,126.0,16.99,"{Trailers,""Behind the Scenes""}",0,0,0,1,8.9401,15876.0,8.9401,7,0,1


In [212]:
X: pd.DataFrame = movies.drop(columns=["rental_date", "return_date", "special_features", "rental_length_days"])

In [213]:
X.head()

Unnamed: 0,amount,release_year,rental_rate,length,replacement_cost,NC-17,PG,PG-13,R,amount_2,length_2,rental_rate_2,deleted_scenes,behind_the_scenes
0,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
1,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
2,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
3,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1
4,2.99,2005.0,2.99,126.0,16.99,0,0,0,1,8.9401,15876.0,8.9401,0,1


In [214]:
y: pd.core.series.Series = movies["rental_length_days"]
y.head()

0    3
1    2
2    7
3    2
4    4
Name: rental_length_days, dtype: int64

In [215]:
## Split data

In [216]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.20, random_state=9)
print(f"{X_train.shape = }")
print(f"{X_test.shape = }")
print(f"{y_train.shape = }")
print(f"{y_test.shape = }")

X_train.shape = (12688, 14)
X_test.shape = (3173, 14)
y_train.shape = (12688,)
y_test.shape = (3173,)


## Train a model

In [217]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

import multiprocessing

# Crear pipeline
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor())
])

# Definir el espacio de búsqueda de hiperparámetros
param_distributions = {
    'regressor__n_estimators': [10, 50, 100, 200],
    'regressor__max_depth': [None, 5, 10, 20, 30],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4, 8],
    'regressor__bootstrap': [True, False]
}

n_modelos = 10

# Realizar RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=pipeline, 
    param_distributions=param_distributions, 
    n_iter=n_modelos, 
    cv=5, 
    scoring='neg_mean_squared_error', 
    random_state=42,
    n_jobs=multiprocessing.cpu_count(),
)

random_search

In [218]:
# Entrenar modelo
import time

t1 = time.perf_counter()
random_search.fit(X_train, y_train)
t2 = time.perf_counter()

In [219]:
print(f"Tiempo para encontrar el mejor modelo aleatorio con {n_modelos} iteraciones: {(t2-t1)/60:.2f} minutos")

Tiempo para encontrar el mejor modelo aleatorio con 10 iteraciones: 0.26 minutos


In [220]:
# Mostrar los mejores hiperparámetros
print("Mejores hiperparámetros encontrados:")
print(random_search.best_params_)

# Evaluar el modelo en el conjunto de prueba
print("Puntuación del modelo en el conjunto de prueba:", random_search.score(X_test, y_test))

Mejores hiperparámetros encontrados:
{'regressor__n_estimators': 10, 'regressor__min_samples_split': 5, 'regressor__min_samples_leaf': 4, 'regressor__max_depth': 20, 'regressor__bootstrap': True}
Puntuación del modelo en el conjunto de prueba: -2.155959757084717


## Evaluate model

In [221]:
from sklearn.metrics import mean_squared_error as MSE

preds = random_search.predict(X_test)
mse = MSE(y_test, preds)

print(f"{mse = }")

mse = 2.155959757084717


## Save model

In [222]:
# Save model
best_model = random_search
best_mse = mse