### Lab | Making predictions with logistic regression

In this lab, you will be using the Sakila database of movie rentals.

In order to optimize our inventory, we would like to know which films will be rented. We are asked to create a model to predict it. So we use the information we have from May 2005 to create the model.

Instructions
1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features (X).
2. Create a query to get the list of all unique film titles and a boolean indicating if it was rented (rental_date) in May 2005. (Create new column called - 'rented_in_may'). This will be our TARGET (y) variable.
3. Read the data into a Pandas dataframe. At this point you should have 1000 rows. Number of columns depends on the number of features you chose.
4. Analyze extracted features (X) and transform them. You may need to encode some categorical variables, or scale numerical variables.
5. Create a logistic regression model to predict 'rented_in_may' from the cleaned data.
6. Evaluate the results.

In [1]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [2]:
import pymysql
from sqlalchemy import create_engine

import getpass  # To get the password without showing the input
password = getpass.getpass()

········


In [3]:
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)

### 1. Creating a query for features

In [4]:
query = "SELECT f.film_id, f.title, f.rental_duration, f.rental_rate, f.length, f.rating, f.special_features,\
COUNT(i.inventory_id) AS 'nr_of_inventories', c.name AS 'category_name' \
from film f \
LEFT JOIN inventory i \
ON f.film_id = i.film_id \
LEFT JOIN film_category fc \
ON f.film_id = fc.film_id \
LEFT JOIN category c \
ON fc.category_id = c.category_id \
GROUP BY f.title"


In [5]:
df = pd.read_sql_query(query, engine)
df.shape

(1000, 9)

In [6]:
df.head()

Unnamed: 0,film_id,title,rental_duration,rental_rate,length,rating,special_features,nr_of_inventories,category_name
0,1,ACADEMY DINOSAUR,6,0.99,86,PG,"Deleted Scenes,Behind the Scenes",8,Documentary
1,2,ACE GOLDFINGER,3,4.99,48,G,"Trailers,Deleted Scenes",3,Horror
2,3,ADAPTATION HOLES,7,2.99,50,NC-17,"Trailers,Deleted Scenes",4,Documentary
3,4,AFFAIR PREJUDICE,5,2.99,117,G,"Commentaries,Behind the Scenes",7,Horror
4,5,AFRICAN EGG,6,2.99,130,G,Deleted Scenes,3,Family


### 2. Query to get all unique titles

In [7]:
query_title = '''SELECT DISTINCT title
from sakila.film
;'''

df_t = pd.read_sql_query(query_title, engine)
df_t.shape

(1000, 1)

In [8]:
df_t.head()

Unnamed: 0,title
0,ACADEMY DINOSAUR
1,ACE GOLDFINGER
2,ADAPTATION HOLES
3,AFFAIR PREJUDICE
4,AFRICAN EGG


In [9]:
query_rentals_may = '''SELECT DISTINCT f.title, COUNT(r.rental_date) AS "nr_rentals_may"
from sakila.film f
LEFT JOIN inventory i
ON f.film_id = i.film_id
left JOIN rental r ON i.inventory_id = r.inventory_id
WHERE r.rental_date BETWEEN '2005-05-01 00:00:00' AND '2005-05-31 23:59:59'
group by f.film_id
;'''

In [10]:
df_rentals_may = pd.read_sql_query(query_rentals_may, engine)
df_rentals_may.shape

(686, 2)

In [11]:
df_rentals_may.head()

Unnamed: 0,title,nr_rentals_may
0,ACADEMY DINOSAUR,2
1,ADAPTATION HOLES,1
2,AFFAIR PREJUDICE,2
3,AFRICAN EGG,1
4,AGENT TRUMAN,2


In [12]:
# Merging two tables to create our target
df_new = pd.merge(df_t, df_rentals_may, how='left', on='title')

In [13]:
df_new.shape

(1000, 2)

In [14]:
df_new.tail()

Unnamed: 0,title,nr_rentals_may
995,YOUNG LANGUAGE,
996,YOUTH KICK,
997,ZHIVAGO CORE,1.0
998,ZOOLANDER FICTION,1.0
999,ZORRO ARK,3.0


In [15]:
df_new.isna().sum()

title               0
nr_rentals_may    314
dtype: int64

In [16]:
df_new["nr_rentals_may"].unique()

array([ 2., nan,  1.,  3.,  4.,  5.])

In [17]:
df_new["nr_rentals_may"] = df_new["nr_rentals_may"].fillna(0)

In [18]:
# Changing nan with 0 
df_new["nr_rentals_may"].unique()

array([2., 0., 1., 3., 4., 5.])

In [19]:
df_new.isna().sum()

title             0
nr_rentals_may    0
dtype: int64

In [20]:
# Creating a boolean mask
# 1 for it was rented in Mai
# 0 for it was not rented in Mai

In [21]:
# Creating a new column with boolean mask
df_new["rented_in_may"] = df_new["nr_rentals_may"]>0
df_new.tail()

Unnamed: 0,title,nr_rentals_may,rented_in_may
995,YOUNG LANGUAGE,0.0,False
996,YOUTH KICK,0.0,False
997,ZHIVAGO CORE,1.0,True
998,ZOOLANDER FICTION,1.0,True
999,ZORRO ARK,3.0,True


### 3. Creating a data frame with all features and boolean mask

In [22]:
df2 = pd.merge(df, df_new, how='left', on='title')
df2

Unnamed: 0,film_id,title,rental_duration,rental_rate,length,rating,special_features,nr_of_inventories,category_name,nr_rentals_may,rented_in_may
0,1,ACADEMY DINOSAUR,6,0.99,86,PG,"Deleted Scenes,Behind the Scenes",8,Documentary,2.0,True
1,2,ACE GOLDFINGER,3,4.99,48,G,"Trailers,Deleted Scenes",3,Horror,0.0,False
2,3,ADAPTATION HOLES,7,2.99,50,NC-17,"Trailers,Deleted Scenes",4,Documentary,1.0,True
3,4,AFFAIR PREJUDICE,5,2.99,117,G,"Commentaries,Behind the Scenes",7,Horror,2.0,True
4,5,AFRICAN EGG,6,2.99,130,G,Deleted Scenes,3,Family,1.0,True
...,...,...,...,...,...,...,...,...,...,...,...
995,996,YOUNG LANGUAGE,6,0.99,183,G,"Trailers,Behind the Scenes",2,Documentary,0.0,False
996,997,YOUTH KICK,4,0.99,179,NC-17,"Trailers,Behind the Scenes",2,Music,0.0,False
997,998,ZHIVAGO CORE,6,0.99,105,NC-17,Deleted Scenes,2,Horror,1.0,True
998,999,ZOOLANDER FICTION,5,2.99,101,R,"Trailers,Deleted Scenes",5,Children,1.0,True


### 4. Analyze extracted features (X) and transform them. You may need to encode some categorical variables, or scale numerical variables.

Before I start to analyze further, I decided to drop some features:
- film_id (because this are all unique numbers and they dont give us any new informations)
- title is also unique for all rows
- special_features

In [23]:
df2 = df2.drop(["film_id", "title", "special_features"], axis=1)

In [24]:
df2.head()

Unnamed: 0,rental_duration,rental_rate,length,rating,nr_of_inventories,category_name,nr_rentals_may,rented_in_may
0,6,0.99,86,PG,8,Documentary,2.0,True
1,3,4.99,48,G,3,Horror,0.0,False
2,7,2.99,50,NC-17,4,Documentary,1.0,True
3,5,2.99,117,G,7,Horror,2.0,True
4,6,2.99,130,G,3,Family,1.0,True


Next steps:

- X/y split (feature/target) : X, y
- train/test split           : X_train, X_test, y_train, y_test
- num/cat split              : X_train_num, X_train_cat, X_test_num, X_test_cat

- fit transformer/scaler on X_train_num
- run transformer on X_train_num     : X_train_normalized
- run same transformer on X_test_num : X_test_normalized
- fit encoder on X_train_cat
- run encoder on X_train_cat         : X_train_encoded
- run same encoder on X_test_cat     : X_test_encoded

- concat X_train_normalized and X_train_encoded : X_train_transformed
- choose model (LienarRegression on numeric target, LogisticRegression(=classification!) on categorical target)
- fit (train) model in X_train_transformed      : model
- concat X_test_normalized and X_test_encoded   : X_test_transformed
- make predictions using X_test_transfomed      : model.predict -> predictions
- compute score using predictions and y_test

#### X, y split

Before I do a split, I will change True/False values from rented_in_may into 0/1.

In [25]:
df2['rented_in_may'].replace({True: 1, False: 0}, inplace=True)

In [26]:
df2.head()

Unnamed: 0,rental_duration,rental_rate,length,rating,nr_of_inventories,category_name,nr_rentals_may,rented_in_may
0,6,0.99,86,PG,8,Documentary,2.0,1
1,3,4.99,48,G,3,Horror,0.0,0
2,7,2.99,50,NC-17,4,Documentary,1.0,1
3,5,2.99,117,G,7,Horror,2.0,1
4,6,2.99,130,G,3,Family,1.0,1


In [27]:
# Splitting data in y (target) and X (features)
y = df2["rented_in_may"]
X = df2.drop(["rented_in_may"], axis=1)

#### train/test split : X_train, X_test, y_train, y_test

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [29]:
X_train.shape

(750, 7)

#### num/cat split : X_train_num, X_train_cat, X_test_num, X_test_cat

In [30]:
# numerical/categorical on train set
X_train_num = X_train.select_dtypes(include = np.number)
X_train_cat = X_train.select_dtypes(include = object)

# numerical/categorical on test set
X_test_num = X_test.select_dtypes(include = np.number)
X_test_cat = X_test.select_dtypes(include = object)

In [31]:
display(X_train_num.shape)
display(X_train_cat.shape)

(750, 5)

(750, 2)

In [32]:
X_train_num.head()

Unnamed: 0,rental_duration,rental_rate,length,nr_of_inventories,nr_rentals_may
253,4,2.99,159,5,2.0
667,7,4.99,80,5,2.0
85,6,4.99,121,8,4.0
969,7,0.99,52,7,1.0
75,4,0.99,103,3,0.0


#### fit transformer/scaler on X_train_num

In [33]:
# Creating a transformer.
# Fitting transformer ONLY with training set. 
transformer = MinMaxScaler().fit(X_train_num)


#### run transformer on X_train_num : X_train_normalized

In [34]:
X_train_normalized = transformer.transform(X_train_num)
df_X_train_normalized = pd.DataFrame(X_train_normalized, columns=X_train_num.columns)
df_X_train_normalized.head()

Unnamed: 0,rental_duration,rental_rate,length,nr_of_inventories,nr_rentals_may
0,0.25,0.5,0.81295,0.625,0.4
1,1.0,1.0,0.244604,0.625,0.4
2,0.75,1.0,0.539568,1.0,0.8
3,1.0,0.0,0.043165,0.875,0.2
4,0.25,0.0,0.410072,0.375,0.0


#### run same transformer on X_test_num : X_test_normalized
Important note:
- Do not fit the transformer with testing set!

In [35]:
X_test_normalized = transformer.transform(X_test_num)
df_X_test_normalized = pd.DataFrame(X_test_normalized, columns=X_test_num.columns)
df_X_test_normalized.head()

Unnamed: 0,rental_duration,rental_rate,length,nr_of_inventories,nr_rentals_may
0,0.75,1.0,0.388489,0.375,0.4
1,0.25,1.0,0.338129,0.0,0.0
2,0.25,0.0,0.705036,0.5,0.4
3,0.75,0.5,0.81295,0.875,0.6
4,0.0,0.0,0.517986,0.5,0.2


#### fit encoder on X_train_cat

In [36]:
X_train_cat.head()

Unnamed: 0,rating,category_name
253,PG-13,Sports
667,PG,Sports
85,R,Music
969,NC-17,Classics
75,NC-17,Music


In [37]:
encoder = OneHotEncoder(drop='first').fit(X_train_cat)

#### run encoder on X_train_cat : X_train_encoded
#### run same encoder on X_test_cat : X_test_encoded

In [38]:
# Getting names of columns to be able to label features
cols = encoder.get_feature_names_out(input_features=X_train_cat.columns)

# Running encoder on a X_train_cat
X_train_cat_encoded = encoder.transform(X_train_cat).toarray()
df_X_train_cat_encoded = pd.DataFrame(X_train_cat_encoded, columns=cols)
df_X_train_cat_encoded.head()

Unnamed: 0,rating_NC-17,rating_PG,rating_PG-13,rating_R,category_name_Animation,category_name_Children,category_name_Classics,category_name_Comedy,category_name_Documentary,category_name_Drama,category_name_Family,category_name_Foreign,category_name_Games,category_name_Horror,category_name_Music,category_name_New,category_name_Sci-Fi,category_name_Sports,category_name_Travel
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [39]:
# Running encoder on a X_test_cat
X_test_cat_encoded = encoder.transform(X_test_cat).toarray()
df_X_test_cat_encoded = pd.DataFrame(X_test_cat_encoded, columns=cols)
df_X_test_cat_encoded.head()

Unnamed: 0,rating_NC-17,rating_PG,rating_PG-13,rating_R,category_name_Animation,category_name_Children,category_name_Classics,category_name_Comedy,category_name_Documentary,category_name_Drama,category_name_Family,category_name_Foreign,category_name_Games,category_name_Horror,category_name_Music,category_name_New,category_name_Sci-Fi,category_name_Sports,category_name_Travel
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


#### concat X_train_normalized and X_train_encoded : X_train_transformed

In [40]:
X_train_transformed = np.concatenate([X_train_normalized, X_train_cat_encoded], axis=1)

### 5. Create a logistic regression model to predict 'rented_in_may' from the cleaned data.

#### - fit (train) model in X_train_transformed      : model

In [41]:
# Creating a logistic model
LR = LogisticRegression(random_state=0, solver='lbfgs')  # Actually we should take random_state away
LR.fit(X_train_transformed, y_train)

#### concat X_test_normalized and X_test_encoded : X_test_transformed

In [42]:
X_test_transformed = np.concatenate([X_test_normalized, X_test_cat_encoded], axis=1)

#### make predictions using X_test_transfomed      : model.predict -> predictions


In [43]:
pred = LR.predict(X_test_transformed)

### 6. Evaluate the results.

#### compute score using predictions and y_test

In [44]:
LR.score(X_test_transformed, y_test)

1.0

In [45]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,pred)

array([[ 76,   0],
       [  0, 174]], dtype=int64)

#### Conclusion:

I have a model, which predicts with 100% probability. This means, that one of the features, which I have used to create my model was too highly correlated with the target.
After checking again the features I see, that the feature 'nr_rentals_may' which showed the number of rentals in May was too highly correlated with the target. My model used this correlation to create a model and make predictions. 

We want to make predictions for a new month June, and maybe also be able to predict what kind of movies could be added to the store. We could also try to predict what kind of movies (like comedy or films with special rating or features) could be interesting to add to the store. To do so the model would need to be made again with the all features besides 'nr_rentals_may'. This will give us a realistic point of view on the problem and allow to create a model, which would be useful for that. 

In [46]:
print("precision: ",precision_score(y_test,pred))
print("recall: ",recall_score(y_test,pred))
print("f1: ",f1_score(y_test,pred))

precision:  1.0
recall:  1.0
f1:  1.0
