# Lab | Making predictions with logistic regression

In this lab, you will be using the [Sakila](https://dev.mysql.com/doc/sakila/en/) database of movie rentals.

In order to optimize our inventory, we would like to know which films will be rented. We are asked to create a model to predict it. So we use the information we have from May 2005 to create the model.

### Instructions

1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features (X). 
2. Create a query to get the list of all unique film titles and a boolean indicating if it was rented (rental_date) in May 2005. (Create new column called - 'rented_in_may'). This will be our **TARGET** (y) variable.
3. Read the data into a Pandas dataframe.  At this point you should have 1000 rows.  Number of columns depends on the number of features you chose.
4. Analyze extracted features (X) and transform them. You may need to encode some categorical variables, or scale numerical variables.
5. Create a logistic regression model to predict 'rented_in_may' from the cleaned data.
6. Evaluate the results.

In [1]:
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass  # To get the password without showing the input
password = getpass.getpass()

········


In [60]:
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)
query = '''SELECT
    f.release_year,f.language_id, f.rental_duration, f.rental_rate, f.length, f.rating, f.last_update,
    MAX(IF(MONTH(r.rental_date) = 5 AND YEAR(r.rental_date) = 2005, 1, 0)) AS rented_in_may
FROM
    film AS f
LEFT JOIN
    inventory AS i ON f.film_id = i.film_id
LEFT JOIN
    rental AS r ON i.inventory_id = r.inventory_id
GROUP BY
    f.title;'''
data = pd.read_sql_query(query, engine)
data.head(10)

Unnamed: 0,release_year,language_id,rental_duration,rental_rate,length,rating,last_update,rented_in_may
0,2006,1,6,0.99,86,PG,2006-02-15 05:03:42,1
1,2006,1,3,4.99,48,G,2006-02-15 05:03:42,0
2,2006,1,7,2.99,50,NC-17,2006-02-15 05:03:42,1
3,2006,1,5,2.99,117,G,2006-02-15 05:03:42,1
4,2006,1,6,2.99,130,G,2006-02-15 05:03:42,1
5,2006,1,3,2.99,169,PG,2006-02-15 05:03:42,1
6,2006,1,6,4.99,62,PG-13,2006-02-15 05:03:42,0
7,2006,1,6,4.99,54,R,2006-02-15 05:03:42,1
8,2006,1,3,2.99,114,PG-13,2006-02-15 05:03:42,0
9,2006,1,6,4.99,63,NC-17,2006-02-15 05:03:42,0


In [6]:
# data.isna().sum()

In [61]:
data.dtypes

release_year                int64
language_id                 int64
rental_duration             int64
rental_rate               float64
length                      int64
rating                     object
last_update        datetime64[ns]
rented_in_may               int64
dtype: object

In [62]:
y = data['rented_in_may']
X = data.drop('rented_in_may', axis=1)

In [63]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

In [64]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# from sklearn.preprocessing import StandardScaler

X_train_num = X_train.select_dtypes(include = np.number)

# Scaling data
transformer = MinMaxScaler().fit(X_train_num) # need to keep transformer
X_train_normalized = transformer.transform(X_train_num)
X_train_norm = pd.DataFrame(X_train_normalized, columns=X_train_num.columns)
X_train_norm

Unnamed: 0,release_year,language_id,rental_duration,rental_rate,length
0,0.0,0.0,0.50,1.0,0.769784
1,0.0,0.0,0.75,0.0,0.151079
2,0.0,0.0,0.00,0.5,0.258993
3,0.0,0.0,0.50,1.0,0.223022
4,0.0,0.0,0.25,0.0,0.733813
...,...,...,...,...,...
795,0.0,0.0,0.50,0.0,0.151079
796,0.0,0.0,0.00,0.0,0.683453
797,0.0,0.0,0.00,1.0,0.467626
798,0.0,0.0,0.50,1.0,0.906475


In [65]:
X_train_categorical = X_train.select_dtypes(include = object)
X_train_cat = pd.get_dummies(X_train_categorical, 
                             columns=['rating'],
                             drop_first=True)
X_train_cat

Unnamed: 0,rating_NC-17,rating_PG,rating_PG-13,rating_R
46,True,False,False,False
789,False,False,True,False
722,False,False,True,False
283,True,False,False,False
39,False,False,False,True
...,...,...,...,...
167,False,False,False,True
232,False,True,False,False
860,False,False,False,True
189,True,False,False,False


In [66]:
X_train_transformed = np.concatenate([X_train_norm, X_train_cat], axis=1)
X_train_transformed.shape

(800, 9)

In [67]:
from sklearn import *
classification = LogisticRegression(random_state=0, solver='lbfgs',
                  multi_class='multinomial').fit(X_train_transformed, y_train)

In [68]:
X_test_num = X_test.select_dtypes(include = np.number)

# Scaling data
# we use the transformer that was trained on the training data
X_test_normalized = transformer.transform(X_test_num)
X_test_norm = pd.DataFrame(X_test_normalized)
X_test_norm

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,0.00,0.0,0.374101
1,0.0,0.0,0.75,0.5,0.964029
2,0.0,0.0,0.25,0.5,0.316547
3,0.0,0.0,1.00,0.5,0.618705
4,0.0,0.0,0.25,0.0,0.453237
...,...,...,...,...,...
195,0.0,0.0,0.25,0.0,0.402878
196,0.0,0.0,0.25,0.5,0.446043
197,0.0,0.0,1.00,0.5,0.273381
198,0.0,0.0,0.25,0.0,0.294964


In [69]:
X_test_categorical = X_test.select_dtypes(include = object)
X_test_cat = pd.get_dummies(X_test_categorical, 
                            columns=['rating'],
                            drop_first=True)
X_test_cat

Unnamed: 0,rating_NC-17,rating_PG,rating_PG-13,rating_R
977,False,False,False,True
15,True,False,False,False
56,False,False,True,False
801,False,True,False,False
747,False,False,True,False
...,...,...,...,...
736,False,False,False,False
369,False,False,False,True
470,True,False,False,False
806,True,False,False,False


In [70]:
list(X_train_cat.columns)==list(X_test_cat.columns)

True

In [71]:
X_test_transformed = np.concatenate([X_test_norm, X_test_cat], axis=1)
X_test_transformed.shape

(200, 9)

In [73]:
predictions = classification.predict(X_test_transformed)
classification.score(X_test_transformed, y_test)

0.69