# Lab | Making predictions with logistic regression

In this lab, you will be using the [Sakila](https://dev.mysql.com/doc/sakila/en/) database of movie rentals.

In order to optimize our inventory, we would like to predict if a film will have more monthly rentals in July than in June. Create a model to predict it.

### Instructions

1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.
2. Create a query to get the total amount of rentals in June for each film. 
3. Do the same with July. 
4. Create a new column containing (Yes/No) for each film whether or not the number of monthly rentals in **July was bigger than in June**. Your objective will be to predict this new column.
6. Read the data into a Pandas dataframe.
7. Analyze extracted features and transform them. You may need to encode some categorical variables or scale numerical variables.
8. Create a logistic regression model to predict this new column from the cleaned data.
9. Evaluate the results.


In [1]:
import pandas as pd
import numpy as np
import pymysql
from sqlalchemy import create_engine
import getpass

#### Establish database connection

In [5]:
password = getpass.getpass()
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)

········


#### So first lets take a look at the sakila database to decide which film features are worth to take a deeper look at.
#### I decided to take to following onces:

- genre (category.name)
- title (film.title)
- length (film.length)
- rental price (film.rental_rate)
- rating (film.rating)
- features (film.secial_features)
- store (store.store_id)
- language (language.name)

In [7]:
query_june_2005 = '''
    SELECT * FROM inventory as i
        JOIN film as f
            ON i.film_id = f.film_id
        JOIN rental as r
            ON r.inventory_id = i.inventory_id
        WHERE left(r.rental_date, 7) = '2005-06';'''


# Execute query with pandas (SELECT ONLY)
data = pd.read_sql_query(query_june_2005, engine)
data.head()
data.columns.tolist()

['inventory_id',
 'film_id',
 'store_id',
 'last_update',
 'film_id',
 'title',
 'description',
 'release_year',
 'language_id',
 'original_language_id',
 'rental_duration',
 'rental_rate',
 'length',
 'replacement_cost',
 'rating',
 'special_features',
 'last_update',
 'rental_id',
 'rental_date',
 'inventory_id',
 'customer_id',
 'return_date',
 'staff_id',
 'last_update']