<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/01-Pandas/B3-Pandas-vs-SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Python/Pandas together with SQL

In [None]:
!sudo pip3 install -U -q PyMySQL sqlalchemy

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Change the graph defaults
plt.rcParams['figure.figsize'] = (6, 2)  # Default figure size of 6x2 inches
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.color'] = 'lightgray'
plt.rcParams['font.size'] = 12  # Default font size of 12 points
plt.rcParams['lines.linewidth'] = 1  # Default line width of 1 points
plt.rcParams['lines.markersize'] = 3  # Default marker size of 3 points
plt.rcParams['legend.fontsize'] = 10  # Default legend font size of 10 points


## Importing SQL results into DataFrames using read_sql



The `read_sql` function of Pandas allows us to create a dataframe directly from a SQL query. To execute the query, we first setup the connection to the database using the SQLAlchemy library.

In [None]:
from sqlalchemy import create_engine
from sqlalchemy import text

In [None]:
conn_string_imdb = 'mysql+pymysql://{user}:{password}@{host}:{port}/{db}?charset=utf8'.format(
    user='student',
    password='dwdstudent2015',
    host = 'db.ipeirotis.org',
    port=3306,
    db='imdb',
    encoding = 'utf-8'
)
engine_imdb = create_engine(conn_string_imdb)

Let's start with a simple example. We issue an SQL query, and get back the results loaded in a dataframe.

In [None]:
query = '''
SELECT * FROM actors LIMIT 10
'''

In [None]:
with engine_imdb.connect() as connection:
  df_actors = pd.read_sql(text(query), con=connection)

In [None]:
df_actors

In [None]:
len(df_actors)

## Aggregation Calculations: Pandas or SQL?



Now let's work on a slightly more advanced example. We want to analyze the number of movies over time.

## Basic Option: Fetch all data, analyze in Pandas

Let's do the simple thing first. We will fetch all the data from the movies table and then do a pivot table on top. Since we care about efficiency, we will also time the operation.

In [None]:
%%time
query = '''SELECT * FROM movies'''
with engine_imdb.connect() as conn:
  df_basic = pd.read_sql(text(query), con=conn)

In [None]:
len(df_basic)

So, notice that it takes 2-3 seconds to fetch the data from SQL and create the dataframe, as we need to fetch almost 400K records.

Once we have the records, we can then compute a pivot table:

In [None]:
%%time
# Counting movie IDs returns all the movies within the year
# Counting movie ranks returns all the movies that have
# a non-empty "rank" value (i.e., they have been rated)
pivot = df_basic.pivot_table(
    index = 'year',
    aggfunc = 'count',
    values = ['id', 'rating']
)
# Rename the columns
pivot.columns = ['all_movies', 'rated_movies']

In [None]:
# And let's check a few lines of the table
pivot.sample(5)

And we can then plot the results:

In [None]:
pivot.plot()

## Better option: Aggregation in SQL, fetch only necessary data

Now let's push the computation on the SQL server instead, using a GROUP BY and COUNT aggregates in SQL.

In [None]:
%%time
query = '''
SELECT year, COUNT(*) AS all_movies, COUNT(rating) AS rated_movies
FROM movies
GROUP BY year
ORDER BY year
'''
with engine_imdb.connect() as conn:
  df_movies = pd.read_sql(text(query), con=conn)

In [None]:
len(df_movies)

In [None]:
df_movies.sample(5)

Notice that the same calculation was done in a few (4-5) **milliseconds**. The SQL query that we used earlier it took **seconds** to execute. In fact, the **pivot** table calculation, executed after fetching all the data took longer than executing the GROUPBY/COUNT SQL query and fetching the results.

While in this example the difference is negligible, once you deal with datasets that have millions, or tens of millions of rows, the savings become material and significant.

### Plotting: The importance of index

Let's try to plot the results. In pandas, the simple `plot()` command will use the index as the x-axis, and will plot all the numeric columns, as a line plot.

In [None]:
# The plot() command takes the index (the first "column") of the dataframe
# and makes that the x-axis.
# Then it plots *ALL* the numeric columns as a line
df_movies.plot()

We do not want to plot the `year` variable as a line. So, we select just the other two columns and plot.

In [None]:
# First step: We can eliminate the "year" line by selecting
# the columns that we want to plot
# To select columns, we pass a list of the column names that
# we want to keep in square brackets
df_movies[ ["all_movies", "rated_movies"] ].plot()
# still the x-axis does not list the year

A bit better. `year` is not appearing anymore, but we still do not have `year` as the x-axis.

To make `year` the x-axis, we need to make it the index of the dataframe:

In [None]:
df_movies_2 = df_movies.set_index('year')
df_movies_2.sample(5)

Now the plot has the year as the x-axis, and the labels are proper.

In [None]:
df_movies_2.plot()

### (Optional, but useful) Changing data types: Int vs Datetime

In our index above, the "year" variable is an integer:

In [None]:
df_movies_2.index.dtype

This is mostly fine, but we can leverage the time series processing capabilities of Pandas by converting `year` to a date.

In [None]:
# We first convert the index into datetime.
df_movies_2.index = pd.to_datetime(df_movies_2.index, format='%Y')

In [None]:
df_movies_2.sample(5)

Now we can do the `resample` the dates in the index. For example, we can compute numbers over decades:

In [None]:
df_movies_2.resample('10Y').sum()

## Exercise

* Connect to the Facebook database, and use the `MemberSince` variable from the `Profiles` table to plot the growth of Facebook users. Use the following information:
>    user='student',
>    password='dwdstudent2015',
>    host = 'db.ipeirotis.org',
>    port=3306,
>    db='facebook'
* (_Learn something new_) Use the [cumsum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cumsum.html) function of Pandas and plot the total number of registered users over time.

### Solution

In [None]:
conn_string_fb = 'mysql+pymysql://{user}:{password}@{host}:{port}/{db}?charset=utf8'.format(
    user='student',
    password='dwdstudent2015',
    host = 'db.ipeirotis.org',
    port=3306,
    db='facebook',
    encoding = 'utf-8'
)
engine_fb = create_engine(conn_string_fb)

In [None]:
%%time
# Naive approach, fetch all the data first
query = 'SELECT * FROM Profiles'
with engine_fb.connect() as conn:
  df = pd.read_sql(text(query), con=conn)

pivot = df.pivot_table(
    index='MemberSince',
    values='ProfileID',
    aggfunc='count'
)
# Calculate weekly signups
weekly_signups = pivot.resample('1W').sum()

In [None]:
%%time
# Push calculations into SQL
query = '''
  SELECT MemberSince, COUNT(ProfileID) as signups
  FROM Profiles
  GROUP BY MemberSince
  ORDER BY MemberSince
'''
with engine_fb.connect() as conn:
  df = pd.read_sql(text(query), con=conn)
df.set_index("MemberSince", inplace=True)


In [None]:
df.plot()

In [None]:
df.cumsum().plot()

In [None]:
weekly_signups = df.resample('1W').sum()
weekly_signups.plot()