<a href="https://colab.research.google.com/github/nhwhite212/DealingwithDataSpring2021/blob/master/6-Pandas/B-Pandas_and_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Using Pandas together with SQL

### First install the relevant linux libraries

In [None]:
!sudo apt-get install -y python-dev libmysqlclient-dev && sudo pip install mysqlclient

In [None]:
%matplotlib inline

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# Make the graphs a bit prettier, and bigger
matplotlib.style.use(['seaborn-talk', 'seaborn-ticks', 'seaborn-whitegrid'])
plt.rcParams['figure.figsize'] = (15, 5)

In [None]:
# Install the SQLAlchemy library if it is not installed
!sudo -H pip3 install -U sqlalchemy

### Importing into DataFrames using read_sql

The `read_sql` function of Pandas allows us to create a dataframe directly from a SQL query. To execute the query, we first setup the connection to the database using the SQLAlchemy library.

In [None]:
from sqlalchemy import create_engine

In [None]:
conn_string_imdb = 'mysql://{user}:{password}@{host}:{port}/{db}'.format(
    user='student', 
    password='dwdstudent2015', 
    host = 'db.ipeirotis.org', 
    port=3306, 
    db='imdb'
)
engine_imdb = create_engine(conn_string_imdb)

#### Retrieve the first 10 lines from the actors table 

In [None]:
query = '''
SELECT * FROM actors LIMIT 10
'''

#### Now issue that query with the read_sql dataframe method to create a DF from the table

In [None]:
df_actors = pd.read_sql(query, con=engine_imdb)

In [None]:
df_actors

#### Query to retrieve the number of movies per year

In [None]:
query = '''
SELECT year, COUNT(*) AS num_movies, COUNT(rating) AS rated_movies
FROM movies 
GROUP BY year
ORDER BY year;
'''

#### Issue the query using the pandas read_sql method. It returns a pandas dataframe. Note how we have used MYSQL to do the aggregation and sorting for us.

In [None]:
df_movies = pd.read_sql(query, con=engine_imdb)

In [None]:
df_movies.head(5)

#### Let's try to plot the results.  The pandas plot method will pick some default settings for matplotlib.

In [None]:
df_movies.plot()

We have a couple of issues. We also plotted the year as a line, and we do not have it as the label of the x-axis. For that, we need to convert the year into a proper datetime variable, and then make it the index for the dataframe.

In [None]:
df_movies['year'] = pd.to_datetime(df_movies['year'], format='%Y')
df_movies2 = df_movies.set_index('year')

In [None]:
df_movies2.plot()

### Exercise

* Connect to the Facebook database, and use the `MemberSince` variable from the `Profiles` table to plot the growth of Facebook users.
* (_Learn something new_) Use the [cumsum()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cumsum.html) function of Pandas and plot the total number of registered users over time.

In [None]:
# your code here

### Further Examples with SQL and Pandas

Now let's run a query to get the political views of Facebook users, broken down by gender.

In [None]:
conn_string_fb = 'mysql://{user}:{password}@{host}:{port}/{db}'.format(
    user='student', 
    password='dwdstudent2015', 
    host = 'db.ipeirotis.org', 
    port=3306, 
    db='facebook'
)
engine_fb = create_engine(conn_string_fb)

In [None]:
describeprofiles = '''
describe Profiles
'''
df=pd.read_sql(describeprofiles,con=engine_fb)
df

In [None]:
polviews_by_gender = '''
SELECT Sex,  PoliticalViews, COUNT(*) AS cnt 
FROM Profiles 
WHERE Sex IS NOT NULL AND PoliticalViews IS NOT NULL 
GROUP BY Sex, PoliticalViews  
ORDER BY  PoliticalViews, Sex
'''


And let's get the dataframe:

In [None]:
df = pd.read_sql(polviews_by_gender, con=engine_fb)
df

In [None]:
# Let's plot this!
# Bleh, this is really ugly...
# Remember that the index of the dataframe becomes the default x-axis
df.plot(kind='bar')

In [None]:
# Pivot, baby!
# Now the index contains the Political Views, which will be our x-axis
dfp = df.pivot_table(index='PoliticalViews', columns='Sex', values='cnt')
dfp

In [None]:
dfp.plot(kind='bar')

In [None]:
# Let's normalize the columns, as we have more females than males, and it seems that there are always more women
dfp = dfp / dfp.sum()
dfp

In [None]:
dfp.plot(kind='bar')

In [None]:
# OK, now let's try to re-order the list of results according to the logical structure
neworder = ['Very Liberal', 'Liberal', 'Moderate', 'Conservative', 'Very Conservative', 'Libertarian', 'Apathetic', 'Other']
newindex = sorted(dfp.index, key=lambda x: neworder.index(x))
dfp = dfp.reindex(newindex)
dfp

In [None]:
dfp.plot(kind='bar')

### Facebook, Favorite Books, and Political views

In [None]:
Fbooks= '''
describe FavoriteBooks
'''
df_fbooks = pd.read_sql(Fbooks, con=engine_fb)
df_fbooks.head(10)

In [None]:
books = '''
SELECT F.Book, P.PoliticalViews , COUNT(*) AS cnt 
FROM Profiles P JOIN FavoriteBooks F ON F.ProfileID = P.ProfileId  
WHERE PoliticalViews IS NOT NULL AND F.Book IS NOT NULL 
      AND (PoliticalViews = 'Liberal' OR PoliticalViews = 'Conservative')
GROUP BY F.Book, P.PoliticalViews
'''

In [None]:
df_books = pd.read_sql(books, con=engine_fb)
df_books.head(10)

In [None]:
dfp = df_books.pivot_table(index='Book', columns='PoliticalViews', values='cnt')
dfp.head(10)

In [None]:
# If we compute the sums, we will see that we have very different 
# number of likes per political view, due to imbalance in the population
dfp.sum()

In [None]:
# Normalize the values, so that each column sums up to 1.0
dfp = dfp / dfp.sum()
dfp.head(20)

In [None]:
dfp["Liberal_To_Conservative"] = dfp["Liberal"]  / dfp["Conservative"] 
dfp["Conservative_To_Liberal"] = dfp["Conservative"]  / dfp["Liberal"] 

In [None]:
liberal_books = dfp[["Liberal_To_Conservative"]].sort_values("Liberal_To_Conservative", ascending=False).head(10)
liberal_books

In [None]:
conservative_books = dfp[["Conservative_To_Liberal"]].sort_values("Conservative_To_Liberal", ascending=False).head(10)
conservative_books

In [None]:
conservative_books.plot(kind='bar')

#   
# Inserting Data in a Database using Pandas
### WE NEED TO SWITCH TO THE BIGDATA SERVER for THIS
#    We also need to download a copy of the NYC Open Data 
#    Restaurant Inspections 

In [None]:
!curl http://people.stern.nyu.edu/nwhite/DealingwithDataSpring2021/data/restaurant.csv.gz  -o restaurant.csv.gz

In [None]:
# Read the CSV file
restaurants = pd.read_csv('restaurant.csv.gz', encoding="utf-8", dtype="unicode")
restaurants.describe()

In [None]:
# Usual bookkeeping regarding datatypes
restaurants["GRADE DATE"] = pd.to_datetime(restaurants["GRADE DATE"], format="%m/%d/%Y")
restaurants["RECORD DATE"] = pd.to_datetime(restaurants["RECORD DATE"], format="%m/%d/%Y")
restaurants["INSPECTION DATE"] = pd.to_datetime(restaurants["INSPECTION DATE"], format="%m/%d/%Y")
restaurants["SCORE"] = pd.to_numeric(restaurants["SCORE"])
restaurants["BORO"] =  pd.Categorical(restaurants["BORO"], ordered=False)
restaurants["GRADE"] =  pd.Categorical(restaurants["GRADE"], categories = ['A', 'B', 'C'], ordered=True)
restaurants["VIOLATION CODE"] =  pd.Categorical(restaurants["VIOLATION CODE"], ordered=False)
restaurants["CRITICAL FLAG"] =  pd.Categorical(restaurants["CRITICAL FLAG"], ordered=False)
restaurants["ACTION"] =  pd.Categorical(restaurants["ACTION"], ordered=False)
restaurants["CUISINE DESCRIPTION"] =  pd.Categorical(restaurants["CUISINE DESCRIPTION"], ordered=False)

In [None]:
# Connect to the MySQL, but without selecting a database
# I will use the class userid and password for testing
#

conn_string = 'mysql://{user}:{password}@{host}:{port}/'.format(
    user='DealingS21', password='DealingS21!!', 
    host = 'bigdata.stern.nyu.edu', port=3306)
engine = create_engine(conn_string)

In [None]:
# Create the database where we want to store the data
# Do not worry about the Warning if the database already exists
engine.execute('CREATE DATABASE IF NOT EXISTS nyc_restaurant_inspections')
engine.execute('USE nyc_restaurant_inspections')

In [None]:
# We drop the table if it is already there
engine.execute('DROP TABLE inspections1000')


## BEWARE the to_sql pandas method is VERY SLOW!!!
#### I ran this code and created an inspections table, which I will keep
#### It took 14 minutes to load all 394000 rows, so for class I will  only load
#### 1000 as an example. I am keeping the full inspections table for example queries,


In [None]:
# Lets create a small table to test, since insertions are very slow
# Takes over 10 minutes to insert all of the restaurant data

restaurants1000=restaurants.head(1000)

In [None]:
# Store the dataframe as a SQL table, using the to_sql command
# This command is very slow, so this can take a while
restaurants1000.to_sql(name='inspections1000', if_exists='replace', index=False, con=engine, chunksize=500)

In [None]:
# And then we can just retrieve it from the database
df = pd.read_sql("SELECT * FROM inspections1000 LIMIT 100", con=engine)
df.head(5)

In [None]:
### How many rows do we have?
df1= pd.read_sql('select count(*) from inspections1000',con=engine)
df1

In [None]:
### How about in the full table? 
df1=pd.read_sql('select count(*) from inspections',con=engine)
df1

# In class exercise
####   
### Connect to the mysql server on bigdata.stern.nyu.edu
#### using your team userid and password

#### Just copy the cell above, and use your team userid and password
##### DealingS21GBx   DealingFS21GBx!!   
#### Use your team database (DealingS21GBx)
####  
#### Create a table named  ("yournetid) in your team database. It should have two 
#### attributes, a and b, both ints
#### i.e.
##### create table nhw1 (a int, b int);
##### insert into yournetid(a,b) values(1,2);  
#### and 3,4 in the first two rows.
#### select * from the table into a dataframe
#### print it.
#### drop the table

In [None]:
# create a small data frame
data=[[1,2],[3,4],[5,6]]
mydf=pd.DataFrame(data,columns=['a','b'])
mydf


In [None]:
# Your code here (Change the user and password)
# Connect to the MySQL, but without selecting a database
# I will use team 10's userid and password for testing
#

conn_string = 'mysql://{user}:{password}@{host}:{port}/'.format(
    user='DealingS21GB10', password='DealingS21GB10!!', 
    host = 'bigdata.stern.nyu.edu', port=3306)
engine = create_engine(conn_string)

engine.execute("USE DealingS21GB10")

engine.execute("SHOW TABLES")
engine.execute("DROP TABLE IF EXISTS  nhw1")
# create a table for the data frame

mydf.to_sql(name='nhw1', index=False, con=engine, chunksize=500)

engine.execute("SHOW TABLES")

In [None]:
engine.execute("USE DealingS21GB10")
df = pd.read_sql("SELECT * FROM nhw1 LIMIT 100", con=engine)
df.head(5)
res=pd.read_sql("SHOW TABLES", con=engine)
res
engine.execute("DROP TABLE nhw1")

In [None]:
df

In [None]:
res

# That should get you started on using mysql and Pandas