<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/01-Pandas/A2-Introduction_Basic_Data_Manipulation_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas: Basic Data Manipulation Techniques

## Setup and preliminaries

Install the necessary libraries to connect to MySQL and to read Excel files

In [None]:
!pip3 install -U -q PyMySQL sqlalchemy

In order to read and process files, we are going to use a very powerful, and widely used Python library, called pandas. So, our next step is to import the pandas library in Python, and a few related libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

And we type some code to simply change the visual style of the plots. (The code below is optional and not necessary, and for now you do not need to understand what exactly is happening.)

In [None]:
# Render our plots with high resolution
%config InlineBackend.figure_format = 'retina'

In [None]:
# Make the graphs a bit bigger
matplotlib.style.use(["seaborn-v0_8-talk", "seaborn-v0_8-ticks", "seaborn-v0_8-whitegrid"])

## Reading data using SQL from a MySQL Server

We will use a dataset with [restaurant inspection results in NYC](https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j). The dataset that we are going to use has been cleaned up, normalized, and stored in our MySQL database, under the `doh_restaurants` database.

In [None]:
import os
from sqlalchemy import create_engine
from sqlalchemy import text

conn_string = 'mysql+pymysql://{user}:{password}@{host}/{db}?charset=utf8mb4'.format(
    host = 'db.ipeirotis.org',
    user = 'student',
    password = 'dwdstudent2015',
    db = 'doh_restaurants',
    encoding = 'utf8mb4')

engine = create_engine(conn_string)

We fetch the results of the query using the `read_sql` command.

In [None]:
# This query returns back the restaurants in the DOH database
sql = '''
	SELECT R.CAMIS, R.DBA, R.BUILDING, R.STREET, R.ZIPCODE, R.BORO,
          R.CUISINE_DESCRIPTION, R.LATITUDE, R.LONGITUDE, R.NTA
		FROM doh_restaurants.restaurants R
'''

with engine.connect() as connection:
	restaurants = pd.read_sql(text(sql), con=connection)

In [None]:
# This query returns back the results of the inspections of each restaurant
sql = '''
	SELECT R.CAMIS, R.DBA, R.ZIPCODE, R.BORO, R.CUISINE_DESCRIPTION, R.NTA,
				 I.INSPECTION_DATE, I.INSPECTION_ID,
				 I.INSPECTION_TYPE, I.SCORE, I.GRADE
	FROM restaurants R
		JOIN inspections I ON I.CAMIS = R.CAMIS
'''

with engine.connect() as connection:
	inspections = pd.read_sql(text(sql), con=connection)

In [None]:
# This query returns back the results and violations captured in the
# latest inspection of each restaurant
sql = '''
  WITH latest_inspection AS (
		SELECT CAMIS, MAX(INSPECTION_DATE) AS INSPECTION_DATE FROM inspections
		GROUP BY CAMIS
	)
	SELECT R.CAMIS, R.DBA, R.ZIPCODE, R.BORO,
          I.INSPECTION_DATE, I.INSPECTION_ID, I.INSPECTION_TYPE,
          V.VIOLATION_CODE, I.SCORE, I.GRADE
		FROM restaurants R
			JOIN latest_inspection L ON R.CAMIS = L.CAMIS
			JOIN inspections I ON I.CAMIS = L.CAMIS AND L.INSPECTION_DATE = I.INSPECTION_DATE
			JOIN violations V ON I.INSPECTION_ID = V.INSPECTION_ID
'''

with engine.connect() as connection:
	violations = pd.read_sql(text(sql), con=connection)

# Selecting a subset of the columns -- `filter()`

In a dataframe, we can specify the column(s) that we want to keep, and get back another dataframe with just the subset of the columns that we want to keep.

In [None]:
inspections

In [None]:
inspections.filter(
    items = ["DBA", "GRADE", "INSPECTION_DATE"]
)

In [None]:
columns = ["CAMIS", "DBA", "GRADE", "INSPECTION_DATE", "SCORE"]

# Notice the use of "chain notation" below
# Chain notation means putting parentheses around
# the command and then having each operation in its
# own line
(
  inspections
  .filter( items = columns )
  .head(10)
)


We can also use the `like` option to find all the column names that include a certain string. For example, to get all the columns that include the string `DATE`:

In [None]:
inspections.filter(
    like = 'DATE'
)

We can expand the functionality and also use regular expressions:

In [None]:
restaurants.filter(
    regex = r'^C' # all the columns that start with C
)

### Exercise

Keep the columns "DBA", "SCORE", "CUISINE DESCRIPTION" and "ZIPCODE" from the `inspections` dataframe.

In [None]:
# your code here

# Renaming Columns -- `rename()`

To do the equivalent of `SELECT attr AS alias` in Pandas, we use the `rename` command, and pass a dictionary specifying which columns we want to rename:



In [None]:
restaurants.rename(
    columns = {
      "CAMIS": "RESTID",
      "DBA": "REST_NAME",
      "BUILDING": "STREET_NUM",
      "BORO": "BOROUGH"
    }
)

# Selecting rows -- `query()`

To select rows, we can write basic queries using the `query()` command:

In [None]:
# Find all violations for restaurants with DBA being Starbucks
restaurants.query(' DBA == "STARBUCKS" ')

In [None]:
# Find all violations with code 04L (i.e., "has mice")
violations.query(' VIOLATION_CODE == "04L" ')

In [None]:
# We can store the result in a dataframe called  has_mice
has_mice = violations.query(' VIOLATION_CODE == "04L" ')
has_mice

In [None]:
# List the most frequent DBA values in the has_mice dataframe
has_mice["DBA"].value_counts().head(20)

In [None]:
# For comparison, the most frequent DBA names overall across restaurants
restaurants["DBA"].value_counts().head(20)

And we can use more complex conditions.

In [None]:
has_mice_10012 = (
    violations
    .query('  VIOLATION_CODE == "04L" and ZIPCODE == "10012" ')
    .filter( items = ['DBA', 'INSPECTION_DATE'] )
)

has_mice_10012

...and just to have a bit more fun:

In [None]:
# all restaurants with mice
mice = has_mice["DBA"].value_counts()
mice.head(5) # show the top-5

In [None]:
# top-25 most popular restaurant names
topK = 25
top_restaurants = restaurants["DBA"].value_counts().head(topK)
top_restaurants.head(5) # show the top-5

In [None]:
# Now calculate what % of the top restaurant chains had mice
# The dropna() removes the restaurants that do not appear in top_restaurants
(mice / top_restaurants).dropna()

### Exercise

The following command reads the table `violation_codes`. In addition to the `04L`, check the violation descriptions for the codes `04K`, `04M`, `04N`, and `04O`. Then create an analysis for the restaurants in the area that have these violations.

[This StackOverflow post](https://stackoverflow.com/questions/33990955/combine-pandas-dataframe-query-method-with-isin) explains how to use the `IN` construct with Pandas.

In [None]:
with engine.connect() as connection:
  sql = "SELECT * FROM doh_restaurants.violation_codes"
  codes = pd.read_sql(text(sql), con=connection)

#### Solution

In [None]:
filthy_near_NYU = (
    violations
    .query('  VIOLATION_CODE in ["04K", "04L", "04M", "04N", "04O"]  ' )
    .query('  ZIPCODE in ["10012", "10003", "10014"] ')
    .query('  INSPECTION_DATE > "2023-01-01" ')
    .filter( items = ['DBA', 'INSPECTION_DATE'] )
    .sort_values("INSPECTION_DATE", ascending=False)
    .drop_duplicates()
)

filthy_near_NYU.head(20)

# Selecting distinct values -- `drop_duplicates()`

We can do the equivalent of `SELECT DISTINCT` in Pandas by doing the following

In [None]:
(
    restaurants
    .query(' CUISINE_DESCRIPTION == "Coffee/Tea"  and ZIPCODE == "10012" ')
    .filter( items = ['DBA'])
    .drop_duplicates()
)

# Sorting values -- `sort_values()`

And we can do the equivalent of `ORDER BY` by using the `.sort_values()

In [None]:
(
    has_mice_10012
    .sort_values("INSPECTION_DATE", ascending=False)
    .head(15)
)

In [None]:
(
    has_mice_10012
    .sort_values(["INSPECTION_DATE","DBA"], ascending=[False,True])
    .head(15)
)

# Join two tables -- `pd.merge()`

In [None]:
# Fetch data about population of NYC neighborhoods (NTAs)
nyc_population_url = 'https://data.cityofnewyork.us/api/views/rnsn-acs2/rows.csv?accessType=DOWNLOAD'
nyc_pop = pd.read_csv(nyc_population_url)
nyc_pop

In [None]:
nyc_pop.columns

In [None]:
# Change the name of the columns
nyc_pop.columns = ['BOROUGH', 'FIPS_COUNTY', 'NTA_CODE',
       'NTA_NAME', 'POPULATION_2000', 'POPULATION_2010',
       'POP_DIFF_NUMBER', 'POP_DIFF_PCT']

# Drop unnecessary columns
# nyc_pop = nyc_pop.drop(['POPULATION_2000', 'POP_DIFF_NUMBER', 'POP_DIFF_PCT'], axis='columns')

# Dropping lines with empty cell values
nyc_pop = nyc_pop.dropna()

## Merging two dataframes

In [None]:
merged = pd.merge(
  left = inspections,
  right = nyc_pop,
  left_on = 'NTA',
  right_on = 'NTA_CODE'
)

merged

In [None]:
# How would you improve the plot below?

merged.plot(
    kind='scatter',
    y = 'SCORE',
    x = 'POPULATION_2010',
    s = 1
)

# Calculating aggregates per groups -- `groupby()`

In [None]:
# Calculate the average of the "SCORE" variable, grouped by neighborhood name
merged.groupby('NTA_NAME')["SCORE"].mean()

In [None]:
merged.groupby('NTA_NAME')["SCORE"].mean().sort_values()

In [None]:
# Calculate the average score per population
merged.groupby('POPULATION_2010')["SCORE"].mean()

In [None]:
# Calculate the average score per neighborhood

# The "reset_index()" converts the "Series" (single column dataframe) to a Dataframe
grouped_df = merged.groupby('POPULATION_2010')["SCORE"].mean().reset_index()

grouped_df.plot(
    kind='scatter',
    y = 'SCORE',
    x = 'POPULATION_2010',
    s = 5, figsize = (5,5)
)

In [None]:
# We can use "seaborn" a visualization library to create a better version of the plot
# with a regression line added
# https://seaborn.pydata.org/generated/seaborn.lmplot.html
sns.lmplot(
    data = grouped_df,
    x='POPULATION_2010',
    y = 'SCORE'
)

In [None]:
# count how many inspections per neighborhood
grouped_df = merged.groupby('NTA_NAME')['CAMIS'].count()
grouped_df

#### Multiple aggregations per group -- agg() function

In [None]:
(
  merged
  .groupby('NTA_NAME')
  .agg(
    score_mean = ('SCORE', 'mean'), # calculate the mean of the score
    inspections = ('CAMIS', 'count'), # count the number of inspections
    graded_restaurants = ('CAMIS', 'nunique') # count unique restaurant IDs
  )
  .sort_values('inspections', ascending=False) # sort in descending order of inspections
  .tail(20) # show the last 20 lines
)

In [None]:
(
  inspections
  .groupby('INSPECTION_DATE')
  .agg(
    score_mean = ('SCORE', 'mean'), # calculate the mean of the score
    graded_restaurants = ('CAMIS', 'nunique') # count unique restaurant IDs
  )
  .tail(20) # show the last 20 lines
)

In [None]:
(
  inspections
  .groupby('INSPECTION_DATE')
  .agg(
    score_mean = ('SCORE', 'mean'), # calculate the aveage score for the date
    graded_restaurants = ('CAMIS', 'nunique') # and the number of restaurants
  )
  .query('graded_restaurants>10') # keep only days with at least 10 graded restauranta
  .filter(items=['score_mean']) # we only want to plot the score
  .plot()
)

In [None]:
(
  inspections
  .groupby('INSPECTION_DATE')
  .agg(
    score_mean = ('SCORE', 'mean'), # calculate the aveage score for the date
    graded_restaurants = ('CAMIS', 'nunique') # and the number of restaurants
  )
  .query('graded_restaurants>10') # keep only days with at least 10 graded restauranta
  .filter(items=['score_mean']) # we only want to plot the score
  .resample('1M').mean() # change the frequency to 1 month, and show avg score per month
  .plot(
    style='--o', # use a dotted line and circles as markers
    linewidth=2, # the line should be 1 pixel wide
    markersize=8, # the marker size set to 8
  )
)

# Aggregation functions -- `agg()`

In [None]:
inspections['SCORE'].agg('mean')

In [None]:
inspections['SCORE'].agg(['mean','std','count','nunique'])

In [None]:
inspections.agg(
    {
        'SCORE': ['mean','std','count','nunique'],
        'CAMIS':  ['nunique','count']
    }
)

In [None]:
inspections.agg(
        num_scored_violations = ('SCORE', 'count'),
        mean_score = ('SCORE', 'mean'),
        std_score  = ('SCORE', 'std'),
        num_entries = ('CAMIS',  'count'),
        num_restaurants = ('CAMIS',  'nunique'),
  )