# PANDAS 4EVER

Import 

- pandas under the alias pd 
- datetime under the alias dt
- mayplotlib.pyplot under the alias plt

Run 
- %matplotlib inline

Read in as data
- the csv `FoodServiceData_23_0` in the data folder and assign to the variable `food`

In [123]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline

food = pd.read_csv('data/FoodServiceData_23_0.csv')

## Data Exploration and Cleaning

The first question to ask yourself of a dataset: "what is this dataset treating as an observation?"

Think of an "observation" as an "event" or a "subject".  For example, an observation could be a:

- specific subject, like an individual person, with features about that person's characteristics or behaviors: medical data like `blood pressure` or `test results`, econ / sociological data like `yearly income` or `crime rate of neighorhood in which they live`, behavioral data like what products they purchased)

- aggregated subject, like in the Boston housing dataset, where each row was a suburb/town.  Features can be aggregated statistics about things within the region - like `crime rate` or `median house value` - or it can be about the specific region itself, such as `distance to Hahvahd Yahd`

- event, where each row isn't tied to a specific identity but instead tied to a specific action that occured. Often, these types of datasets will have a number of features that act as keys that distinguish events from each other, as well as features containing data about the event. For example, a store with multiple locations might have a dataset of "transactions", where the key features for each row are `Store`, `Time` and `Transaction ID`, with other features `Item Purchased`, `Payment Method`, `Coupons Used`, etc.  Notice that the same type of data - purchasing items - can be organized as either features of a "person" or an "event".

Figuring out which "observation" makes a row is an important part of figuring out how to analyze a dataset.  

Take a look at the first five rows.  How does this dataset appear to be organized?  What is an "observation"?  What are the features?

In [124]:
#Your code here
food.head()
# so it looks like each row is a restruant/establishment of some kind...

Unnamed: 0,EstablishmentID,InspectionID,EstablishmentName,PlaceName,Address,Address2,City,State,Zip,TypeDescription,Latitude,Longitude,InspectionDate,Score,Grade,NameSearch,Intersection
0,73002,1386666,JU-LI CREATIONS,,8621 HI VIEW LN,,LOUISVILLE,KY,40272,CATERERS,38.1212,0.0,2018-06-08 00:00:00,100.0,A,JU-LI CREATIONS,
1,41292,1386726,OLE HICKORY PIT BAR B Q,,6106 OLD SHEPHERDSVILLE RD,,LOUISVILLE,KY,40228,FOOD SERVICE,38.1628,-85.6604,2018-05-21 00:00:00,,,OLE HICKORY PIT BAR B Q,
2,41292,1386727,OLE HICKORY PIT BAR B Q,,6106 OLD SHEPHERDSVILLE RD,,LOUISVILLE,KY,40228,FOOD SERVICE,38.1628,-85.6604,2018-06-05 00:00:00,,,OLE HICKORY PIT BAR B Q,
3,90821,1386729,SONIC DRIVE-IN,,8600 AMBROSSE LN,,LOUISVILLE,KY,40299,FOOD SERVICE,38.1961,-85.603,2018-06-07 00:00:00,,,SONIC DRIVE-IN,
4,75560,1386745,PROOF ON MAIN,,702 W MAIN ST,,LOUISVILLE,KY,40202,FOOD SERVICE,38.257,-85.7618,2018-06-12 00:00:00,100.0,A,PROOF ON MAIN,


Which have nulls in them?

In [125]:
#Your code here
food.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9809 entries, 0 to 9808
Data columns (total 17 columns):
EstablishmentID      9809 non-null int64
InspectionID         9809 non-null int64
EstablishmentName    9809 non-null object
PlaceName            0 non-null float64
Address              9809 non-null object
Address2             0 non-null float64
City                 9809 non-null object
State                9809 non-null object
Zip                  9809 non-null int64
TypeDescription      9809 non-null object
Latitude             9809 non-null float64
Longitude            9809 non-null float64
InspectionDate       9809 non-null object
Score                8090 non-null float64
Grade                6271 non-null object
NameSearch           9809 non-null object
Intersection         0 non-null float64
dtypes: float64(6), int64(3), object(8)
memory usage: 1.3+ MB


In [126]:
# PlaceName, Address2, Intersection are ALL null rows
# Score and Grade have some null rows. 

#### There are 3 features that are all nulls, let's get rid of them.

**First**, use a method to drop a specific column.  **Then**, for the other two, use a method that will drop all columns that are completely null. 

Check that only those columns were dropped.

In [127]:
#Your code here
# First drop PlaceName:
food.drop('PlaceName', axis = 1, inplace = True)

In [128]:
# Now drop other 2 rows that are completely null:
food.dropna(axis = 1, how = 'all', inplace = True)

In [129]:
food.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9809 entries, 0 to 9808
Data columns (total 14 columns):
EstablishmentID      9809 non-null int64
InspectionID         9809 non-null int64
EstablishmentName    9809 non-null object
Address              9809 non-null object
City                 9809 non-null object
State                9809 non-null object
Zip                  9809 non-null int64
TypeDescription      9809 non-null object
Latitude             9809 non-null float64
Longitude            9809 non-null float64
InspectionDate       9809 non-null object
Score                8090 non-null float64
Grade                6271 non-null object
NameSearch           9809 non-null object
dtypes: float64(3), int64(3), object(8)
memory usage: 1.0+ MB


#### For now, let's only look at rows w/ values in the `Score` column

Drop all rows w/ nulls for `Score`.  Make sure you print out how many rows there are pre-drop, how many you dropped, and how many there are after dropping!

In [115]:
#Your code here
# Pre-drop:
food.shape

(9809, 14)

In [133]:
# Rows to drop:
rows_to_drop = food[food['Score'].isna() == True].index
rows_to_drop.shape

(1719,)

In [134]:
# Rows after dropping:
food.drop(rows_to_drop, axis = 0, inplace = True)
food.shape

(8090, 14)

#### Looks like there might be a relationship in nulls b/t `Score` and `Grade`

Do all the nulls of `Score` also have nulls for `Grade`? Vice versa?

In [135]:
#Your code here
# food.info()
score_nullGrade = food['Score'].notnull() & food['Grade'].isnull()

# score_nullGrade
grade_nullScore = food['Grade'].notnull() & food['Score'].isnull()

# grade_nullScore

print(f'Rows w/ Score and null Grade: {len(food[score_nullGrade])}')
print(f'Rows w/ Grade and null Score: {len(food[grade_nullScore])}')

Rows w/ Score and null Grade: 1819
Rows w/ Grade and null Score: 0


#### Let's see if we can fill in those `Grade` values from `Score`

How does `Grade` map onto `Score`?  Let's find the rows that have both `Grade` and `Score` values, group by `Grade`, and see the min, max and mean for `Score` for each `Grade`

In [136]:
#Your code here
food_grade = food[food['Grade'].notna()]
food_grade.groupby('Grade')['Score'].agg(['count', min, max, 'mean'])
food.shape

(8090, 14)

#### Whelp.  Let's just drop `Grade` then

In [137]:
#Your code here
food.drop('Grade', axis = 1, inplace = True)

In [138]:
food.shape

(8090, 13)

#### Let's familiarize ourselves with the levels of the categories for the features that are object types

In [None]:
#Your code here

#### Do you see some columns that might be duplicated?

Test to see if they're identical

In [None]:
#Your code here

#### Of the two identical columns, drop the one that comes second

In [None]:
#Your code here

#### Let's inspect the `InspectionDate` column

What type is it?

In [None]:
#Your code here

Convert the column to datetime object

In [None]:
#Your code here

## Data Manipulation

#### Let's keep working with that `InspectionDate` column

Create a column that shows the day of inspection

In [231]:
#Your code here

#### Get mean score per day

In [None]:
#Your code here

#### Graph!

Give it a title "Average Inspection Score by Date"

Label the axes "Date" and "Avg Inspection Score"

In [None]:
#Your code here

#### Let's say we wanted to compare it to a city that had scores that dropped down to 80

Re-set the scale of the y-axis so it starts at 75 and ends at 100.  Re-graph.

In [None]:
#Your code here

Let's see how `Score` breaks down by `TypeDescription`.

Create two columns, one whose value is the mean `Score` of the `TypeDescription` value for that row, one whose value is the std of `Score`
- Groupby `TypeDescription` and calc the mean and std of `Score` 
- Merge with `Food` on `TypeDescription` value

In [None]:
#Your code here

Calculate a new column that's difference between an inspections's `Score` and its `TypeDescription_Mean` in units of `TypeDescription_Std`

In [None]:
#Your code here

Find the values of `EstablishmentName` of the 20 inspections whose `Score` most exceeds its `TypeDescrition_Mean`

In [None]:
#Your code here

# Import Libraries

In [4]:
# SQL Connection and Querying
import sqlite3

# Data manipulation
import pandas as pd

# API Connection
import requests

# Visualization
import matplotlib.pyplot as plt

# SQL

![](index_files/schema.png)

Open a connection to ```chinook.db```

In [2]:
# Your code here


## 1.

>Select all column and rows from the genres table

In [1]:
# Your code here


## 2.

1. Select the ```City``` column from the ```customers``` table 
2. Select the ```Name``` column from the ```genres``` table –– aliased as "Genre" .
3. Create a column that counts the number of purchases made from each city for Blues music.
4. Sort the results in descending order.
5. Return the top ten cities.

In [3]:
# Your code here


## 3.

1. Select the ```FirstName``` column from the ```customers``` table
2. Select the ```LastName``` column from the ```customers``` table
3. Select the ```Email``` column from the ```customers``` table
4. Create a new column that is the multiplication of the ```UnitPrice``` and ```Quantity``` columns from the ```invoice_items``` table. 
    - Alias this column as ```Total```.
5. Use ```GROUP BY```  to return the sum total for each customer
6. Sort in descending order
7. Return the top 20 highest spending customers.

In [None]:
# Your code here


# API


>For this review, we will take a look at three separate APIs and work through the process of writing requests based on each APIs documentation.

## Public Holiday API

>This API provides public holiday information for more than 90 countries. 

>The API's Documentation can be found [here](https://date.nager.at/swagger/index.html)



**Write a request to return all available countries**

In [4]:
# Your code here


**Convert the results of our request to a DataFrame**

In [19]:
# Your code here


**What is the key for the United States?**

In [26]:
# Your code here

**Make a request to the API that returns the public holidays for the United States**

In [27]:
# Your code here

**Convert ```us``` to a DataFrame**

In [None]:
# Your code here


## iTunes API

Documentation for this API can be found [here](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/)

Submit a request to the iTunes API that returns data on Harry Potter Audio Books

In [None]:
# Your code here


### Level Up

Using the data from the Harry Potter Audio Books request, collect the artistId for each entry and use those IDs to make a single ```https://itunes.apple.com/lookup?id={}&entity=audiobooks&sort=recent``` request. 

To do this:
- Every id should be added to a string
- Each id should be followed by a comma. ie ```id1,id2,id3,id4```
    - The final id should not be followed by a comma
- No id should be added to the string more than once.

In [None]:
# Your code here


In [142]:
# Run this cell!
REQUEST = 'https://itunes.apple.com/lookup?id={}&entity=audiobook&sort=recent'.format(ARTIST_IDS)
req = requests.get(REQUEST).json()

number_of_results = req['resultCount']
print('Number of results:', number_of_results)

Number of results: 123
