# Discussion 03: HW2 Recap, Data Cleaning, and EDA

In this discussion, we will recap HW2 and go over important EDA concepts to understand when working with data.

Specifically, we'll discuss:

- Question 3b (zip_counts, groupby)
- Question 7a (score_pairs_by_business, groupby using filter https://pandas.pydata.org/pandas-docs/stable/groupby.html#filtration)
- Question 8c (num_vio, pd.merge vs. pd.join)

Then, we'll recap EDA:

- Structure
- Granularity
- Scope
- Temporality
- Faithfulness

## HW2 Recap

Here are the big takeaways from HW2:

1. Data are messy and often really hard to deal with! Eg. zipcodes and locations of businesses.
2. Restaurants don't often improve their inspection score after their first inspection.
3. It's not obvious how exactly the inspection score is influenced by the violations.

See https://github.com/hopelessoptimism/happy-healthy-hungry/blob/master/h3.ipynb for another analysis on the same dataset.

In terms of writing pandas code, here are our recommendations:

1. Write code very incrementally. As soon as you try to chain together >3 functions in a row, you'll start having a really hard time debugging.
1. The docs are not superrr helpful. You should use the links of the pandas docs sidebar instead of the actual API reference.
1. There are always 5 different ways to do things in pandas. Start with what we've taught in class, then branch out.
1. You'll very rarely have to do something more complicated than some combination of `.apply`, `.groupby`, and `pd.merge` because they're very flexible. (You don't need to learn every pandas function there is.)
1. When grouping, select out your columns before calling `.groupby`.
1. When joining, use `pd.merge`.
1. Avoid mutation!

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

dsDir = "data/"

bus = pd.read_csv(os.path.join(dsDir, "businesses.csv"), encoding='ISO-8859-1')
ins = pd.read_csv(os.path.join(dsDir, "inspections.csv"))
vio = pd.read_csv(os.path.join(dsDir, "violations.csv"))

#### Question 3b

To explore the zip code values, it makes sense to examine counts, i.e., the number of records  that have the same zip code value. This is essentially answering the question: How many restaurants are in each zip code? 

Please generate a dataframe with `postal_code` as the index and a column called `count` which denotes the number of restaurants for each zip code.   If the zipcode is missing be sure to replace it with `MISSING` (e.g., by using `fillna`).

Notes:

- Lots of people had trouble with fillna, so explain what that does
- Lots of people had trouble using groupby()/value_counts() and getting a DataFrame back, so talk about the `.to_frame()` method.
- **We recommend using `.loc[]` to slice out the columns you want to group first, then using `.groupby`.**

In [None]:
# SOLUTION 1
zip_counts = ...

# SOLUTION 2
zip_counts = ...

In [None]:
bus['zip_code'] = bus['postal_code'].str[:5].str.replace("94602", "94102")
ins2016 = ins[pd.to_datetime(ins['date'], format='%Y%m%d').dt.year == 2016]
vio['new_date'] = pd.to_datetime(vio['date'], format='%Y%m%d')
vio2016 = vio[vio['new_date'].dt.year == 2016]


#### Question 7a

What's the relationship between the first and second scores for the businesses with 2 inspections in a year? Do they typically improve?

First, make a dataframe called, `scores_pairs_by_business`, indexed by business_id (containing only business with exactly 2 inspections in 2016).  This dataframe contains the field `score_pair` consisting of the score pairs ordered chronologically  `[first_score, second_score]`. 

Plot these scores. That is, make a scatter plot to display these pairs of scores. Include on the plot a reference line with slope 1. 

You may find the functions `sort_values`, `groupby`, `filter` and `agg` helpful, though not all necessary. 

The first few rows resulting table should look something like:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>score_pair</th>
    </tr>
    <tr>
      <th>business_id</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>24</th>
      <td>[96, 98]</td>
    </tr>
    <tr>
      <th>45</th>
      <td>[78, 84]</td>
    </tr>
    <tr>
      <th>66</th>
      <td>[98, 100]</td>
    </tr>
    <tr>
      <th>67</th>
      <td>[87, 94]</td>
    </tr>
    <tr>
      <th>76</th>
      <td>[100, 98]</td>
    </tr>
  </tbody>
</table>

Notes:

- Turns out that `.filter` for `groupby` is different from `.filter` for DataFrames. Make sure to point this out.
- To groupby and then convert to a list, use `.agg` with a user-defined function. For more advanced users, use `.apply` on the groupby: https://stackoverflow.com/a/22221675

In [None]:
#SOLUTION CELL
scores_pairs_by_business = ...
scores_pairs_by_business

#### Question 8c
Derive a variable, `num_vio`, that contains the number of violations in a restaurant inspection.

Notes:

- Again, there was confusion about how to use groupby with size to get a dataframe, so you should talk about how  reset_index works for Series.
- There was lots of confusion about how `pd.merge` works. You should discuss that in more detail and the difference between `pd.join` and `pd.merge`: https://stackoverflow.com/a/37891437

In [None]:
num_vios = ...
ins2016['num_vio'] = ...

<img src="qmark.svg" width="20px" style="display:inline"> Discussion question (10 mins): In the Berkeley Police Calls dataset that Joey discussed in lecture:

1. What is the structure of the dataset?
2. What is the granularity of the dataset?
3. What is the scope of the dataset?
4. What is the temporality of the dataset?
5. What is the faithfulness of the dataset?

<img src="qmark.svg" width="20px" style="display:inline"> Discussion question (10 mins): Go over vitamin questions, if people had any.

If time allows:

Pick a dataset with your partner! Here are some suggestions of places to look:

- Berkeley Open Data: https://data.cityofberkeley.info/browse?limitTo=datasets&utf8=&page=1 
- Transparent California: http://transparentcalifornia.com/
- Big list of Public Datasets: https://github.com/caesar0301/awesome-public-datasets

<img src="qmark.svg" width="20px" style="display:inline"> Discussion question (10 mins): After picking your dataset, downlaod it, load it in, and answer the following questions with your partner:

1. What is the structure of the dataset?
2. What is the granularity of the dataset?
3. What is the scope of the dataset?
4. What is the temporality of the dataset?
5. What is the faithfulness of the dataset?

<img src="qmark.svg" width="20px" style="display:inline"> Discussion question (10 mins): Use `seaborn` to create a visualization showing something you didn't know before about the data.