# Pandas for Exploratory Data Analysis II 

Recall Pandas is the most useful Python library for data manipulation and exploration. We have so much more to see!

In this lesson, we'll continue exploring Pandas for EDA. Specifically: 

- Identify and handle missing values with Pandas.
- Implement groupby statements for specific segmented analysis.
- Use apply functions to clean data with Pandas.

We'll implicitly review many functions from our first Pandas lesson along the way!

## Remember the Iowa Liquor Dataset?

- **Invoice/Item Number** - Concatenated invoice and line number associated with the liquor order. This provides a unique identifier for the individual liquor products included in the store order
- **Date** - Date of order 
- **Store Number** - Unique number assigned to the store who ordered the liquor.
- **Store Name** - Name of store who ordered the liquor.
- **Address** - Address of the store that ordered the liquor
- **City** - City where the store who ordered the liquor is located
- **Zip Code** - Zip Code of where the store that ordered is located 
- **Store Location** - Location of store who ordered the liquor. The Address, City, State and Zip Code are geocoded to provide geographic coordinates. Accuracy of geocoding is dependent on how well the address is interpreted and the completeness of the reference data used.
- **County Number** - Iowa county number for the county where store who ordered the liquor is located
- **County** - County where the store who ordered the liquor is located
- **Category** - Category code associated with the liquor ordered
- **Category Names** - Category of the liquor ordered.
- **Vendor Number** - The vendor number of the company for the brand of liquor ordered
- **Vendor Name** - The vendor name of the company for the brand of liquor ordered
- **Item Name** - Item number for the individual liquor product ordered.
- **Item Description** - Description of the individual liquor product ordered.
- **Pack** - The number of bottles in a case for the liquor ordered
- **Bottle Volume (mL)** - Volume of each liquor bottle ordered in milliliters.
- **State Bottle Cost** - The amount that Alcoholic Beverages Division paid for each bottle of liquor ordered
- **State Bottle Retail** - The amount the store paid for each bottle of liquor ordered
- **Bottles Solde** - The number of bottles of liquor ordered by the store
- **Sale (Dollars)** - Total cost of liquor order (number of bottles multiplied by the state bottle retail)
- **Volume Sold (Liters)** - Total volume of liquor ordered in liters. (i.e. (Bottle Volume (ml) x Bottles Sold)/1,000)
- **Volume Sold (Gallons)** - Total volume of liquor ordered in gallons. (i.e. (Bottle Volume (ml) x Bottles Sold)/3785.411784)


### Our Modified Iowa Liquor Dataset

Because the full dataset (of all liquor sales from 2012 to present) is greater than 13 million rows (13,948,103+ at the time of writing), **we will work with a modified dataset.**

Our modified dataset has a few key changes:
- Only sales from May 2017 and May 2018 are present
- A number of values have been deliberately deleted (to practice working with missing data!)


### Import Pandas

In [None]:
import pandas as pd
import numpy as np # used for linear algebra and random sampling
%matplotlib inline

### Read in the dataset

We are using the `read_csv()` method (and using a special encoding to handle our file's Excel roots).

In [None]:
liq = pd.read_csv("../data/iowa_liquor_may_17_18.csv", encoding='cp1252')

In [None]:
# remember checking the top five rows
liq.head()

In [None]:
liq.shape

### Rename our columns (like last time)

Let's rename our columns so our data is easier to work with.

In [None]:
# declare a list of strings - these strings will become the new column names
cols = ['date', 'store_number', 'store_name', 'city', 
        'zip_code', 'location', 'county', 'category_name',
        'vendor_name', 'item_number', 'item_description', 'pack', 
       'bottle_vol_ml', 'state_bottle_cost', 'state_bottle_retail', 'bottles_sold',
       'sale', 'volumne_sold_l', 'volume_sold_gal', 'is_may_2017', 'is_may_2018']

In [None]:
liq.columns = cols

In [None]:
liq.columns

## Handling missing data

Recall missing data is a systemic, challenging problem for data scientists. Imagine conducting a US election poll, but losing all female voter responses in the process!

"Handling missing data" itself is a broad topic. We'll focus on two components:

- Using Pandas to identify we have missing data
- Strategies to fill in missing data
- Filling in missing data with Pandas


***Create missing data*** 😮

> For the purposes of education... Run the below cell to *create* missing data in our DataFrame.

In [None]:
# create random places to drop data
to_drop1 = np.random.randint(1,427923,72746)
to_drop2 = np.random.randint(1,427923,29954)
np.append(to_drop2, 2) # make sure we have index number 2 to drop


# drop the data!!!
liq.iloc[to_drop1,15] = np.nan
liq.iloc[to_drop2,16] = np.nan


### Identifying missing data

Before *handling*, we must identify we're missing data at all! (In this given dataset, we have eliminated datapoints for the purposes of these exercises.)

We have a few ways to explore missing data, and they are reminiscient of our Boolean filters.

In [None]:
# True when data isn't missing
liq.notnull() 

In [None]:
# True when data is missing
liq.isnull() 

Now, we may want to see null values in aggregate. We can use `sum()` to sum down a given column

In [None]:
# see number of missing values per column
liq.isnull().sum()

Look! We've found missing values!

How could this missing data be problematic for our analysis?

### Understanding missing data

Finding missing data is the easy part! Determining way to do next is more complicated.

Typically, we are most interested in knowing **why** we are missing data. Once we know what 'type of missingness' we have (the source of missing data), we can proceed effectively.

Let's first quantify how much data we are missing.

In [None]:
# use a boolean filter to only show rows where bottles_sold is missing
liq[liq.bottles_sold.isnull()]

In [None]:
# obtain just the number of rows
liq[liq.bottles_sold.isnull()].shape[0]

In [None]:
# divide this by the overall DataFrame to get a percent of missing values
liq[liq.bottles_sold.isnull()].shape[0] / liq.shape[0]

Let's do the same for `sale`.

In [None]:
liq[liq.sale.isnull()].shape[0] / liq.shape[0]

Collectively, we are missing about 16% of data on the number of bottles sold in a given daily transaction, and about 7% of the data on total sale value for a given number of items in a single day.

### Filling in missing data

How we fill in data depends largely on why it is missing (types of missingness) and what sampling we have available to us.

We may:

- Delete missing data altogether
- Fill in missing data with:
    - The average of the column
    - The median of the column
    - A predicted amount based on other factors
- Collect more data:
    - Resample the population
    - Followup with the authority providing data that is missing


In our case, let's focus on handling `bottles_sold`.

In [None]:
# Can we identify a pattern of missingness (no)
liq[liq.bottles_sold.isnull()]

In [None]:
# Do the missing values have a significantly different five number summary than non-missing?
liq[liq.bottles_sold.isnull()].describe()

In [None]:
# full dataset 5-number summary
liq.describe()

In [None]:
# check the difference between the two
liq.describe() - liq[liq.bottles_sold.isnull()].describe()

It appears the two samples do not have *significant* differences! (We could run statistical tests, but...another day.)

Now, this makes sense! We did randomly drop values, afterall.

Option 1: Drop the missing values.

In [None]:
# drops rows where any row has a missing value - this does not happen *in place*, so we are not actually dropping
liq.dropna()

Option 2: Fill in missing values

Traditionally, we fill missing data with a median, average, or modelled value. Let's see the five-number-summary of the column of interest to decide.

In [None]:
liq.bottles_sold.describe()

In this given case, we may opt to fill our data in with the *median* (50%) rather than the *mean* because we see such a positive skew. The most commonly processed transaction is on bottles that are single order.

In [None]:
# get the 50th percentile
liq.bottles_sold.quantile()

In [None]:
# fill in missing data with 50th percentile -- note we *are* making this change in place
liq.bottles_sold.fillna(value=liq.bottles_sold.quantile(), inplace = True)

In [None]:
# check total number of missing values
liq.isnull().sum()

They're gone!

Now, to be fair, we may want to investigate our missing values *even more*! What if counties with larger orders, on balance, are more likely to be missing from our dataset? This would skew our data unfairly.

Even determining how to fill in missing data requires careful exploratory data analysis!

## Groupby Statements

In Pandas, groupby statements are similar to pivot tables in that they allow us to segment our population to a specific subset.

For example, if we want to know the average number of bottles sold and pack sizes per city, a groupby statement would make this task much more straightforward.


To think how a groupby statement works, think about it like this:

- **Split:** Separate our DataFrame by a specific attribute
- **Apply:** Determine how categories are going to be mathematically incorporated. For example, if there are multiple store locations in one city, do we want the average amount across all stores, the total amount for the stores, or perhaps even the highest amount for a single store per city?
- **Combine:** Put our DataFrame back together.

![](http://i.imgur.com/yjNkiwL.png)

Let's try it out!

In [None]:
# groupby city - take the average for each column when combining back together
liq.groupby('city').mean()

In [None]:
# perhaps we want *just* bottles sold from the above
liq.groupby('city').bottles_sold.mean()

In [None]:
# or maybe, we want the biggest single transaction per county
liq.groupby('city').bottles_sold.max()

In [None]:
# in fact, we can 'apply' a mean and max at once- plus count and min!
liq.groupby('city').bottles_sold.agg(['count', 'mean', 'min', 'max'])

In [None]:
# sort by largest average; grab top 10 cities in Iowa by average liquor store bottle size purchase
liq.groupby('city').bottles_sold.agg(['count', 'mean', 'min', 'max']).sort_values(by='mean', ascending=False).head(10)

In [None]:
# groupby creates a groupby object - it needs to be told how to aggregate things together
liq.groupby('city').bottles_sold

In [None]:
liq.groupby('city').bottles_sold.count()

In [None]:
# top 10 cities by counts of active liquor stores
liq.groupby('city').bottles_sold.count().sort_values(ascending=False).head(10)

## Apply functions for column operations

Apply functions allow us to perform a complex operation across an entire columns highly efficiently.

For example, recall our `sale` data is formatted in an unhelpful way (strings, not floats):                                                      

In [None]:
liq.dtypes

In [None]:
# first sale value
liq.sale[0]

We need to convert this value to a float, and without the dollar sign.

**Apply functions** allow us to write a function that cleans a single value, and then we *apply* that function to a whole column. (It's like a for loop, but way more efficient as an operation!)

Writing them follows a familiar three steps:

1. Write a function that creates the desired output on a single value
2. Test that function on one value of interest
3. Apply that function to the whole column

To start, let's write a function that converts an inputted value with a dollar sign to a float, and returns that float.

In [None]:
def dollars_to_float(value):
    
    # try to convert the inputted value to a float
    try:
        return float(value.strip('$'))
    
    # in the case of the value being a null value, we simply return a null
    except:
        return np.nan

Let's try our function on a value of interest or two.

In [None]:
liq.sale[0]

In [None]:
liq.sale[2]

In [None]:
dollars_to_float(liq.sale[0])

In [None]:
dollars_to_float(liq.sale[2])

Now, we apply this function to the whole column with the following syntax. Notice: we are going to create a new column (out of thin air!) called `sale_clean`.

In [None]:
liq['sale_clean'] = liq.sale.apply(dollars_to_float)

Voila! Our first apply function.

**Your turn:** Identify one other column where we may want to write a new apply function, or use the one we just created for the purposes of cleaning up our dataset.

In [None]:
# identify a column to fix


In [None]:
# write a function to fix a single value in that columns


In [None]:
# apply that function across the whole column


## Wrap up

We've covered even more useful information! Here are the key takeaways:

- **Missing data** comes in many shapes and sizes. Before deciding how to handle it, we identify it exists. We then derive how the missingness is affecting our dataset, and make a determination about how to fill in values.

```python
# pro tip for identifying missing data
df.isnull().sum()
```

- **Grouby** statements are particularly useful for a subsection-of-interest analysis. Specifically, zooming in on one condition, and determining relevant statstics.

```python
# group by 
df.groupby('column').agg['count', 'mean', 'max', 'min']
```

- **Apply functions** help us clean values across an entire DataFrame column. They are *like* a for loop for cleaning, but many times more efficient. They follow a common pattern:
1. Write a function that works on a single value
2. Test that function on a single value
3. Apply that function to a whole column

(The most confusing part of apply functions is that we write them with *a single value* in mind, and then apply them to many single values at once.)