# Exploratory Data Analysis with Pandas

![panda](http://res.freestockphotos.biz/thumbs/3/3173-illustration-of-a-giant-panda-eating-bamboo-th.png)

In [None]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

# Objectives

- Use lambda functions and DataFrame methods to transform data
- Handle missing data

# More Pandas

Suppose you were interested in opening an animal shelter. To inform your planning, it would be useful to analyze data from other shelters to understand their operations. In this lecture, we'll analyze animal outcome data from the Austin Animal Center.  

## Loading the Data

Let's take a moment to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). 

We can also ingest the data right off the web, as we do below. The code below will load JSON data for the last 1000 animals to leave the center from this [JSON file](https://data.austintexas.gov/resource/9t4d-g238.json). 

In [None]:
json_url = 'https://data.austintexas.gov/resource/9t4d-g238.json'
print(json_url)

In [None]:
json_url = 'https://data.austintexas.gov/resource/9t4d-g238.json'
animals = pd.read_json(json_url)

In [None]:
type(animals)

# Exploratory Data Analysis (EDA)

Exploring a new dataset is essential for understanding what it contains. This will generate ideas for processing the data and questions to try to answer in further analysis.

## Inspecting the Data

Let's take a look at a few rows of data.

In [None]:
animals.head(10)

The `info()` and `describe()` provide a useful overview of the data.

In [None]:
animals.info()

> We can see we have some missing data. Specifically in the `outcome_type`, `outcome_subtype`, and `name` columns.

In [None]:
animals.describe()

In [None]:
# Use value counts to check a categorical feature's distribution

animals['color'].value_counts()

Now that we have a sense of the data available to us, we can focus in on some more specific questions to dig into. These questions may or may not be directly relevant to your goal (e.g. helping plan a new shelter), but will always help you gain a better understanding of your data.

In your EDA notebooks, **markdown** will be especially helpful in tracking these questions and your methods of answering the questions.

## Question 1: What animal types are in the dataset?

We can then begin thinking about what parts of the DataFrame we need to answer the question.

* What features do we need?
 - "animal_type"
* What type of logic and calculation do we perform?
 - Let's use `.value_counts()` to count the different animal types
* What type of visualization would help us answer the question?
 - A bar chart would be good for this purpose

In [None]:
animals.columns

In [None]:
animals['animal_type'].value_counts()

In [None]:
fig, ax = plt.subplots()

animal_type_values = animals['animal_type'].value_counts()

ax.barh(
    y=animal_type_values.index,
    width=animal_type_values.values
)
ax.set_xlabel('count');

In [None]:
animals['animal_type'].hist();

Questions lead to other questions. For the above example, the visualization raises the question...

## Question 2: What "Other" animals are in the dataset?

To find out, we need to know whether the type of animal for "Other" is in our dataset - and if so, where to find it.   

**Discussion**: Where might we look to find animal types within the Other category?

<details>
    <summary>
        Answer
    </summary>
        The breed column.
</details>

In [None]:
# Your exploration here
animals.head()

Let's use that column to answer our question.

In [None]:
mask_other_animals = animals['animal_type'] == 'Other'
animals[mask_other_animals]['breed'].value_counts()

In [None]:
animals[mask_other_animals]

## Question 3: How old are the animals in our dataset?

Let's try to answer this with the `age_upon_outcome` variable to learn some new `pandas` tools.

In [None]:
animals['age_upon_outcome'].value_counts()

### `Series.map()`

The `.map()` method applies a transformation to every entry in the Series. This transformation  "maps" each value from the Series to a new value. A transformation can be defined by a function, Series, or dictionary - usually we'll use functions.

The `.apply()` method is similar to the `.map()` method for Series, but can only use functions. It has more powerful uses when working with DataFrames.

In [None]:
def one_year(age):
    if age == '1 year':
        return '1 years'
    else:
        return age

In [None]:
animals['new_age1'] = animals['age_upon_outcome'].map(one_year)
animals['new_age1'].value_counts()

### More Sophisticated Mapping

Let's use `.map()` to turn sex_upon_outcome into a category with three values (called **ternary**): male, female, or unknown. 

First, explore the unique values:

In [None]:
animals['sex_upon_outcome'].unique()

In [None]:
def sex_mapper(status):
    if status in ['Neutered Male', 'Intact Male']:
        return 'Male'
    elif status in ['Spayed Female', 'Intact Female']:
        return 'Female'
    else:
        return 'Unknown'

In [None]:
animals['new_sex1'] = animals['sex_upon_outcome'].apply(sex_mapper)
animals.loc[:, ['sex_upon_outcome', 'new_sex1']]

### Lambda Functions

Simple functions can be defined just when you need them, when you would call the function. These are called **lambda functions**. These functions are **anonymous** and disappear immediately after use.

Let's use a lambda function to get rid of 'Other' in the "animal_type' column.

In [None]:
animals[animals['animal_type'] == 'Other']

In [None]:
animals['animal_type'].value_counts()

In [None]:
type(np.nan)

In [None]:
animals['animal_type'].map(lambda x: np.nan if x == 'Other' else x).value_counts()

In [None]:
animals['animal_type'].value_counts()

In [None]:
#animals['animal_type'] = animals['animal_type'].map(lambda x: np.nan if x == 'Other' else x)

# Handling Missing Data

A lot of the times we'll have missing information in our data set. This can sometimes be troublesome in what we're trying to do.

So far, we've been doing some preprocessing/cleaning to answer questions. Now we're going to handle the missing values in our data.

There are a few strategies we can choose from and they each have their special use case.

> Before making changes, it's convenient to make changes to a copy instead of overwriting data. We'll keep all our changes in `animals_clean` which will be a [copy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html) of the original DataFrame.

In [None]:
animals_clean = animals.copy()

## Fill with a Relevant Value

A lot of times we already have an idea of how we want to specify that a value was missing and replace it with a value that makes more sense than an "empty" value.

For example, it might make sense to fill the value as "MISSING" or "UNKNOWN". This way it's clearer when do more analysis.

> We can use Pandas' [`fillna()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) to replace missing values with something specific

In [None]:
# Note this creates a copy of `animals` with the missing values replaced
animals_name_filled = animals_clean.fillna('np.nan') # {col_name:new_value}
animals_name_filled.head(10)

In [None]:
# `animals` DataFrame is left untouched
animals_clean.head()

In [None]:
# Alternative way to fill missing values by specifying column(s) first
animals_only_names = animals[['name']].fillna(value='UNKNOWN')
animals_only_names.head(10)

In [None]:
# To keep changes in DataFrame, overwrite the column
animals_clean[['name']] = animals_only_names
animals_clean.head()

## Fill with a Reasonable Value

Other times we don't know what the missing value was but we might have a reasonable guess. This allows us to still use the data point (row) in our analysis.

> Beware that filling in missing values can lead to you drawing incorrect conclusions. If most of the data from a column are missing, it's going to appear that the value you filled it in with is more common that it actually was!

A lot of the time we'll use the _mean_ or _median_ for numerical values. Sometimes values like $0$ make sense since it might make sense in the context of how the data was collected.

With categorical values, you might choose to fill the missing values with the most common value (the *mode*).

> Similar to the previous subsection, we can use the `fillna()` method after specifying the value to fill

In [None]:
## Let's find the most common value for `outcome_subtype`
outcome_subtype_counts = animals['outcome_subtype'].value_counts()
outcome_subtype_counts

In [None]:
# This gets us just the values in order of most frequent to least frequent
outcome_subtype_ordered = outcome_subtype_counts.index
print(outcome_subtype_ordered)

# Get the first one
most_common_outcome_subtype = outcome_subtype_ordered[0]

In [None]:
most_common_outcome_subtype

In [None]:
animals['outcome_subtype'].mode()

In [None]:
# Using the built-in mode() method
# Note this is Series so we have to get the first element (which is the value)
most_common_outcome_subtype = animals['outcome_subtype'].mode()[0]
most_common_outcome_subtype

In [None]:
# Similar to the previous subsection, we can use fillna() and update the DF
animals_clean['outcome_subtype'] = animals['outcome_subtype']\
.fillna(most_common_outcome_subtype)
animals_clean.head()

## Specify That the Data Were Missing

Even after filling in missing values, it might make sense to specify that there were missing data. You can document that the data was missing by creating a new column that represents whether the data was originally missing or not.

This can be helpful when you suspect that the fact the data was missing could be important for an analysis.

> Since we already removed some missing values, we're going to reference back to the original `animals` DataFrame. (Good thing we didn't overwrite it! 😉)

In [None]:
# Let's specify which values were originally missing in "outcome_subtype"
missing_outcome_subtypes = animals['outcome_subtype'].isna()
missing_outcome_subtypes

In [None]:
# Create new column for missing outcome subtypes matched w/ replaced values
animals_clean['outcome_subtype_missing'] = missing_outcome_subtypes
animals_clean.head()

## Drop Missing Data

You should try to keep as much relevant data as possible, but sometimes the other methods don't make as much sense and it's better to remove or **drop** the missing data.

We typically drop missing data if very little data would be lost and/or trying to fill in the values wouldn't make sense for our use case. For example, if you're trying to predict the outcome based on the other features/columns it might not make sense to fill in those missing values with something you can't confirm.

> We noticed that `outcome_type` had only a few missing values. It might not be worth trying to handle those few missing values. We can pretend that the `outcome_type` was an important feature and without it the rest of the row's data is of little importance to us.
>
> So we'll decide to drop the row if a value from `outcome_type` is missing. We'll use Pandas' [`dropna()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).

In [None]:
animals_clean['outcome_type'].value_counts().sum()

In [None]:
# This will drop any row (axis=0) or column (axis=1) that has missing values
animals_clean = animals_clean.dropna(   # Note we're overwriting animals_clean
                                axis=0, # This is the default & will drop rows; axis=1 for cols
                                subset=['outcome_type'] # Specific labels to consider (defaults to all)
)
animals_clean.head()

In [None]:
animals_clean.shape

## Comparing Before and After

We can now see all the work we did!

In [None]:
# Original data
animals.info()

In [None]:
# Missing data cleaned
animals_clean.info()

In [None]:
animals_clean.reset_index()