# More Pandas

### Introduction
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and to get more information about planning. In this lecture, we'll look at a real data set collected by Austin Animal Center over several years and use our pandas skills from the last lecture and learn some new ones in order to explore this data further.

#### Our goals today are to be able to: <br/>

- Apply and use `.map()` and `.applymap()` from the Pandas library
- Explain what a groupby object is and split a DataFrame using `.groupby()`
- Explain lambda functions and use them on a DataFrame
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting
- Use one-hot encoding to make use of categorical variables

#### Getting started

Let's take a moment to download and to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). What kinds of questions can we ask this data and what kinds of information can we get back?

Let's take a look at the data:

In [None]:
import numpy as np
import pandas as pd
animals = pd.read_csv('Austin_Animal_Center_Outcomes.csv')
animals.head()

What do we notice about this dataset?

In [None]:
animals.isnull()

In [None]:
animals.isnull().sum()

In [None]:
animals.fillna('null')

In [None]:
animals.fillna(np.nan)

### 1. Applying and using map and applymap from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [None]:
animals.applymap(str).head()

The .map() method takes a function as input that it will then apply to every entry in the Series.

In [None]:
# This line of code will split the IDs into two parts and add the parts as new columns.

animals[['Animal ID Prefix', 'Animal ID Num']] =\
animals['Animal ID'].str.split('A', expand=True)

In [None]:
# Now: How can we convert the Animal ID Num column to integers?

animals['Animal ID Num'] = animals['Animal ID Num'].map(int)

Or we could have just used the `.astype()` method:

In [None]:
animals['Animal ID Num'] = animals['Animal ID Num'].astype(int)

#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [None]:
animals['Animal ID Num'].map(lambda x: x*2)[:4]

**Exercise: Use an anonymous function to add 'approximately' in front of the entries in Age upon Outcome**

In [None]:
# Your code here!



What went wrong? How can we fix it?

### 2. Methods for Re-Organizing DataFrames: .groupby()

Those of you familiar with SQL have probably used the GROUP BY command. (And if you haven't, you'll see it very soon!) Pandas has this, too.

The .groupby() method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [None]:
animals.groupby('Animal Type')

#### .groups and .get_group()

In [None]:
animals.groupby('Animal Type').groups

In [None]:
animals.groupby('Animal Type').get_group('Livestock')

#### Aggregating

In [None]:
animals.groupby('Animal Type').std()

#### Datetime Objects

'Datetime' is a special data type for dates. And we can convert an appropriately formatted variable to the datetime type simply by calling `pd.to_datetime()`.

In [None]:
pd.to_datetime(animals['Date of Birth'])

**Exercise: Find the latest date of birth per animal type.**

In [None]:
# First redefine Date of Birth as a series of datetime objects.
# Then group by Animal Type and calculate the max.




### 3. Reshaping a DataFrame

#### .pivot()

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [None]:
animals.pivot(values='Age upon Outcome', columns='Animal Type').head()

### 4. Methods for Combining DataFrames: .join(), .merge(), .concat(), .melt()

#### .join()

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'MP'])

In [None]:
toy1

In [None]:
toy2

In [None]:
toy1.set_index('age').join(toy2.set_index('age'))

For more on this method, check out the [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)!

#### .merge()

In [None]:
ds_chars = pd.read_csv('ds_chars.csv', index_col=0)
ds_chars

In [None]:
states = pd.read_csv('states.csv', index_col=0)
states

In [None]:
ds_chars.merge(states, left_on='home_state', right_on='state', how='inner')

#### pd.concat()

This method takes a *list* of pandas objects as arguments.

N.B. The cell below will likely produce a **Deprecation Warning**.

In [None]:
ds_full = pd.concat([ds_chars, states])
ds_full

`pd.concat()`––and many other pandas operations––make use of an `axis` parameter. For this particular method I need to specify whether I want to concatenate the DataFrames *row-wise* (`axis=0`) or *column-wise* (`axis=1`). The default is `axis=0`, so let's override that!

#### pd.melt()

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
pd.melt(ds_full)

### 5. Making Use of Categories: One-Hot Encoding

Pandas has a one-hot encoder called `get_dummies()`, which is good for exploratory data analysis (EDA).

This might be good to use if we're in the **data-understanding** stage (Stage 2) of our CRISP-DM process.

We can call it on a DataFrame as a whole or on a Series (column).

In [None]:
pd.get_dummies(animals['Animal Type'])

If however we're in a later stage of the process and we're interested, say, in preparing a data pipeline, `pandas.get_dummies()` will prove inferior to other tools.

In practice, we will **not** use `pandas.get_dummies()`. The library Scikit-Learn (`sklearn`, included with your Anaconda installation) has a `OneHotEncoder` class that creates an object that persists. This makes it much more apt for production environments, and so it's good to get in the habit of using it.

Ultimately, we will use **many** tools from sklearn.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe = OneHotEncoder()

In [None]:
ohe.fit(animals[['Animal Type']])

Now that the `OneHotEncoder` has been fitted to our data, it has newly available attributes and methods. In particular, it has access to the different categories that we're replacing:

In [None]:
ohe.get_feature_names()

We'll have much more to say about `sklearn` syntax and about Python's object structure. But let's now transform our data to see what the new table looks like:

In [None]:
ohe.transform(animals[['Animal Type']])

For the sake of saving storage space, the return is a **sparse matrix**, but we can "re-inflate it if we want to see it in tabular form:

In [None]:
types_encoded = ohe.transform(animals[['Animal Type']]).todense()
types_encoded

Let's put it into a DataFrame:

In [None]:
pd.DataFrame(types_encoded, columns=ohe.get_feature_names()).head()