# More Pandas

### Introduction
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and to get more information about planning. In this lecture, we'll look at a real data set collected by Austin Animal Center over several years and use our pandas skills from the last lecture and learn some new ones in order to explore this data further.

#### Our goals today are to be able to: <br/>

- Apply and use `.map()` and `.applymap()` from the Pandas library
- Explain what a groupby object is and split a DataFrame using `.groupby()`
- Explain lambda functions and use them on a DataFrame
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting
- Use one-hot encoding to make use of categorical variables

#### Getting started

Let's take a moment to download and to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). What kinds of questions can we ask this data and what kinds of information can we get back?

Let's take a look at the data:

In [1]:
import numpy as np
import pandas as pd
animals = pd.read_csv('Austin_Animal_Center_Outcomes.csv')
animals.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black


What do we notice about this dataset?

In [3]:
animals.isnull()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,False,False,False,False,False,False,True,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,False,False
5,False,False,False,False,False,False,True,False,False,False,False,False
6,False,True,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,True,False,False,False,False,False
8,False,True,False,False,False,False,False,False,False,False,False,False
9,False,False,False,False,False,False,False,False,False,False,False,False


In [4]:
animals.isnull().sum()

Animal ID               0
Name                36214
DateTime                0
MonthYear               0
Date of Birth           0
Outcome Type            6
Outcome Subtype     63374
Animal Type             0
Sex upon Outcome        3
Age upon Outcome       28
Breed                   0
Color                   0
dtype: int64

In [5]:
animals.fillna('null')

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black
5,A680969,*Zeus,08/05/2014 04:59:00 PM,08/05/2014 04:59:00 PM,06/03/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby
6,A684617,,07/27/2014 09:00:00 AM,07/27/2014 09:00:00 AM,07/26/2012,Transfer,SCRP,Cat,Intact Female,2 years,Domestic Shorthair Mix,Black
7,A742354,Artemis,01/22/2017 11:56:00 AM,01/22/2017 11:56:00 AM,01/20/2010,Return to Owner,,Cat,Neutered Male,7 years,Domestic Shorthair Mix,Blue/White
8,A681036,,06/11/2014 05:11:00 PM,06/11/2014 05:11:00 PM,06/09/2014,Transfer,Partner,Cat,Intact Male,2 days,Domestic Shorthair Mix,Brown Tabby
9,A803149,*Birch,08/31/2019 04:26:00 PM,08/31/2019 04:26:00 PM,08/08/2019,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair,Brown Tabby


In [6]:
animals.fillna(np.nan)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black
5,A680969,*Zeus,08/05/2014 04:59:00 PM,08/05/2014 04:59:00 PM,06/03/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby
6,A684617,,07/27/2014 09:00:00 AM,07/27/2014 09:00:00 AM,07/26/2012,Transfer,SCRP,Cat,Intact Female,2 years,Domestic Shorthair Mix,Black
7,A742354,Artemis,01/22/2017 11:56:00 AM,01/22/2017 11:56:00 AM,01/20/2010,Return to Owner,,Cat,Neutered Male,7 years,Domestic Shorthair Mix,Blue/White
8,A681036,,06/11/2014 05:11:00 PM,06/11/2014 05:11:00 PM,06/09/2014,Transfer,Partner,Cat,Intact Male,2 days,Domestic Shorthair Mix,Brown Tabby
9,A803149,*Birch,08/31/2019 04:26:00 PM,08/31/2019 04:26:00 PM,08/08/2019,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair,Brown Tabby


### 1. Applying and using map and applymap from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [7]:
animals.applymap(str).head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black


The .map() method takes a function as input that it will then apply to every entry in the Series.

In [11]:
# This line of code will split the IDs into two parts and add the parts as new columns.

animals[['Animal ID Prefix', 'Animal ID Num']] = animals['Animal ID'].str.split('A', expand=True)

In [12]:
animals

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Animal ID Prefix,Animal ID Num
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,,794011
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,,776359
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,,720371
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,,674754
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black,,689724
5,A680969,*Zeus,08/05/2014 04:59:00 PM,08/05/2014 04:59:00 PM,06/03/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby,,680969
6,A684617,,07/27/2014 09:00:00 AM,07/27/2014 09:00:00 AM,07/26/2012,Transfer,SCRP,Cat,Intact Female,2 years,Domestic Shorthair Mix,Black,,684617
7,A742354,Artemis,01/22/2017 11:56:00 AM,01/22/2017 11:56:00 AM,01/20/2010,Return to Owner,,Cat,Neutered Male,7 years,Domestic Shorthair Mix,Blue/White,,742354
8,A681036,,06/11/2014 05:11:00 PM,06/11/2014 05:11:00 PM,06/09/2014,Transfer,Partner,Cat,Intact Male,2 days,Domestic Shorthair Mix,Brown Tabby,,681036
9,A803149,*Birch,08/31/2019 04:26:00 PM,08/31/2019 04:26:00 PM,08/08/2019,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair,Brown Tabby,,803149


In [13]:
# Now: How can we convert the Animal ID Num column to integers?

animals['Animal ID Num'] = animals['Animal ID Num'].map(int)

In [15]:
animals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115604 entries, 0 to 115603
Data columns (total 14 columns):
Animal ID           115604 non-null object
Name                79390 non-null object
DateTime            115604 non-null object
MonthYear           115604 non-null object
Date of Birth       115604 non-null object
Outcome Type        115598 non-null object
Outcome Subtype     52230 non-null object
Animal Type         115604 non-null object
Sex upon Outcome    115601 non-null object
Age upon Outcome    115576 non-null object
Breed               115604 non-null object
Color               115604 non-null object
Animal ID Prefix    115604 non-null object
Animal ID Num       115604 non-null int64
dtypes: int64(1), object(13)
memory usage: 12.3+ MB


Or we could have just used the `.astype()` method:

In [16]:
animals['Animal ID Num'] = animals['Animal ID Num'].astype(int)

#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [19]:
animals['Animal ID Num'].map(lambda x: x*2)[:4]

0    1588022
1    1552718
2    1440742
3    1349508
Name: Animal ID Num, dtype: int64

**Exercise: Use an anonymous function to add 'approximately' in front of the entries in Age upon Outcome**

In [None]:
# Your code here!



What went wrong? How can we fix it?

### 2. Methods for Re-Organizing DataFrames: .groupby()

Those of you familiar with SQL have probably used the GROUP BY command. (And if you haven't, you'll see it very soon!) Pandas has this, too.

The .groupby() method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [20]:
animals.groupby('Animal Type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11c743860>

#### .groups and .get_group()

In [21]:
animals.groupby('Animal Type').groups

{'Bird': Int64Index([   178,    460,    843,    880,   1106,   1938,   1973,   2193,
               2260,   2352,
             ...
             112768, 112848, 113406, 113590, 113640, 113650, 114487, 114615,
             114909, 114912],
            dtype='int64', length=539),
 'Cat': Int64Index([     0,      3,      4,      5,      6,      7,      8,      9,
                 10,     11,
             ...
             115522, 115523, 115526, 115537, 115538, 115552, 115554, 115555,
             115574, 115577],
            dtype='int64', length=43482),
 'Dog': Int64Index([     1,      2,     14,     15,     16,     18,     21,     22,
                 23,     24,
             ...
             115594, 115595, 115596, 115597, 115598, 115599, 115600, 115601,
             115602, 115603],
            dtype='int64', length=65651),
 'Livestock': Int64Index([  2087,   8935,  11581,  12595,  22997,  27928,  33483,  44421,
              45189,  49504,  67426,  72863,  85749,  90668,  93594, 10244

In [22]:
animals.groupby('Animal Type').get_group('Livestock')

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Animal ID Prefix,Animal ID Num
2087,A668167,,11/30/2013 12:18:00 PM,11/30/2013 12:18:00 PM,05/28/2013,Return to Owner,,Livestock,Intact Female,6 months,Pig Mix,Black/White,,668167
8935,A803469,,09/08/2019 08:00:00 AM,09/08/2019 08:00:00 AM,09/01/2017,Return to Owner,,Livestock,Intact Female,2 years,Pygmy,Tan/Black,,803469
11581,A788252,Bacon,01/29/2019 01:25:00 PM,01/29/2019 01:25:00 PM,09/29/2018,Return to Owner,,Livestock,Unknown,4 months,Pig Mix,Gray,,788252
12595,A764822,,02/12/2018 04:51:00 PM,02/12/2018 04:51:00 PM,01/06/2016,Adoption,,Livestock,Intact Female,2 years,Potbelly Pig Mix,White,,764822
22997,A742204,,03/15/2017 12:49:00 PM,03/15/2017 12:49:00 PM,07/17/2016,Transfer,Partner,Livestock,Intact Female,7 months,Potbelly Pig Mix,Black/White,,742204
27928,A673651,,03/11/2014 02:39:00 PM,03/11/2014 02:39:00 PM,02/28/2013,Adoption,Foster,Livestock,Neutered Male,1 year,Pig Mix,Black/White,,673651
33483,A718910,,01/27/2016 12:00:00 AM,01/27/2016 12:00:00 AM,01/09/2015,Transfer,Partner,Livestock,Intact Male,1 year,Pig Mix,White,,718910
44421,A782370,,10/14/2018 02:12:00 PM,10/14/2018 02:12:00 PM,09/29/2018,Died,At Vet,Livestock,Unknown,2 weeks,Pig,Black,,782370
45189,A778971,,04/02/2019 11:35:00 AM,04/02/2019 11:35:00 AM,04/23/2018,Adoption,Foster,Livestock,Intact Female,11 months,Pig Mix,Gray/Black,,778971
49504,A811675,,01/08/2020 09:41:00 AM,01/08/2020 09:41:00 AM,01/07/2018,,,Livestock,Intact Female,,Goat,Black/White,,811675


#### Aggregating

In [23]:
animals.groupby('Animal Type').std()

Unnamed: 0_level_0,Animal ID Num
Animal Type,Unnamed: 1_level_1
Bird,46031.824323
Cat,49606.981267
Dog,59746.226032
Livestock,52471.760994
Other,42326.302901


#### Datetime Objects

'Datetime' is a special data type for dates. And we can convert an appropriately formatted variable to the datetime type simply by calling `pd.to_datetime()`.

In [24]:
pd.to_datetime(animals['Date of Birth'])

0        2017-05-02
1        2017-07-12
2        2015-10-08
3        2014-03-12
4        2014-08-01
5        2014-06-03
6        2012-07-26
7        2010-01-20
8        2014-06-09
9        2019-08-08
10       2014-06-05
11       2018-05-05
12       2016-04-15
13       2016-05-18
14       2009-01-18
15       2007-10-23
16       2017-02-25
17       2017-03-18
18       2009-07-22
19       2019-05-06
20       2017-02-15
21       2013-12-10
22       2015-09-08
23       2019-02-15
24       2016-06-06
25       2017-01-25
26       2013-04-03
27       2015-10-30
28       2010-03-15
29       2015-04-17
            ...    
115574   2015-07-15
115575   2017-10-07
115576   2018-02-19
115577   2013-06-11
115578   2018-11-19
115579   2018-02-09
115580   2019-07-22
115581   2019-04-08
115582   2017-02-09
115583   2014-03-22
115584   2019-02-09
115585   2019-02-11
115586   2019-02-11
115587   2018-12-23
115588   2018-02-07
115589   2019-02-14
115590   2019-05-13
115591   2018-02-13
115592   2019-12-22


**Exercise: Find the latest date of birth per animal type.**

In [25]:
# First redefine Date of Birth as a series of datetime objects.
# Then group by Animal Type and calculate the max.




### 3. Reshaping a DataFrame

#### .pivot()

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [26]:
animals.pivot(values='Age upon Outcome', columns='Animal Type').head()

Animal Type,Bird,Cat,Dog,Livestock,Other
0,,2 years,,,
1,,,1 year,,
2,,,4 months,,
3,,6 days,,,
4,,2 months,,,


### 4. Methods for Combining DataFrames: .join(), .merge(), .concat(), .melt()

#### .join()

In [27]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'MP'])

In [28]:
toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [29]:
toy2

Unnamed: 0,age,MP
0,63,100
1,33,200


In [30]:
toy1.set_index('age').join(toy2.set_index('age'))

Unnamed: 0_level_0,HP,MP
age,Unnamed: 1_level_1,Unnamed: 2_level_1
63,142,100
33,47,200


For more on this method, check out the [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)!

#### .merge()

In [31]:
ds_chars = pd.read_csv('ds_chars.csv', index_col=0)
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [32]:
states = pd.read_csv('states.csv', index_col=0)
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


In [33]:
ds_chars.merge(states, left_on='home_state', right_on='state', how='inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


#### pd.concat()

This method takes a *list* of pandas objects as arguments.

N.B. The cell below will likely produce a **Deprecation Warning**.

In [34]:
ds_full = pd.concat([ds_chars, states])
ds_full

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,HP,capital,home_state,name,nickname,state
0,200.0,,WA,greg,,
1,200.0,,WA,miles,,
2,170.0,,TX,alan,,
3,300.0,,DC,alison,,
4,200.0,,TX,rachel,,
0,,Olympia,,,evergreen,WA
1,,Austin,,,alamo,TX
2,,Washington,,,district,DC
3,,Columbus,,,buckeye,OH
4,,Salem,,,beaver,OR


`pd.concat()`––and many other pandas operations––make use of an `axis` parameter. For this particular method I need to specify whether I want to concatenate the DataFrames *row-wise* (`axis=0`) or *column-wise* (`axis=1`). The default is `axis=0`, so let's override that!

#### pd.melt()

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [35]:
pd.melt(ds_full)

Unnamed: 0,variable,value
0,HP,200
1,HP,200
2,HP,170
3,HP,300
4,HP,200
5,HP,
6,HP,
7,HP,
8,HP,
9,HP,


### 5. Making Use of Categories: One-Hot Encoding

Pandas has a one-hot encoder called `get_dummies()`, which is good for exploratory data analysis (EDA).

This might be good to use if we're in the **data-understanding** stage (Stage 2) of our CRISP-DM process.

We can call it on a DataFrame as a whole or on a Series (column).

In [37]:
pd.get_dummies(animals['Animal Type'])

Unnamed: 0,Bird,Cat,Dog,Livestock,Other
0,0,1,0,0,0
1,0,0,1,0,0
2,0,0,1,0,0
3,0,1,0,0,0
4,0,1,0,0,0
5,0,1,0,0,0
6,0,1,0,0,0
7,0,1,0,0,0
8,0,1,0,0,0
9,0,1,0,0,0


If however we're in a later stage of the process and we're interested, say, in preparing a data pipeline, `pandas.get_dummies()` will prove inferior to other tools.

In practice, we will **not** use `pandas.get_dummies()`. The library Scikit-Learn (`sklearn`, included with your Anaconda installation) has a `OneHotEncoder` class that creates an object that persists. This makes it much more apt for production environments, and so it's good to get in the habit of using it.

Ultimately, we will use **many** tools from sklearn.

In [38]:
from sklearn.preprocessing import OneHotEncoder

In [41]:
ohe = OneHotEncoder()

In [42]:
ohe.fit(animals[['Animal Type']])

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=True)

Now that the `OneHotEncoder` has been fitted to our data, it has newly available attributes and methods. In particular, it has access to the different categories that we're replacing:

In [43]:
ohe.get_feature_names()

array(['x0_Bird', 'x0_Cat', 'x0_Dog', 'x0_Livestock', 'x0_Other'],
      dtype=object)

We'll have much more to say about `sklearn` syntax and about Python's object structure. But let's now transform our data to see what the new table looks like:

In [44]:
ohe.transform(animals[['Animal Type']])

<115604x5 sparse matrix of type '<class 'numpy.float64'>'
	with 115604 stored elements in Compressed Sparse Row format>

For the sake of saving storage space, the return is a **sparse matrix**, but we can "re-inflate it if we want to see it in tabular form:

In [45]:
types_encoded = ohe.transform(animals[['Animal Type']]).todense()
types_encoded

matrix([[0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0.],
        ...,
        [0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 1., 0., 0.]])

Let's put it into a DataFrame:

In [46]:
pd.DataFrame(types_encoded, columns=ohe.get_feature_names()).head()

Unnamed: 0,x0_Bird,x0_Cat,x0_Dog,x0_Livestock,x0_Other
0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0
4,0.0,1.0,0.0,0.0,0.0
