# Module 1 - Manipulating data with Pandas (continued)
## Pandas Part 2

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)

## Scenario:
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. In this lecture, we are continue to look at a real data set collected by Austin Animal Center over several years and use our pandas skills from the last lecture and learn some new ones in order to explore this data further.

#### _Our goals today are to be able to_: <br/>

Use the pandas library to:

- Get summary info about a dataset and its variables
  - Apply and use info, describe and dtypes
  - Use mean, min, max, and value_counts 
- Use apply and applymap to transform columns and create new values

- Explain lambda functions and use them to use an apply on a DataFrame
- Explain what a groupby object is and split a DataFrame using a groupby
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting


In [1]:
import pandas as pd
import numpy as np

### 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [3]:
l = [1, 2, 3, 4, 5]
l[[0, 2, 4]]

TypeError: list indices must be integers or slices, not list

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [4]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New York']
}

students_df = pd.DataFrame(student_dict)

In [6]:
students_df

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [7]:
students_df.size

9

In [10]:
students_df.loc[:, 'name']

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [14]:
# returns the first row
students_df.iloc[0, :]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [15]:
# returns the first column
students_df.iloc[:, 0]

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

In [16]:
# returns first two rows notice that ILOC performs regular python slicing.
students_df.iloc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


In [17]:
# returns the first two columns
students_df.iloc[:, 0:2]

Unnamed: 0,name,age
0,Samantha,35
1,Alex,17
2,Dante,26


In [18]:
# returns first row and columns 1 and 2
students_df.iloc[0:1, 0:2]

Unnamed: 0,name,age
0,Samantha,35


### How would we use `.iloc` to return the last item in the last row?


In [21]:
# return the last item in the last row using iloc
students_df.iloc[-1]

name       Dante
age           26
city    New York
Name: 2, dtype: object

### How would we use `.iloc` to return the last item in the last column?


In [22]:
# return the last item in the last column using iloc
students_df.iloc[:, -1]

0     Houston
1     Seattle
2    New York
Name: city, dtype: object

In [23]:
students_df.iloc[-1, -1]

'New York'

### What if we only want certain columns or rows?

In [24]:
# Don't do students_df.iloc[0, 2]
students_df.iloc[[0, 2]]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
2,Dante,26,New York


In [25]:
students_df.iloc[[0, 2], [0, 2]]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New York


### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [26]:
# We will use loc to return rows and columns based on labels. Let's look at the students_df DataFrame again.
students_df

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [27]:
# returns the student information associated with index 0
students_df.loc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [28]:
# returns the student information for row index 0 to 2 inclusive.
# note iloc would return normal python slicing not including 2 as demonstrated above.
students_df.loc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [29]:
# returns the column labeled 'age'
students_df.loc[:, 'age']

0    35
1    17
2    26
Name: age, dtype: object

In [30]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age"
students_df.loc[1:2, 'age']

1    17
2    26
Name: age, dtype: object

In [31]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age to city (inclusive)"
students_df.loc[1:2, 'age':'city']

Unnamed: 0,age,city
1,17,Seattle
2,26,New York


In [35]:
# What should we get?
students_df.loc[1:2, ['name', 'city']]

Unnamed: 0,name,city
1,Alex,Seattle
2,Dante,New York


In [36]:
# How about?
students_df.loc[[0, 2], ['name', 'city']]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New York


In [37]:
# if index rearranged
school_ids = ['5', '11', '3']
students_df = pd.DataFrame(student_dict, index=school_ids)

In [38]:
students_df

Unnamed: 0,name,age,city
5,Samantha,35,Houston
11,Alex,17,Seattle
3,Dante,26,New York


In [40]:
# What should we get now?
students_df.loc[[0, 2], ['name', 'city']]

KeyError: "None of [Int64Index([0, 2], dtype='int64')] are in the [index]"

In [43]:
# What should we get now?
students_df.loc[['5', '11'], ['name', 'city']]

Unnamed: 0,name,city
5,Samantha,Houston
11,Alex,Seattle


In [44]:
students_df.loc['5':'11']

Unnamed: 0,name,age,city
5,Samantha,35,Houston
11,Alex,17,Seattle


In [45]:
students_df = students_df.set_index("name")
students_df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New York


In [46]:
students_df.loc[['Samantha']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston


In [47]:
# Subsetting nonconsecutive rows
students_df.loc[['Samantha', 'Dante']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Dante,26,New York


In [48]:
# Samantha to the end
students_df.loc['Samantha':]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New York


In [53]:
# return the first and last rows using one loc command
students_df.loc[['Samantha', 'Dante']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Dante,26,New York


### Boolean Subsetting

In [54]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante', 'Samantha'],
    'age': ['35', '17', '26', '21'],
    'city': ['Houston', 'Seattle', 'New york', 'Atlanta'],
    'state': ['Texas', 'Washington', 'New York', 'Georgia']
}

students_df = pd.DataFrame(student_dict)

In [55]:
students_df['name'] == 'Samantha'

0     True
1    False
2    False
3     True
Name: name, dtype: bool

In [56]:
# The statement data[‘name’] == ‘Samantha’] produces a Pandas Series with a True/False value for every row
# in the ‘data’ DataFrame, where there are “True” values for the rows where the name is “Samantha”.
# These type of boolean arrays can be passed directly to the .loc indexer.
students_df.loc[students_df['name'] == 'Samantha']

Unnamed: 0,name,age,city,state
0,Samantha,35,Houston,Texas
3,Samantha,21,Atlanta,Georgia


In [57]:
# What about if we only want the city and state of the selected students with the name Samantha?
students_df.loc[students_df['name'] == 'Samantha', ['city', 'state']]

Unnamed: 0,city,state
0,Houston,Texas
3,Atlanta,Georgia


In [58]:
# What amount if we want to select a student of a specific age?
students_df.loc[students_df['age'] == '21']

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


In [62]:
(students_df['age'] == '21') & (students_df['city'] == 'Atlanta')

0    False
1    False
2    False
3     True
dtype: bool

In [63]:
# What amount if we want to select a student of a specific age?
students_df.loc[(students_df['age'] == '21') &
                (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


In [64]:
(students_df['age'] == '35') & (students_df['city'] == 'Atlanta')

0    False
1    False
2    False
3    False
dtype: bool

In [65]:
# What should be returned?
students_df.loc[(students_df['age'] == '35') &
                (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state


In [66]:
_.size

0

## Getting started

Before we look back at the animal shelter data, let's practice on a simpler dataset.
Read about this dataset here: https://www.kaggle.com/ronitf/heart-disease-uci
![heart-data](images/heartbloodpres.jpeg)

The dataset is most often used to practice classification algorithms. Can one develop a model to predict the likelihood of heart disease based on other measurable characteristics? We will return to that specific question in a few weeks, but for now we wish to use the dataset to practice some pandas methods.

### 1. Get summary info about a dataset and its variables

Applying and using `info`, `describe`, `mean`, `min`, `max`, `apply`, and `applymap` from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

In [68]:
!pwd
!ls -al data

/Users/enkeboll/code/fis/dc-ds-010620/module-1/day-6-pandas-2
total 40
drwxr-xr-x  5 enkeboll  staff    160 Jan 10 15:43 [1m[34m.[m[m
drwxr-xr-x  7 enkeboll  staff    224 Jan 13 10:14 [1m[34m..[m[m
-rw-r--r--  1 enkeboll  staff     95 Apr 24  2019 ds_chars.csv
-rwxr-xr-x  1 enkeboll  staff  11328 Dec 30 11:09 [35mheart.csv[m[m
-rw-r--r--  1 enkeboll  staff    130 Dec 30 11:09 states.csv


In [69]:
import pandas as pd
uci = pd.read_csv('data/heart.csv')

In [70]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


#### The `.columns` and `.shape` Attributes

In [71]:
uci.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [72]:
uci.shape

(303, 14)

In [73]:
uci.size

4242

#### The `.info() `and `.describe()` and `.dtypes` methods

Pandas DataFrames have many useful methods! Let's look at `.info()` , `.describe()`, and `dtypes`.

In [74]:
# Call the .info() method on our dataset. What do you observe?

uci.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [75]:
# Call the .describe() method on our dataset. What do you observe?

uci.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [None]:
# Use the code below. How does the output differ from info() ?
uci.dtypes

#### `.mean()`, .`min()`,` .max()`, `.sum()`

The methods `.mean()`, `.min()`, and `.max()` will perform just the way you think they will!

Note that these are methods both for Series and for DataFrames.

In [76]:
uci.ca.mean()

0.7293729372937293

In [79]:
uci.mean(axis=1)

0      42.878571
1      43.892857
2      39.600000
3      42.771429
4      50.114286
         ...    
298    40.514286
299    40.085714
300    39.885714
301    31.585714
302    43.000000
Length: 303, dtype: float64

#### The Axis Variable

In [81]:
uci.sum(axis=1) # Try [shift] + [tab] here!

0      600.3
1      614.5
2      554.4
3      598.8
4      701.6
       ...  
298    567.2
299    561.2
300    558.4
301    442.2
302    602.0
Length: 303, dtype: float64

#### .`value_counts()`

For a DataFrame _Series_, the `.value_counts()` method will tell you how many of each value you've got.

In [82]:
uci['age'].value_counts()[:10]

58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
Name: age, dtype: int64

Exercise: What are the different values for restecg?

In [87]:
# Your code here!
uci.restecg.value_counts()

1    152
0    147
2      4
Name: restecg, dtype: int64

In [84]:
uci.restecg.unique()

array([0, 1, 2])

### Apply to Animal Shelter Data
Using `.info()` and `.describe()` and `dtypes` what observations can we make about the data?

What are the breed value counts?

How about age counts for dogs?

In [88]:
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')

In [89]:
# Breed value counts
animal_outcomes.Breed.value_counts()

Domestic Shorthair Mix                            30591
Pit Bull Mix                                       8207
Labrador Retriever Mix                             6518
Chihuahua Shorthair Mix                            6101
Domestic Shorthair                                 4533
                                                  ...  
Standard Schnauzer/Soft Coated Wheaten Terrier        1
Pit Bull/Queensland Heeler                            1
Finnish Spitz/German Shepherd                         1
Airedale Terrier/Miniature Schnauzer                  1
Hovawart                                              1
Name: Breed, Length: 2536, dtype: int64

In [93]:
# Age counts for dogs
animal_outcomes.loc[animal_outcomes['Animal Type'] == 'Dog']['Age upon Outcome'].value_counts()

1 year       12571
2 years      11797
3 years       5272
2 months      5081
4 years       3254
5 years       2985
1 month       2026
6 years       2022
7 years       1691
8 years       1620
4 months      1564
3 months      1513
5 months      1508
6 months      1423
10 months     1283
8 months      1281
10 years      1232
7 months       945
9 years        930
9 months       850
12 years       590
11 months      559
11 years       509
13 years       361
14 years       246
4 weeks        234
15 years       190
2 weeks        178
1 weeks        169
3 weeks        131
16 years        91
2 days          89
1 week          81
1 day           64
0 years         55
6 days          46
17 years        42
3 days          32
5 weeks         27
5 days          23
18 years        22
4 days          20
19 years        14
-1 years         4
20 years         3
24 years         1
-3 years         1
Name: Age upon Outcome, dtype: int64

In [95]:
animal_outcomes.loc[animal_outcomes['Age upon Outcome'] == '-3 years']

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
44700,A687107,Montopolis,02/25/2016 06:04:00 PM,02/25/2016 06:04:00 PM,03/17/2019,Return to Owner,,Dog,Neutered Male,-3 years,Rhod Ridgeback,Red/Brown


What are the breed `value_counts`?
What's the top breed for adopted dogs?

How about outcome counts for dogs?




In [97]:
# Breed value_counts for dogs?
animal_outcomes.loc[animal_outcomes['Animal Type'] == 'Dog']['Breed'].value_counts()

Pit Bull Mix                         8207
Labrador Retriever Mix               6518
Chihuahua Shorthair Mix              6101
German Shepherd Mix                  2825
Australian Cattle Dog Mix            1428
                                     ... 
Smooth Fox Terrier/Basenji              1
Basset Hound/English Pointer            1
Chihuahua Shorthair/Affenpinscher       1
Shih Tzu/Pug                            1
Hovawart                                1
Name: Breed, Length: 2242, dtype: int64

In [100]:
# Top breed for adopted dogs?
animal_outcomes.loc[(animal_outcomes['Animal Type'] == 'Dog') & 
                    (animal_outcomes['Outcome Type'] == 'Adoption')]['Breed'].value_counts()[:1]

Labrador Retriever Mix    3332
Name: Breed, dtype: int64

In [101]:
# Outcome counts for dogs?
animal_outcomes.loc[animal_outcomes['Animal Type'] == 'Dog']['Outcome Type'].value_counts()

Adoption           30226
Return to Owner    17980
Transfer           14001
Euthanasia          1693
Rto-Adopt            455
Died                 229
Missing               29
Disposal              21
Name: Outcome Type, dtype: int64

In [106]:
animal_outcomes.loc[:, 'Name':'Breed'].iloc[2:7]

Unnamed: 0,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed
2,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix
3,*Zeus,08/05/2014 04:59:00 PM,08/05/2014 04:59:00 PM,06/03/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix
4,,07/27/2014 09:00:00 AM,07/27/2014 09:00:00 AM,07/26/2012,Transfer,SCRP,Cat,Intact Female,2 years,Domestic Shorthair Mix
5,Artemis,01/22/2017 11:56:00 AM,01/22/2017 11:56:00 AM,01/20/2010,Return to Owner,,Cat,Neutered Male,7 years,Domestic Shorthair Mix
6,,06/11/2014 05:11:00 PM,06/11/2014 05:11:00 PM,06/09/2014,Transfer,Partner,Cat,Intact Male,2 days,Domestic Shorthair Mix


### 2.  Changing data

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [116]:
def successor(x):
    return x + 1

In [114]:
uci.applymap(successor).head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0


The `.map()` method takes a function as input that it will then apply to every entry in the Series.

In [120]:
uci['age_plus_one'] = uci['age'].map(successor)

In [121]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,age_plus_one
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,64
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,38
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,42
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,57
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,58


#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [122]:
def my_round(x):
    return round(x)

In [127]:
uci['oldpeak'].map(lambda x: x**2)[:4]

0     5.29
1    12.25
2     1.96
3     0.64
Name: oldpeak, dtype: float64

Exercise: Use an anonymous function to turn the entries in age to strings

In [132]:
uci.age.map(lambda my_age: f'My age is {my_age}')[:4]

0    My age is 63
1    My age is 37
2    My age is 41
3    My age is 56
Name: age, dtype: object

In [133]:
def print_age(my_age):
    return f'My age is {my_age}'

In [134]:
def print_age(my_age): return f'My age is {my_age}'

In [135]:
print_age(37)

'My age is 37'

### Apply to Animal Shelter Data

Use an `apply` to change the dates from strings to datetime objects. Similarly, use an apply to change the ages of the animals from strings to floats.

In [None]:
# Your code here

## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [None]:
uci.groupby('sex')

#### `.groups` and `.get_group()`

In [None]:
uci.groupby('sex').groups

In [None]:
uci.groupby('sex').get_group(0) # .tail()

### Aggregating

In [None]:
uci.groupby('sex').std()

Exercise: Tell me the average cholesterol level for those with heart disease.

In [None]:
# Your code here!


### Apply to Animal Shelter Data

#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?
 

#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on date
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [None]:
# Your code here

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [None]:
uci.pivot(values='sex', columns='target').head()

### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns = ['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns = ['age', 'HP'])

In [None]:
toy1.join(toy2.set_index('age'),
          on = 'age',
          lsuffix = '_A',
          rsuffix = '_B').head()

### `.merge()`

In [None]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col = 0)

In [None]:
states = pd.read_csv('data/states.csv', index_col = 0)

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on = 'state',
               how = 'inner')

### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [None]:
pd.concat([ds_chars, states], sort=False)

### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars.head()

In [None]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

_Hints_ :
- import and clean the intake dataset first
- use apply/applymap/lambda to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new days-in-shelter variable
- Notice that some values in "days_in_shelter" column are NaN or values < 0 (remove these rows using the "<" operator and ~is.na())
- Use group_by to get some interesting information about the dataset

Make sure to export and save your cleaned dataset. We will use it in a later lecture!

use the notation `df.to_csv()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
#code here