### Apply to Animal Shelter Data

Use an `apply` to change the dates from strings to datetime objects. Similarly, use an apply to change the ages of the animals from strings to floats.

In [1]:
import pandas as pd
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')

In [2]:
# Use an apply to change the dates from strings to datetime objects
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114105 entries, 0 to 114104
Data columns (total 12 columns):
Animal ID           114105 non-null object
Name                78240 non-null object
DateTime            114105 non-null object
MonthYear           114105 non-null object
Date of Birth       114105 non-null object
Outcome Type        114098 non-null object
Outcome Subtype     51688 non-null object
Animal Type         114105 non-null object
Sex upon Outcome    114101 non-null object
Age upon Outcome    114077 non-null object
Breed               114105 non-null object
Color               114105 non-null object
dtypes: object(12)
memory usage: 10.4+ MB


In [54]:
import datetime
animal_outcomes.DateTime = animal_outcomes.DateTime.apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'))

In [6]:
pd.to_datetime(animal_outcomes.DateTime)

0        2016-02-13 17:59:00
1        2014-03-18 11:47:00
2        2014-10-18 18:52:00
3        2014-08-05 16:59:00
4        2014-07-27 09:00:00
                 ...        
114100   2017-10-18 13:27:00
114101   2018-03-01 18:28:00
114102   2018-06-23 11:59:00
114103   2018-05-21 12:59:00
114104   2018-03-12 13:27:00
Name: DateTime, Length: 114105, dtype: datetime64[ns]

In [7]:
# ages of the animals from strings to floats.
animal_outcomes['Age upon Outcome'].head()

0    4 months
1      6 days
2    2 months
3    2 months
4     2 years
Name: Age upon Outcome, dtype: object

In [9]:
animal_outcomes['Age upon Outcome'].value_counts()

1 year       20529
2 years      16881
2 months     13769
3 years       6949
3 months      5350
1 month       5053
4 years       4109
5 years       3804
4 months      3641
5 months      2809
6 months      2748
6 years       2539
8 years       2178
7 years       2165
3 weeks       2017
2 weeks       1912
8 months      1813
10 years      1731
4 weeks       1718
10 months     1645
7 months      1462
9 years       1189
9 months      1187
12 years       851
1 weeks        768
11 months      708
11 years       674
1 week         626
13 years       542
14 years       361
2 days         330
3 days         321
15 years       302
1 day          240
6 days         228
4 days         201
0 years        164
5 days         145
16 years       132
5 weeks        109
17 years        77
18 years        48
19 years        23
20 years        17
-1 years         4
22 years         4
24 years         1
-3 years         1
25 years         1
21 years         1
Name: Age upon Outcome, dtype: int64

In [14]:
animal_outcomes['Age upon Outcome'].isna().sum()

28

In [16]:
x = 5
y = 5

x == y

True

In [17]:
x is y

True

In [18]:
id(5)

4543827856

In [21]:
x=6
id(x)

4543827888

In [20]:
id(y)

4543827856

In [22]:
x = np.NaN
y = np.NaN

In [23]:
x is y

True

In [24]:
id(x) == id(y)

True

In [25]:
x == y

False

In [27]:
header = 'Andy Enkeboll'
print(header)
print('='*len(header))

Andy Enkeboll


In [56]:
import numpy as np
def age_string_to_days_old(age):
    if age is np.NaN:
        return age
    qty, unit = age.split(' ')
    qty = int(qty)
    if 'day' in unit:
        return qty
    elif 'week' in unit:
        return qty * 7
    elif 'month' in unit:
        return qty * 30
    elif 'year' in unit:
        return qty * 365
    
    return np.NaN

animal_outcomes['age'] = animal_outcomes['Age upon Outcome'].apply(age_string_to_days_old)

## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [28]:
uci = pd.read_csv('data/heart.csv')

In [30]:
uci.sex.value_counts()

1    207
0     96
Name: sex, dtype: int64

In [34]:
uci.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [35]:
uci.groupby('age')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12891cf10>

#### `.groups` and `.get_group()`

In [38]:
uci.groupby('age').groups

{29: Int64Index([72], dtype='int64'),
 34: Int64Index([58, 125], dtype='int64'),
 35: Int64Index([65, 157, 227, 239], dtype='int64'),
 37: Int64Index([1, 115], dtype='int64'),
 38: Int64Index([163, 164, 259], dtype='int64'),
 39: Int64Index([44, 124, 154, 212], dtype='int64'),
 40: Int64Index([24, 175, 283], dtype='int64'),
 41: Int64Index([2, 30, 63, 80, 116, 122, 133, 134, 162, 189], dtype='int64'),
 42: Int64Index([22, 84, 100, 103, 132, 142, 149, 280], dtype='int64'),
 43: Int64Index([18, 74, 98, 113, 141, 178, 215, 251], dtype='int64'),
 44: Int64Index([7, 21, 32, 46, 53, 68, 146, 148, 185, 200, 294], dtype='int64'),
 45: Int64Index([42, 57, 67, 81, 94, 107, 255, 299], dtype='int64'),
 46: Int64Index([35, 87, 118, 119, 196, 270, 285], dtype='int64'),
 47: Int64Index([47, 126, 156, 230, 274], dtype='int64'),
 48: Int64Index([11, 41, 56, 90, 171, 219, 245], dtype='int64'),
 49: Int64Index([12, 131, 135, 208, 267], dtype='int64'),
 50: Int64Index([15, 104, 108, 109, 184, 188, 257], d

In [40]:
uci.groupby('age').get_group(60) # .tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
82,60,0,2,102,318,0,1,160,0,0.0,2,1,2,1
136,60,0,2,120,178,1,1,96,0,0.0,2,0,2,1
147,60,0,3,150,240,0,1,171,0,0.9,2,0,2,1
174,60,1,0,130,206,0,0,132,1,2.4,1,2,3,0
176,60,1,0,117,230,1,1,160,1,1.4,2,2,3,0
186,60,1,0,130,253,0,1,144,1,1.4,2,1,3,0
193,60,1,0,145,282,0,0,142,1,2.8,1,2,3,0
194,60,1,2,140,185,0,0,155,0,3.0,1,0,2,0
201,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
207,60,0,0,150,258,0,0,157,0,2.6,1,2,3,0


### Aggregating

In [46]:
uci.groupby('age').median()

Unnamed: 0_level_0,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
29,1.0,1.0,130.0,204.0,0.0,0.0,202.0,0.0,0.0,2.0,0.0,2.0,1.0
34,0.5,2.0,118.0,196.0,0.0,0.5,183.0,0.0,0.35,2.0,0.0,2.0,1.0
35,1.0,0.0,124.0,195.0,0.0,1.0,165.0,0.5,0.7,2.0,0.0,2.5,0.5
37,0.5,2.0,125.0,232.5,0.0,1.0,178.5,0.0,1.75,1.0,0.0,2.0,1.0
38,1.0,2.0,138.0,175.0,0.0,1.0,173.0,0.0,0.0,2.0,4.0,2.0,1.0
39,0.5,2.0,128.0,219.5,0.0,1.0,165.5,0.0,0.0,1.5,0.0,2.0,1.0
40,1.0,0.0,140.0,199.0,0.0,1.0,178.0,1.0,1.4,2.0,0.0,3.0,0.0
41,1.0,1.0,116.0,209.0,0.0,1.0,168.0,0.0,0.0,2.0,0.0,2.0,1.0
42,1.0,1.5,125.0,242.0,0.0,1.0,167.5,0.0,0.3,1.5,0.0,2.0,1.0
43,1.0,0.0,126.0,247.0,0.0,1.0,161.5,0.0,1.35,1.0,0.0,2.5,1.0


Exercise: Tell me the average cholesterol level for those with heart disease.

In [48]:
# Your code here!
uci.groupby('target').get_group(1).chol.mean()

242.23030303030302

In [52]:
uci.loc[uci.target == 1].chol.mean()

242.23030303030302

### Apply to Animal Shelter Data

#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?
 

In [57]:
animal_outcomes.groupby('Animal Type').mean()

Unnamed: 0_level_0,age
Animal Type,Unnamed: 1_level_1
Bird,502.923507
Cat,506.68357
Dog,965.442621
Livestock,411.411765
Other,459.662369


In [65]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114105 entries, 0 to 114104
Data columns (total 15 columns):
Animal ID           114105 non-null object
Name                78240 non-null object
DateTime            114105 non-null datetime64[ns]
MonthYear           114105 non-null object
Date of Birth       114105 non-null object
Outcome Type        114098 non-null object
Outcome Subtype     51688 non-null object
Animal Type         114105 non-null object
Sex upon Outcome    114101 non-null object
Age upon Outcome    114077 non-null object
Breed               114105 non-null object
Color               114105 non-null object
age                 114077 non-null float64
year                114105 non-null int64
month               114105 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(2), object(11)
memory usage: 13.1+ MB


In [67]:
animal_outcomes.groupby(['Animal Type', 'Sex upon Outcome']).mean().age

Animal Type  Sex upon Outcome
Bird         Intact Female        803.405405
             Intact Male          541.086093
             Unknown              412.897106
Cat          Intact Female        320.782064
             Intact Male          219.976081
             Neutered Male        676.079787
             Spayed Female        685.390752
             Unknown              186.176802
Dog          Intact Female        810.742839
             Intact Male          867.618011
             Neutered Male       1021.148468
             Spayed Female        985.552200
             Unknown              368.940618
Livestock    Intact Female        458.571429
             Intact Male          577.000000
             Neutered Male        365.000000
             Unknown              133.500000
Other        Intact Female        551.289377
             Intact Male          497.184136
             Neutered Male        550.568862
             Spayed Female        545.179856
             Unknown     

#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on date
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [60]:
# Your code here
animal_outcomes['year'] = animal_outcomes.DateTime.apply(lambda x: x.year)
animal_outcomes['month'] = animal_outcomes.DateTime.apply(lambda x: x.month)

In [68]:
animal_outcomes.groupby('month').count()['Animal ID']

month
1      8209
2      6882
3      8230
4      8208
5      9999
6     10888
7     11050
8     10451
9      9701
10    10995
11     9701
12     9791
Name: Animal ID, dtype: int64

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [73]:
uci.pivot(values=['sex', 'age'], columns='target')

Unnamed: 0_level_0,sex,sex,age,age
target,0,1,0,1
0,,1.0,,63.0
1,,1.0,,37.0
2,,0.0,,41.0
3,,1.0,,56.0
4,,0.0,,57.0
...,...,...,...,...
298,0.0,,57.0,
299,1.0,,45.0,
300,1.0,,68.0,
301,1.0,,57.0,


### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [74]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'HP'])

In [75]:
toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [81]:
toy2.set_index('age').loc[63]

HP    100
Name: 63, dtype: int64

In [79]:
toy1.join(toy2.set_index('age'),
          on='age',
          lsuffix='_A',
          rsuffix='_B').head()

Unnamed: 0,age,HP_A,HP_B
0,63,142,100
1,33,47,200


### `.merge()`

In [82]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)

In [83]:
states = pd.read_csv('data/states.csv', index_col=0)

In [84]:
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [85]:
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


In [88]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='right')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,WA,evergreen,Olympia
1,miles,200.0,WA,WA,evergreen,Olympia
2,alan,170.0,TX,TX,alamo,Austin
3,rachel,200.0,TX,TX,alamo,Austin
4,alison,300.0,DC,DC,district,Washington
5,,,,OH,buckeye,Columbus
6,,,,OR,beaver,Salem


### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [93]:
pd.concat([ds_chars.rename({'home_state': 'state'}), states]).reset_index()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,index,HP,capital,home_state,name,nickname,state
0,0,200.0,,WA,greg,,
1,1,200.0,,WA,miles,,
2,2,170.0,,TX,alan,,
3,3,300.0,,DC,alison,,
4,4,200.0,,TX,rachel,,
5,0,,Olympia,,,evergreen,WA
6,1,,Austin,,,alamo,TX
7,2,,Washington,,,district,DC
8,3,,Columbus,,,buckeye,OH
9,4,,Salem,,,beaver,OR


### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars.head()

In [94]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

Unnamed: 0,name,variable,value
0,greg,HP,200
1,miles,HP,200
2,alan,HP,170
3,alison,HP,300
4,rachel,HP,200
5,greg,home_state,WA
6,miles,home_state,WA
7,alan,home_state,TX
8,alison,home_state,DC
9,rachel,home_state,TX


## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

The Url for the Intake Dataset is here: https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD

_Hints_ :
- import and clean the intake dataset first
- use apply/applymap/lambda to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new days-in-shelter variable
- Notice that some values in "days_in_shelter" column are NaN or values < 0 (remove these rows using the "<" operator and ~is.na())
- Use group_by to get some interesting information about the dataset

Make sure to export and save your cleaned dataset. We will use it in a later lecture!

use the notation `df.to_csv()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
#code here