# Module 1 - Reshaping Data with Pandas
## Pandas Part 3

In [1]:
import pandas as pd
uci = pd.read_csv('data/heart.csv')

In [2]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [4]:
gb = uci.groupby('sex')

#### `.groups` and `.get_group()`

In [5]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [6]:
uci.groupby('sex').get_group(0)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3,0
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0


### Aggregating

In [7]:
uci.groupby('sex').mean()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,55.677083,1.041667,133.083333,261.302083,0.125,0.572917,151.125,0.229167,0.876042,1.427083,0.552083,2.125,0.75
1,53.758454,0.932367,130.94686,239.289855,0.15942,0.507246,148.961353,0.371981,1.115459,1.386473,0.811594,2.400966,0.449275


Exercise: Tell me the average cholesterol level for those with heart disease.

In [10]:
# Your code here!
uci.groupby('target').mean().loc[1, 'chol']

242.23030303030302

In [14]:
uci.groupby('target').get_group(1).mean()['chol']

242.23030303030302

In [16]:
uci.groupby('cp').std().loc[3, 'slope']

0.6887004431501819

In [None]:
gb.quantile()

In [28]:
uci.groupby(['sex', 'cp']).quantile(.05)

Unnamed: 0_level_0,Unnamed: 1_level_0,age,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,cp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0,42.9,105.6,181.1,0.0,0.0,116.7,0.0,0.0,0.0,0.0,2.0,0.0
0,1,39.95,105.0,189.75,0.0,0.0,135.45,0.0,0.0,1.0,0.0,2.0,0.0
0,2,39.0,106.2,177.7,0.0,0.0,109.6,0.0,0.0,1.0,0.0,2.0,1.0
0,3,58.3,141.5,227.95,0.0,0.15,119.55,0.0,0.915,0.3,0.0,2.0,1.0
1,0,41.15,110.0,174.3,0.0,0.0,99.6,0.0,0.0,0.0,0.0,1.0,0.0
1,1,38.3,109.1,194.75,0.0,0.0,126.6,0.0,0.0,0.55,0.0,1.55,0.0
1,2,38.55,106.65,171.85,0.0,0.0,128.75,0.0,0.0,0.0,0.0,2.0,0.0
1,3,37.6,110.0,185.6,0.0,0.0,125.0,0.0,0.0,0.0,0.0,1.0,0.0


### Apply to Animal Shelter Data 

In [4]:
animal_intakes = pd.read_csv('https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD')

In [5]:
animal_intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [9]:
animal_intakes.shape

(117740, 12)

#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?

In [6]:
import datetime

In [11]:
animal_intakes['DateTime'] = pd.to_datetime(animal_intakes['DateTime'])

In [12]:
animal_intakes['DateTime']

0        2019-01-03 16:19:00
1        2015-07-05 12:59:00
2        2016-04-14 18:43:00
3        2013-10-21 07:59:00
4        2014-06-29 10:38:00
                 ...        
117735   2020-06-08 23:35:00
117736   2020-06-08 14:16:00
117737   2019-11-14 18:00:00
117738   2020-06-09 09:26:00
117739   2020-06-09 09:28:00
Name: DateTime, Length: 117740, dtype: datetime64[ns]

In [13]:
animal_intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117740 entries, 0 to 117739
Data columns (total 12 columns):
Animal ID           117740 non-null object
Name                80680 non-null object
DateTime            117740 non-null datetime64[ns]
MonthYear           117740 non-null object
Found Location      117740 non-null object
Intake Type         117740 non-null object
Intake Condition    117740 non-null object
Animal Type         117740 non-null object
Sex upon Intake     117739 non-null object
Age upon Intake     117740 non-null object
Breed               117740 non-null object
Color               117740 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 10.8+ MB


In [20]:
animal_intakes['years_since_intake'] = animal_intakes['DateTime'].map(lambda x: round((datetime.datetime.now() - x).days / 365, 2))

In [21]:
animal_intakes['years_since_intake']

0         1.43
1         4.93
2         4.15
3         6.64
4         5.95
          ... 
117735    0.00
117736    0.00
117737    0.57
117738    0.00
117739    0.00
Name: years_since_intake, Length: 117740, dtype: float64

In [33]:
# - save the age as a datetime value

animal_outcomes['age'] = pd.to_datetime(animal_outcomes.DateTime).map(lambda x: round((datetime.datetime.now() - x).days / 365, 2))

In [27]:
animal_intakes['DateTime'][0].month

1

In [22]:
animal_intakes.head(2)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,time_since_intake,years_since_intake
0,A786884,*Brock,2019-01-03 16:19:00,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,522 days 18:24:05.411996,1.43
1,A706918,Belle,2015-07-05 12:59:00,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,1800 days 21:44:05.412076,4.93


In [28]:
animal_intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117740 entries, 0 to 117739
Data columns (total 14 columns):
Animal ID             117740 non-null object
Name                  80680 non-null object
DateTime              117740 non-null datetime64[ns]
MonthYear             117740 non-null object
Found Location        117740 non-null object
Intake Type           117740 non-null object
Intake Condition      117740 non-null object
Animal Type           117740 non-null object
Sex upon Intake       117739 non-null object
Age upon Intake       117740 non-null object
Breed                 117740 non-null object
Color                 117740 non-null object
time_since_intake     117740 non-null timedelta64[ns]
years_since_intake    117740 non-null float64
dtypes: datetime64[ns](1), float64(1), object(11), timedelta64[ns](1)
memory usage: 12.6+ MB


In [32]:
animal_intakes['time_since_intake'][0]

Timedelta('522 days 18:24:05.411996')

In [42]:
pd.__version__

'0.25.1'

In [43]:
animal_outcomes.groupby('Animal Type').mean()

Unnamed: 0_level_0,age,year
Animal Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Bird,3.079106,2016.864964
Cat,3.425261,2016.491886
Dog,3.431489,2016.510738
Livestock,2.779524,2017.190476
Other,3.538561,2016.443995


In [38]:
animal_outcomes.loc[:, ['Animal Type', 'age']].groupby('Animal Type').mean()

Unnamed: 0_level_0,age
Animal Type,Unnamed: 1_level_1
Bird,3.079106
Cat,3.425261
Dog,3.431489
Livestock,2.779524
Other,3.538561


In [40]:
animal_outcomes.loc[:, ['Animal Type', 'Sex upon Intake', 'age']].groupby('Animal Type').mean()

Unnamed: 0_level_0,age
Animal Type,Unnamed: 1_level_1
Bird,3.079106
Cat,3.425261
Dog,3.431489
Livestock,2.779524
Other,3.538561


#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on `DateTime`
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [41]:
# Your code here
animal_outcomes['year'] = pd.to_datetime(animal_outcomes.DateTime).map(lambda x: x.year)

In [44]:
animal_outcomes['month'] = pd.to_datetime(animal_outcomes.DateTime).map(lambda x: x.month)

In [45]:
animal_outcomes.head(2)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,age,year,month
0,A786884,*Brock,01/03/2019 04:19:00 PM,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,1.43,2019,1
1,A706918,Belle,07/05/2015 12:59:00 PM,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,4.93,2015,7


In [47]:
animal_outcomes.groupby('month').count().loc[:, 'Animal ID']

month
1      8479
2      8027
3      9318
4      9442
5     12337
6     11464
7     10203
8      9752
9      9972
10    10975
11     9303
12     8466
Name: Animal ID, dtype: int64

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [48]:
uci

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [52]:
uci.pivot(values=['cp', 'sex'], columns='target')

Unnamed: 0_level_0,cp,cp,sex,sex
target,0,1,0,1
0,,3.0,,1.0
1,,2.0,,1.0
2,,1.0,,0.0
3,,1.0,,1.0
4,,0.0,,0.0
...,...,...,...,...
298,0.0,,0.0,
299,3.0,,1.0,
300,0.0,,1.0,
301,0.0,,1.0,


### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [69]:
toy1 = pd.DataFrame([[63, 142], [33, 47], [17, 44]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200], [18, 45]], columns=['age', 'HP'])

In [70]:
toy1

Unnamed: 0,age,HP
0,63,142
1,33,47
2,17,44


In [71]:
toy2

Unnamed: 0,age,HP
0,63,100
1,33,200
2,18,45


In [59]:
toy2.set_index('age')

Unnamed: 0_level_0,HP
age,Unnamed: 1_level_1
63,100
33,200


In [62]:
toy1.join(toy2.set_index('age'),
          on='age',
          lsuffix='_A',
          rsuffix='_B')

Unnamed: 0,age,HP_A,HP_B
0,63,142,100
1,33,47,200


In [72]:
toy1.set_index('age').join(toy2.set_index('age'),
                           rsuffix='_2')

Unnamed: 0_level_0,HP,HP_2
age,Unnamed: 1_level_1,Unnamed: 2_level_1
63,142,100.0
33,47,200.0
17,44,


### `.merge()`

In [75]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [76]:
states = pd.read_csv('data/states.csv', index_col=0)
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


In [77]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


### `pd.concat()`

In [79]:
pd.concat([ds_chars.rename(columns={'home_state':'state'}), states], sort=False)

Unnamed: 0,name,HP,state,nickname,capital
0,greg,200.0,WA,,
1,miles,200.0,WA,,
2,alan,170.0,TX,,
3,alison,300.0,DC,,
4,rachel,200.0,TX,,
0,,,WA,evergreen,Olympia
1,,,TX,alamo,Austin
2,,,DC,district,Washington
3,,,OH,buckeye,Columbus
4,,,OR,beaver,Salem


### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [80]:
ds_chars.head()

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [81]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

Unnamed: 0,name,variable,value
0,greg,HP,200
1,miles,HP,200
2,alan,HP,170
3,alison,HP,300
4,rachel,HP,200
5,greg,home_state,WA
6,miles,home_state,WA
7,alan,home_state,TX
8,alison,home_state,DC
9,rachel,home_state,TX


In [85]:
melted = pd.melt(animal_outcomes, id_vars=['Animal ID'], value_vars=['Age upon Intake', 'age'])

In [86]:
melted.loc[melted['Animal ID'] == 'A786884']

Unnamed: 0,Animal ID,variable,value
0,A786884,Age upon Intake,2 years
117738,A786884,age,1.43


## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

_Hints_ :
- import and clean the intake dataset first
- use `apply`/`applymap`/`lambda` to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new `days-in-shelter` column
- Notice that some values in `days_in_shelter` are `NaN` or values < 0 (remove these rows using the "<" operator and `isna()` or `dropna()`)
- Use `groupby` to get aggregate information about the dataset (your choice)

To save your dataset:
Use the notation `df.to_csv()` or `df.to_excel()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
#code here

## 5. Pandas Practice

### Introduction

In [1]:
# find and import the World Cup data held in data/ folder

### Practice Questions <a id="practice"></a>

1. Subset the DataFrame to only non-null rows.

In [None]:
#Your code here.

2. How many of the matches were in Montevideo?  

In [None]:
#Your code here.

2. b If you haven't already, investigate why this code returns zero:

```python
print(len(df[df.City=="Montevideo"]))
```

In [None]:
#Your code here.

3. How many matches did USA play in 2014?  

Hint: they could have been home or away.  

You can combine conditions like this:  
```python
# Returns rows where either condition is true
df[(condition1) | (condition2)]

# Returns rows where both conditions are true  
df[(condition1) & (condition2)]
```

In [None]:
#Your code here.

4. How many teams played in 1986?

In [None]:
#Your code here.

5. How many matches were there with 5 or more total goals?

In [None]:
#Your code here.

6. Come up with and answer, two other questions you could answer by filtering or subsetting this DataFrame.

In [None]:
#6a Question:

In [None]:
#6a Solution (with code):

In [None]:
#6b Question:

In [None]:
#6b Solution (with code):