# Module 1 - Reshaping Data with Pandas
## Pandas Part 3

In [1]:
import pandas as pd
uci = pd.read_csv('data/heart.csv')

In [2]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [3]:
uci.groupby('sex')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023C1B9E8C08>

#### `.groups` and `.get_group()`

In [4]:
uci.groupby('sex').groups

{0: Int64Index([  2,   4,   6,  11,  14,  15,  16,  17,  19,  25,  28,  30,  35,
              36,  38,  39,  40,  43,  48,  49,  50,  53,  54,  59,  60,  65,
              67,  69,  74,  75,  82,  84,  85,  88,  89,  93,  94,  96, 102,
             105, 107, 108, 109, 110, 112, 115, 118, 119, 120, 122, 123, 124,
             125, 127, 128, 129, 130, 131, 134, 135, 136, 140, 142, 143, 144,
             146, 147, 151, 153, 154, 155, 161, 167, 181, 182, 190, 204, 207,
             213, 215, 216, 220, 223, 241, 246, 252, 258, 260, 263, 266, 278,
             289, 292, 296, 298, 302],
            dtype='int64'),
 1: Int64Index([  0,   1,   3,   5,   7,   8,   9,  10,  12,  13,
             ...
             288, 290, 291, 293, 294, 295, 297, 299, 300, 301],
            dtype='int64', length=207)}

In [5]:
uci.groupby('sex').get_group(0)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
6,56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2,1
14,58,0,3,150,283,1,0,162,0,1.0,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,55,0,0,128,205,0,2,130,1,2.0,1,1,3,0
292,58,0,0,170,225,1,0,146,1,2.8,1,2,1,0
296,63,0,0,124,197,0,1,136,1,0.0,1,0,2,0
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0


### Aggregating

In [6]:
uci.groupby('sex').mean()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,55.677083,1.041667,133.083333,261.302083,0.125,0.572917,151.125,0.229167,0.876042,1.427083,0.552083,2.125,0.75
1,53.758454,0.932367,130.94686,239.289855,0.15942,0.507246,148.961353,0.371981,1.115459,1.386473,0.811594,2.400966,0.449275


Exercise: Tell me the average cholesterol level for those with heart disease.

In [7]:
# 'target' is incidence of heart disease
uci.groupby('target').mean().loc[1,'chol']

242.23030303030302

In [8]:
uci.groupby('target').get_group(1).mean()

age          52.496970
sex           0.563636
cp            1.375758
trestbps    129.303030
chol        242.230303
fbs           0.139394
restecg       0.593939
thalach     158.466667
exang         0.139394
oldpeak       0.583030
slope         1.593939
ca            0.363636
thal          2.121212
target        1.000000
dtype: float64

In [9]:
uci.groupby('cp').std().loc[3, 'slope']

0.6887004431501819

In [10]:
uci.groupby('sex').min()

Unnamed: 0_level_0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,34,0,94,141,0,0,96,0,0.0,0,0,0,0
1,29,0,94,126,0,0,71,0,0.0,0,0,0,0


In [11]:
# Multi-index
uci.groupby(['sex','cp']).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
sex,cp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0,35,100,149,0,0,106,0,0.0,0,0,1,0
0,1,34,105,160,0,0,121,0,0.0,1,0,2,0
0,2,37,94,141,0,0,96,0,0.0,0,0,0,0
0,3,58,140,226,0,0,114,0,0.9,0,0,2,1
1,0,35,100,131,0,0,71,0,0.0,0,0,0,0
1,1,29,101,157,0,0,103,0,0.0,0,0,1,0
1,2,37,94,126,0,0,112,0,0.0,0,0,1,0
1,3,34,110,182,0,0,125,0,0.0,0,0,1,0


### Apply to Animal Shelter Data 

In [12]:
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD')

In [13]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117772 entries, 0 to 117771
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         117772 non-null  object
 1   Name              80695 non-null   object
 2   DateTime          117772 non-null  object
 3   MonthYear         117772 non-null  object
 4   Found Location    117772 non-null  object
 5   Intake Type       117772 non-null  object
 6   Intake Condition  117772 non-null  object
 7   Animal Type       117772 non-null  object
 8   Sex upon Intake   117771 non-null  object
 9   Age upon Intake   117772 non-null  object
 10  Breed             117772 non-null  object
 11  Color             117772 non-null  object
dtypes: object(12)
memory usage: 10.8+ MB


In [14]:
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,01/03/2019 04:19:00 PM,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,07/05/2015 12:59:00 PM,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,04/14/2016 06:43:00 PM,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,10/21/2013 07:59:00 AM,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,06/29/2014 10:38:00 AM,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?

In [15]:
animal_outcomes['DateTime'] = animal_outcomes['DateTime'].map(lambda x: pd.to_datetime(x))
animal_outcomes['MonthYear'] = animal_outcomes['MonthYear'].map(lambda x: pd.to_datetime(x))
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117772 entries, 0 to 117771
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         117772 non-null  object        
 1   Name              80695 non-null   object        
 2   DateTime          117772 non-null  datetime64[ns]
 3   MonthYear         117772 non-null  datetime64[ns]
 4   Found Location    117772 non-null  object        
 5   Intake Type       117772 non-null  object        
 6   Intake Condition  117772 non-null  object        
 7   Animal Type       117772 non-null  object        
 8   Sex upon Intake   117771 non-null  object        
 9   Age upon Intake   117772 non-null  object        
 10  Breed             117772 non-null  object        
 11  Color             117772 non-null  object        
dtypes: datetime64[ns](2), object(10)
memory usage: 10.8+ MB


In [16]:
# Clean up "Age Upon Outcome" into a common unit (days?)
def turn_into_days(string):
    num, unit = string.split(" ")
    num = int(num)
    if 'year' in unit:
        return num * 365
    if 'month' in unit:
        return num * 30
    if 'week' in unit:
        return num * 7
    if 'day' in unit:
        return num
    return 'Error'

animal_outcomes['Days upon Intake'] = animal_outcomes['Age upon Intake'].map(turn_into_days)
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Days upon Intake
0,A786884,*Brock,2019-01-03 16:19:00,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,730
1,A706918,Belle,2015-07-05 12:59:00,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2920
2,A724273,Runster,2016-04-14 18:43:00,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,330
3,A665644,,2013-10-21 07:59:00,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,28
4,A682524,Rio,2014-06-29 10:38:00,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,1460


In [17]:
import datetime
animal_outcomes['Age'] = animal_outcomes['DateTime'].map(lambda x: (datetime.datetime.now() - x).days)
animal_outcomes['Age'] = animal_outcomes['Age'] + animal_outcomes['Days upon Intake']
animal_outcomes['Age'] = animal_outcomes['Age'].map(lambda x: round(x/365, 2))
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Days upon Intake,Age
0,A786884,*Brock,2019-01-03 16:19:00,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,730,3.43
1,A706918,Belle,2015-07-05 12:59:00,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2920,12.93
2,A724273,Runster,2016-04-14 18:43:00,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,330,5.06
3,A665644,,2013-10-21 07:59:00,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,28,6.72
4,A682524,Rio,2014-06-29 10:38:00,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,1460,9.95


In [18]:
animal_outcomes.groupby('Animal Type')['Age'].mean()

Animal Type
Bird         4.445493
Cat          4.789296
Dog          6.037395
Livestock    3.506667
Other        4.809147
Name: Age, dtype: float64

In [19]:
animal_outcomes.groupby(['Animal Type', 'Sex upon Intake'])['Age'].mean()

Animal Type  Sex upon Intake
Bird         Intact Female      4.852838
             Intact Male        5.016797
             Unknown            4.079283
Cat          Intact Female      4.079412
             Intact Male        3.931873
             Neutered Male      7.697362
             Spayed Female      7.873273
             Unknown            3.950686
Dog          Intact Female      4.974086
             Intact Male        5.191976
             Neutered Male      7.601011
             Spayed Female      7.779381
             Unknown            3.790644
Livestock    Intact Female      3.982500
             Intact Male        3.992500
             Neutered Male      7.280000
             Unknown            2.316250
Other        Intact Female      4.734297
             Intact Male        4.421200
             Neutered Male      5.077015
             Spayed Female      4.775741
             Unknown            4.848917
Name: Age, dtype: float64

#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on date
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [None]:
animal_outcomes.groupby('Month').size()

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [None]:
uci.pivot(values='sex', columns='target')

### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy1

In [None]:
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'HP'])
toy2

In [None]:
toy1.join(toy2.set_index('age'),
          on='age',
          lsuffix='_A',
          rsuffix='_B')

In [None]:
toy1.set_index('age').join(toy2.set_index('age'),
                           lsuffix='_A',
                           rsuffix='_B')

### `.merge()`

In [None]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars.head()

In [None]:
states = pd.read_csv('data/states.csv', index_col=0)
states.head()

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

### `pd.concat()`

In [None]:
pd.concat([ds_chars, states], sort=False)

In [None]:
pd.concat([ds_chars.rename(columns={'home_state':'state'}), states], sort=False)

### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars.head()

In [None]:
# Turns wideform datatable into into longform datatable
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

_Hints_ :
- import and clean the intake dataset first
- use `apply`/`applymap`/`lambda` to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new `days_in_shelter` column
- Notice that some values in `days_in_shelter` are `NaN` or values < 0 (remove these rows using the "<" operator and `isna()` or `dropna()`)
- Use `groupby` to get aggregate information about the dataset (your choice)

To save your dataset:
Use the notation `df.to_csv()` or `df.to_excel()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
#code here
animal_intakes = pd.read_csv('https://data.austintexas.gov/api/views/wter-evkm/rows.csv?accessType=DOWNLOAD')
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')

In [None]:
animal_intakes.info()

In [None]:
animal_intakes.head()

In [None]:
animal_outcomes.info()

In [None]:
animal_outcomes.loc[animal_outcomes['Animal ID'] =='A682524']

## 5. Pandas Practice

### Introduction

In [None]:
# find and import the World Cup data held in data/ folder

### Practice Questions <a id="practice"></a>

1. Subset the DataFrame to only non-null rows.

In [None]:
#Your code here.

2. How many of the matches were in Montevideo?  

In [None]:
#Your code here.

2. b If you haven't already, investigate why this code returns zero:

```python
print(len(df[df.City=="Montevideo"]))
```

In [None]:
#Your code here.

3. How many matches did USA play in 2014?  

Hint: they could have been home or away.  

You can combine conditions like this:  
```python
# Returns rows where either condition is true
df[(condition1) | (condition2)]

# Returns rows where both conditions are true  
df[(condition1) & (condition2)]
```

In [None]:
#Your code here.

4. How many teams played in 1986?

In [None]:
#Your code here.

5. How many matches were there with 5 or more total goals?

In [None]:
#Your code here.

6. Come up with and answer, two other questions you could answer by filtering or subsetting this DataFrame.

In [None]:
#6a Question:

In [None]:
#6a Solution (with code):

In [None]:
#6b Question:

In [None]:
#6b Solution (with code):