In [4]:
!open .

# Module 1 - Manipulating data with Pandas (continued)
## Pandas Part 2

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)

## Scenario:
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. In this lecture, we are continue to look at a real data set collected by [Austin Animal Center](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) over several years and use our Pandas skills from last class and learn some new ones in order to explore this data further.

#### *Our goals today are to be able to*:  

Use the pandas library to:

- Get summary info about a dataset and its variables
  - Apply and use info, describe and dtypes
  - Use `mean`, `min`, `max`, and `value_counts` 
- Use `apply` and `applymap` to transform columns and create new values

- Explain lambda functions and use them to use an apply on a DataFrame
- Explain what a `groupby` object is and split a DataFrame using `groupby`
- Reshape a DataFrame using joins, merges, pivoting, stacking, and melting


## Getting started

Before we look at the animal shelter data, let's practice on a simpler dataset.
Read about this dataset here: https://www.kaggle.com/ronitf/heart-disease-uci
![heart-data](images/heartbloodpres.jpeg)

The dataset is most often used to practice classification algorithms. Can one develop a model to predict the likelihood of heart disease based on other measurable characteristics? We will return to that specific question in a few weeks, but for now we wish to use the dataset to practice some pandas methods.

### 1. Get summary info about a dataset and its variables

Applying and using `info`, `describe`, `mean`, `min`, `max`, `apply`, and `applymap` from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

In [1]:
!pwd
!ls -al

'pwd' is not recognized as an internal or external command,
operable program or batch file.
'ls' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
pwd

'C:\\Users\\jongs\\Desktop\\DS_2020\\Course_Materials\\code\\hbs-ds-060120\\module-1\\day-6-pandas-2'

In [4]:
ls -al

 Volume in drive C has no label.
 Volume Serial Number is 06B9-E2F5

 Directory of C:\Users\jongs\Desktop\DS_2020\Course_Materials\code\hbs-ds-060120\module-1\day-6-pandas-2



File Not Found


In [5]:
import pandas as pd
uci = pd.read_csv('data/heart.csv')

In [6]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


#### The `.columns` and `.shape` Attributes

In [7]:
uci.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [10]:
uci.shape

(303, 14)

In [11]:
uci.info

<bound method DataFrame.info of      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0     63    1   3       145   233    1        0      150      0      2.3   
1     37    1   2       130   250    0        1      187      0      3.5   
2     41    0   1       130   204    0        0      172      0      1.4   
3     56    1   1       120   236    0        1      178      0      0.8   
4     57    0   0       120   354    0        1      163      1      0.6   
..   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
298   57    0   0       140   241    0        1      123      1      0.2   
299   45    1   3       110   264    0        1      132      0      1.2   
300   68    1   0       144   193    1        1      141      0      3.4   
301   57    1   0       130   131    0        1      115      1      1.2   
302   57    0   1       130   236    0        0      174      0      0.0   

     slope  ca  thal  target  
0        0   0     1    

#### The `.info() `and `.describe()` and `.dtypes` methods

Pandas DataFrames have many useful methods! Let's look at `.info()` , `.describe()`, and `dtypes`.

In [12]:
# Call the .info() method on our dataset. What do you observe?

uci.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [13]:
# Call the .describe() method on our dataset. What do you observe?

uci.describe()

# count: missing data
# mean, std, min max median
# can apply all np functions

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [14]:
# Use the code below. How does the output differ from info() ?
uci.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

#### `.mean()`, .`min()`,` .max()`, `.sum()`

The methods `.mean()`, `.min()`, and `.max()` will perform just the way you think they will!

Note that these are methods both for Series and for DataFrames.

In [15]:
uci.ca.mean()

0.7293729372937293

In [18]:
uci.ca.hasnans()  # not callable error. no parentheses needed. 

TypeError: 'bool' object is not callable

In [16]:
uci.age.mean()


54.366336633663366

#### The Axis Variable

In [23]:
uci.sum(axis=0) # Try [shift] + [tab] here!

age         16473.0
sex           207.0
cp            293.0
trestbps    39882.0
chol        74618.0
fbs            45.0
restecg       160.0
thalach     45343.0
exang          99.0
oldpeak       315.0
slope         424.0
ca            221.0
thal          701.0
target        165.0
dtype: float64

In [24]:
uci.sum(axis=1) #axis: index(0) calumns(1)

0      600.3
1      614.5
2      554.4
3      598.8
4      701.6
       ...  
298    567.2
299    561.2
300    558.4
301    442.2
302    602.0
Length: 303, dtype: float64

#### .`value_counts()`

For a DataFrame _Series_, the `.value_counts()` method will tell you how many of each value you've got.

In [25]:
uci['age'].value_counts()[:10]   # histogram count (or frequency graph)

58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
Name: age, dtype: int64

In [26]:
uci['age'].value_counts()

58    19
57    17
54    16
59    14
52    13
51    12
62    11
44    11
60    11
56    11
64    10
41    10
63     9
67     9
55     8
45     8
42     8
53     8
61     8
65     8
43     8
66     7
50     7
48     7
46     7
49     5
47     5
39     4
35     4
68     4
70     4
40     3
71     3
69     3
38     3
34     2
37     2
77     1
76     1
74     1
29     1
Name: age, dtype: int64

In [32]:
uci.age.value_counts(ascending=True)[:15]  # prefer this expression over parenthese. shift tab is more powerful

29    1
74    1
76    1
77    1
37    2
34    2
38    3
69    3
71    3
40    3
70    4
68    4
35    4
39    4
47    5
Name: age, dtype: int64

Exercise: What are the different values for restecg?

In [36]:
# Your code here!

uci.restecg.value_counts()

1    152
0    147
2      4
Name: restecg, dtype: int64

In [41]:
type(uci.restecg.value_counts())  # uci.restect in istelf is a series comprising the dataframe

pandas.core.series.Series

In [42]:
uci.restecg.value_counts().index

Int64Index([1, 0, 2], dtype='int64')

In [38]:
uci.restecg.unique()

array([0, 1, 2], dtype=int64)

In [39]:
type(uci.restecg.unique())

numpy.ndarray

In [47]:
# creating a new dataframe from the counts
age_counts = pd.DataFrame(uci.age.value_counts())  #get age value counts
age_counts.columns = ['count']                  # set column to 'count'
age_counts = age_counts.reset_index()           # reset the index
age_counts = age_counts.rename(columns={'index':'age'}) # rename the column
age_counts

Unnamed: 0,age,count
0,58,19
1,57,17
2,54,16
3,59,14
4,52,13
5,51,12
6,62,11
7,44,11
8,60,11
9,56,11


### Apply to Animal Shelter Data
Using `.info()` and `.describe()` and `dtypes` what observations can we make about the data?

What are the breed value counts?

How about age counts for dogs?

In [43]:
animal_outcomes = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')

In [48]:
animal_outcomes.info

<bound method DataFrame.info of        Animal ID          Name                DateTime  \
0        A794011         Chunk  05/08/2019 06:20:00 PM   
1        A776359         Gizmo  07/18/2018 04:02:00 PM   
2        A720371         Moose  02/13/2016 05:59:00 PM   
3        A674754           NaN  03/18/2014 11:47:00 AM   
4        A689724    *Donatello  10/18/2014 06:52:00 PM   
...          ...           ...                     ...   
117995   A818200   *Baby Spice  06/07/2020 04:34:00 PM   
117996   A800898  Chile Pepper  07/28/2019 05:48:00 PM   
117997   A817995       *Topeka  06/07/2020 06:01:00 PM   
117998   A818380           NaN  06/07/2020 07:24:00 PM   
117999   A818054           NaN  06/07/2020 07:17:00 PM   

                     MonthYear Date of Birth     Outcome Type Outcome Subtype  \
0       05/08/2019 06:20:00 PM    05/02/2017        Rto-Adopt             NaN   
1       07/18/2018 04:02:00 PM    07/12/2017         Adoption             NaN   
2       02/13/2016 05:59:00 

In [None]:
# what do you notice? missing data (NaN)
# space in column names
# combined info in some columns

In [51]:
animal_outcomes.describe()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
count,118000,81031,118000,118000,118000,117992,53642,118000,117996,117948,118000,118000
unique,105483,18947,97241,97241,6852,9,22,5,5,50,2571,585
top,A721033,Max,04/18/2016 12:00:00 AM,04/18/2016 12:00:00 AM,04/21/2014,Adoption,Partner,Dog,Neutered Male,1 year,Domestic Shorthair Mix,Black/White
freq,33,531,39,39,117,51976,29271,67130,41427,21236,30780,12401


In [50]:
animal_outcomes.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

In [52]:
animal_outcomes.Name.value_counts()[:10]  #top ten names

Max         531
Bella       499
Luna        461
Rocky       358
Daisy       345
Princess    328
Charlie     316
Coco        310
Lucy        304
Blue        297
Name: Name, dtype: int64

In [54]:
animal_outcomes.Breed.value_counts()

Domestic Shorthair Mix                  30780
Pit Bull Mix                             8305
Labrador Retriever Mix                   6645
Chihuahua Shorthair Mix                  6174
Domestic Shorthair                       5229
                                        ...  
Dutch Shepherd/Anatol Shepherd              1
Cavalier Span/Toy Poodle                    1
Norwich Terrier/Pug                         1
Chihuahua Shorthair/Shiba Inu               1
Australian Shepherd/Golden Retriever        1
Name: Breed, Length: 2571, dtype: int64

In [55]:
animal_outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black


In [56]:
animal_outcomes['Animal Type'].unique()

array(['Cat', 'Dog', 'Other', 'Bird', 'Livestock'], dtype=object)

In [60]:
animal_outcomes.loc[animal_outcomes['Animal Type'] == 'Dog'].head()
#loc takes 2 positional arguments.
# passing true false statements on print rows. so only the rows that satisfy the condition are shown

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
14,A765349,Einstein,06/08/2018 01:04:00 PM,06/08/2018 01:04:00 PM,01/18/2009,Adoption,Foster,Dog,Neutered Male,9 years,Chihuahua Shorthair Mix,Tricolor
15,A760697,Star,10/26/2017 03:22:00 PM,10/26/2017 03:22:00 PM,10/23/2007,Transfer,Partner,Dog,Intact Male,10 years,Yorkshire Terrier Mix,Brown/Black
16,A767231,Millie,02/25/2018 05:19:00 PM,02/25/2018 05:19:00 PM,02/25/2017,Return to Owner,,Dog,Spayed Female,1 year,Jack Russell Terrier/Chihuahua Shorthair,White/Tan


In [61]:
animal_outcomes.loc[animal_outcomes['Animal Type'] == 'Dog', 'Breed'].head()  # show the breed only

1                      Chihuahua Shorthair Mix
2           Anatol Shepherd/Labrador Retriever
14                     Chihuahua Shorthair Mix
15                       Yorkshire Terrier Mix
16    Jack Russell Terrier/Chihuahua Shorthair
Name: Breed, dtype: object

In [66]:
animal_outcomes['Animal Type'].unique()

array(['Cat', 'Dog', 'Other', 'Bird', 'Livestock'], dtype=object)

What are the breed `value_counts`?
What's the top breed for adopted dogs?

How about outcome counts for dogs?




In [65]:
animal_outcomes.loc[(animal_outcomes['Animal Type'] == 'Dog') & (animal_outcomes['Outcome Type'] == 'Adoption'), 'Breed'].value_counts()[0:5]

Labrador Retriever Mix       3417
Pit Bull Mix                 3234
Chihuahua Shorthair Mix      2954
German Shepherd Mix          1488
Australian Cattle Dog Mix     826
Name: Breed, dtype: int64

### 2.  Changing data

#### DataFrame.applymap() and Series.map()

The ```.applymap()``` method takes a function as input that it will then apply to every entry in the dataframe.

In [74]:
import numpy as np

def successor(x):  # defining a sample function
    return x + 1

successor(np.pi)

4.141592653589793

In [76]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [82]:
type(successor)

function

In [84]:
type(successor(1))

int

In [77]:
uci.applymap(successor).head() # Apply a function to a Dataframe elementwise. to every single line
# apply to the whole thing. very rare thing to do.

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,64,2,4,146,234,2,1,151,1,3.3,1,1,2,2
1,38,2,3,131,251,1,2,188,1,4.5,1,1,3,2
2,42,1,2,131,205,1,1,173,1,2.4,3,1,3,2
3,57,2,2,121,237,1,2,179,1,1.8,3,1,3,2
4,58,1,1,121,355,1,2,164,2,1.6,3,1,3,2


The `.map()` method takes a function as input that it will then apply to every entry in the Series.

.map() or apply() takes a function and apply to the designated series!

In [86]:
uci['age'].map(successor).head(10)

0    64
1    38
2    42
3    57
4    58
5    58
6    57
7    45
8    53
9    58
Name: age, dtype: int64

In [87]:
uci.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
uci.age.map() # arg : function, collections.abc.Mapping subclass or Series

In [88]:
uci.sex.map({0: 'male', 1: 'female'}).head()  # for map, you can pass a dictionary

0    female
1    female
2      male
3    female
4      male
Name: sex, dtype: object

In [89]:
uci.sex

0      1
1      1
2      0
3      1
4      0
      ..
298    0
299    1
300    1
301    1
302    0
Name: sex, Length: 303, dtype: int64

In [91]:
def decode_sex(x):
    if x == 1:
        return 'female'
    return 'male'

In [92]:
uci['sex'].map(decode_sex).head()

0    female
1    female
2      male
3    female
4      male
Name: sex, dtype: object

In [94]:
#to update the dataframe with the changes using map, either set to = or add a column
uci['sex_name'] = uci['sex'].map(decode_sex)
uci

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,sex_name
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,female
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,female
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,male
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,female
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,male
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0,male
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0,female
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0,female
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0,female


#### Anonymous Functions (Lambda Abstraction)

Simple functions can be defined right in the function call. This is called 'lambda abstraction'; the function thus defined has no name and hence is "anonymous".

In [97]:
uci.oldpeak

0      2.3
1      3.5
2      1.4
3      0.8
4      0.6
      ... 
298    0.2
299    1.2
300    3.4
301    1.2
302    0.0
Name: oldpeak, Length: 303, dtype: float64

In [96]:
def new_round_func(x):
    return round(x)

# this is same as using lambda

In [95]:
uci['oldpeak'].map(lambda x: round(x))[:4]  # writing in one line without defining the function. lambda is like x

0    2
1    4
2    1
3    1
Name: oldpeak, dtype: int64

In [98]:
uci['oldpeak'].map(lambda x: f'my new peak is {x**2: .2f}!!')   # 2 decimal places

0       my new peak is  5.29!!
1      my new peak is  12.25!!
2       my new peak is  1.96!!
3       my new peak is  0.64!!
4       my new peak is  0.36!!
                ...           
298     my new peak is  0.04!!
299     my new peak is  1.44!!
300    my new peak is  11.56!!
301     my new peak is  1.44!!
302     my new peak is  0.00!!
Name: oldpeak, Length: 303, dtype: object

In [101]:
uci['newpeak'] = uci['oldpeak'].map(lambda x: f'my new peak is {x**2: .2f}!!') 
uci

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,sex_name,newpeak
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,female,my new peak is 5.29!!
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,female,my new peak is 12.25!!
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,male,my new peak is 1.96!!
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,female,my new peak is 0.64!!
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,male,my new peak is 0.36!!
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0,male,my new peak is 0.04!!
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0,female,my new peak is 1.44!!
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0,female,my new peak is 11.56!!
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0,female,my new peak is 1.44!!


Exercise: Use an anonymous function to turn the entries in age to strings

In [102]:
uci.age.map(lambda y: str(y))  # datatype is now object

0      63
1      37
2      41
3      56
4      57
       ..
298    57
299    45
300    68
301    57
302    57
Name: age, Length: 303, dtype: object

In [105]:
uci.age.map(lambda y: f'age: {y}')  #only for one liners. if complicated, it needs to be a separate function in multiple lines

0      age: 63
1      age: 37
2      age: 41
3      age: 56
4      age: 57
        ...   
298    age: 57
299    age: 45
300    age: 68
301    age: 57
302    age: 57
Name: age, Length: 303, dtype: object

In [106]:
squared = lambda x: x**2

In [107]:
squared(2)   #works, but wrong style!

4

### Apply to Animal Shelter Data

Use an `apply` to change the dates from strings to datetime objects. Similarly, use an apply to change the ages of the animals from strings to floats.

In [108]:
# Your code here
animal_outcomes

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black
...,...,...,...,...,...,...,...,...,...,...,...,...
117995,A818200,*Baby Spice,06/07/2020 04:34:00 PM,06/07/2020 04:34:00 PM,04/03/2020,Adoption,,Cat,Spayed Female,2 months,Domestic Shorthair,Black/White
117996,A800898,Chile Pepper,07/28/2019 05:48:00 PM,07/28/2019 05:48:00 PM,07/28/2006,Return to Owner,,Dog,Intact Female,13 years,Chihuahua Shorthair,White/Tan
117997,A817995,*Topeka,06/07/2020 06:01:00 PM,06/07/2020 06:01:00 PM,05/30/2019,Adoption,,Dog,Spayed Female,1 year,Labrador Retriever Mix,Brown/White
117998,A818380,,06/07/2020 07:24:00 PM,06/07/2020 07:24:00 PM,05/23/2020,Euthanasia,Rabies Risk,Other,Unknown,2 weeks,Bat,Brown


In [109]:
animal_outcomes.info()  # datetime is in str format (objects)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118000 entries, 0 to 117999
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Animal ID         118000 non-null  object
 1   Name              81031 non-null   object
 2   DateTime          118000 non-null  object
 3   MonthYear         118000 non-null  object
 4   Date of Birth     118000 non-null  object
 5   Outcome Type      117992 non-null  object
 6   Outcome Subtype   53642 non-null   object
 7   Animal Type       118000 non-null  object
 8   Sex upon Outcome  117996 non-null  object
 9   Age upon Outcome  117948 non-null  object
 10  Breed             118000 non-null  object
 11  Color             118000 non-null  object
dtypes: object(12)
memory usage: 10.8+ MB


In [111]:
animal_outcomes['DateTime'] = pd.to_datetime(animal_outcomes.DateTime)
animal_outcomes['MonthYear'] = pd.to_datetime(animal_outcomes.MonthYear)

not using for loops in pandas. it slows things down

In [114]:
animal_outcomes.head(10)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,2019-05-08 18:20:00,2019-05-08 18:20:00,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,2018-07-18 16:02:00,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,2016-02-13 17:59:00,2016-02-13 17:59:00,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,2014-03-18 11:47:00,2014-03-18 11:47:00,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,2014-10-18 18:52:00,2014-10-18 18:52:00,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black
5,A680969,*Zeus,2014-08-05 16:59:00,2014-08-05 16:59:00,06/03/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby
6,A684617,,2014-07-27 09:00:00,2014-07-27 09:00:00,07/26/2012,Transfer,SCRP,Cat,Intact Female,2 years,Domestic Shorthair Mix,Black
7,A742354,Artemis,2017-01-22 11:56:00,2017-01-22 11:56:00,01/20/2010,Return to Owner,,Cat,Neutered Male,7 years,Domestic Shorthair Mix,Blue/White
8,A681036,,2014-06-11 17:11:00,2014-06-11 17:11:00,06/09/2014,Transfer,Partner,Cat,Intact Male,2 days,Domestic Shorthair Mix,Brown Tabby
9,A803149,*Birch,2019-08-31 16:26:00,2019-08-31 16:26:00,08/08/2019,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair,Brown Tabby


In [112]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118000 entries, 0 to 117999
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         118000 non-null  object        
 1   Name              81031 non-null   object        
 2   DateTime          118000 non-null  datetime64[ns]
 3   MonthYear         118000 non-null  datetime64[ns]
 4   Date of Birth     118000 non-null  object        
 5   Outcome Type      117992 non-null  object        
 6   Outcome Subtype   53642 non-null   object        
 7   Animal Type       118000 non-null  object        
 8   Sex upon Outcome  117996 non-null  object        
 9   Age upon Outcome  117948 non-null  object        
 10  Breed             118000 non-null  object        
 11  Color             118000 non-null  object        
dtypes: datetime64[ns](2), object(10)
memory usage: 10.8+ MB


In [117]:
#animal age.

animal_outcomes['Date of Birth'] = pd.to_datetime(animal_outcomes['Date of Birth'])  # has to use [] because of spaces in the column name

In [118]:
animal_outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118000 entries, 0 to 117999
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         118000 non-null  object        
 1   Name              81031 non-null   object        
 2   DateTime          118000 non-null  datetime64[ns]
 3   MonthYear         118000 non-null  datetime64[ns]
 4   Date of Birth     118000 non-null  datetime64[ns]
 5   Outcome Type      117992 non-null  object        
 6   Outcome Subtype   53642 non-null   object        
 7   Animal Type       118000 non-null  object        
 8   Sex upon Outcome  117996 non-null  object        
 9   Age upon Outcome  117948 non-null  object        
 10  Breed             118000 non-null  object        
 11  Color             118000 non-null  object        
dtypes: datetime64[ns](3), object(9)
memory usage: 10.8+ MB


In [119]:
import datetime

In [125]:
diff = datetime.date.today() - datetime.date(2010, 5, 15)  # getting time deltas

In [129]:
diff.days /365

10.073972602739726

In [140]:
def calculate_age(val):
    return round((datetime.datetime.now - val).days / 365)


In [141]:
calculate_age(datetime.datetime(1776, 7 , 4))

TypeError: unsupported operand type(s) for -: 'builtin_function_or_method' and 'datetime.datetime'

In [138]:
animal_outcomes['Date of Birth'].map(calculate_age)

TypeError: unsupported operand type(s) for -: 'builtin_function_or_method' and 'Timestamp'

1. clean up age upon outcome into a common unit
1. make all column titles lower case, and remove spaces
1. remove null values from certain columns
  - outcome type
  - age upon outcome
  - sex upon outcome
1. veryfy DateTime == Month Year
  - if true, drop moth year
 


In [142]:
# make all column titles lower case, and remove spaces

animal_outcomes.columns

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')

In [143]:
[x for x in animal_outcomes.columns] #verbatim list

['Animal ID',
 'Name',
 'DateTime',
 'MonthYear',
 'Date of Birth',
 'Outcome Type',
 'Outcome Subtype',
 'Animal Type',
 'Sex upon Outcome',
 'Age upon Outcome',
 'Breed',
 'Color']

In [144]:
[x.lower() for x in animal_outcomes.columns]

['animal id',
 'name',
 'datetime',
 'monthyear',
 'date of birth',
 'outcome type',
 'outcome subtype',
 'animal type',
 'sex upon outcome',
 'age upon outcome',
 'breed',
 'color']

In [145]:
[x.lower().replace(' ', '_') for x in animal_outcomes.columns]

['animal_id',
 'name',
 'datetime',
 'monthyear',
 'date_of_birth',
 'outcome_type',
 'outcome_subtype',
 'animal_type',
 'sex_upon_outcome',
 'age_upon_outcome',
 'breed',
 'color']

In [146]:
[x.replace(' ', '_').lower() for x in animal_outcomes.columns] #same as above

['animal_id',
 'name',
 'datetime',
 'monthyear',
 'date_of_birth',
 'outcome_type',
 'outcome_subtype',
 'animal_type',
 'sex_upon_outcome',
 'age_upon_outcome',
 'breed',
 'color']

In [148]:
animal_outcomes.columns = [x.replace(' ', '_').lower() for x in animal_outcomes.columns]
animal_outcomes.columns

Index(['animal_id', 'name', 'datetime', 'monthyear', 'date_of_birth',
       'outcome_type', 'outcome_subtype', 'animal_type', 'sex_upon_outcome',
       'age_upon_outcome', 'breed', 'color'],
      dtype='object')

In [None]:
# remove null values from certain columns


In [149]:
animal_outcomes.dropna   #drop every row that has NaN

<bound method DataFrame.dropna of        animal_id          name            datetime           monthyear  \
0        A794011         Chunk 2019-05-08 18:20:00 2019-05-08 18:20:00   
1        A776359         Gizmo 2018-07-18 16:02:00 2018-07-18 16:02:00   
2        A720371         Moose 2016-02-13 17:59:00 2016-02-13 17:59:00   
3        A674754           NaN 2014-03-18 11:47:00 2014-03-18 11:47:00   
4        A689724    *Donatello 2014-10-18 18:52:00 2014-10-18 18:52:00   
...          ...           ...                 ...                 ...   
117995   A818200   *Baby Spice 2020-06-07 16:34:00 2020-06-07 16:34:00   
117996   A800898  Chile Pepper 2019-07-28 17:48:00 2019-07-28 17:48:00   
117997   A817995       *Topeka 2020-06-07 18:01:00 2020-06-07 18:01:00   
117998   A818380           NaN 2020-06-07 19:24:00 2020-06-07 19:24:00   
117999   A818054           NaN 2020-06-07 19:17:00 2020-06-07 19:17:00   

       date_of_birth     outcome_type outcome_subtype animal_type  \
0       

In [152]:
animal_outcomes = animal_outcomes.dropna(subset=['outcome_type', 'sex_upon_outcome', 'age_upon_outcome'])

In [154]:
#verify datetime == monthyear

(animal_outcomes.datetime == animal_outcomes.monthyear).sum()

117942

In [155]:
animal_outcomes.shape

(117942, 12)

In [158]:
animal_outcomes = animal_outcomes.drop(columns = 'monthyear')

In [161]:
#clean up age upon outcome into a common unit

animal_outcomes.age_upon_outcome.str.split(' ')   #split function!

0          [2, years]
1           [1, year]
2         [4, months]
3           [6, days]
4         [2, months]
             ...     
117995    [2, months]
117996    [13, years]
117997      [1, year]
117998     [2, weeks]
117999      [1, year]
Name: age_upon_outcome, Length: 117942, dtype: object

now you have a bunch of lists comprised of a number and unit

In [170]:
animal_outcomes.age_upon_outcome.value_counts()

1 year       21233
2 years      17705
2 months     14128
3 years       7219
3 months      5491
1 month       5149
4 years       4260
5 years       3944
4 months      3747
5 months      2901
6 months      2828
6 years       2617
8 years       2259
7 years       2237
3 weeks       2053
2 weeks       1956
8 months      1874
10 years      1790
4 weeks       1742
10 months     1704
7 months      1514
9 years       1231
9 months      1226
12 years       879
1 weeks        794
11 months      747
11 years       703
1 week         638
13 years       563
14 years       377
3 days         337
2 days         334
15 years       319
1 day          242
6 days         232
4 days         211
0 years        177
5 days         154
16 years       135
5 weeks        112
17 years        78
18 years        48
19 years        25
20 years        17
-1 years         4
22 years         4
-3 years         1
21 years         1
24 years         1
25 years         1
Name: age_upon_outcome, dtype: int64

age_list = []
age_list = animal_outcomes.age_upon_outcome.str.split(' ') 
age_list

In [171]:
def convert_to_days_old(val):
    number, unit = val.split(' ') # assiging variables to split values and units
    number = int(number)
    if 'year' in units:
        return 365 * number
    if 'month' in unit:
        return 30 * number
    if 'week' in unit:
        return 7 * number
    if 'day' in unit: 
        return number
    
    return 'unkown'
    

In [172]:
animal_outcomes['days_upon_outcome'] = animal_outcomes.age_up_outcome.map(convert_to_days_old)

AttributeError: 'DataFrame' object has no attribute 'age_up_outcome'

## 3. Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [None]:
uci.groupby('sex')

#### `.groups` and `.get_group()`

In [None]:
uci.groupby('sex').groups

In [None]:
uci.groupby('sex').get_group(0) # .tail()

### Aggregating

In [None]:
uci.groupby('sex').std()

Exercise: Tell me the average cholesterol level for those with heart disease.

In [None]:
# Your code here!


### Apply to Animal Shelter Data

#### Task 1
- Use a groupby to show the average age of the different kinds of animal types.
- What about by animal types **and** gender?
 

#### Task 2:
- Create new columns `year` and `month` by using a lambda function x.year on date
- Use `groupby` and `.size()` to tell me how many animals are adopted by month

In [None]:
# Your code here

## 4. Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [None]:
uci.pivot(values='sex', columns='target').head()

### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

### `.join()`

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns = ['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns = ['age', 'HP'])

In [None]:
toy1.join(toy2.set_index('age'),
          on = 'age',
          lsuffix = '_A',
          rsuffix = '_B').head()

### `.merge()`

In [None]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col = 0)

In [None]:
states = pd.read_csv('data/states.csv', index_col = 0)

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on = 'state',
               how = 'inner')

### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [None]:
pd.concat([ds_chars, states], sort=False)

### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars.head()

In [None]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

## Bringing it all together with the Animal Shelter Data

Join the data from the [Austin Animal Shelter Intake dataset](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) to the outcomes dataset by Animal ID.

Use the dates from each dataset to see how long animals spend in the shelter. Does it differ by time of year? By outcome?

_Hints_ :
- import and clean the intake dataset first
- use `apply`/`applymap`/`lambda` to change the variables to their proper format in the intake data
- rename the columns in the intake dataset *before* joining
- create a new `days-in-shelter` column
- Notice that some values in `days_in_shelter` are `NaN` or values < 0 (remove these rows using the "<" operator and `isna()` or `dropna()`)
- Use `groupby` to get aggregate information about the dataset (your choice)

To save your dataset:
Use the notation `df.to_csv()` or `df.to_excel()` to write the `df` to a csv. Read more about the `to_csv()` documentation [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

In [None]:
#code here