# Importing, Reading and Manipulating Data with ACTUAL LITERAL PANDAS

![I have no idea what I'm doing panda](https://cdn-images-1.medium.com/max/1600/1*oBx032ncOwLmCFX3Epo3Zg.jpeg)

Just kidding - but Pandas is a great library to work with relational data. 

[Check out the documentation!](https://pandas.pydata.org/pandas-docs/stable/) (always a great idea)

Note! Here's something cool - Pandas is built on top of Numpy! That means they work really well together, and that Pandas has some math functionality already built in.

If you'd like to read more about Numpy and Pandas, [here is an interesting blog post](https://cloudxlab.com/blog/numpy-pandas-introduction/) discussing them.

Let's dive into some data from the Austin Animal Shelter. 

Data source: [intakes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and [outcomes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238).

Today we'll be working with the intakes data, which I've already downloaded and included in the repository.

In [1]:
# Import
import pandas as pd

Before reading in the data, we need to know what format the data is in and where exactly the data can be found, so we can tell Pandas what to do.

In [2]:
# Where is our data?
!ls

PandasPractice_solution.ipynb [34mdata[m[m


In [3]:
!ls data/

Austin_Animal_Center_Intakes_080421.csv
Austin_Animal_Center_Outcomes_080421.csv


In [4]:
# Read in the comma-separated-value (csv) document as df
df = pd.read_csv('data/Austin_Animal_Center_Intakes_080421.csv',
                 parse_dates=['DateTime'], infer_datetime_format=True)

What options do we have when we read in a csv? Let's look at the documentation!

[Convenient link](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

I happen to know that there is a column in the data named 'DateTime' - let's use an argument to read it in as a datetime object, then discuss.

In [5]:
df['DateTime']

0        2019-01-03 16:19:00
1        2015-07-05 12:59:00
2        2016-04-14 18:43:00
3        2013-10-21 07:59:00
4        2014-06-29 10:38:00
                 ...        
129510   2021-08-03 22:55:00
129511   2021-08-04 07:17:00
129512   2021-08-04 07:23:00
129513   2021-06-05 23:01:00
129514   2021-08-02 15:23:00
Name: DateTime, Length: 129515, dtype: datetime64[ns]

### Initial Exploration of a Dataframe

Questions to ask yourself:

- How big is the data?
- Are there any empty cells? 
- What are the datatypes of the columns of data?

In [6]:
# What does this dataframe look like?
# Check out the first 5 rows
df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,2019-01-03 16:19:00,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [7]:
# Check out the shape of the df
df.shape

(129515, 12)

In [8]:
# And then the size
df.size

1554180

In [9]:
# And then look at some info on the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129515 entries, 0 to 129514
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         129515 non-null  object        
 1   Name              89389 non-null   object        
 2   DateTime          129515 non-null  datetime64[ns]
 3   MonthYear         129515 non-null  object        
 4   Found Location    129515 non-null  object        
 5   Intake Type       129515 non-null  object        
 6   Intake Condition  129515 non-null  object        
 7   Animal Type       129515 non-null  object        
 8   Sex upon Intake   129514 non-null  object        
 9   Age upon Intake   129515 non-null  object        
 10  Breed             129515 non-null  object        
 11  Color             129515 non-null  object        
dtypes: datetime64[ns](1), object(11)
memory usage: 11.9+ MB


In [10]:
# Describe the columns
df.describe()

  df.describe()


Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
count,129515,89389,129515,129515,129515,129515,129515,129515,129514,129515,129515,129515
unique,115786,21080,91255,91255,55429,6,11,5,5,52,2663,603
top,A721033,Max,2016-09-23 12:00:00,09/23/2016 12:00:00 PM,Austin (TX),Stray,Normal,Dog,Intact Male,1 year,Domestic Shorthair Mix,Black/White
freq,33,580,64,64,23974,89880,111865,73042,42281,22444,31375,13571
first,,,2013-10-01 07:51:00,,,,,,,,,
last,,,2021-08-04 07:23:00,,,,,,,,,


**A note on `.describe()`:** this function behaves differently whether we feed in objects or numeric types. We can explore this more later.

**And a question:** You see that some of the ways we dealt with our dataframe required `()` and some did not - why is that?

- Methods vs. attributes


### Accessing Columns

Use brackets and the exact column name to access a particular column.

In [11]:
df['Name']

0          *Brock
1           Belle
2         Runster
3             NaN
4             Rio
           ...   
129510        NaN
129511        NaN
129512        NaN
129513    *Woogie
129514    *Woogie
Name: Name, Length: 129515, dtype: object

### Dealing with Datetime Objects

You can access parts of a datetime object using `.dt` - an attribute of the column, not a method!

In [12]:
df['DateTime']

0        2019-01-03 16:19:00
1        2015-07-05 12:59:00
2        2016-04-14 18:43:00
3        2013-10-21 07:59:00
4        2014-06-29 10:38:00
                 ...        
129510   2021-08-03 22:55:00
129511   2021-08-04 07:17:00
129512   2021-08-04 07:23:00
129513   2021-06-05 23:01:00
129514   2021-08-02 15:23:00
Name: DateTime, Length: 129515, dtype: datetime64[ns]

In [13]:
# Let's check out the intake year
df['DateTime'].dt.year

0         2019
1         2015
2         2016
3         2013
4         2014
          ... 
129510    2021
129511    2021
129512    2021
129513    2021
129514    2021
Name: DateTime, Length: 129515, dtype: int64

In [14]:
# How do we create a new column?
# Let's create a new column for intake year
df['Intake Year'] = df['DateTime'].dt.year

In [15]:
# Check our work
df.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Intake Year
0,A786884,*Brock,2019-01-03 16:19:00,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2019
1,A706918,Belle,2015-07-05 12:59:00,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
3,A665644,,2013-10-21 07:59:00,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
4,A682524,Rio,2014-06-29 10:38:00,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014


In [16]:
# What datatype is the data in our new column?
df['Intake Year'].dtype

dtype('int64')

### Checking for Null Values

Can use `.isna` or `.isnull` - same thing!

In [17]:
# Check it - is the result what you expect?
df.isna()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Intake Year
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129510,False,True,False,False,False,False,False,False,False,False,False,False,False
129511,False,True,False,False,False,False,False,False,False,False,False,False,False
129512,False,True,False,False,False,False,False,False,False,False,False,False,False
129513,False,False,False,False,False,False,False,False,False,False,False,False,False


In [18]:
# How can you make that result more usable?
df.isna().sum()

Animal ID               0
Name                40126
DateTime                0
MonthYear               0
Found Location          0
Intake Type             0
Intake Condition        0
Animal Type             0
Sex upon Intake         1
Age upon Intake         0
Breed                   0
Color                   0
Intake Year             0
dtype: int64

### Checking for Duplicate Rows

In [19]:
# Function is called duplicated - check the documentation!
df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
129510    False
129511    False
129512    False
129513    False
129514    False
Length: 129515, dtype: bool

In [20]:
# Can use same trick as above on duplicated
df.duplicated().sum()

19

### Dropping Columns or Rows

Several different methods depending on what we're doing - but the to discuss right now is `.drop`

In [21]:
# Let's drop the MonthYear column, which is the same as our DateTime
df = df.drop(columns=['MonthYear'])

In [22]:
# Check our work here...
df.head()

Unnamed: 0,Animal ID,Name,DateTime,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color,Intake Year
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2019
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014


Why won't my changes save ???

Fun thing about pandas - time to discuss resetting variables, or using `inplace`

### Renaming Columns

[Documentation for `.rename`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)

In [23]:
# Let's remove spaces from the columns, and make all column names lowercase to be easier
old_col_names = list(df.columns)

In [25]:
old_col_names

['Animal ID',
 'Name',
 'DateTime',
 'Found Location',
 'Intake Type',
 'Intake Condition',
 'Animal Type',
 'Sex upon Intake',
 'Age upon Intake',
 'Breed',
 'Color',
 'Intake Year']

In [28]:
# Can use a dictionary to rename
new_col_names = [x.lower().replace(" ", "_") for x in old_col_names]

In [29]:
new_col_names

['animal_id',
 'name',
 'datetime',
 'found_location',
 'intake_type',
 'intake_condition',
 'animal_type',
 'sex_upon_intake',
 'age_upon_intake',
 'breed',
 'color',
 'intake_year']

In [30]:
col_rename_dict = dict(zip(old_col_names, new_col_names))

In [31]:
df = df.rename(columns=col_rename_dict)

In [32]:
# Check your work
df.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,intake_year
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2019
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014


In [33]:
# Can also use a lambda function
df.rename(columns=lambda x: x.replace(" ", "_").lower())

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,intake_year
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2019
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014
...,...,...,...,...,...,...,...,...,...,...,...,...
129510,A840205,,2021-08-03 22:55:00,1806 Belford in Austin (TX),Stray,Nursing,Cat,Unknown,4 weeks,Domestic Shorthair,Brown Tabby,2021
129511,A840209,,2021-08-04 07:17:00,1300 Crossing Place in Austin (TX),Stray,Nursing,Cat,Unknown,4 weeks,Russian Blue,Blue,2021
129512,A840210,,2021-08-04 07:23:00,Oak Hill in Austin (TX),Stray,Injured,Dog,Intact Female,1 year,Maltese Mix,White,2021
129513,A836102,*Woogie,2021-06-05 23:01:00,2803 Parker Lane in Austin (TX),Owner Surrender,Normal,Dog,Intact Male,3 months,Pit Bull Mix,White,2021


### Slicing and Dicing

Perhaps your biggest tool for exploring around your dataframes will be `.loc` (and it's accompanying `.iloc`). This allows you to use conditionals to explore your data!

In [34]:
# Example: look only at animals with intake type 'Stray'
df.loc[df['intake_type'] == 'Stray'].head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,intake_year
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2019
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014


In [35]:
# Second example: animals where the animal type is not dog
df.loc[df['animal_type'] != 'Dog'].head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,intake_year
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
8,A818975,,2020-06-18 14:53:00,Braker Lane And Metric in Travis (TX),Stray,Normal,Cat,Intact Male,4 weeks,Domestic Shorthair,Cream Tabby,2020
9,A774147,,2018-06-11 07:45:00,6600 Elm Creek in Austin (TX),Stray,Injured,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Black/White,2018
10,A731435,*Casey,2016-08-08 17:52:00,Austin (TX),Owner Surrender,Normal,Cat,Neutered Male,5 months,Domestic Shorthair Mix,Cream Tabby,2016
14,A790209,Ziggy,2019-03-06 14:31:00,4424 S Mopac Expwy in Austin (TX),Public Assist,Normal,Cat,Intact Female,4 years,Domestic Shorthair Mix,Brown Tabby/White,2019


In [36]:
# And a third - animals found before 2018
df.loc[df['intake_year'] < 2018].head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,intake_year
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,2013
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2 years,Labrador Retriever Mix,Chocolate,2017


## Let's Start to Answer Questions!

#### Question 1: What is the most common Animal Type?

In [37]:
# Let's explore the Animal Type column to find out
df['animal_type'].value_counts()

Dog          73042
Cat          48925
Other         6915
Bird           610
Livestock       23
Name: animal_type, dtype: int64

In [38]:
# Another way - look above at describe, or run another describe
# 'Top' for an object column means 'most common'
df['animal_type'].describe()

count     129515
unique         5
top          Dog
freq       73042
Name: animal_type, dtype: object

In [39]:
# Can also check the percentage
df['animal_type'].value_counts(normalize=True)

Dog          0.563966
Cat          0.377755
Other        0.053391
Bird         0.004710
Livestock    0.000178
Name: animal_type, dtype: float64

#### Question 2: What is the most common dog breed to come into the shelter?

In [40]:
# Let's create a new df, dogs, for all dogs in the original data
dogs = df.loc[df['animal_type'] == 'Dog']

In [41]:
dogs.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color,intake_year
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,2019
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2015
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White,2016
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,2014
5,A743852,Odin,2017-02-18 12:46:00,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,2 years,Labrador Retriever Mix,Chocolate,2017


In [43]:
# Now it's easier to look at common dog breeds
dogs['breed'].value_counts().head(10)

Pit Bull Mix                 8613
Labrador Retriever Mix       7051
Chihuahua Shorthair Mix      6294
German Shepherd Mix          3108
Pit Bull                     1558
Australian Cattle Dog Mix    1557
Labrador Retriever           1165
Chihuahua Shorthair          1142
Dachshund Mix                1057
German Shepherd              1013
Name: breed, dtype: int64

#### Question 3: What percentage of animals have come into the shelter in a condition other than "Normal"?

In [44]:
# Need to explore the proper column
df['intake_condition'].value_counts()

Normal      111865
Injured       7136
Sick          5507
Nursing       3978
Aged           444
Other          236
Medical        116
Feral          114
Pregnant        83
Behavior        32
Space            4
Name: intake_condition, dtype: int64

In [52]:
df['intake_condition'].value_counts(normalize=True)

Normal      0.863722
Injured     0.055098
Sick        0.042520
Nursing     0.030715
Aged        0.003428
Other       0.001822
Medical     0.000896
Feral       0.000880
Pregnant    0.000641
Behavior    0.000247
Space       0.000031
Name: intake_condition, dtype: float64

In [45]:
# Want to use pandas to calculate, not inputting number manually
df['intake_condition'].value_counts()['Normal']

111865

In [46]:
num_normal = df['intake_condition'].value_counts()['Normal']

In [47]:
total_num = len(df)

In [48]:
# Calculate percentage
((total_num - num_normal) / total_num) * 100

13.627765123730843

In [49]:
# Other way to calculate

not_normal = df.loc[df['intake_condition'] != 'Normal']

In [50]:
len(not_normal)

17650

In [51]:
total_num - num_normal

17650

## Now - Outtake Data!

Let's explore together if we have time! If not - extra credit!