# Module 1 - Introduction to Pandas
## Pandas Part 1

### Introduction

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. You have found out that Austin has one of the largest no-kill animal shelters in the country, and they keep meticulous track of animals that have been taken in and released. However, it is a large file, the online visualization tools provided are terrible, the data is sorted as strings, and the file holds an overwhelming amount  of information. Is there an easy way to look at this data? Can we do this with base Python? Is there a better way?


#### _Our goals today are to be able to_: <br/>

- Import/read data using Pandas
- Identify Pandas objects and manipulate Pandas objects by index and columns
- Filter data using Pandas

#### _Big questions for this lesson_: <br/>
- Why use Pandas? 
 
 (a) Provides methods able analyze data stored in the format Data Scientist most often encounter (.csv, .tsv, or .xlsx). 
 
 (b) Makes it very convenient to load, process, and analyze in the aforementioned formats. 
 
 (c) Along with python visualization packages allows for the visual analysis of tabular data.
 

- When do we want to use NumPy versus Pandas?
- What are the [advantages of using Pandas?](https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/)
- What are the [disadvantages of using Pandas?](https://wesmckinney.com/blog/apache-arrow-pandas-internals/)
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

### Activation:

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=600>  




- The data manipulation capabilities of Pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.



In [2]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd
!ls -la data

/Users/enkeboll/code/fis/dc-ds-010620/module-1/day-5-pandas-1
total 16
drwxr-xr-x  4 enkeboll  staff  128 Dec 30 11:09 [1m[34m.[m[m
drwxr-xr-x  6 enkeboll  staff  192 Jan 10 11:22 [1m[34m..[m[m
-rw-r--r--  1 enkeboll  staff   62 Dec 30 11:09 example1.csv
-rw-r--r--  1 enkeboll  staff  238 Dec 30 11:09 made_up_jobs.csv


In [3]:
!head data/example1.csv

Title1,Title2,Title3
one,two,three
example1,example2,example3


In [4]:
import pandas as pd

example_csv = pd.read_csv('data/example1.csv', sep='\t')

In [6]:
my_string = "Hello!\nMy name is \t\t ANDY!"

In [9]:
my_string

'Hello!\nMy name is \t\t ANDY!'

In [8]:
print(my_string)

Hello!
My name is 		 ANDY!


In [5]:
example_csv

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


In [None]:
pd.read_

There is also `read_excel`, `read_html`, and many other pandas `read_` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [10]:
example_csv.describe()

Unnamed: 0,Title1,Title2,Title3
count,2,2,2
unique,2,2,2
top,one,two,example3
freq,1,1,1


In [11]:
example_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
Title1    2 non-null object
Title2    2 non-null object
Title3    2 non-null object
dtypes: object(3)
memory usage: 128.0+ bytes


In [17]:
None
import numpy as np

np.NaN

nan

Try loading in the example file in the `data` directory called `made_up_jobs.csv` using pandas.

In [18]:
!head data/made_up_jobs.csv

ID,Name,Job,Years Employed
0,Bob Bobberty,Underwater Basket Weaver,13
1,Susan Smells,Salad Spinner,5
2,Alex Lastname,Productivity Manager,2
3,Rudy P.,Being cool,55
4,Rudy G.,Being compared to Rudy P,50
5,Sir Wellington,Cheese Stacker, 10


In [21]:
# read in your csv here!
muj = pd.read_csv('data/made_up_jobs.csv', index_col='ID')

# remember that it's nice to be able to look at your data, so let's do that here, too.
muj.head()

Unnamed: 0_level_0,Name,Job,Years Employed
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Bob Bobberty,Underwater Basket Weaver,13
1,Susan Smells,Salad Spinner,5
2,Alex Lastname,Productivity Manager,2
3,Rudy P.,Being cool,55
4,Rudy G.,Being compared to Rudy P,50


In [20]:
muj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
ID                6 non-null int64
Name              6 non-null object
Job               6 non-null object
Years Employed    6 non-null int64
dtypes: int64(2), object(2)
memory usage: 272.0+ bytes


You can also load in data directly from the internet with its url.

In [22]:
shelter_data = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')
# this link is copied directly from the download option for CSV

shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
1,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
2,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black
3,A680969,*Zeus,08/05/2014 04:59:00 PM,08/05/2014 04:59:00 PM,06/03/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby
4,A684617,,07/27/2014 09:00:00 AM,07/27/2014 09:00:00 AM,07/26/2012,Transfer,SCRP,Cat,Intact Female,2 years,Domestic Shorthair Mix,Black


In [29]:
shelter_data.head(2)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
1,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby


Now that we can read in data, let's get more comfortable with our Pandas data structures.

In [30]:
type(shelter_data)

pandas.core.frame.DataFrame

In [34]:
# Now that data is read let's look at it's shape
print(shelter_data.shape)

(113950, 12)


In [32]:
print(shelter_data.size)

1367400


In [35]:
# What are the names of the columns
print(shelter_data.columns)

Index(['Animal ID', 'Name', 'DateTime', 'MonthYear', 'Date of Birth',
       'Outcome Type', 'Outcome Subtype', 'Animal Type', 'Sex upon Outcome',
       'Age upon Outcome', 'Breed', 'Color'],
      dtype='object')


In [23]:
# What are the different data types present in our data
print(shelter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113950 entries, 0 to 113949
Data columns (total 12 columns):
Animal ID           113950 non-null object
Name                78118 non-null object
DateTime            113950 non-null object
MonthYear           113950 non-null object
Date of Birth       113950 non-null object
Outcome Type        113942 non-null object
Outcome Subtype     51633 non-null object
Animal Type         113950 non-null object
Sex upon Outcome    113946 non-null object
Age upon Outcome    113920 non-null object
Breed               113950 non-null object
Color               113950 non-null object
dtypes: object(12)
memory usage: 10.4+ MB
None


In [36]:
# We can find the type of a particular columns in a data frame in this way.
ID_series = shelter_data['Animal ID']
type(ID_series)

pandas.core.series.Series

In [38]:
ID_series.dtype

dtype('O')

In [43]:
shelter_data['RealDateTime'] = pd.to_datetime(shelter_data['DateTime'])

In [44]:
shelter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113950 entries, 0 to 113949
Data columns (total 13 columns):
Animal ID           113950 non-null object
Name                78118 non-null object
DateTime            113950 non-null object
MonthYear           113950 non-null object
Date of Birth       113950 non-null object
Outcome Type        113942 non-null object
Outcome Subtype     51633 non-null object
Animal Type         113950 non-null object
Sex upon Outcome    113946 non-null object
Age upon Outcome    113920 non-null object
Breed               113950 non-null object
Color               113950 non-null object
RealDateTime        113950 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(12)
memory usage: 11.3+ MB


In [45]:
shelter_data.sort_values('RealDateTime')

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,RealDateTime
39234,A659834,*Dudley,10/01/2013 09:31:00 AM,10/01/2013 09:31:00 AM,07/23/2013,Adoption,Foster,Dog,Neutered Male,2 months,Labrador Retriever Mix,Black,2013-10-01 09:31:00
32288,A664235,,10/01/2013 10:39:00 AM,10/01/2013 10:39:00 AM,09/24/2013,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01 10:39:00
65689,A664237,,10/01/2013 10:44:00 AM,10/01/2013 10:44:00 AM,09/24/2013,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01 10:44:00
58062,A664236,,10/01/2013 10:44:00 AM,10/01/2013 10:44:00 AM,09/24/2013,Transfer,Partner,Cat,Unknown,1 week,Domestic Shorthair Mix,Orange/White,2013-10-01 10:44:00
7866,A664223,Moby,10/01/2013 11:03:00 AM,10/01/2013 11:03:00 AM,09/30/2009,Return to Owner,,Dog,Neutered Male,4 years,Bulldog Mix,White,2013-10-01 11:03:00
20699,A663646,,10/01/2013 11:12:00 AM,10/01/2013 11:12:00 AM,09/22/2010,Transfer,Partner,Dog,Neutered Male,3 years,Toy Poodle Mix,White,2013-10-01 11:12:00
69129,A663888,,10/01/2013 11:13:00 AM,10/01/2013 11:13:00 AM,09/25/2011,Transfer,Partner,Dog,Spayed Female,2 years,Boxer Mix,Red/White,2013-10-01 11:13:00
18433,A663572,*Starla,10/01/2013 11:42:00 AM,10/01/2013 11:42:00 AM,09/21/2010,Adoption,,Dog,Spayed Female,3 years,Anatol Shepherd Mix,White/Brown,2013-10-01 11:42:00
19385,A663833,Baby Girl,10/01/2013 11:50:00 AM,10/01/2013 11:50:00 AM,09/24/2004,Return to Owner,,Dog,Spayed Female,9 years,Labrador Retriever Mix,Black,2013-10-01 11:50:00
6037,A661795,Blakie,10/01/2013 11:53:00 AM,10/01/2013 11:53:00 AM,03/25/2013,Adoption,,Cat,Spayed Female,6 months,Domestic Shorthair Mix,Tortie,2013-10-01 11:53:00


In [48]:
shelter_data['Outcome Type'].value_counts()

Adoption           50093
Transfer           34014
Return to Owner    20067
Euthanasia          7626
Died                1047
Rto-Adopt            556
Disposal             452
Missing               67
Relocate              20
Name: Outcome Type, dtype: int64

In [50]:
columns_to_check = [x for x in shelter_data.columns if x != 'Animal ID']

In [51]:
columns_to_check

['Name',
 'DateTime',
 'MonthYear',
 'Date of Birth',
 'Outcome Type',
 'Outcome Subtype',
 'Animal Type',
 'Sex upon Outcome',
 'Age upon Outcome',
 'Breed',
 'Color',
 'RealDateTime']

In [54]:
shelter_data_dedup = shelter_data.drop_duplicates(subset=columns_to_check).sort_values('RealDateTime')

In [55]:
shelter_data.shape

(113950, 13)

In [56]:
shelter_data_dedup.shape

(112010, 13)

### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

To start, let's start with this list of fruits.

In [57]:
fruits = ['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']

print(fruits)

['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']


Using our list of fruits, we can create a pandas object called a 'series' which is much like an array or a vector, or a column from a spreadsheet.

In [58]:
fruits_series = pd.Series(fruits)

print(fruits_series)
type(fruits_series)

0         Apple
1        Orange
2    Watermelon
3         Lemon
4         Mango
dtype: object


pandas.core.series.Series

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [59]:
ind = ['a', 'b', 'c', 'd', 'e']

fruits_series = pd.Series(fruits, index=ind)

print(fruits_series)

a         Apple
b        Orange
c    Watermelon
d         Lemon
e         Mango
dtype: object


With a partner, create your own custom series from a list of items.

In [60]:
my_list = ['Chalupa', 'Crunchwrap Supreme', 'Burrito Supreme']

# create custom indices for your series
ind = ['item1', 'item2', 'item3']

# create the series using your list objects
# You can use either a for loop or also pd.Series
my_series = pd.Series(my_list, index=ind)

# print your objects
print(my_series)
type(my_series)

item1               Chalupa
item2    Crunchwrap Supreme
item3       Burrito Supreme
dtype: object


pandas.core.series.Series

We can do a simliar thing with Python dictionaries. This time, however, we will create a DataFrame object from a python dictionary.

In [71]:
# Dictionary with list objects in values
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New York']
}

students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [77]:
# List with dictionary objects in values
student_list = [
    {'name': 'Samantha', 'age': '35', 'city': 'Houston'},
    {'name': 'Alex', 'age': '17', 'city': 'Seattle'},
    {'name': 'Dante', 'age': '26', 'city': 'New York'}
]

students_df = pd.DataFrame(student_list, columns=['name', 'age', 'city'])

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New York


In [64]:
#to find data types of columns
students_df.dtypes

age     object
city    object
name    object
dtype: object

Let's change the data type of ages to int.

In [66]:
students_df.age.astype(int)

0    35
1    17
2    26
Name: age, dtype: int64

In [70]:
students_df.age

0    35
1    17
2    26
Name: age, dtype: object

In [76]:
# We can also change a columns type but the change has to make sense.
# students_df.age = students_df.age.astype(int)

# Uncomment line below and observe what happens when trying to convert student's name to int or float
students_df.name = students_df.age.astype(float)

# How about what happens converting numeric to string
# students_df.age = students_df.age.astype(str)

students_df.dtypes

name    float64
age      object
city     object
dtype: object

We can also use a custom index for these items. For example, we might want them to be the individual student ID numbers.

In [78]:
school_ids = ['1111', '1145', '0096']

# Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
students_df = pd.DataFrame(student_dict, index=school_ids)

students_df.head()

Unnamed: 0,name,age,city
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New York


Using Pandas, we can also rename column names.

In [84]:
students_df.columns = [x.upper() for x in students_df.columns]
students_df.head()

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New York


Or, we can also change the column names using the rename function.

In [85]:
students_df.rename(columns={"AGE": "YEARS"})

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New York


In [86]:
# Notice what happens when we print students_df

students_df

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New York


In [87]:
# If you want the file to save over itself, use the option `inplace = True`.
students_df.rename(columns={'AGE': 'YEARS'}, inplace=True)
students_df.head()

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New York


Similarly, there is a tool to remove rows and columns from your DataFrame

In [88]:
students_df.drop(columns=['YEARS', 'HOME'])

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [89]:
#Notice again what happens if we print students_df 
students_df

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New York


In [90]:
students_df = students_df.drop(columns=['YEARS', 'HOME'])
students_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)