# Module 1 - Introduction to Pandas
## Pandas Part 1

### Introduction

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. You have found out that Austin has one of the largest no-kill animal shelters in the country, and they keep meticulous track of animals that have been taken in and released. However, it is a large file, the online visualization tools provided are terrible, the data is sorted as strings, and the file holds an overwhelming amount  of information. Is there an easy way to look at this data? Can we do this with base Python? Is there a better way?


#### _Our goals today are to be able to_: <br/>

- Import/read data using Pandas
- Identify Pandas objects and manipulate Pandas objects by index and columns
- Filter data using Pandas

#### _Big questions for this lesson_: <br/>
- Why use Pandas? 
 
 (a) Provides methods able analyze data stored in the format Data Scientist most often encounter (.csv, .tsv, or .xlsx). 
 
 (b) Makes it very convenient to load, process, and analyze in the aforementioned formats. 
 
 (c) Along with python visualization packages allows for the visual analysis of tabular data.
 

- When do we want to use NumPy versus Pandas?
- What are the advantages of using Pandas?    
https://stackabuse.com/beginners-tutorial-on-the-pandas-python-library/
- What are the disadvantages of using Pandas?                      
https://wesmckinney.com/blog/apache-arrow-pandas-internals/

- The data structures in Pandas are implemented using series and dataframe classes.  
- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.  
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

### Activation:

<img src="https://cdn-images-1.medium.com/max/1600/1*9IU5fBzJisilYjRAi-f55Q.png" width=700, height=700>  




- The data manipulation capabilities of Pandas are built on top of the numpy library.
- Pandas dataframe object represents a spreadsheet with cell values, column names, and row index labels.

### 1. Importing and reading data with Pandas!

#### Let's use pandas to read some csv files so we can interact with them.



In [3]:
# First, let's check which directory we are in so the files we expect to see are there.
!pwd
!ls -la data

/Users/enkeboll/code/fis/dc-ds-08-26-19/module-1/week-1/day-5-pandas-1
total 16
drwxr-xr-x  4 enkeboll  staff  128 Aug 30 10:24 [1m[34m.[m[m
drwxr-xr-x  6 enkeboll  staff  192 Aug 30 10:31 [1m[34m..[m[m
-rw-r--r--@ 1 enkeboll  staff   62 Jun  5 16:07 example1.csv
-rw-r--r--@ 1 enkeboll  staff  238 Jun  5 16:07 made_up_jobs.csv


In [4]:
import pandas as pd

example_csv = pd.read_csv('data/example1.csv')

There is also `read_excel`, `read_html`, and many other pandas `read_` functions.  
http://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In [6]:
!head data/example1.csv

Title1,Title2,Title3
one,two,three
example1,example2,example3


In [5]:
example_csv.head()

Unnamed: 0,Title1,Title2,Title3
0,one,two,three
1,example1,example2,example3


Try loading in the example file in the `data` directory called `made_up_jobs.csv` using pandas.

In [7]:
!head data/made_up_jobs.csv

ID,Name,Job,Years Employed
0,Bob Bobberty,Underwater Basket Weaver,13
1,Susan Smells,Salad Spinner,5
2,Alex Lastname,Productivity Manager,2
3,Rudy P.,Being cool,55
4,Rudy G.,Being compared to Rudy P,50
5,Sir Wellington,Cheese Stacker, 10


In [8]:
#read in your csv here!
jobs = pd.read_csv('data/made_up_jobs.csv')

#remember that it's nice to be able to look at your data, so let's do that here, too.
jobs.head()

Unnamed: 0,ID,Name,Job,Years Employed
0,0,Bob Bobberty,Underwater Basket Weaver,13
1,1,Susan Smells,Salad Spinner,5
2,2,Alex Lastname,Productivity Manager,2
3,3,Rudy P.,Being cool,55
4,4,Rudy G.,Being compared to Rudy P,50


In [19]:
jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
ID                6 non-null int64
Name              6 non-null object
Job               6 non-null object
Years Employed    6 non-null int64
dtypes: int64(2), object(2)
memory usage: 272.0+ bytes


You can also load in data by using the url of an associated dataset.

In [11]:
shelter_data = pd.read_csv('https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')
# this link is copied directly from the download option for CSV

shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A790578,*Tux,08/30/2019 08:00:00 AM,08/30/2019 08:00:00 AM,09/12/2017,Adoption,Foster,Dog,Neutered Male,1 year,Boxer/Pointer,Black/White
1,A798506,*Drake,08/30/2019 07:26:00 AM,08/30/2019 07:26:00 AM,06/16/2019,Adoption,Foster,Dog,Neutered Male,2 months,Labrador Retriever/Bearded Collie,Black/White
2,A798504,*Travis,08/30/2019 06:53:00 AM,08/30/2019 06:53:00 AM,06/16/2019,Adoption,Foster,Dog,Neutered Male,2 months,Labrador Retriever/Bearded Collie,Black/Brown
3,A798503,*Annie,08/30/2019 06:50:00 AM,08/30/2019 06:50:00 AM,06/16/2019,Adoption,Foster,Dog,Spayed Female,2 months,Labrador Retriever/Bearded Collie,Black/Brown
4,A798500,*Monroe,08/30/2019 06:45:00 AM,08/30/2019 06:45:00 AM,06/16/2019,Adoption,Foster,Dog,Spayed Female,2 months,Labrador Retriever/Bearded Collie,Tricolor/White


Now that we can read in data, let's get more comfortable with our Pandas data structures.

In [10]:
type(shelter_data)

pandas.core.frame.DataFrame

In [16]:
# Now that data is read let's look at it's shape
print(shelter_data.shape)

(106877, 12)


In [13]:
shelter_data.size

1282524

In [23]:
shelter_data.index

RangeIndex(start=0, stop=106877, step=1)

In [22]:
# What are the names of the columns
list(shelter_data.columns)

['Animal ID',
 'Name',
 'DateTime',
 'MonthYear',
 'Date of Birth',
 'Outcome Type',
 'Outcome Subtype',
 'Animal Type',
 'Sex upon Outcome',
 'Age upon Outcome',
 'Breed',
 'Color']

In [18]:
# What are the different data types present in our data
print(shelter_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106877 entries, 0 to 106876
Data columns (total 12 columns):
Animal ID           106877 non-null object
Name                73287 non-null object
DateTime            106877 non-null object
MonthYear           106877 non-null object
Date of Birth       106877 non-null object
Outcome Type        106870 non-null object
Outcome Subtype     48695 non-null object
Animal Type         106877 non-null object
Sex upon Outcome    106874 non-null object
Age upon Outcome    106865 non-null object
Breed               106877 non-null object
Color               106877 non-null object
dtypes: object(12)
memory usage: 9.8+ MB
None


In [30]:
# We can find the type of a particular columns in a data frame in this way.
ID_series = shelter_data['Animal ID']
type(ID_series.values)

numpy.ndarray

In [27]:
shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A790578,*Tux,08/30/2019 08:00:00 AM,08/30/2019 08:00:00 AM,09/12/2017,Adoption,Foster,Dog,Neutered Male,1 year,Boxer/Pointer,Black/White
1,A798506,*Drake,08/30/2019 07:26:00 AM,08/30/2019 07:26:00 AM,06/16/2019,Adoption,Foster,Dog,Neutered Male,2 months,Labrador Retriever/Bearded Collie,Black/White
2,A798504,*Travis,08/30/2019 06:53:00 AM,08/30/2019 06:53:00 AM,06/16/2019,Adoption,Foster,Dog,Neutered Male,2 months,Labrador Retriever/Bearded Collie,Black/Brown
3,A798503,*Annie,08/30/2019 06:50:00 AM,08/30/2019 06:50:00 AM,06/16/2019,Adoption,Foster,Dog,Spayed Female,2 months,Labrador Retriever/Bearded Collie,Black/Brown
4,A798500,*Monroe,08/30/2019 06:45:00 AM,08/30/2019 06:45:00 AM,06/16/2019,Adoption,Foster,Dog,Spayed Female,2 months,Labrador Retriever/Bearded Collie,Tricolor/White


In [34]:
shelter_data.columns = [x.replace(' ', '') for x in shelter_data.columns]

In [41]:
shelter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106877 entries, 0 to 106876
Data columns (total 12 columns):
AnimalID          106877 non-null object
Name              73287 non-null object
DateTime          106877 non-null object
MonthYear         106877 non-null object
DateofBirth       106877 non-null object
OutcomeType       106870 non-null object
OutcomeSubtype    48695 non-null object
AnimalType        106877 non-null object
SexuponOutcome    106874 non-null object
AgeuponOutcome    106865 non-null object
Breed             106877 non-null object
Color             106877 non-null object
dtypes: object(12)
memory usage: 9.8+ MB


In [35]:
shelter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106877 entries, 0 to 106876
Data columns (total 12 columns):
AnimalID          106877 non-null object
Name              73287 non-null object
DateTime          106877 non-null object
MonthYear         106877 non-null object
DateofBirth       106877 non-null object
OutcomeType       106870 non-null object
OutcomeSubtype    48695 non-null object
AnimalType        106877 non-null object
SexuponOutcome    106874 non-null object
AgeuponOutcome    106865 non-null object
Breed             106877 non-null object
Color             106877 non-null object
dtypes: object(12)
memory usage: 9.8+ MB


In [36]:
shelter_data.AnimalID

0         A790578
1         A798506
2         A798504
3         A798503
4         A798500
5         A798502
6         A798499
7         A803079
8         A803077
9         A803095
10        A803110
11        A802958
12        A802872
13        A802881
14        A578861
15        A801437
16        A802606
17        A802679
18        A798430
19        A768899
20        A802833
21        A802884
22        A803224
23        A733193
24        A802893
25        A802964
26        A802946
27        A802948
28        A802947
29        A802491
           ...   
106847    A663887
106848    A663935
106849    A663201
106850    A663955
106851    A664171
106852    A663938
106853    A664261
106854    A664260
106855    A664262
106856    A664263
106857    A664271
106858    A663495
106859    A664219
106860    A613553
106861    A648744
106862    A664258
106863    A663546
106864    A664225
106865    A663342
106866    A656894
106867    A661795
106868    A663833
106869    A663572
106870    A663888
106871    

### 2. Utilizing and identifying Pandas objects

- What is a DataFrame object and what is a Series object? 
- How are they different from Python lists?

These are questions we will cover in this section. To start, let's start with this list of fruits.

In [42]:
fruits = ['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']

print(fruits)

['Apple', 'Orange', 'Watermelon', 'Lemon', 'Mango']


Using our list of fruits, we can create a pandas object called a 'series' which is much like an array or a vector.

In [47]:
fruits_series = pd.Series(fruits)

fruits_series
# type(fruits_series)

0         Apple
1        Orange
2    Watermelon
3         Lemon
4         Mango
dtype: object

In [49]:
fruits_series.index = ['a', 'b', 'c', 'd', 'e']

In [50]:
fruits_series

a         Apple
b        Orange
c    Watermelon
d         Lemon
e         Mango
dtype: object

One difference between python **list objects** and pandas **series objects** is the fact that you can define the index manually for a **series objects**.

In [46]:
ind = ['a', 'b', 'c', 'd', 'e']

fruits_series = pd.Series(fruits, index=ind)

print(fruits_series)

a         Apple
b        Orange
c    Watermelon
d         Lemon
e         Mango
dtype: object


With a partner, create your own custom series from a list of lists.

In [51]:
list_of_lists = [['cat'], ['dog'], ['horse'], ['cow'], ['macaw']]

# create custom indices for your series
ind = ['tacocat', 'airbud', 'ed', 'bessie', 'polly']

# create the series using your list objects
# You can use either a for loop or also pd.Series
list_of_lists_series = pd.Series(list_of_lists, index=ind)

# print your series
print(list_of_lists_series)
type(list_of_lists_series)

tacocat      [cat]
airbud       [dog]
ed         [horse]
bessie       [cow]
polly      [macaw]
dtype: object


pandas.core.series.Series

We can do a simliar thing with Python dictionaries. This time, however, we will create a DataFrame object from a python dictionary.

In [71]:
# Dictionary with list object in values
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [53]:
# List with dictionary objects in values
student_dict = [
    {'name': 'Samantha', 'age': 35, 'city': 'Houston'},
    {'name': 'Alex', 'age': 17, 'city': 'Seattle'},
]
students_df = pd.DataFrame(student_dict)

students_df.head()

Unnamed: 0,age,city,name
0,35,Houston,Samantha
1,17,Seattle,Alex


In [55]:
students_df.age

0    35
1    17
Name: age, dtype: int64

In [56]:
#to find data types of columns
students_df.dtypes

age      int64
city    object
name    object
dtype: object

Let's change the data type of ages to float.

In [65]:
# We can also change a columns type but the change has to make sense.
# students_df.age = students_df.age.astype(int)

#Uncomment line below and observe what happens when trying to convert student's name to int or float
# students_df.name = students_df.name.astype(int)

#How about what happens converting numeric to string
students_df.age = students_df.age.astype(str)

students_df.dtypes

age     object
city    object
name    object
dtype: object

In [69]:
students_df.astype(str).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
age     2 non-null object
city    2 non-null object
name    2 non-null object
dtypes: object(3)
memory usage: 128.0+ bytes


We can also use a custom index for these items. For example, we might want them to be the individual student ID numbers.

In [72]:
school_ids = ['1111', '1145', '0096']

# Notice here we use pd.DataFrame not pd.Series as we did for a pandas series.
students_df = pd.DataFrame(student_dict, index=school_ids)

students_df.head()

Unnamed: 0,name,age,city
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Using Pandas, we can also rename column names.

In [78]:
# students_df.columns = ['NAME', 'AGE', 'HOME']
students_df.columns = [x.upper() for x in students_df.columns]
students_df.head()

Unnamed: 0,NAME,AGE,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Or, we can also change the column names using the rename function.

In [81]:
students_df.rename(columns={'AGE': 'YEARS'}, inplace=True)

In [82]:
# Notice what happens when we print students_df

students_df

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [83]:
# If you want the file to save over itself, use the option `inplace = True`.
students_df.rename(columns={'AGE': 'YEARS'}, inplace=True)
students_df.head()

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


Similarly, there is a tool to remove rows and columns from your DataFrame

In [85]:
students_df.drop(['YEARS', 'HOME'], axis=1)

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [86]:
#Notice again what happens if we print students_df 
students_df

Unnamed: 0,NAME,YEARS,HOME
1111,Samantha,35,Houston
1145,Alex,17,Seattle
96,Dante,26,New york


In [87]:
students_df.drop(['YEARS', 'HOME'], inplace=True, axis=1)
students_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


In [88]:
students_df

Unnamed: 0,NAME
1111,Samantha
1145,Alex
96,Dante


If you want the file to save over itself, use the option `inplace = True`.

Every function has options. Let's read more about `drop` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

### 3. Filtering Data Using Pandas
There are several ways to grab particular data from a DataFrame. 
- Python lists allow for selection of data only through integer location. 
- You can use a single integer or slice notation to make the selection but NOT a list of integers.
- Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [92]:
l = [1, 2, 3, 4, 5]
l[4]

5

### DataFrames can be indexed by column name (label) or row name (index) or by position.   
#### The `.loc` method is used for indexing by name.  
#### While `.iloc` is used for indexing by number.

In [93]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante'],
    'age': ['35', '17', '26'],
    'city': ['Houston', 'Seattle', 'New york']
}

students_df = pd.DataFrame(student_dict)

In [94]:
students_df

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


### Let's take a look at `.iloc`
#### `.iloc` takes slices based on index position.
#### `.iloc` stands for integer location so that should help with remember what it does
#### `.iloc`[row , column]

In [99]:
# returns the first row
students_df.iloc[0, :]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [98]:
# returns the first column
students_df.iloc[:, 0]

0    Samantha
1        Alex
2       Dante
Name: name, dtype: object

In [102]:
# returns first two rows notice that ILOC performs regular python slicing.
students_df.iloc[0:2, :]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle


In [103]:
# returns the first two columns
students_df.iloc[:, 0:2]

Unnamed: 0,name,age
0,Samantha,35
1,Alex,17
2,Dante,26


In [104]:
# returns first row and columns 1 and 2
students_df.iloc[0, 0:2]

name    Samantha
age           35
Name: 0, dtype: object

### How would we use `.iloc` to return the last item in the last row?


In [106]:
# return the last item in the last row using iloc
students_df.iloc[-1, -1]

'New york'

### How would we use `.iloc` to return the last item in the last column?


In [107]:
# return the last item in the last column using iloc
students_df.iloc[-1, -1]

'New york'

### What if we only want certain columns or rows?

In [108]:
# Don't do students_df.iloc[0, 2]
students_df.iloc[[0, 2]]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
2,Dante,26,New york


In [109]:
students_df.iloc[[0, 2], [0, 2]]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New york


### Let's take a look at `.loc`
#### Label based method. 
#### Names or labels of the index is used when taking slices.
#### Also supports boolean subsetting.

In [110]:
# We will use loc to return rows and columns based on labels. Let's look at the students_df DataFrame again.
students_df

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [111]:
# returns the student information associated with index 0
students_df.loc[0]

name    Samantha
age           35
city     Houston
Name: 0, dtype: object

In [114]:
# returns the student information for row index 0 to 2 inclusive.
# note iloc would return normal python slicing not including 2 as demonstrated above.
students_df.loc[0:2]

Unnamed: 0,name,age,city
0,Samantha,35,Houston
1,Alex,17,Seattle
2,Dante,26,New york


In [115]:
# returns the column labeled 'age'
students_df.loc[:, 'age']

0    35
1    17
2    26
Name: age, dtype: object

In [116]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age"
students_df.loc[1:2, 'age']

1    17
2    26
Name: age, dtype: object

In [118]:
# returns the column labeled 'age' and index values 1 to 2.
# gives us the values of the rows with index from 1 to 2 (inclusive)
# and columns labeled age to city (inclusive)"
students_df.loc[1:2, 'name':'city']

Unnamed: 0,name,age,city
1,Alex,17,Seattle
2,Dante,26,New york


In [119]:
# What should we get?
students_df.loc[1:2, ['name', 'city']]

Unnamed: 0,name,city
1,Alex,Seattle
2,Dante,New york


In [120]:
# How about?
students_df.loc[[0, 2], ['name', 'city']]

Unnamed: 0,name,city
0,Samantha,Houston
2,Dante,New york


In [121]:
# if index rearranged
school_ids = ['5', '11', '3']
students_df = pd.DataFrame(student_dict, index=school_ids)
students_df

Unnamed: 0,name,age,city
5,Samantha,35,Houston
11,Alex,17,Seattle
3,Dante,26,New york


In [127]:
students_df.reset_index()

Unnamed: 0,index,name,age,city
0,5,Samantha,35,Houston
1,11,Alex,17,Seattle
2,3,Dante,26,New york


In [128]:
students_df

Unnamed: 0,name,age,city
5,Samantha,35,Houston
11,Alex,17,Seattle
3,Dante,26,New york


In [129]:
# What should we get now?
students_df.loc[["5"], ['name', 'city']]

Unnamed: 0,name,city
5,Samantha,Houston


In [130]:
# What should we get now?
students_df.loc[['5', '11'], ['name', 'city']]

Unnamed: 0,name,city
5,Samantha,Houston
11,Alex,Seattle


In [131]:
students_df.set_index("name", inplace=True)
students_df

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


In [132]:
students_df.loc[['Samantha']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston


In [133]:
# Subsetting nonconsecutive rows
students_df.loc[['Samantha', 'Dante']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Dante,26,New york


In [134]:
# Samantha to the end
students_df.loc['Samantha':]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Alex,17,Seattle
Dante,26,New york


In [137]:
# return the first and last rows using one loc command
students_df.loc[['Samantha', 'Dante']]

Unnamed: 0_level_0,age,city
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Samantha,35,Houston
Dante,26,New york


### Boolean Subsetting

In [138]:
student_dict = {
    'name': ['Samantha', 'Alex', 'Dante', 'Samantha'],
    'age': ['35', '17', '26', '21'],
    'city': ['Houston', 'Seattle', 'New york', 'Atlanta'],
    'state': ['Texas', 'Washington', 'New York', 'Georgia']
}

students_df = pd.DataFrame(student_dict)

In [139]:
# The statement data[‘name’] == ‘Samantha’] produces a Pandas Series with a True/False value for every row
# in the ‘data’ DataFrame, where there are “True” values for the rows where the name is “Samantha”.
# These type of boolean arrays can be passed directly to the .loc indexer.
students_df.loc[students_df['name'] == 'Samantha']

Unnamed: 0,name,age,city,state
0,Samantha,35,Houston,Texas
3,Samantha,21,Atlanta,Georgia


In [140]:
# What about if we only want the city and state of the selected students with the name Samantha?
students_df.loc[students_df['name'] == 'Samantha', ['city', 'state']]

Unnamed: 0,city,state
0,Houston,Texas
3,Atlanta,Georgia


In [141]:
# What amount if we want to select a student of a specific age?
students_df.loc[students_df['age'] == '21']

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


In [142]:
# What amount if we want to select a student of a specific age?
students_df.loc[(students_df['age'] == '21') &
                (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state
3,Samantha,21,Atlanta,Georgia


In [143]:
# What should be returned?
students_df.loc[(students_df['age'] == '35') &
                (students_df['city'] == 'Atlanta')]

Unnamed: 0,name,age,city,state


### Lesson Recap
Pandas combines the power of python lists (selection via integer location) and dictionaries (selection by label)

`.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

`.iloc` will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).

`.loc` is primarily label based, but may also be used with a boolean array.

#### Warning Note that contrary to usual python slices, both the start and the stop are included.

`.loc` will raise a keyError when any items are not found.

### Pandas
- The data structures in Pandas are implemented using series and dataframe classes.  
- A series is a one-dimensional indexed array of some fixed data type.  
- While a dataframe is a two-dimensional data structure like a table where each column contains data of the same type.  
- DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.


### CLASS ASSIGNMENT
Now that we have all of these new tools in our tool belt, use these tools on the shelter data set! 
- Use `shelter_data.columns` to get the list of column names.
- Subset the data by `Outcome Subtype`.
- Subset the data by `Outcome Subtype`: `Adoption` and only return the `Animal Type` column. 
- Subset the data by `Outcome Subtype`: `Adoption` and only return the `Animal Type` column with only `Cat`. 
- Play around with your new tools on the data set.
- For extra credit: What are the data types returned from the different subsetting? Is what returned a series or dataframe?

In [144]:
import pandas as pd
shelter_data = pd.read_csv(
    'https://data.austintexas.gov/api/views/9t4d-g238/rows.csv?accessType=DOWNLOAD')
shelter_data.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A803058,,08/30/2019 10:49:00 AM,08/30/2019 10:49:00 AM,08/26/2017,Transfer,Snr,Cat,Intact Male,2 years,Domestic Shorthair,Cream Tabby/White
1,A803057,,08/30/2019 10:49:00 AM,08/30/2019 10:49:00 AM,08/26/2017,Transfer,Snr,Cat,Intact Male,2 years,Domestic Shorthair,Cream Tabby/White
2,A803060,,08/30/2019 10:49:00 AM,08/30/2019 10:49:00 AM,04/26/2019,Transfer,Snr,Cat,Intact Female,4 months,Domestic Shorthair,Gray Tabby
3,A803061,,08/30/2019 10:49:00 AM,08/30/2019 10:49:00 AM,04/26/2019,Transfer,Snr,Cat,Intact Male,4 months,Domestic Shorthair,Gray Tabby
4,A803020,,08/30/2019 10:48:00 AM,08/30/2019 10:48:00 AM,08/26/2017,Transfer,Snr,Cat,Unknown,2 years,Domestic Shorthair,Brown Tabby/White


In [148]:
# Use shelter_data.columns to get the list of column names.
print(*shelter_data.columns, sep='\n')

Animal ID
Name
DateTime
MonthYear
Date of Birth
Outcome Type
Outcome Subtype
Animal Type
Sex upon Outcome
Age upon Outcome
Breed
Color


In [152]:
# Subset the data by Outcome Type.
shelter_data['Outcome Type'].value_counts()

Adoption           46484
Transfer           31941
Return to Owner    19100
Euthanasia          7409
Died                1004
Rto-Adopt            455
Disposal             407
Missing               65
Relocate              19
Name: Outcome Type, dtype: int64

In [153]:
# Subset the data by Outcome Type: Adoption and only return the Animal Type column.
shelter_data.loc[shelter_data['Outcome Type'] == 'Adoption', 'Animal Type']

6         Cat
7         Cat
8         Cat
9         Cat
10        Dog
13        Dog
14        Dog
15        Dog
16        Dog
17        Dog
18        Dog
19        Dog
25        Dog
26        Cat
27        Dog
28        Cat
29        Dog
30        Dog
31        Cat
33        Dog
34        Dog
43        Cat
44        Dog
57        Dog
58        Cat
59        Cat
60        Cat
65        Cat
66        Cat
67        Cat
         ... 
106757    Cat
106758    Cat
106764    Dog
106781    Cat
106784    Dog
106786    Dog
106789    Dog
106790    Dog
106792    Dog
106793    Dog
106797    Dog
106798    Cat
106812    Dog
106815    Cat
106825    Dog
106826    Dog
106827    Dog
106829    Cat
106838    Dog
106840    Cat
106843    Dog
106847    Dog
106848    Dog
106849    Cat
106852    Cat
106857    Dog
106880    Cat
106881    Cat
106883    Dog
106890    Dog
Name: Animal Type, Length: 46484, dtype: object

In [156]:
# Subset the data by Outcome Subtype: Adoption and only return the Animal Type column with only Cat.
shelter_data.loc[(shelter_data['Outcome Type'] == 'Adoption') &
                 (shelter_data['Animal Type'] == 'Cat'),
                 'Animal Type']

6         Cat
7         Cat
8         Cat
9         Cat
26        Cat
28        Cat
31        Cat
43        Cat
58        Cat
59        Cat
60        Cat
65        Cat
66        Cat
67        Cat
68        Cat
69        Cat
70        Cat
78        Cat
99        Cat
120       Cat
123       Cat
127       Cat
130       Cat
143       Cat
144       Cat
153       Cat
154       Cat
155       Cat
156       Cat
157       Cat
         ... 
106644    Cat
106646    Cat
106653    Cat
106659    Cat
106663    Cat
106665    Cat
106672    Cat
106677    Cat
106680    Cat
106684    Cat
106685    Cat
106686    Cat
106688    Cat
106725    Cat
106729    Cat
106731    Cat
106733    Cat
106735    Cat
106745    Cat
106757    Cat
106758    Cat
106781    Cat
106798    Cat
106815    Cat
106829    Cat
106840    Cat
106849    Cat
106852    Cat
106880    Cat
106881    Cat
Name: Animal Type, Length: 17759, dtype: object

In [158]:
# What are the data types returned from the different subsetting? Is what returned a series or dataframe?
type(shelter_data.loc[(shelter_data['Outcome Type'] == 'Adoption') &
                 (shelter_data['Animal Type'] == 'Cat'),
                 'Animal Type'])

pandas.core.series.Series