# Pandas
We will now begin our discussion of pandas.

## What is Pandas?
We can think of Pandas as "numpy" with labels.
What does this mean?
Lets see an example.

Recall, the list of heights of my friends. 

In [1]:
heights = [73,72,69,70]
names = ['Jason','Alex','Fayzan','Ethan']

In [2]:
import numpy as np
heights_arr = np.array(heights)
heights_arr

array([73, 72, 69, 70])

This is what are are used to. This is good, but what if we wanted some way to keep track of who those heights belonged to? 

This cannot be done in NumPy.

Enter pandas

In [3]:
#literally enter pandas
import pandas as pd

There is a pandas analog for both the 1-d and 2-d array. Past 2 dimensions there is not pandas analog. 2 dimensions is sufficient for many datasets.

Lets look at the 1-d array analog, the `series`.

# The Series

In [4]:
#instead of calling np.array(heights) we'll call pd.Series(heights)
heights_ser = pd.Series(heights)
heights_ser

0    73
1    72
2    69
3    70
dtype: int64

In [5]:
#The same
heights_ser = pd.Series(heights_arr)
heights_ser

0    73
1    72
2    69
3    70
dtype: int64

All the rules about math and broadcasting are the same. Like the numpy array, the series must be type identical.

What is different here? The column on the left. This is the index. It doesn't look like much as this exactly the same way as numpy indexes. Two ways to get elements by index.

In [6]:
heights_ser[2]

69

In [7]:
heights_ser.loc[2]

#These are the same

69

In [8]:
heights_ser[1:3]

1    72
2    69
dtype: int64

In [9]:
heights_ser.loc[1:3]
#Second element inclusive

1    72
2    69
3    70
dtype: int64

In [10]:
#Logical indexing
heights_ser[heights_ser>70]

heights_ser.loc[heights_ser>70]

0    73
1    72
dtype: int64

In [11]:
#Modification
heights_ser[3]=2
heights_ser

0    73
1    72
2    69
3     2
dtype: int64

There is a bit more convience because we don't have to manually count each time, but this isn't the power of Pandas. What if we want to label each point with a list of names?
We must modify the 'index' which is just a list of labels. Right now our index looks like this.

In [12]:
heights_ser.index
#Its just essentially a python "range"

RangeIndex(start=0, stop=4, step=1)

We can simply change our index to the list of names. First, recall names.

In [13]:
names

['Jason', 'Alex', 'Fayzan', 'Ethan']

In [14]:
heights_ser.index = names

In [15]:
heights_ser

Jason     73
Alex      72
Fayzan    69
Ethan      2
dtype: int64

Now the index column is the list of names. That we can use as our index. Alternatively, we could have put the index into our constructor

In [16]:
heights_ser = pd.Series(heights,index = names)
heights_ser

Jason     73
Alex      72
Fayzan    69
Ethan     70
dtype: int64

In [17]:
heights_ser['Jason']

73

We can also slice.

In [18]:
heights_ser['Jason':'Fayzan']

Jason     73
Alex      72
Fayzan    69
dtype: int64

What if we wanted to index in the typical numpy way? 
The `iloc` method.

In [19]:
heights_ser.iloc[1:3]

Alex      72
Fayzan    69
dtype: int64

What if we wanted to access the numpy array hiding underneath? The `values` method.

In [20]:
heights_ser.values

array([73, 72, 69, 70])

# Addendum: The Series From Dictionary

In addition to thinking of the Series as numpy with labels we also can think of it as a dictionary with extra methods.

In [22]:
heights_dict = {'Jason':73,'Alex':72,'Fayzan':69,'Ethan':70}
heights_dict

{'Jason': 73, 'Alex': 72, 'Fayzan': 69, 'Ethan': 70}

In [23]:
pd.Series(heights_dict)

Jason     73
Alex      72
Fayzan    69
Ethan     70
dtype: int64

Don't worry too much about this.

# The DataFrame

Most, of the time we will work with data that has more than 1 feature. In the past week, I've mentioned that we'd use a 2-d array or matrix for this. In reality, we'll use a DataFrame. We can think of this a numpy matrix with labels on both axes. Lets take a look.

In [24]:
heights

[73, 72, 69, 70]

In [25]:
weights = [175,160,150,180]

In [26]:
friends_matrix = np.array([heights,weights]).T
friends_matrix

array([[ 73, 175],
       [ 72, 160],
       [ 69, 150],
       [ 70, 180]])

In [28]:
#Make the dataframe
friends_frame = pd.DataFrame(friends_matrix)
friends_frame

Unnamed: 0,0,1
0,73,175
1,72,160
2,69,150
3,70,180


But we are missing our labels :(

Lets again change the index to the names

In [29]:
friends_frame.index 

RangeIndex(start=0, stop=4, step=1)

In [30]:
friends_frame.index = names

In [31]:
friends_frame

Unnamed: 0,0,1
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


What about the columns names? We'd like them to read "heights" and "weights". How do we check the current column names?

In [32]:
friends_frame.columns

RangeIndex(start=0, stop=2, step=1)

In [33]:
friends_frame.columns = ['Height','Weight']

In [34]:
friends_frame

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


In [35]:
friends_frame = pd.DataFrame(friends_matrix,index=names,columns=['Height','Weight'])

In [36]:
friends_frame

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


We also could have constructed the dataframe from a list of lists.

In [None]:
pd.DataFrame([heights,weights])

In [None]:
pd.DataFrame([heights,weights]).T

That's better.

In [37]:
friends_frame

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


# Note:
The DataFrame does not need to all be of the same data type. Only each column must be of the same datatype. We'll see this soon.

# Indexing the DataFrame

Indexing the dataframe is similar to indexing the Series.

How would we get the information about Jason?

In [38]:
friends_frame['Jason']
#Doesn't work

KeyError: 'Jason'

We can only use this style of indexing to get use the column series's

In [39]:
friends_frame['Height']

Jason     73
Alex      72
Fayzan    69
Ethan     70
Name: Height, dtype: int64

In [40]:
#Another way
friends_frame.Height

Jason     73
Alex      72
Fayzan    69
Ethan     70
Name: Height, dtype: int64

Obviously this doesn't work if your column name has a space anywhere. Also doesn't work if column name as int/float/non string.

How would we get information about Jason?

In [41]:
friends_frame.loc['Jason']
#returns a series

Height     73
Weight    175
Name: Jason, dtype: int64

How would we get Jason's height

In [43]:
friends_frame.loc['Jason','Height']

friends_frame

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


Again, for numpy style indexing, use `iloc`

In [44]:
friends_frame.iloc[0,0]

73

### Mixed indexing. 
If you look up documentation you may see the use of `.ix` command for mixed style indexing (text indexing for one axis and numbered indexing for another). This will no longer be supported. Here are two ways to do mixed indexing.

In [None]:
#first select column by text then person by number
friends_frame.Weight.iloc[3]

In [None]:
#first select row by text then feature by number
friends_frame.loc['Ethan'].iloc[0]

## Exercise

In [None]:
## Use 4 ways to select Fayzans height.

# Multiple Indexing: Slicing

All of the indexing methods we have just talked about are compatible with slicing and logical indexing.

In [45]:
friends_frame

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


In [46]:
friends_frame.iloc[0:2,0:]

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160


In [47]:
friends_frame.loc['Jason':'Alex','Height':'Weight']

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160


# Logical Indexing

In [48]:
friends_frame.loc[friends_frame.Height>70]

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160


In [None]:
friends_frame[friends_frame.Height>70]

In [None]:
friends_frame['Jason':'Fayzan']

# An Odd Rule

The brackets with no methods seem to have odd behavior.

Here is the rule, when we are selecting one element, the brackets with no method choose columns.

When we do logicial indexing or slicing, the brackets with no method select the rows.

My advice, try to avoid the brackets with no method.

# A new tool for Logical Indexing

Up to now, we've done logical indexing based on the data. What if we want to select certain rows or columns based on the index themselves?

In [49]:
friends_frame

Unnamed: 0,Height,Weight
Jason,73,175
Alex,72,160
Fayzan,69,150
Ethan,70,180


In [50]:
good_friends = ['Alex','Ethan','Fayzan']

In [51]:
friends_frame.index

Index(['Jason', 'Alex', 'Fayzan', 'Ethan'], dtype='object')

We will talk now about the `.isin` method for indexes and series'

In [52]:
friends_frame.index.isin(good_friends)
#What did this do?

array([False,  True,  True,  True])

In [53]:
friends_frame.loc[friends_frame.index.isin(good_friends)]

Unnamed: 0,Height,Weight
Alex,72,160
Fayzan,69,150
Ethan,70,180


In [58]:
#We can also this for columns
friends_frame.loc[:,friends_frame.columns.isin(good_friends)]
#Why was no data returned?

Jason
Alex
Fayzan
Ethan


The `isin` method is probably the most useful thing you can do but we'll talk about ways to do stuff like this later.

# Math in Pandas: Numpy Functions

In [61]:
#check this out
friends_frame
np.sin(friends_frame)

Unnamed: 0,Height,Weight
Jason,-0.676772,-0.801135
Alex,0.253823,0.219425
Fayzan,-0.114785,-0.714876
Ethan,0.773891,-0.801153


We can use any numpy math function and it will only work on the underlying data and preserve the indices!

# Summary Operators

In [63]:
#Even without axis argument we get this
friends_frame.mean()

Height     71.00
Weight    166.25
dtype: float64

Pandas is simply doing what it thinks is the most useful for us.

In [64]:
#If we actually wanted to weight and height we could do
friends_frame.mean(1)

Jason     124.0
Alex      116.0
Fayzan    109.5
Ethan     125.0
dtype: float64

# Adding Columns

In [65]:
bmi = friends_frame.Weight/(friends_frame.Height**2)*703

In [66]:
#Keeps indices
bmi

Jason     23.085945
Alex      21.697531
Fayzan    22.148708
Ethan     25.824490
dtype: float64

How would we add this to our dataframe? Easy

In [67]:
friends_frame['BMI']=bmi

In [68]:
friends_frame

Unnamed: 0,Height,Weight,BMI
Jason,73,175,23.085945
Alex,72,160,21.697531
Fayzan,69,150,22.148708
Ethan,70,180,25.82449


# Getting rid of rows/columns

In [71]:
#drop a row
friends_frame.drop('Alex')

Unnamed: 0,Height,Weight,BMI
Jason,73,175,23.085945
Fayzan,69,150,22.148708
Ethan,70,180,25.82449


In [70]:
#returns a copy
friends_frame

Unnamed: 0,Height,Weight,BMI
Jason,73,175,23.085945
Alex,72,160,21.697531
Fayzan,69,150,22.148708
Ethan,70,180,25.82449


In [None]:
#to actually do it
friends_frame = friends_frame.drop('Jason')

In [72]:
#drop rows
friends_frame.drop(['Jason','Ethan'])

Unnamed: 0,Height,Weight,BMI
Alex,72,160,21.697531
Fayzan,69,150,22.148708


In [73]:
#drop columns, the axis argument
friends_frame.drop('Weight',axis=1)

Unnamed: 0,Height,BMI
Jason,73,23.085945
Alex,72,21.697531
Fayzan,69,22.148708
Ethan,70,25.82449


# Missing Data

In [75]:
eye_color = pd.Series(['Brown','Brown','Blue'],index=['Jason','Fayzan','Ethan'])
eye_color

Jason     Brown
Fayzan    Brown
Ethan      Blue
dtype: object

In [76]:
friends_frame['Eye_Color']=eye_color

In [77]:
friends_frame

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73,175,23.085945,Brown
Alex,72,160,21.697531,
Fayzan,69,150,22.148708,Brown
Ethan,70,180,25.82449,Blue


In [None]:
#Notice the NaN in Alex's eye color. This is because we do not have the data there so pandas reserves a spot for it.

What do we do with missing data? If we knew his eye color we can just fill in that spot.

In [78]:
friends_frame.loc['Alex','Eye_Color']='Blue'
friends_frame

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73,175,23.085945,Brown
Alex,72,160,21.697531,Blue
Fayzan,69,150,22.148708,Brown
Ethan,70,180,25.82449,Blue


What if we didn't know the value? There are a few things we can do. It is very task dependent.

In [80]:
#lets say we didn't know Fayzans height for some reason
#You will almost never do this
friends_frame.loc['Fayzan','Height']=None
friends_frame

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73.0,175,23.085945,Brown
Alex,72.0,160,21.697531,Blue
Fayzan,,150,22.148708,Brown
Ethan,70.0,180,25.82449,Blue


In [82]:
#Maybe we just want to get rid of the Fayzan row
#makes a copy
friends_frame.dropna()

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73.0,175,23.085945,Brown
Alex,72.0,160,21.697531,Blue
Ethan,70.0,180,25.82449,Blue


In [83]:
#Maybe we just want to drop the height column
#makes a copy
friends_frame.dropna(1)

Unnamed: 0,Weight,BMI,Eye_Color
Jason,175,23.085945,Brown
Alex,160,21.697531,Blue
Fayzan,150,22.148708,Brown
Ethan,180,25.82449,Blue


In [84]:
friends_frame.fillna(1)

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73.0,175,23.085945,Brown
Alex,72.0,160,21.697531,Blue
Fayzan,1.0,150,22.148708,Brown
Ethan,70.0,180,25.82449,Blue


In [86]:
#You can broadcast a fillna call. This is a very common way to fill nans
#lets say we also didnt know Jasons weight
friends_frame.loc['Jason','Weight']=None
friends_frame

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73.0,,23.085945,Brown
Alex,72.0,160.0,21.697531,Blue
Fayzan,,150.0,22.148708,Brown
Ethan,70.0,180.0,25.82449,Blue


In [87]:
#Fill the nans with the mean of each column
friends_frame.mean()

Height     71.666667
Weight    163.333333
BMI        23.189168
dtype: float64

In [88]:
friends_frame.fillna(friends_frame.mean())

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73.0,163.333333,23.085945,Brown
Alex,72.0,160.0,21.697531,Blue
Fayzan,71.666667,150.0,22.148708,Brown
Ethan,70.0,180.0,25.82449,Blue


Things ive done? 

Filled with 0s, filled with column means or row means. Used ML/other lin alg techniques to attempt to predict value. Dropped those rows/columns. In case of time series backed and front filled.


# Detecting NaNs

In [91]:
friends_frame.isnull()

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,False,True,False,False
Alex,False,False,False,False
Fayzan,True,False,False,False
Ethan,False,False,False,False


In [96]:
friends_frame.isnull().sum()

Height       1
Weight       1
BMI          0
Eye_Color    0
dtype: int64

# Operations on Missing Data

In [97]:
friends_frame.sum()
#simply ignored

Height                      215
Weight                      490
BMI                     92.7567
Eye_Color    BrownBlueBrownBlue
dtype: object

In [98]:
friends_frame.mean()
#ignored whens summing and dividing

Height     71.666667
Weight    163.333333
BMI        23.189168
dtype: float64

In [99]:
friends_frame

Unnamed: 0,Height,Weight,BMI,Eye_Color
Jason,73.0,,23.085945,Brown
Alex,72.0,160.0,21.697531,Blue
Fayzan,,150.0,22.148708,Brown
Ethan,70.0,180.0,25.82449,Blue
