# Pandas
We will now begin our discussion of pandas.

## What is Pandas?
We can think of Pandas as "numpy" with labels.
What does this mean?
Lets see an example.

Recall, the list of heights of my friends. 

In [None]:
heights = [73,72,69,70]
names = ['Jason','Alex','Fayzan','Ethan']

In [None]:
import numpy as np
heights_arr = np.array(heights)
heights_arr

This is what are are used to. This is good, but what if we wanted some way to keep track of who those heights belonged to? 

This cannot be done in NumPy.

Enter pandas

In [None]:
#literally enter pandas
import pandas as pd

There is a pandas analog for both the 1-d and 2-d array. Past 2 dimensions there is not pandas analog. 2 dimensions is sufficient for many datasets.

Lets look at the 1-d array analog, the `series`.

# The Series

In [None]:
#instead of calling np.array(heights) we'll call pd.Series(heights)
heights_ser = pd.Series(heights)
heights_ser

In [None]:
#The same
heights_ser = pd.Series(heights_arr)
heights_ser

All the rules about math and broadcasting are the same. Like the numpy array, the series must be type identical.

What is different here? The column on the left. This is the index. It doesn't look like much as this exactly the same way as numpy indexes. Two ways to get elements by index.

In [None]:
heights_ser[2]

In [None]:
heights_ser.loc[2]

#These are the same

In [None]:
heights_ser[1:3]

In [None]:
heights_ser.loc[1:3]
#Second element inclusive

In [None]:
#Logical indexing
heights_ser[heights_ser>70]

heights_ser.loc[heights_ser>70]

In [None]:
#Modification
heights_ser[3]=2
heights_ser

There is a bit more convience because we don't have to manually count each time, but this isn't the power of Pandas. What if we want to label each point with a list of names?
We must modify the 'index' which is just a list of labels. Right now our index looks like this.

In [None]:
heights_ser.index
#Its just essentially a python "range"

We can simply change our index to the list of names. First, recall names.

In [None]:
names

In [None]:
heights_ser.index = names

In [None]:
heights_ser

Now the index column is the list of names. That we can use as our index. Alternatively, we could have put the index into our constructor

In [None]:
heights_ser = pd.Series(heights,index = names)
heights_ser

In [None]:
heights_ser['Jason']

We can also slice.

In [None]:
heights_ser['Jason':'Fayzan']

What if we wanted to index in the typical numpy way? 
The `iloc` method.

In [None]:
heights_ser.iloc[1:3]

What if we wanted to access the numpy array hiding underneath? The `values` method.

In [None]:
heights_ser.values

# Addendum: The Series From Dictionary

In addition to thinking of the Series as numpy with labels we also can think of it as a dictionary with extra methods.

In [None]:
heights_dict = {'Jason':73,'Alex':72,'Fayzan':69,'Ethan':70}
heights_dict

In [None]:
pd.Series(heights_dict)

Don't worry too much about this.

# The DataFrame

Most, of the time we will work with data that has more than 1 feature. In the past week, I've mentioned that we'd use a 2-d array or matrix for this. In reality, we'll use a DataFrame. We can think of this a numpy matrix with labels on both axes. Lets take a look.

In [None]:
heights

In [None]:
weights = [175,160,150,180]

In [None]:
friends_matrix = np.array([heights,weights]).T
friends_matrix

In [None]:
#Make the dataframe
friends_frame = pd.DataFrame(friends_matrix)
friends_frame

But we are missing our labels :(

Lets again change the index to the names

In [None]:
friends_frame.index 

In [None]:
friends_frame.index = names

In [None]:
friends_frame

What about the columns names? We'd like them to read "heights" and "weights". How do we check the current column names?

In [None]:
friends_frame.columns

In [None]:
friends_frame.columns = ['Height','Weight']

In [None]:
friends_frame

In [None]:
friends_frame = pd.DataFrame(friends_matrix,index=names,columns=['Height','Weight'])

In [None]:
friends_frame

We also could have constructed the dataframe from a list of lists.

In [None]:
pd.DataFrame([heights,weights])

In [None]:
pd.DataFrame([heights,weights]).T

That's better.

In [None]:
friends_frame

# Note:
The DataFrame does not need to all be of the same data type. Only each column must be of the same datatype. We'll see this soon.

# Indexing the DataFrame

Indexing the dataframe is similar to indexing the Series.

How would we get the information about Jason?

In [None]:
friends_frame['Jason']
#Doesn't work

We can only use this style of indexing to get use the column series's

In [None]:
friends_frame['Height']

In [None]:
#Another way
friends_frame.Height

Obviously this doesn't work if your column name has a space anywhere. Also doesn't work if column name as int/float/non string.

How would we get information about Jason?

In [None]:
friends_frame.loc['Jason']
#returns a series

How would we get Jason's height

In [None]:
friends_frame.loc['Jason','Height']

friends_frame

Again, for numpy style indexing, use `iloc`

In [None]:
friends_frame.iloc[0,0]

### Mixed indexing. 
If you look up documentation you may see the use of `.ix` command for mixed style indexing (text indexing for one axis and numbered indexing for another). This will no longer be supported. Here are two ways to do mixed indexing.

In [None]:
#first select column by text then person by number
friends_frame.Weight.iloc[3]

In [None]:
#first select row by text then feature by number
friends_frame.loc['Ethan'].iloc[0]

## Exercise

In [None]:
## Use 4 ways to select Fayzans height.

# Multiple Indexing: Slicing

All of the indexing methods we have just talked about are compatible with slicing and logical indexing.

In [None]:
friends_frame

In [None]:
friends_frame.iloc[0:2,0:]

In [None]:
friends_frame.loc['Jason':'Alex','Height':'Weight']

# Logical Indexing

In [None]:
friends_frame.loc[friends_frame.Height>70]

In [None]:
friends_frame[friends_frame.Height>70]

In [None]:
friends_frame['Jason':'Fayzan']

# An Odd Rule

The brackets with no methods seem to have odd behavior.

Here is the rule, when we are selecting one element, the brackets with no method choose columns.

When we do logicial indexing or slicing, the brackets with no method select the rows.

My advice, try to avoid the brackets with no method.

# A new tool for Logical Indexing

Up to now, we've done logical indexing based on the data. What if we want to select certain rows or columns based on the index themselves?

In [None]:
friends_frame

In [None]:
good_friends = ['Alex','Ethan','Fayzan']

In [None]:
friends_frame.index

We will talk now about the `.isin` method for indexes and series'

In [None]:
friends_frame.index.isin(good_friends)
#What did this do?

In [None]:
friends_frame.loc[friends_frame.index.isin(good_friends)]

In [None]:
#We can also this for columns
friends_frame.loc[:,friends_frame.columns.isin(good_friends)]
#Why was no data returned?

The `isin` method is probably the most useful thing you can do but we'll talk about ways to do stuff like this later.

# Math in Pandas: Numpy Functions

In [None]:
#check this out
friends_frame
np.sin(friends_frame)

We can use any numpy math function and it will only work on the underlying data and preserve the indices!

# Summary Operators

In [None]:
#Even without axis argument we get this
friends_frame.mean()

Pandas is simply doing what it thinks is the most useful for us.

In [None]:
#If we actually wanted to weight and height we could do
friends_frame.mean(1)

# Adding Columns

In [None]:
bmi = friends_frame.Weight/(friends_frame.Height**2)*703

In [None]:
#Keeps indices
bmi

How would we add this to our dataframe? Easy

In [None]:
friends_frame['BMI']=bmi

In [None]:
friends_frame

# Getting rid of rows/columns

In [None]:
#drop a row
friends_frame.drop('Alex')

In [None]:
#returns a copy
friends_frame

In [None]:
#to actually do it
friends_frame = friends_frame.drop('Jason')

In [None]:
#drop rows
friends_frame.drop(['Jason','Ethan'])

In [None]:
#drop columns, the axis argument
friends_frame.drop('Weight',axis=1)

# Missing Data

In [None]:
eye_color = pd.Series(['Brown','Brown','Blue'],index=['Jason','Fayzan','Ethan'])
eye_color

In [None]:
friends_frame['Eye_Color']=eye_color

In [None]:
friends_frame

In [None]:
#Notice the NaN in Alex's eye color. This is because we do not have the data there so pandas reserves a spot for it.

What do we do with missing data? If we knew his eye color we can just fill in that spot.

In [None]:
friends_frame.loc['Alex','Eye_Color']='Blue'
friends_frame

What if we didn't know the value? There are a few things we can do. It is very task dependent.

In [None]:
#lets say we didn't know Fayzans height for some reason
#You will almost never do this
friends_frame.loc['Fayzan','Height']=None
friends_frame

In [None]:
#Maybe we just want to get rid of the Fayzan row
#makes a copy
friends_frame.dropna()

In [None]:
#Maybe we just want to drop the height column
#makes a copy
friends_frame.dropna(1)

In [None]:
friends_frame.fillna(1)

In [None]:
#You can broadcast a fillna call. This is a very common way to fill nans
#lets say we also didnt know Jasons weight
friends_frame.loc['Jason','Weight']=None
friends_frame

In [None]:
#Fill the nans with the mean of each column
friends_frame.mean()

In [None]:
friends_frame.fillna(friends_frame.mean())

Things ive done? 

Filled with 0s, filled with column means or row means. Used ML/other lin alg techniques to attempt to predict value. Dropped those rows/columns. In case of time series backed and front filled.


# Detecting NaNs

In [None]:
friends_frame.isnull()

In [None]:
friends_frame.isnull().sum()

# Operations on Missing Data

In [None]:
friends_frame.sum()
#simply ignored

In [None]:
friends_frame.mean()
#ignored whens summing and dividing

In [None]:
friends_frame