# Pandas P1 Continued

![more_pandas](https://media.giphy.com/media/KyBX9ektgXWve/giphy.gif)

In [None]:
# You will get very used to these imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


Learning Goals:

1. Learn to interact and manipulate dataframe columns
2. Learn to interact and manipulate dataframe row indices
3. Identify and deal with N/A values
4. Visualize data using built in dataframe methods and MPL

There are several well-worn datasets you will come to know: the iris dataset, the boston housing dataset, the heart dataset.  In this notebook, we will look at the Titanic dataset.  As a tool, it is a bit macabre - predicting survival on the ill-fated ship - but it is still very useful.

![leo_titanic](https://media.giphy.com/media/XOY5y7YXjTD7q/giphy.gif)

In [None]:
# The data is in the csv file called titanic.csv
# create a dataframe object using it, and look at the head to start getting familiar with its structure

df = pd.read_csv('titanic.csv')


# 1. Learn to interact and manipulate dataframe columns

Let's take a look at the head of the data frame and the shape, just to get a quick overview.

In [None]:
df.head()

In [None]:
df.shape

### Quick knowledge check
We always want to be aware of what a row represents. 

What does each row in the dataframe represent? 

In [None]:
# Type answer here

Like most things code, there are several ways to view columns.

The first way is to look at the columns attribute of the dataframe.

In [None]:
# We are getting familiar with dataframe attributes: .shape and now .columns
df.columns

In [None]:
# We can confirm that the number of columns matches the second index of the shape attribute

len(df.columns) == df.shape[1]

A second way to see the columns is using the built in list() method:

In [None]:
list(df)

Consider the situation where you want to rename a column in the dataframe. Let's say you are getting tired of remembering that SibSp refers to siblings and spouses. We can rename it like so:

In [None]:
df.rename({'SibSp':'siblings_and_spouses'}, axis=1) # Axis tells the rename method to look for SibSp along the columns axis

Great. Now print out the head of the df

In [None]:
df.head()

Looks like something did not register.  The column name is back to SibSp. 
A finicky thing about Pandas is the use of inplace.  
In order for the object to be transformed in memory, we need to assign the inplace paramater the value of True

In [None]:
df.rename({'SibSp':'siblings_and_spouses'}, axis=1, inplace=True)

In [None]:
df.head()

We can also change multiple columns at once with a dictionary:

In [None]:
df.rename(columns = {'Parch': 'parent_child_ratio', 'Pclass': 'ticket_class'}, inplace=True)

In [None]:
df.head()

We can also interact directly with the .columns attribute


In [None]:
df.columns = [name.lower() for name in list(df)]

If we find a column is not useful, we can drop columns with the drop method.



In [None]:
df.drop('name', axis=1, inplace=True)

# Exercise:
Write code below to drop three columns at once.  Don't include  `inplace=True.`

In [None]:
# your answer here

## 2. Learn to interact and manipulate dataframe row indices


Row indices are an attribute of a dataframe just as columns are.

In [None]:
# This is a RangeIndex object, which can be iterated over
df.index

The index can be set in the same way as columns:

In [None]:
# Note they are the same length
df.index = range(1000, 1891)
df.index

We can also reset the index:

In [None]:
df.reset_index(inplace=True, drop=True)

In [None]:
df.head()

### Exercise
Use the set_index() dataframe method to set passengerid to the index of the dataframe

In [None]:
# Your code here

## 3. Identify and deal with N/A values

NA (not available) values, are a constant annoyance.  They can mess up our code and our analysis.  One of the first steps of EDA you will perform is looking at whether your data has NA's.  

Apropo to the event it describes, the titanic dataset has many NA values. 

We can see that in a few ways, first using describe.

In [None]:
df.info()

## Knowledge Check: From the above info() output, which columns have na's? How can you tell?


Your answer here  
$ 

Another way to see na's is with the **isna()** method

In [None]:
df.isna()

More usefully, we can sum the values which are na:

In [None]:
df.isna().sum()

## Dealing with na's


One way to deal with na's is by dropping rows that have them:


In [None]:
df.dropna()

Let's explore what happened there. Since we didn't include inplace=True, we can run the same code with some additions to see the difference:

In [None]:
df.dropna().info()

# Knowledge check
How did drop na affect the dataframe?  Why did it remove so many rows?

In [None]:
# your answer here

Dropna without params reduced our data significantly, which is a very bad thing. Our model performance, when we get to modeling, will heavily rely on having enough data.

Let's add a parameter to dropna:

In [None]:
list(df)

In [None]:
df.dropna(subset=['embarked'], inplace=True)

In [None]:
# Now theere are only two columns with na values
df.info()

You will find that data preprocessing presents you with many paths to follow.  You have many choices you can make as to how to preprocess. 

For now let's make the choice to drop cabin, since it has so many nulls:

In [None]:
df.drop('cabin', axis=1, inplace=True)

With age, let's be a bit more creative, and impute the mean. This is a common method.

In [None]:
df.age.mean()

##  Short Exercise
Using the fillna() method, write code below to fill the na's in age with the mean of age.

In [None]:
# Your code here

In [None]:
# Run df.info() to check that you have no more na's.
df.info()

# 4. Visualize data using built in dataframe methods and MPL

Dataframes have some built in methods for visualization.  


## Hist

For example, a very useful one is hist(), 
which will display histograms of each numeric field.


In [None]:
df.hist(figsize=(10,10));

# Boxplot

Another very useful method is boxplot.  One use of boxplot is to quickly see whether there are outliers.

In [None]:
df.boxplot()


These methods are using matplotlib under the hood.  So we can use the same methods we learned before to alter our plots.  Let's use our knowledge of matplotlib to rotate the xticks 45 degrees.

In [None]:
# Your code here

# Dataframes and series with Matplotlib and Seaborn

Of course, we can pump in the data held in our dataframes and series to Seaborn.

In [None]:
# Correlation Heatmap
sns.heatmap(df.corr())

In [None]:
sns.pairplot(df)

In [None]:
# If we want to zoom in a bit on one of the scatters in pairplot, we can use matplotlib and seaborn.
# Here we can visualize a slight positive correlation between age and fare price
fig, (ax1, ax2) = plt.subplots(1,2, sharey=True, sharex=True)

ax1.scatter(df.age, df.fare)
sns.regplot(df.age, df.fare, ax=ax2, color='r', scatter=False)

In [None]:
# Use distplot to look at the distribution of the ages
sns.distplot(df.age)

In [None]:
# Use distplot to look at the distribution of ages on subsets of survived vs perished

sns.distplot(df[df['survived']==0].age, label='Perished')
sns.distplot(df[df['survived']==1].age, label='Survived')
plt.title("People near the median age had lower rate\n of survival while infants and children had higher")
plt.legend()

In [None]:
sns.distplot(df[df['survived']==0].fare, label='Perished')
sns.distplot(df[df['survived']==1].fare, label='Survived')
plt.title("People with lower fares were more likely to perish")
plt.xlim(0,100)
plt.legend()

In [None]:
# With the rest of your time,  play around with visualizing some of the titanic data. It's can be pretty fun.