# Homework 1: Exploring & Visualizing Data

## Part 1: Data Exploration and Preparation

## Setup

Make sure you have seaborn and missingno installed. Run `pip3 install seaborn` and `pip3 install missingno` in your container/shell if you don't.

In this homework, we will more rigorously explore data visualization and data manipulation with a couple datasets. Please fill in the cells with `## YOUR CODE HERE` following the appropriate directions.

In [None]:
# removes the need to call plt.show() every time
%matplotlib inline

Seaborn is a powerful data visualization library built on top of matplotlib. We will be using seaborn for this homework (since it is a better tool and you should know it well). Plus seaborn comes default with *much* better aesthetics (invoked with the `set()` function call). **Import seaborn as sns **

In [None]:
## YOUR CODE HERE
sns.set() # Sets aesthetic parameters in one step

First load the `titanic` dataset directly from seaborn. The `load_dataset` function will return a pandas dataframe. Documentation: https://seaborn.pydata.org/generated/seaborn.load_dataset.html

In [None]:
titanic = ## YOUR CODE HERE

## Preparing a new dataset (Pandas)

Import `numpy` and `pandas` (remember to abbreviate them accordingly!)

In [None]:
## YOUR CODE HERE

Now use some pandas functions to get a quick overview/statistics on the dataset. Take a quick glance at the overview you create.

In [None]:
## YOUR CODE HERE


In [None]:
## YOUR CODE HERE


With your created overview, you should be able to answer these questions:

* What was the age of the oldest person on board?  ## YOUR ANSWER HERE
* What was the survival rate of people on board?  ## YOUR ANSWER HERE
* What was the average fare of people on board?  ## YOUR ANSWER HERE

By the way, for getting overviews, pandas also has a `groupby` function that is quite nice to use. example:

In [None]:
titanic.groupby(['sex','embark_town'])['survived'].mean()

Now we have an overview of our dataset. The next thing we should do is clean it.

One quick observation is that there are a couple of repetitive columns. One example is the 'embarked_town' and 'embarked' columns which are conveying the same information. As we are preparing this dataset to be used in a model we only want to keep one of those columns. **Find the other repetitive pair of columns and drop the non-numerical one.**

In [None]:
titanic.drop('embark_town', axis=1, inplace=True) 
 ## YOUR CODE HERE

check for missing values and deal with them appropriately. `missingno` allows us to really easily see where missing values are in our dataset. It's a simple command. <b>Import missingno as msno and use its matrix function on titanic.</b> 
https://www.residentmar.io/2016/03/28/missingno.html

In [None]:
## YOUR CODE HERE

## YOUR CODE HERE

The white lines show us the missing data. One quick observation is the `deck` has a lot of missing data. Let's just go ahead and **drop the deck column from the dataset** since it's not that relevant. Make sure to set inplace=True.  Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html. 

In [None]:
 ## YOUR CODE HERE

Now let's rerun the matrix and see. All that white is gone! Nice.

In [None]:
## YOUR CODE HERE

We still have a bunch of <b>missing values for the age field.</b> We can't just drop the age column since it is a pretty important datapoint. One way to deal with this is simply to just remove the records with missing information with `dropna()`, but this would end up removing out a significant amount of our data. 

What do we do now? We can now explore a technique called `missing value imputation`. What this means is basically we find a reasonable way to *replace* the unknown data with workable values. 

There's a lot of theory regarding how to do this properly, ([for the curious look here](http://www.stat.columbia.edu/~gelman/arm/missing.pdf)). We can simply put in the average age value for the missing ages. But this really isn't so great, and will skew our stats.

If we assume that the data is missing *at random* (which actually is rarely the case and very hard to prove), we can just fit a model to predict the missing value based on the other available factors. One popular way to do this is to use KNN (where you look at the nearest datapoints to a certain point to conclude the missing value), but we can also use deep neural networks to achieve this task.  

You must now make you own decision on how to deal with the missing data. You may choose any of the methods discussed above. Easiest would be to fill in with average value (but this will skew our visualizations) (if you use pandas correctly, you can do this in one line - try looking at pandas documentation!). After writing your code, verify the result by rerunning the matrix - you should not see any white lines.

In [None]:
## YOUR CODE HERE
titanic['age'].fillna(titanic['age'].mean(), inplace=True) 
titanic.dropna(inplace=True)

If we were to feed our dataset into a model, the model cannot understand features such as 'True', 'Yes', 'male'. You have to turn the data into binary format, 1 for positive and 0 negative. ** Fill in toBinary and then use it to binarize the 'alive', 'sex', 'alone', and 'adult_male' columns. **

In [None]:
def toBinary(data, positive):
    ## YOUR CODE HERE
    
    pass
        
 ## YOUR CODE HERE
 ## YOUR CODE HERE
 ## YOUR CODE HERE
 ## YOUR CODE HERE

As we learned in lecture, we have to Normalize our numerical data points. Now you must **define the function 'Standardize' that standardizes the column passed in**, and then pass in the 'age' and 'fare' columns. 

In [None]:
def Standardize(data):
    ## YOUR CODE HERE
    
    pass

titanic['age'] =  ## YOUR CODE HERE
titanic['fare'] = ## YOUR CODE HERE

Finally we have to deal with the categorical columns that have more than two categories. For this situation we will use One-Hot encoding http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example. Pandas has a good method 'get_dummies' that makes this task much easier. Read the documentation and fill in the code to ** One-Hot encode the 'pclass' and 'embarked' columns. .concat the output from get_dummies and drop the original columns. ** 

In [None]:
 ## YOUR CODE HERE
titanic.drop('embarked', axis=1, inplace=True)

 ## YOUR CODE HERE
titanic.drop('pclass', axis=1, inplace=True)

In [None]:
titanic.head()

## Intro to Seaborn

Seaborn can handle categorical data and NaN directly, **reload the titanic dataset to its original version**

In [None]:
## YOUR CODE HERE


There are 2 types of data in any dataset: categorial and numerical data. We will first explore categorical data.

One really easy way to show categorical data is through bar plots. Let's explore how to make some in seaborn.
We want to investigate the difference in rates at which males vs females survived the accident. Using the [documentation here](https://seaborn.pydata.org/generated/seaborn.barplot.html) and [example here](http://seaborn.pydata.org/examples/color_palettes.html), create a `barplot` to depict this. It should be a really simple one-liner.

We will show you how to do this so you can get an idea of how to use the API.

In [None]:
sns.barplot(x='sex', y='survived', data=titanic)

Notice how it was so easy to create the plot! You simply passed in the entire dataset, and just specified the `x` and `y` fields that you wanted exposed for the barplot. Behind the scenes seaborn ignored `NaN` values for you and automatically calculated the survival rate to plot. Also, that black tick is a 95% confidence interval that seaborn plots.

So we see that females were much more likely to make it out alive. What other factors do you think could have an impact on surival rate? ** Plot a couple more barplots below. ** Make sure to use *categorical* values, not something numerical like age or fare.

In [None]:
## YOUR CODE HERE


In [None]:
## YOUR CODE HERE


What if we wanted to add a further sex breakdown for the categories chosen above? Go back and add a `hue='sex'` parameter for the couple plots you just created, and seaborn will split each bar into a male/female comparison.

Now we want to compare the embarking town vs the age of the individuals. We don't simply want to use a barplot, since that will just give the average age; rather, we would like more insight into the relative and numeric *distribution* of ages.

A good tool to help us here is [`swarmplot`](https://seaborn.pydata.org/generated/seaborn.swarmplot.html). Use this function to view `embark_town` vs `age`, again using `sex` as the `hue`.

In [None]:
## YOUR CODE HERE


Cool! This gives us much more information. What if we didn't care about the number of individuals in each category at all, but rather just wanted to see the *distribution* in each category? [`violinplot`](https://seaborn.pydata.org/generated/seaborn.violinplot.html) plots a density distribution. Plot that. Keep the `hue`.

In [None]:
## YOUR CODE HERE


Go back and clean up the violinplot by adding `split='True'` parameter. 

Now take a few seconds to look at the graphs you've created of this data. What are some observations? Jot a couple down here.

#### Your observations Here
----
* 
* 
* 

As I mentioned, data is categorical or numeric. We already started getting into numerical data with the swarmplot and violinplot. We will now explore a couple more examples.

Let's look at the distribution of ages. Use [`distplot`](https://seaborn.pydata.org/generated/seaborn.distplot.html) to make a histogram of just the ages.

In [None]:
## YOUR CODE HERE

If you did your missing value imputation by average value (If we had not reloaded the data set), your results will look very skewed. This is why we don't normally just fill in an average. As a quick fix for now, though, you can filter out the age values that equal the mean before passing it in to `displot`.

A histogram can nicely represent numerical data by breaking up numerical ranges into chunks so that it is easier to visualize. As you might notice from above, seaborn also automatically plots a gaussian kernel density estimate.

Do the same thing for fares - do you notice something odd about that histogram? What does that skew mean?

In [None]:
 ## YOUR CODE HERE

Now, using the [`jointplot`](https://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot) function, make a scatterplot of the `age` and `fare` variables to see if there is any relationship between the two.

In [None]:
 ## YOUR CODE HERE

Scatterplots allow one to easily see trends/coorelations in data. As you can see here, there seems to be very little correlation. Also observe that seaborn automatically plots histograms.

Now, use a seaborn function we haven't used yet to plot something. The [API](http://seaborn.pydata.org/api.html) has a list of all the methods.

In [None]:
## YOUR CODE HERE
