# Module 2: Data wrangling using `pandas`

## Overview: Explore the Titanic dataset
A lot of datasets are pretty messy (more on that later), but we will start off with a simple, clean dataset that imports correctly into `pandas` without any issues, and is easy to explore.

For questions on this notebook, ask them on the [GEOL 557 slack](https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA)<a href="https://join.slack.com/t/minesgeo/shared_invite/zt-cqawm4lu-Zcfpf4mBLwjnksY6_umlKA">
<img src="https://cdn.brandfolder.io/5H442O3W/as/pl546j-7le8zk-ex8w65/Slack_RGB.svg" alt="Go to the GEOl 557 slack" width="100">
</a>

## Instructions
Work through this notebook - there will be several places where you need to fill-in-the-blank or write some code into an open cell. When you are finished, make sure to use the Colab menu (not the browser-level menu) to do the following:
- Expand all the sections - in the Colab menu, choose View --> Expand sections) 
- Save the notebook as a pdf, again using the Colab menu, using File --> Print --> Save as PDF. 

--- 
## Course
**GEOL 557 Earth Resource Data Science I: Fundamentals**. GEOL 557 forms part 2 of the four-part course series for the "Earth Resource Data Science" online graduate certificate at Mines - [learn more about the certificate here](https://online.mines.edu/er/)

Notebook created by **Zane Jobe** and **Thomas Martin**, [CoRE research group](https://core.mines.edu), Colorado School of Mines

[![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/ZaneJobe.svg?style=social&label=Follow%20%40ZaneJobe)](https://twitter.com/ZaneJobe)
and [![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/ThomasM_geo.svg?style=social&label=Follow%20%40ThomasM_geo)](https://twitter.com/ThomasM_geo) on Twitter 

In [1]:
import pandas as pd # this imports pandas to this notebook
import numpy as np

## Let's load the dataset

The original location of the Titanic dataset is [here](https://www.openml.org/d/40945). 

We will use [this version](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv) that has been cleaned up a bit. For more information, go [here](https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html) to familiarize yourself with the data labels.


In [2]:
df = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.Age.hist(bins=40); # plotting simple things is easy in pandas - it uses matplotlib in the background

## OK, now it's your turn
For a few of the required tasks, you will be asked to use common pandas methods. If you need more ideas, here are some [code examples](https://www.kaggle.com/grroverpr/pandas-cheatsheet) for ways to slice and analyze this dataset.

Now, you try a few things:

First, group the data into male and female passengers, and display how many men vs women survived (hint - use the `size()` method after you `groupby`) 

In [None]:
# for example
print(df.groupby('Survived').size()[0],'people on the Titanic perished and',df.groupby('Survived').size()[1],'survived' )

In [None]:
# your code here

Now, make a new dataframe with only 1st class passengers, and then display how much their tickets cost using `describe()`

In [None]:
# your code here


Now, examine how many people with kids survived vs people with no kids on board. 

Use `print()` statements like the example I give to show your results

You should get something that looks like:

`51.17 percent of people with 1 or more kid on board survived`

`34.57 percent of people with ZERO kids on board survived`

Hint: you can use `.isin([1,2,3])` to select rows with 1, 2, 3 kids on board, etc.


In [None]:
# your code here


Lastly, use your concept map to perform some data exploration on the Titanic data - what did you think would be an interesting thing to look at? How can you subset or group the data to do this? 

In [None]:
# your code here

![That's all folks!](https://media1.tenor.com/images/aaca7edcdf6fcb030945615c8469bf5c/tenor.gif?itemid=15403579)