<center><img src='img/ms_logo.jpeg' width=30% height=30%></center>
<center><h1>Descriptive Statistics</h1></center>

In this lesson, we'll clean and analyze the Titanic data set, and see if descriptive statistics can tell us anything interesting about the data set.  Along the way, we'll gain hands-on experience with industry-standard tools such as the `pandas` library in python. 

#### About the Data Set.  

The Titanic sank after hitting an iceberg the night of May 31st, 1911.  1503 people died--only 705 survived (could have been 706, but some people are door hogs).  

<img src='http://4.bp.blogspot.com/-QZUM4Q23E3c/UJ7WFXdJABI/AAAAAAAAAXk/ityxfFzjpPE/w1200-h630-p-k-no-nu/titanic1.jpg' height=50% width=50%>


The Titanic dataset is often used as an introduction to data analysis, especially for those interested in machine learning. Because we know who survived and who didn't, we can use the data on passengers to explore, look for trends that affect survival rate, and maybe even make some predictions on which passengers survived by looking at their data.   Today, we'll explore the Titanic dataset and get some real-world practice with cleaning and manipulating data.  

#### Pandas--Favorite Tool of Data Scientists 

<img src='https://media.giphy.com/media/aUhEBE0T8XNHa/giphy.gif' height=25% width=25%>

For data processing in Python, you can't beat the `pandas` library.  Pandas is used for creating dataframes, which are reminiscent of tables or spreadsheets, but much more powerful and easy to use.  Pandas provides a clean, easy-to-use API for reading, manipulating, sorting, and slicing data.  

#### Learning Goals

In this exercise, we have the following goals:

1.  Use pandas to read-in and manipulate data.
1.  Explore strategies for detecting outliers and dealing with missing (NaN) values
1.  Answer questions about our data set using descriptive statistics.
1.  Use pandas to slice our dataframe into smaller dataframes based on conditional logic (for instance, all female survivors under a certain age)

**Let's get started!**

(To execute any cell, press SHIFT + ENTER or press the 'Run' button on the toolbar at the top.)

We'll start by importing pandas and setting an alias:

In [1]:
# We'll be using pandas a lot, so let's alias it to the name 'pd' to save ourselves some keystrokes
import pandas as pd

Next, we'll want to read in our dataset from the from the the 'datasets' folder in this directory.  Did you know jupyter notebooks can also use terminal commands in code cells?  Just put a '%' sign before the command and it works the same as in the terminal.  For instance, if you wanted to list all the items contained in this current directory, you would just type `%ls`!  Try it below:

In [None]:
# Type % ls and run this cell


Our data set is contained within the `datasets` folder, and is called `titanic.csv`.  You may have some experience reading data in manually using the `open` keyword in python.  With large/complex datasets, this method can get tedious very quickly.  However, `pandas` makes this a simple task!

**TASK: Read in the titanic.csv file using the pd.read_csv( ) method in pandas.**

HINT:  You'll still need to pass the method the correct arguments--in order to get this right, You'll need the documentation for this method.  You can find that [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html).  

Alternatively, you can just type the method name with a question mark instead of parentheses to open a method's docs right here in the notebook!

In [2]:
# Need the docs? try typing pd.read_csv? 

path = "datasets/titanic.csv"

df = pd.read_csv(path)   # store the newly created dataframe in this variable
df   # this will display the contents of df (the dataframe containing all our data, truncated for readability)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Great! With the pandas library, importing large datasets becomes really, really easy.  We can also use it to examine the data, and manipulate the dataset very easily. Before we can make sense of the data, we should probably have a feel for what each column means.  Here's the **Data Dictionary** explaining what each column and value actually means:


<center><h2>Data Dictionary</h2></center>


| Variable | Definition                                        | Key                                            |
|----------|---------------------------------------------------|------------------------------------------------|
| Survived | Survival                                          | 0 = No, 1 = Yes                                |
| Pclass   | Ticket Class (proxy for socio-economic status)    | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                               |                                                |
| Age      | Age (in years)                                    |                                                |
| SibSp    | # of siblings and/or spouses also aboard          |                                                |
| Parch    | # of parents / children also aboard               |                                                |
| Ticket   | Ticket number                                     |                                                |
| Fare     | Passenger fare (how much their ticket cost)       |                                                |
| Cabin    | Cabin number                                      |                                                |
| Embarked | Port of Embarkation (where they boarded the ship) | C = Cherbourg, Q = Queenstown, S = Southampton |

Let's get the basic descriptive statistics to see what we can figure out about this dataset.  

**TASK: Run the dataframe's .describe( ) command.  **

In [8]:
# Run df.describe().  What does each cell in the table mean?

df.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


We can also call `.describe()` on individual columns (a single column of data is called a pandas _series_).  For instance, if we wanted to see the summary statistics on the _SibSp_ column, we would type `df["SibSp"].describe()`

See if you can answer the following questions:  

1.  How old is the oldest passenger on the titanic?
1.  How young is the youngest passenger?
1.  What is the average price paid for a ticket?
1.  How much did the most expensive ticket cost?
1.  How many passengers in the dataset are female?

Answers:

1.  
1.  
1.  
1.  
1.  

In [11]:
# Use the .describe() method on the appropriate columns to answer the questions listed above.

df['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

<center><h3>Slicing</h3></center>

We can also use pandas to 'slice' dataframes, just as we would with a list in python.  However, unlike list slicing in vanilla python, we can slice using conditional logic.  For instance, what if we want to examine a dataframe that only contains the passengers that survived?  Easy-- we just type:

<center>`survived_df = df[df["Survived"] == 1]`</center>

The syntax for slicing can feel a bit clunky at first, but it will become intuitive with practice.  

If you want to slice on multiple conditions, you can do that too!  Just make sure each condition is wrapped in a set of parentheses.  For instance, if we wanted to grab a dataframe filled only when men that survived, we would use:
<br>
<br>
<center>`df[(df['Survived'] == 1) & (df['Sex'] == 'male')]`</center>


**TASK: Use your knowledge of conditional slicing to answer the following questions:**

(HINT: Don't forget about the `.describe()` method!)

1.  How many men survived?
1.  What is the average age of male passengers that survived?
1.  How many female passengers under the age of 30 did not survive?
1.  What was the most expensive ticket bought by answer passenger from Cherbourg that did NOT survive?
1.  Of all surviving passengers from Southampton, how many passengers paid between \$216 and \$676.50 for their ticket? 

**Answers:**

1.  
1.  
1.  
1.  
1.  

In [None]:
# Use your knowledge of conditional slicing to answer the questions above!


In [15]:
survived_df = df[(df["Survived"] == 1) & (df["Sex"] == "male")]
survived_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.00,0,0,248698,13.0000,D56,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.00,0,0,113788,35.5000,A6,S
36,37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C
55,56,1,1,"Woolner, Mr. Hugh",male,,0,0,19947,35.5000,C52,S
65,66,1,3,"Moubarek, Master. Gerios",male,,1,1,2661,15.2458,,C
74,75,1,3,"Bing, Mr. Lee",male,32.00,0,0,1601,56.4958,,S
78,79,1,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0000,,S
81,82,1,3,"Sheerlinck, Mr. Jan Baptist",male,29.00,0,0,345779,9.5000,,S
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.00,0,1,PC 17759,63.3583,D10 D12,C
