## 1. Introduction

In this mission, we'll clean and analyze data on passenger survival from the [Titanic](https://en.wikipedia.org/wiki/RMS_Titanic). Each row contains information for a specific Titanic passenger.

Here are the first few rows of the dataset:

|  | pclass | survived | name                                            | sex    | age     | sibsp | parch | ticket | fare     | cabin   | embarked | boat | body | home.dest                       |
|---|--------|----------|-------------------------------------------------|--------|---------|-------|-------|--------|----------|---------|----------|------|------|---------------------------------|
| 0 | 1      | 1        | Allen, Miss. Elisabeth Walton                   | female | 29.0000 | 0     | 0     | 24160  | 211.3375 | B5      | S        | 2    |      | St Louis, MO                    |
| 1 | 1      | 1        | Allison, Master. Hudson Trevor                  | male   | 0.9167  | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        | 11   |      | Montreal, PQ / Chesterville, ON |
| 2 | 1      | 0        | Allison, Miss. Helen Loraine                    | female | 2       | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        |      |      | Montreal, PQ / Chesterville, ON |
| 3 | 1      | 0        | Allison, Mr. Hudson Joshua Creighton            | male   | 30.0000 | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        |      | 135  | Montreal, PQ / Chesterville, ON |
| 4 | 1      | 0        | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25      | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        |      |      | Montreal, PQ / Chesterville,    |


Lets take a closer look at a few of the key columns:

- <span style="background-color: #F9EBEA; color:##C0392B">pclass</span> -- The passenger's cabin class from **1** to **3** where **1** was the highest class
- <span style="background-color: #F9EBEA; color:##C0392B">survived</span> -- **1** if the passenger survived, and **0** if they did not.
- <span style="background-color: #F9EBEA; color:##C0392B">sex</span> -- The passenger's gender
- <span style="background-color: #F9EBEA; color:##C0392B">age</span> -- The passenger's age
- <span style="background-color: #F9EBEA; color:##C0392B">fare</span> -- The amount the passenger paid for their ticket
- <span style="background-color: #F9EBEA; color:##C0392B">embarked</span> -- Either **C**, **Q**, or **S**, to indicate which port the passenger boarded the ship from.


Many of the columns, such as <span style="background-color: #F9EBEA; color:##C0392B">sex</span> and <span style="background-color: #F9EBEA; color:##C0392B">age</span>, have missing values.

Because missing values can cause errors in numerical functions, we'll need to deal with them before we can analyze the data. For instance, finding the mean of a column with a missing value will fail because it's impossible to average a missing value. Addressing missing values will let us perform calculations on the entire data set.

### 1.1 Importing the data

Let's import the data set into pandas. You may notice at the start of the code, we import pandas differently from how we have previously.

>```python
import pandas as pd
```

This gives the pandas library the alias <span style="background-color: #F9EBEA; color:##C0392B">pd</span>, so that instead of typing pandas every time we want to use a function, we can instead type <span style="background-color: #F9EBEA; color:##C0392B">pd</span>, for example <span style="background-color: #F9EBEA; color:##C0392B">pd.read_csv()</span>.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Read the file <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival.csv</span> into a dataframe called <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span>.

## 2. Finding the missing data

Missing data can take a few different forms:

- In Python, the None keyword and type indicates no value.
- The Pandas library uses <span style="background-color: #F9EBEA; color:##C0392B">NaN</span>, which stands for **"not a number"**, to indicate a missing value.
- In general terms, both <span style="background-color: #F9EBEA; color:##C0392B">NaN</span> and None can be called null values.

If we want to see which values are <span style="background-color: #F9EBEA; color:##C0392B">NaN</span>, we can use the [pandas.isnull()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html) function which takes a pandas series and returns a series of <span style="background-color: #F9EBEA; color:##C0392B">True</span> and <span style="background-color: #F9EBEA; color:##C0392B">False</span> values, the same way that NumPy did when we compared arrays.


>```python
sex = titanic_survival["sex"]
sex_is_null = pandas.isnull(sex)
```

We can use this resultant series to select only the rows that have null values.

>```python
sex_null_true = sex[sex_is_null]
```

We'll use this structure to look at the null values for the <span style="background-color: #F9EBEA; color:##C0392B">"age"</span> column.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Count how many values in the <span style="background-color: #F9EBEA; color:##C0392B">"age"</span> column have null values:
    - Use **pandas.isnull()** on age variable to create a Series of **True** and **False** values. 
    - Use the resulting series to select only the elements in age that are null, and assign the result to <span style="background-color: #F9EBEA; color:##C0392B">age_null_true</span>
    - Assign the length of <span style="background-color: #F9EBEA; color:##C0392B">age_null_true</span> to <span style="background-color: #F9EBEA; color:##C0392B">age_null_count</span>.
2. Print <span style="background-color: #F9EBEA; color:##C0392B">age_null_count</span> to see how many null values are in the <span style="background-color: #F9EBEA; color:##C0392B">"age"</span> column.

## 3. Whats The Big Deal With Missing Data?

So, we know that quite a few values are missing from the <span style="background-color: #F9EBEA; color:##C0392B">"age"</span> column, and other columns are missing data too. But why is this a problem?

Lets look at a typical approach to calculate the average for the <span style="background-color: #F9EBEA; color:##C0392B">"age"</span> column:

>```python
mean_age = sum(titanic_survival["age"]) / len(titanic_survival["age"])
```

The result of this is that <span style="background-color: #F9EBEA; color:##C0392B">mean_age</span> would be <span style="background-color: #F9EBEA; color:##C0392B">nan</span>. This is because any calculations we do with a null value also result in a null value. This makes sense when you think about it -- how can you add a null value to a known value?

Instead, we have to filter out the missing values before we calculate the mean.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Use <span style="background-color: #F9EBEA; color:##C0392B">age_is_null</span> to create a vector that only contains values from the <span style="background-color: #F9EBEA;
color:##C0392B">"age"</span> column that aren't <span style="background-color: #F9EBEA; color:##C0392B">NaN</span>.
>```python
age_is_null = pd.isnull(titanic_survival["age"])
```
2. Calculate the mean of the new vector, and assign the result to <span style="background-color: #F9EBEA; color:##C0392B">correct_mean_age</span>.


## 4. Easier Ways To Do Math

Luckily, missing data is so common that many pandas methods automatically filter for it. For example, if we use use the [Series.mean()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) method to calculate the mean of a column, missing values will not be included in the calculation.

To calculate the mean age that we did earlier, we can replace all of our code with one line

>```python
correct_mean_age = titanic_survival["age"].mean()
```

Using the built in method is much easier, but it's import to understand what is happening behind the scenes.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Assign the mean of the <span style="background-color: #F9EBEA; color:##C0392B">"fare"</span> column to <span style="background-color: #F9EBEA; color:##C0392B">correct_mean_fare</span>.

## 5. Calculating Summary Statistics


Let's calculate more summary statistics for the data. The pclass column indicates the cabin class for each passenger, which was either first **class (1)**, **second class (2)**, or **third class (3)**. You'll use the list <span style="background-color: #F9EBEA; color:##C0392B">passenger_classes</span>, which contains these values, in the following exercise.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Use a for loop to iterate over <span style="background-color: #F9EBEA; color:##C0392B">passenger_classes</span>. 
>```python
passenger_classes = [1, 2, 3]
fares_by_class = {}
```
Within the for loop:
    - Select just the rows in <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span> where the **pclass** value is equivalent to the current iterator value (class).
    - Select just the **fare** column for the current subset of rows.
    - Use the **Series.mean** method to calculate the mean of this subset.
    - Add the mean of the class to the <span style="background-color: #F9EBEA; color:##C0392B">fares_by_class</span> dictionary with class as the key.
2. Once the loop completes, the dictionary <span style="background-color: #F9EBEA; color:##C0392B">fares_by_class</span> should have **1, 2, and 3** as keys, with the average fares as the corresponding values.


## 6. Making Pivot Tables

[Pivot tables](https://en.wikipedia.org/wiki/Pivot_table) provide an easy way to subset by one column and then apply a calculation like a sum or a mean. The concept of Pivot tables was popularized with the introduction of the 'PivotTable' feature in Microsoft Excel in the mid 1990's.

Pivot tables first group and then apply a calculation. In the previous screen, we actually made a pivot table manually by grouping by the column <span style="background-color: #F9EBEA; color:##C0392B">"pclass"</span> and then calculating the mean of the <span style="background-color: #F9EBEA; color:##C0392B">"fare"</span> column for each class.

Luckily, we can use the [Dataframe.pivot_table()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) method instead, which simplifies the kind of work we did on the last screen. To produce the same data, we could use one line.

|  | pclass | survived | name                                            | sex    | age     | sibsp | parch | ticket | fare     | cabin   | embarked | boat | body | home.dest                       |
|---|--------|----------|-------------------------------------------------|--------|---------|-------|-------|--------|----------|---------|----------|------|------|---------------------------------|
| 0 | 1      | 1        | Allen, Miss. Elisabeth Walton                   | female | 29.0000 | 0     | 0     | 24160  | 211.3375 | B5      | S        | 2    |      | St Louis, MO                    |
| 1 | 1      | 1        | Allison, Master. Hudson Trevor                  | male   | 0.9167  | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        | 11   |      | Montreal, PQ / Chesterville, ON |
| 2 | 1      | 0        | Allison, Miss. Helen Loraine                    | female | 2       | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        |      |      | Montreal, PQ / Chesterville, ON |
| 3 | 1      | 0        | Allison, Mr. Hudson Joshua Creighton            | male   | 30.0000 | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        |      | 135  | Montreal, PQ / Chesterville, ON |
| 4 | 1      | 0        | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25      | 1     | 2     | 113781 | 151.5500 | C22 C26 | S        |      |      | Montreal, PQ / Chesterville,    |



>```python
passenger_class_fares = titanic_survival.pivot_table(index="pclass", values="fare", aggfunc=np.mean)
```



The first parameter of the method, index tells the method which column to group by. The second parameter values is the column that we want to apply the calculation to, and aggfunc specifies the calculation we want to perform. The default for the **aggfunc** parameter is actually the mean, so if we're calculating this we can omit this parameter.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Use the <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.pivot_table()</span> method to calculate the **mean** age for each passenger class (**"pclass"**).
2. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">passenger_age</span>.
3. Display the <span style="background-color: #F9EBEA; color:##C0392B">passenger_age</span> pivot table using the **print()** function.

## 7. More complex pivot tables

We can use the <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.pivot_table()</span> method to perform even more advanced tasks. If we pass a list of column names to the values parameter instead of a single value, we can perform calculations on multiple columns at once.

We can also specify a custom calculation to be made. For instance, if we pass <span style="background-color: #F9EBEA; color:##C0392B">np.sum</span> to the **aggfunc** parameter it will total the values in each column.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Make a pivot table that calculates the total fares collected (<span style="background-color: #F9EBEA; color:##C0392B">"fare"</span>) and total number of survivors (<span style="background-color: #F9EBEA; color:##C0392B">"survived"</span>) for each embarkation port (<span style="background-color: #F9EBEA; color:##C0392B">"embarked"</span>).
2. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">port_stats</span>.
3. Display <span style="background-color: #F9EBEA; color:##C0392B">port_stats</span> using the **print()** function.

## 8. Droping missing values

We learned how to remove the missing values in a vector of data, but how about in a matrix?

We can use the [DataFrame.dropna()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) method on pandas **DataFrames** to do this. The method will drop any rows that contain missing values.

The <span style="background-color: #F9EBEA; color:##C0392B">dropna()</span> method takes an axis parameter, which indicates whether you would like to drop rows or columns. Specifying **axis=0** or **axis='index'** will drop any rows that have null values, while specifying **axis=1** or **axis='columns'** will drop any columns that have null values. We will use **0** and **1** since they're more commonly used, but you can use either.

The code below will drop all rows in <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span> that have null values.

>```python
drop_na_rows = titanic_survival.dropna(axis=0)
```

There is also a parameter that allows you to specify a list of columns or rows to look at when using <span style="background-color: #F9EBEA; color:##C0392B">dropna()</span>. You will need to use this in the next exercise - take a look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) to work out the name of this parameter and how it works.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Drop all columns in <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span> that have missing values and assign the result to <span style="background-color: #F9EBEA; color:##C0392B">drop_na_columns</span>.
2. Drop all rows in <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span> where the columns **"age"** or **"sex"** have missing values and assign the result to <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span>.


## 9.  Using Iloc To Access Rows By Position

In previous missions, we have used row labels to select data in pandas using <span style="background-color: #F9EBEA; color:##C0392B">Dataframe.loc[]</span>. These work just like column labels, and can be values like numbers, characters, and strings.

Sometimes your dataset will have row labels that are not numbers, or that are not in order. We have sorted the <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> dataframe by the **"age"** column from highest to lowest. Here is a preview of the a few of the columns for the first five rows of the data, or the five oldest passengers onboard.


|   | pclass | survived | name                                              | sex    | age  |
|------|--------|----------|---------------------------------------------------|--------|------|
| 14   | 1.0    | 1.0      | Barkworth, Mr. Algernon Henry Wilson              | male   | 80.0 |
| 61   | 1.0    | 1.0      | Cavendish, Mrs. Tyrell William (Julia Florence... | female | 76.0 |
| 1235 | 3.0    | 0.0      | Svensson, Mr. Johan                               | male   | 74.0 |
| 135  | 1.0    | 0.0      | Goldschmidt, Mr. George B                         | male   | 71.0 |
| 9    | 1.0    | 0.0      | Artagaveytia, Mr. Ramon                           | male   | 71.0 |


You can see that the row labels for the first **5** rows are **14**, **61**, **1235**, **135** and **9**. If we wanted to select the first five rows, we can use <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.iloc[]</span> method to select by position. The easy way to remember which is which is to remember that <span style="background-color: #F9EBEA; color:##C0392B">iloc[]</span> stands for integer location, because you use integers and not labels to select the data.

The following code will select the first 5 rows as shown above:

>```python
first_five_rows = new_titanic_survival.iloc[0:5]
```

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Assign the first ten rows from <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> to <span style="background-color: #F9EBEA; color:##C0392B">first_ten_rows</span>.
2. Assign the fifth row from <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> to <span style="background-color: #F9EBEA; color:##C0392B">row_position_fifth</span>.
3. Assign the row with index label <span style="background-color: #F9EBEA; color:##C0392B">25</span> from  <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> to <span style="background-color: #F9EBEA; color:##C0392B">row_index_25</span>.

## 10. Using Column Indexes

We can also index columns using both the <span style="background-color: #F9EBEA; color:##C0392B">loc[]</span> and <span style="background-color: #F9EBEA; color:##C0392B">iloc[]</span> methods. With <span style="background-color: #F9EBEA; color:##C0392B">.loc[]</span>, we specify the column label strings as we have in the earlier exercises in this missions. With <span style="background-color: #F9EBEA; color:##C0392B">iloc[]</span>, we simply use the integer number of the column, starting from the left-most column which is **0**. Similar to indexing with NumPy arrays, you separate the row and columns with a comma, and can use a colon to specify a range or as a wildcard.

>```python
first_row_first_column = new_titanic_survival.iloc[0,0]
all_rows_first_three_columns = new_titanic_survival.iloc[:,0:3]
row_index_83_age = new_titanic_survival.loc[83,"age"]
row_index_766_pclass = new_titanic_survival.loc[766,"pclass"]
```

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Assign the value at row index label **1100**, column index label **"age"** from <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> to <span style="background-color: #F9EBEA; color:##C0392B">row_index_1100_age</span>.
2. Assign the value at row index label **25**, column index label **"survived"** from <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> to <span style="background-color: #F9EBEA; color:##C0392B">row_index_25_survived</span>.
3. Assign the first **5** rows and first three columns from <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> to <span style="background-color: #F9EBEA; color:##C0392B">five_rows_three_cols</span>.


## 11. Reindexing Rows

After we sorted <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> by **age**, the row indexes were no longer sequential. Each row retained its original index from <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span>.

Sometimes it's useful to reindex, starting from **0**. We can use the <span style="background-color: #F9EBEA; color:##C0392B">[DataFrame.reset_index()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)</span> method to do this. By default, the method retains the old index by adding an extra column to the dataframe with the old index values.

In this exercise, we don't want to retain the index. Check the documentation to see what parameter you need to add so that we don't retain the old index.


<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Reindex the <span style="background-color: #F9EBEA; color:##C0392B">new_titanic_survival</span> dataframe so the row indexes start from **0**, and the old index is **dropped**.
2. Assign the final result to <span style="background-color: #F9EBEA; color:##C0392B">titanic_reindexed</span>.
3. Print the first 5 rows and the first 3 columns of <span style="background-color: #F9EBEA; color:##C0392B">titanic_reindexed</span>.

## 12. Apply Functions Over A DataFrame

To perform a complex calculation across pandas objects, we'll need to learn about the <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.apply()</span> method. By default, <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.apply()</span> will iterate through each column in a DataFrame, and perform on each function. When we create our function, we give it one parameter, <span style="background-color: #F9EBEA; color:##C0392B">apply()</span> method passes each column to the parameter as a pandas series.

The result from the function will be combined with all of the other results, and placed into a new series. The function results will have the same position as the column or row we generated them from. Let's look at a simple example:


>```python
# This function returns the hundredth item from a series
def hundredth_row(column):
    # Extract the hundredth item
    hundredth_item = column.iloc[99]
    return hundredth_item
# Return the hundredth item from each column
hundredth_row_var = titanic_survival.apply(hundredth_row)
```

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Write a function that counts the number of null elements in a Series.
2. Use the <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.apply()</span> method along with your function to run across all the columns in <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span>.
3. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">column_null_count</span>.


In [73]:
import pandas as pd

data = pd.read_csv("titanic_survival.csv")

In [75]:
data.head(2)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


In [80]:
def hundredth_row(column):
    # Extract the hundredth item
    hundredth_item = column.isnull().sum()
    return hundredth_item
# Return the hundredth item from each column
hundredth_row_var = data.apply(hundredth_row)

In [81]:
hundredth_row_var

pclass          1
survived        1
name            1
sex             1
age           264
sibsp           1
parch           1
ticket          1
fare            2
cabin        1015
embarked        3
boat          824
body         1189
home.dest     565
dtype: int64

In [77]:
hundredth_row_var

pclass                                                       1
survived                                                     1
name         Duff Gordon, Lady. (Lucille Christiana Sutherl...
sex                                                     female
age                                                         48
sibsp                                                        1
parch                                                        0
ticket                                                   11755
fare                                                      39.6
cabin                                                      A16
embarked                                                     C
boat                                                         1
body                                                       NaN
home.dest                                       London / Paris
dtype: object

## 13. Applying A Function To A Row

By passing in the **axis=1** argument, we can use the <span style="background-color: #F9EBEA; color:##C0392B">DataFrame.apply()</span> method to iterate over rows instead of columns.

We can use this to calculate some summary information about the ages of the passengers on the Titanic. You will need to use an **if/elif/else** statement in your function. The **elif** statement just means **else if**. Below is an example of how these statements work.

>```python
def which_class(row):
    pclass = row['pclass']
    if pd.isnull(pclass):
        return "Unknown"
    elif pclass == 1:
        return "First Class"
    elif pclass == 2:
        return "Second Class"
    else:
        return "Third Class"
classes = titanic_survivors.apply(which_class, axis=1)
```

When the function is called, each test runs until one of the **if**, **elif** or **else** statements is met.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Create a function that returns the string <span style="background-color: #F9EBEA; color:##C0392B">"minor"</span> if someone is under **18**, <span style="background-color: #F9EBEA; color:##C0392B">"adult"</span> if they are equal to or over **18**, and <span style="background-color: #F9EBEA; color:##C0392B">"unknown"</span> if their age is **null**.
2. Then, use the function along with <span style="background-color: #F9EBEA; color:##C0392B">.apply()</span> to find the correct label for everyone in the <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span> dataframe.
3. Assign the result to <span style="background-color: #F9EBEA; color:##C0392B">age_labels</span>.
4. You can use **pd.isnull** to check if a value is **null** or not.

In [94]:
data["agecat"] = pd.cut(data.age,
      bins=[0,5,10,18,30,50,65,100],
      labels=["Infant","Child","Teenager","Young adult",
             "Adult","Senior adult","Senior"])

In [105]:
data.pivot_table(index=["agecat","survived"],
                 values="age",
                 aggfunc="count").sum()

age    1046
dtype: int64

In [110]:
data.pivot_table(index=["agecat","survived"],
                 values="age",
                 aggfunc=lambda x: len(x)/len(data[~data.age.isnull()]))

Unnamed: 0_level_0,Unnamed: 1_level_0,age
agecat,survived,Unnamed: 2_level_1
Infant,0.0,0.018164
Infant,1.0,0.035373
Child,0.0,0.016252
Child,1.0,0.012428
Teenager,0.0,0.059273
Teenager,1.0,0.043021
Young adult,0.0,0.251434
Young adult,1.0,0.146272
Adult,0.0,0.192161
Adult,1.0,0.134799


In [106]:
aux["age_prop"] = aux.age/aux.sum().age

In [107]:
aux

Unnamed: 0_level_0,Unnamed: 1_level_0,age,age_prop
agecat,survived,Unnamed: 2_level_1,Unnamed: 3_level_1
Infant,0.0,19,0.018164
Infant,1.0,37,0.035373
Child,0.0,17,0.016252
Child,1.0,13,0.012428
Teenager,0.0,62,0.059273
Teenager,1.0,45,0.043021
Young adult,0.0,263,0.251434
Young adult,1.0,153,0.146272
Adult,0.0,201,0.192161
Adult,1.0,141,0.134799


In [90]:
def which_class(row):
    age = row['age']
    if pd.isnull(age):
        return "Unknown"
    elif age < 5:
        return "Infant"
    elif age < 10:
        return "Child"
    elif age < 18:
        return "Teenager"
    elif age < 30:
        return "Young adult"
    elif age < 50:
        return "Adult"
    elif age < 65:
        return "Senior adult"
    else:
        return "Senior"
data["agecat"] = data.apply(which_class, axis=1)

In [91]:
data

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,pclass_name,agecat
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO",First Class,Young adult
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON",First Class,Infant
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",First Class,Infant
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",First Class,Adult
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",First Class,Young adult
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0000,0.0,0.0,19952,26.5500,E12,S,3,,"New York, NY",First Class,Adult
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY",First Class,Senior adult
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0000,0.0,0.0,112050,0.0000,A36,S,,,"Belfast, NI",First Class,Adult
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY",First Class,Senior adult
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay",First Class,Senior


## 14. Calculating Survival Percentage By Age Group

Now that we have age labels for everyone, let's make a pivot table to find the probability of survival for each age group.

We have added an <span style="background-color: #F9EBEA; color:##C0392B">"age_labels"</span> column to the dataframe containing the <span style="background-color: #F9EBEA; color:##C0392B">age_labels</span> variable from the previous step.

<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Description**:

1. Create a pivot table that calculates the mean survival chance(<span style="background-color: #F9EBEA; color:##C0392B">"survived"</span>) for each age group (<span style="background-color: #F9EBEA; color:##C0392B">"age_labels"</span>) of the dataframe <span style="background-color: #F9EBEA; color:##C0392B">titanic_survival</span>.
2. Assign the resulting Series object to <span style="background-color: #F9EBEA; color:##C0392B">age_group_survival</span>.
