# Lesson 13: Making Sense of Data

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
titanic = pd.read_csv('data/titanic.csv')
titanic.head()

What data type is the `Embarked` Series?

What is the pandas [`category`](https://pbpython.com/pandas_dtypes_cat.html#:~:text=The%20category%20data%20type%20in,more%20efficiently%20store%20the%20data.) type?

Let's change `Embarked` to type `category`.

In [None]:
titanic.Embarked.cat.categories = ["Cherbourg", "Queenstown", "Southampton"]
titanic.Embarked[0:2]

Change the `Sex` and `Survived` columns to ttype `category`.

In [None]:
titanic.Sex = ...
titanic.Sex.cat.categories = ...
titanic.Survived = ...
titanic.Survived.cat.categories = ...

If you correctly changed the `Sex` and `Survived` columns to category you should see this

```
(Index(['Male', 'Female'], dtype='object'),
 Index(['Died', 'Survived'], dtype='object'))
```
if you run the following commands in the next cell.

In [None]:
titanic.Sex.cat.categories, titanic.Survived.cat.categories

Use the `.value_counts()` method to display the counts for the `Embarked`, `Sex`, and `Survived` columns.

## 1. Contingency Tables

### Two-Way

The `pd.crosstab()` function will make a two-way contingency table that gets returned as a dataframe.

In [None]:
pd.crosstab()

We can add marginal totals.

In [None]:
pd.crosstab()

We can change the row names and the column names. 

In [None]:
pd.crosstab()

## Three-Way

We can make three-way contingency tables.

Save the three-way contingency table from the previous cell to an object named `tbl`.

In [None]:
tbl = ...

Now let's look at some bivariate information by slicing our three-way contingency table. 

**Example 1.** 

```
Embarked     Survival
Cherbourg    Died         66
             Survived     29
Queenstown   Died         38
             Survived      3
Southampton  Died        364
             Survived     77
All                      577
Name: Female, dtype: int64
```

In [None]:
tbl.iloc[]

**Example 2.** 

```
Embarked     Survival
Cherbourg    Died          9
             Survived     64
Queenstown   Died          9
             Survived     27
Southampton  Died         63
             Survived    140
All                      312
Name: Male, dtype: int64
```

In [None]:
tbl.loc[]

**Example 3.** 


|               |**Sex**     |**Male**|**Female**|**All**|
|---------------|------------|-------|--------|----------|
|**Embarked**   |**Survival**|       |        |          |
|**Cherbourg**  |**Died**    |  9    | 66     |  75      |
|**Queenstown** |**Died**    |  9    | 38     |  47      |
|**Southampton**|**Died**    |  63   | 364    |  427     |

In [None]:
tbl.iloc[]

**Example 4.** 


|               |**Sex**     |**Male**|**Female**|**All**|
|---------------|------------|-------|--------|----------|
|**Embarked**   |**Survival**|       |        |          |
|**Cherbourg**  |**Died**    |  9    | 66     |  75      |
|**Queenstown** |**Died**    |  9    | 38     |  47      |
|**Southampton**|**Died**    |  63   | 364    |  427     |

In [None]:
tbl.loc[]

Now let's look at some univariate information by slicing our three-way contingency table. 

**Example 5.** 

```
Sex
Male       9
Female    66
All       75
Name: (Cherbourg, Died), dtype: int64
```

In [None]:
tbl.iloc[]

**Example 6.** 

```
Sex
Male       9
Female    66
All       75
Name: (Cherbourg, Died), dtype: int64
```

In [None]:
tbl.loc[]

## 2. Measures of Center and Spread

The `.describe()` method computes and displays summary statistics for a Python dataframe.

In [None]:
titanic.describe()

In [None]:
titanic[["Age", "Fare"]].describe()

- The `percentiles` parameter is list-like of numbers.

- The `percentiles` parameter is optional.

- The percentiles to include in the output. All should fall between 0 and 1. The default is `[.25, .5, .75]`, which returns the 25th, 50th, and 75th percentiles.

In [None]:
titanic[["Age", "Fare"]].describe(percentiles = [0.05, 0.25])

- The `.quantile` method Return values at the given quantile over requested axis.

- The default is the 50%

In [None]:
titanic.Age.quantile()

In [None]:
titanic.Age.quantile(q = [0.2, 0.25, 0.5, 0.95])

## 3. Measures of Linear Relationship

The `.corr()` method is used to find the pairwise correlation of all columns in the dataframe.

In [None]:
titanic.corr()

In [None]:
titanic[["Age", "Fare"]].corr()