# Exploring Titanic Dataset from Kaggle using Pandas.

In this notebook, we will explore [Titanic dataset](https://www.kaggle.com/c/titanic/data) from Kaggle, which provides the list of passengers and their various attributes (age, gender, passenger class etc) who were onboarded on Titatic at the time of Tragedy in April, 1912.

As a first step let's get some summary statistics.

In [3]:
import pandas as pd

Let's load the data.

In [5]:
titanic = pd.read_csv('../kaggle/data/train.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### **What is the average age of Titanic passengers?**

In [6]:
titanic["Age"].mean()

29.69911764705882

### What is the median age and ticket fare price of the titanic passengers?

In [8]:
titanic[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

We can also use this trimmed `DataFrame` and use `describe` to get more summary statistics.

In [9]:
titanic[["Age", "Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


Also, instead of getting pre-defined statistics, we can always get the specific combination aggregated statistics for a set of given columns using `DataFrame.agg`

In [10]:
titanic.agg({
    'Age' : ['min', 'max', 'median', 'skew'],
    'Fare' : ['min', 'max', 'mean', 'median']
})

Unnamed: 0,Age,Fare
max,80.0,512.3292
mean,,32.204208
median,28.0,14.4542
min,0.42,0.0
skew,0.389108,


### What is the average age for male versus female titanic passengers?

In [11]:
titanic[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the `Sex` column) is a common pattern. The groupby method is used to support this type of operations. More general, this fits in the more general split-apply-combine pattern:

- Split the data into groups
- Apply a function to each group independently
- Combine the results into a data structure
- The apply and combine steps are typically done together in pandas.

We used two specific columns, however we can use it directly on the `dataframe` to get `mean` of all the numerical columns

In [14]:
titanic.groupby("Sex").mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


### What is the mean ticket fare price for each of the sex and cabin class combinations?

In [15]:
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

### What is the number of passengers in each of the cabin classes?

In [16]:
titanic["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

The `value_counts()` method counts the number of records for each category in a column.

To be continued..