*Optional step* Install python & pandas from [Anaconda](https://www.anaconda.com/products/individual)
Download `titanic2.zip` data from [nextcloud](https://nextcloud.profinit.eu/index.php/s/tBkoFo8xEJwtKnJ?path=%2Fdata)

In [1]:
import pandas as pd

pd.set_option("display.precision", 2)

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg.
Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

(c) [Kaggle](https://www.kaggle.com/c/titanic/overview)

In [3]:
df = pd.read_csv('titanic_train.csv')

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.28,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.92,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


1. How many rows has the dataframe?
2. How many columns has the dataframe?
3. What is the percentage of non-null values in the Age column?
4. How many text columns has the dataframe?
5. How many men and women are in the dataset?
6. What is the average age of men and women? On average who is younger?
7. What is the percentage of passengers travelling in 3rd class cabins? (Pclass variable)
8. How much did the most expensive ticket cost? (Fare variable)
9. Describe average age, proportion of females and average Fare per Pclass.
10. Who is more likely to travel alone men or women? (Consider a passenger as travelling alone if he/she has no siblings/children)
11. What is the most popular lastname? firstname? (Name column)

In [44]:
# How many rows has the dataframe?

df.shape[0]

891

In [45]:
# How many columns has the dataframe?

df.shape[1]

12

In [47]:
# What is the percentage of non-null values in the Age column?

# df['Age'].isna().sum() / df.shape[0]
df['Age'].count() / df.shape[0]

0.8013468013468014

In [50]:
# How many text columns has the dataframe?

(df.dtypes == 'object').sum()

5

In [52]:
# How many men and women are in the dataset?

df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [8]:
# What is the average age of men and women? On average who is younger?

# df.groupby('Sex')['Age'].descibe()
df.groupby('Sex').agg(avg_age=('Age', 'mean'))

Unnamed: 0_level_0,avg_age
Sex,Unnamed: 1_level_1
female,27.92
male,30.73


In [11]:
# What is the percentage of passengers travelling in 3rd class cabins? (Pclass variable)

# df.groupby('Pclass').agg(pass_count=('PassengerId', 'count')) / df.shape[0]
df['Pclass'].value_counts(normalize=True).sort_index()

1    0.24
2    0.21
3    0.55
Name: Pclass, dtype: float64

In [57]:
# How much did the most expensive ticket cost? (Fare variable)

df.sort_values('Fare', ascending=False).head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
258,259,1,1,"Ward, Miss. Anna",female,35.0,0,0,PC 17755,512.33,,C


In [58]:
# Describe average age, proportion of females and average Fare per Pclass

(
    df
    .assign(is_female=lambda _: _['Sex'] == 'female')
    .groupby('Pclass').agg(
        avg_age=('Age', 'mean'),
        prop_female=('is_female', 'mean'),
        avg_fare=('Fare', 'mean')
    )
)

Unnamed: 0_level_0,avg_age,prop_female,avg_fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,38.23,0.44,84.15
2,29.88,0.41,20.66
3,25.14,0.29,13.68


In [59]:
# Who is more likely to travel alone men or women? (Consider a passenger as travelling alone if he/she has no siblings/children)

(
    df
    .assign(travel_alone=lambda _: (_['Parch'] == 0) & (_['SibSp'] == 0))
    .groupby('Sex').agg(prop_singles=('travel_alone', 'mean'))
)

Unnamed: 0_level_0,prop_singles
Sex,Unnamed: 1_level_1
female,0.4
male,0.71


In [63]:
# What is the most popular lastname? firstname? (Name column)

df['Name'].str.split(',').str[0].value_counts().head(5)

Andersson    9
Sage         7
Johnson      6
Carter       6
Skoog        6
Name: Name, dtype: int64

In [69]:
# most popular first name

df['Name'].str.lower().str.replace('[,\.\(\)\"]', '', regex=True).str.split().explode().value_counts().head(10)

mr         521
miss       182
mrs        129
william     64
john        44
master      40
henry       35
george      24
james       24
charles     23
Name: Name, dtype: int64