# Titanic

**Background** : The dataset contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X). 

In [0]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [0]:
df = pd.read_csv('titanic.csv')

## 1. What is the dimension (col, row) of the data frame?

In [3]:
df.shape

(887, 8)

This data have 887 rows and 8 columns

## 2. How to know data type of each variable?

In [4]:
df.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

There was 8 variable and function 'df.dtypes' :
1. Survived with Integer data type 
2. Passenger Class with Integer data type
3. Name with string data types
4. Sex with string data types
5. Age with float data types
6. Siblings/Spouses Aboard with Integer data type
7. Parents/Children Aboard with Integer data type
8. Fare with float data types

## 3. How many passengers survived (Survived=1) and not-survived (Survived=0)?

In [5]:
df.groupby('Survived').size()

Survived
0    545
1    342
dtype: int64

There was 342 passenger that Survived.

## 4. How to drop column ‘Name’ from the data frame?

In [6]:
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [0]:
df.drop(columns=['Name'], inplace=True)

In [8]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


## 5. Add one new column called ‘family’ to represent number of family-member aboard (hint: family = sibsp + parch)

In [9]:
df['family'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard']
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family
0,0,3,male,22.0,1,0,7.25,1
1,1,1,female,38.0,1,0,71.2833,1
2,1,3,female,26.0,0,0,7.925,0
3,1,1,female,35.0,1,0,53.1,1
4,0,3,male,35.0,0,0,8.05,0


## 6. As shown, columns ‘Age’ contains missing values. Please add new column named ‘Age_miss’ to indicate whether Age is missing or not (Age_miss = ‘YES’ for missing value and ‘NO’ for non-missing value). 


In [11]:
#Checking missing values
df.isnull().sum()

Survived                   0
Pclass                     0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
family                     0
dtype: int64

In [0]:
age_miss = []
for i in range(len(df)):
    if df['Age'][i] == 0:
        age_miss.append('Yes')
    else:
        age_miss.append('No')

In [13]:
df['age_miss'] = age_miss
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


## 7. Please fill Age missing value with means of existing Age values

In [0]:
df['Age'] = df['Age'].fillna(df['Age'].mean())

## 8. What is the maximum passenger Age who survived from the tragedy? 

In [15]:
df[df['Survived']==1]['Age'].max()

80.0

## 9. How many passengers survived from each ‘PClass’? 

In [16]:
df[df['Survived']==1].groupby('Pclass')['Survived'].count()

Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

- There was 136 Passenger who survive in Class 1 
- There was 87 Passenger who survive in Class 2 
- There was 119 Passenger who survive in Class 3

## 10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for tttanic1 and 0.3 for titanic2 ? 


In [17]:
titanic1 = df.sample(frac=0.7).reset_index()
titanic1.head()

Unnamed: 0,index,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,age_miss
0,249,0,3,male,18.0,0,0,7.25,0,No
1,613,0,3,male,34.0,1,1,14.4,2,No
2,498,0,3,female,21.0,0,0,7.75,0,No
3,828,0,3,male,30.0,0,0,7.2292,0,No
4,812,0,3,female,23.0,0,0,7.925,0,No


In [18]:
titanic2 = df.drop(titanic1.index)
titanic2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,age_miss
621,0,3,male,21.0,0,0,16.1,0,No
622,0,1,male,61.0,0,0,32.3208,0,No
623,0,2,male,57.0,0,0,12.35,0,No
624,1,1,female,21.0,0,0,77.9583,0,No
625,0,3,male,26.0,0,0,7.8958,0,No
