# Business Understanding
On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

# Data Understanding
Background : The dataset contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X). The table below shows the Data Dictionary. <br>
**Download Dataset Here**
https://drive.google.com/open?id=15IyT1ODuDKgZb8WN6iG64hFJRZWJTSAz

**Then answer the questions**
1. What is the dimension (col, row) of the data frame?
2. How to know data type of each variable?
3. How many passengers survived (Survived=1) and not-survived (Survived=0)?
4. How to drop column ‘Name’ from the data frame?
5. Add one new column called ‘family’ to represent number of family-member aboard (hint: family = sibsp + parch)
6. As shown, columns ‘Age’ contains missing values. Please add new column named ‘Age_miss’ to indicate whether Age is missing or not (Age_miss = ‘YES’ for missing value and ‘NO’ for non-missing value). 
7. Please fill Age missing value with means of existing Age values
8. What is the maximum passenger Age who survived from the tragedy? 
9. How many passengers survived from each ‘PClass’? 
10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for tttanic1 and 0.3 for titanic2 ? 

## Import Library and Sneak Peak Dataset

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('data/titanic.csv')

In [3]:
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [4]:
# to know the information from dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 8 columns):
Survived                   887 non-null int64
Pclass                     887 non-null int64
Name                       887 non-null object
Sex                        887 non-null object
Age                        887 non-null float64
Siblings/Spouses Aboard    887 non-null int64
Parents/Children Aboard    887 non-null int64
Fare                       887 non-null float64
dtypes: float64(2), int64(4), object(2)
memory usage: 55.6+ KB


In [5]:
# to know the summary from dataframe
df.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
# missing value identification
df.isnull().sum()

Survived                   0
Pclass                     0
Name                       0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

Based on the identification of *missing value* above, it is known that the **dataframe does not have a missing value**.

#### *1. What is the dimension (col, row) of the data frame?*

In [7]:
df.shape

(887, 8)

Diketahui bahwa **jumlah kolom** dari dataframe ini sebanyak **8**, sedangkan **jumlah baris** sebanyak **887**

#### *2. How to know data type of each variable?*

In [8]:
df.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns.<br>
This dataframe contain data which has data type: <br>
- int: 4
- object: 2
- float: 2

#### *3. How many passengers survived (Survived=1) and not-survived (Survived=0)?*

In [9]:
df.groupby('Survived').size()

Survived
0    545
1    342
dtype: int64

We know that passanger which survive only 342, while the passanger which is not survived 545 person. 

#### *4. How to drop column ‘Name’ from the data frame?*

In [10]:
df = df.drop(columns='Name', axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


#### *5. Add one new column called ‘family’ to represent number of family-member aboard (hint: family = sibsp + parch)*

In [11]:
# to make new column called 'family', we can add two column between Siblings/Spouses Aboard and Parents/Children Aboard
df['Family'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard']

In [12]:
# new column called 'family is successfully created
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family
0,0,3,male,22.0,1,0,7.25,1
1,1,1,female,38.0,1,0,71.2833,1
2,1,3,female,26.0,0,0,7.925,0
3,1,1,female,35.0,1,0,53.1,1
4,0,3,male,35.0,0,0,8.05,0


#### *6. As shown, columns ‘Age’ contains missing values. Please add new column named ‘Age_miss’ to indicate whether Age is missing or not (Age_miss = ‘YES’ for missing value and ‘NO’ for non-missing value).*
Based on the results of sneak peak dataset, we know that **there is no missing value in the dataframe**, so we can immediately execute to add new column "age_miss"

In [13]:
# create a list to store the data
age_miss = []
# for each row in the column
for x in range(len(df)):
    # if columns Age contain missing value
    if df['Age'][x] == 0:
        # append the value Yes
        age_miss.append('Yes')
    else:
        # if columns Age doesn't contain missing value
        age_miss.append('No')

In [14]:
# create a column from the list
df['Age_miss'] = age_miss
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


#### *7. Please fill Age missing value with means of existing Age values*

In [15]:
# if there is any missing value in Age and the type of Age is float (numerical), so that we can fill missing value with mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


#### *8.What is the maximum passenger Age who survived from the tragedy?*

In [16]:
maxage_survived = df[df['Survived']==1]['Age'].max()
dfmax = {'Survived':[1],
        'Maximum Age':[maxage_survived]}
dfmax = pd.DataFrame(dfmax)
dfmax

Unnamed: 0,Survived,Maximum Age
0,1,80.0


Based on the result, we know that the maximum age of passanger who survived is 80.0 years old. 

#### *9. How many passengers survived from each ‘PClass’?*

In [17]:
survived = df[df['Survived']==1]
pclass_survived = survived.groupby('Pclass')['Survived'].count().reset_index()
pclass_survived

Unnamed: 0,Pclass,Survived
0,1,136
1,2,87
2,3,119


#### *10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for titanic1 and 0.3 for titanic2 ?* 

In [18]:
# proportion 0.7
titanic1 = df.sample(frac=0.7).reset_index()
titanic1.head()

Unnamed: 0,index,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
0,323,1,1,female,36.0,0,0,135.6333,0,No
1,873,0,3,male,19.0,0,0,7.8958,0,No
2,770,1,2,female,54.0,1,3,23.0,4,No
3,837,0,2,male,16.0,0,0,10.5,0,No
4,721,0,3,male,20.0,0,0,8.6625,0,No


In [19]:
# number of titanic1
titanic1.shape

(621, 10)

Proportion of **0.7 data is 621 rows**.

In [20]:
# proportion 0.3
titanic2 = df.drop(titanic1.index)
titanic2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,Family,Age_miss
621,0,3,male,21.0,0,0,16.1,0,No
622,0,1,male,61.0,0,0,32.3208,0,No
623,0,2,male,57.0,0,0,12.35,0,No
624,1,1,female,21.0,0,0,77.9583,0,No
625,0,3,male,26.0,0,0,7.8958,0,No


In [21]:
# number of titanic2
titanic2.shape

(266, 9)

Proportion of **0.3 data is 266 rows**