# QUESTION

Background : The dataset contains data for 887 of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (S), their age (A), their passenger-class (C), their sex (G) and the fare they paid (X). The table below shows the Data Dictionary.<BR>

Objective : Create a python script that contains the script to answer the questions below. Please put comments on the script to show which line of codes answers which questions. Please upload the script after you finish.<br>

Questions :
1. What is the dimension (col, row) of the data frame?
2. How to know data type of each variable?
3. How many passengers survived (Survived=1) and not-survived (Survived=0)?
4. How to drop column ‘Name’ from the data frame?
5. Add one new column called ‘family’ to represent number of family-member aboard (hint: family = sibsp + parch)
6. As shown, columns ‘Age’ contains missing values. Please add new column named ‘Age_miss’ to indicate whether Age is missing or not (Age_miss = ‘YES’ for missing value and ‘NO’ for non-missing value). 
7. Please fill Age missing value with means of existing Age values
8. What is the maximum passenger Age who survived from the tragedy? 
9. How many passengers survived from each ‘PClass’? 
10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for tttanic1 and 0.3 for titanic2 ?

## IMPORT LIBRARY

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

## LOAD DATASET

In [2]:
df = pd.read_csv('titanic.csv')

In [3]:
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


## 1. What is the dimension (col, row) of the data frame?

In [4]:
df.shape

(887, 8)

- Dataset has 887 Rows dan 8 Columns.

## 2. How to know data type of each variable?

In [5]:
df.dtypes

Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

- Dataset has 3 data types : integer (int64), string, float (float64).

## 3. How many passengers survived (Survived=1) and not-survived (Survived=0)?

In [9]:
df.groupby('Survived').size()

Survived
0    545
1    342
dtype: int64

- The passangers survived 342.
- The passangers not survived 545.

## 4. How to drop column ‘Name’ from the data frame?

In [10]:
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [11]:
# using drop syntax
df.drop(columns=['Name'], inplace=True)

In [12]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


## 5. Add one new column called ‘family’ to represent number of family-member aboard (hint: family = sibsp + parch)

In [13]:
df['family'] = df['Siblings/Spouses Aboard'] + df['Parents/Children Aboard']
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family
0,0,3,male,22.0,1,0,7.25,1
1,1,1,female,38.0,1,0,71.2833,1
2,1,3,female,26.0,0,0,7.925,0
3,1,1,female,35.0,1,0,53.1,1
4,0,3,male,35.0,0,0,8.05,0


## 6. As shown, columns ‘Age’ contains missing values. Please add new column named ‘Age_miss’ to indicate whether Age is missing or not (Age_miss = ‘YES’ for missing value and ‘NO’ for non-missing value). 


In [14]:
#check number of missing vallues
df.isnull().sum()

Survived                   0
Pclass                     0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
family                     0
dtype: int64

In [15]:
# create array to save age_miss value
Age_miss = []
for i in range(len(df)):
    if df['Age'][i] == 0:
        Age_miss.append('Yes')
    else:
        Age_miss.append('No')

In [17]:
# insert value into Age_miss column
df['Age_miss'] = Age_miss
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


## 7. Please fill Age missing value with means of existing Age values

In [18]:
# using fillna
df['Age'] = df['Age'].fillna(df['Age'].mean())

In [19]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
0,0,3,male,22.0,1,0,7.25,1,No
1,1,1,female,38.0,1,0,71.2833,1,No
2,1,3,female,26.0,0,0,7.925,0,No
3,1,1,female,35.0,1,0,53.1,1,No
4,0,3,male,35.0,0,0,8.05,0,No


## 8. What is the maximum passenger Age who survived from the tragedy? 

In [20]:
# Survived represented as 1
df[df['Survived']==1]['Age'].max()

80.0

- The maximum passanger age is 80 years old.

## 9. How many passengers survived from each ‘PClass’? 

In [21]:
# filter survived, show them by using groupby Pclass field, count the filtered survived 
df[df['Survived']==1].groupby('Pclass')['Survived'].count()

Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

## 10. How to randomly split the data frame into 2 parts (titanic1 and titanic2) with proportion of 0.7 for tttanic1 and 0.3 for titanic2 ? 


In [22]:
# save data into titanic1 and create new index
titanic1 = df.sample(frac=0.7).reset_index()
titanic1.head()

Unnamed: 0,index,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
0,92,0,3,male,26.0,1,2,20.575,3,No
1,297,1,1,female,50.0,0,1,247.5208,1,No
2,303,1,1,male,0.92,1,2,151.55,3,No
3,264,0,3,male,16.0,4,1,39.6875,5,No
4,169,0,1,male,61.0,0,0,33.5,0,No


In [23]:
# save data into titanic2
titanic2 = df.drop(titanic1.index)
titanic2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare,family,Age_miss
621,0,3,male,21.0,0,0,16.1,0,No
622,0,1,male,61.0,0,0,32.3208,0,No
623,0,2,male,57.0,0,0,12.35,0,No
624,1,1,female,21.0,0,0,77.9583,0,No
625,0,3,male,26.0,0,0,7.8958,0,No


----