1. Introduction

本项目使用了kaggle提供的Titanic乘客的数据集。数据集包括了乘客的生还情况，乘客之间的关系，和他们所在船舱位置的基本情况等等。本可视化旨在通过交互式的可视化探究乘客生还情况与他们的社会和经济地位情况之间的关系。在本项目中使用d3.js和dimple.js进行交互式可视化，使用了python进行数据预处理。

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the data from csv and load it into data frame
titanic_data = pd.read_csv("Titanic_data.csv")
titanic_data.head(n=10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


上述代码罗列了本数据集前10行数据的信息，首先我需要先解释一下每一列数据的具体含义，参照下表：
Variable	Definition	                       Key
survival	Survival	                     0 = No, 1 = Yes
pclass	Ticket class	                   1 = 1st, 2 = 2nd, 3 = 3rd
sex	       Sex	
Age	    Age in years	
sibsp	# of siblings / spouses aboard the Titanic	
parch	# of parents / children aboard the Titanic	
ticket	Ticket number	
fare	Passenger fare	
cabin	Cabin number	
embarked	Port of Embarkation	         C = Cherbourg, Q = Queenstown, S = Southampton
Notes:
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them. 

In [3]:
# Take a look at the dataset suammry 

titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


我们可以看出，本数据集中一共有891条数据记录，其中Age, Cabin和Embarked三列有缺失值NaN, 其中Cabin的缺失程度最高，Age其次，Embarked最少。由于我们的分析并不涉及Cabin Number的方面，船舱号对我们最后的可视化并无实际意义
我将在数据集中完全删除Cabin这一列。关于AGE，Age将是我们后来可视化环节的一部分，但是我们并不关心统计上的问题，所以此处可以将缺失AGE的行删除。Emarked只缺失两个数据，故而我们也将其中包含空值的行删除。

In [4]:
# Drop the Cabin Column
titanic_data = titanic_data.drop(columns=['Cabin'])
titanic_data = titanic_data.dropna(axis=0, how="any")
titanic_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Name           712 non-null object
Sex            712 non-null object
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Ticket         712 non-null object
Fare           712 non-null float64
Embarked       712 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 66.8+ KB


根据汇总信息可以看出，我们已经将空值清理出数据。接下来，为了方便读者阅读分析报告，我们将部分代表布尔值的0,1替换成熟悉的说明文字。例如该乘客是否存活的0,1状态。
以及三种不同船舱等级。此处我们将根据数据说明信息，将船舱分为Upper,Middel 和Low三种等级。将登船地点还原为地点的全称。

In [5]:
# Map the data for better readbilities


def map_status(data, key, dict):
    data[key] = data[key].map(dict)
    return data

survived_map = {0: 'Perished', 1: 'Survived'}
Pclass_map = {1: 'Upper Class', 2: 'Middle Class', 3: 'Lower Class'}
embark_dict = {'S': 'Southampton', 'C': 'Cherbourg','Q': 'Queenstown'}
titanic_data = map_status(titanic_data, "Pclass", Pclass_map)
titanic_data = map_status(titanic_data, "Embarked", embark_dict)

titanic_data["Survived_Status"] = titanic_data["Survived"].map(survived_map)

titanic_data.head()       

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived_Status
0,1,0,Lower Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Southampton,Perished
1,2,1,Upper Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,Cherbourg,Survived
2,3,1,Lower Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Southampton,Survived
3,4,1,Upper Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,Southampton,Survived
4,5,0,Lower Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Southampton,Perished


我们再看数据框中的，年龄属性，可以发现年龄分布非常广泛，为了更好的研究各个年龄段的存活情况，我们将年龄分为几个有效的年龄组。
此处我们按照0-16岁，16-40,40-60,60-60。最后我们将更新后的数据存入了新的csv文件。

In [14]:
# Min and Max of Age
print(min(titanic_data["Age"])) 
print(max(titanic_data["Age"]))


def divide_age(age):
    passenger_age = age
    if passenger_age in range(0,16):
        return "Child"
    elif passenger_age in range(16,40):
        return "Youth"
    elif passenger_age in range(40,60):
        return "MiddleAge"
    else:
        return "Senior"
    

titanic_data["Age_Group"] = titanic_data["Age"].apply(divide_age)


print((titanic_data[u"Age_Group"]).value_counts())

titanic_data.head(n=10)  





0.42
80.0
Youth        456
MiddleAge    132
Child         75
Senior        49
Name: Age_Group, dtype: int64


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Survived_Status,Age_Group
0,1,0,Lower Class,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,Southampton,Perished,Youth
1,2,1,Upper Class,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,Cherbourg,Survived,Youth
2,3,1,Lower Class,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,Southampton,Survived,Youth
3,4,1,Upper Class,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,Southampton,Survived,Youth
4,5,0,Lower Class,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,Southampton,Perished,Youth
6,7,0,Upper Class,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,Southampton,Perished,MiddleAge
7,8,0,Lower Class,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,Southampton,Perished,Child
8,9,1,Lower Class,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,Southampton,Survived,Youth
9,10,1,Middle Class,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,Cherbourg,Survived,Child
10,11,1,Lower Class,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,Southampton,Survived,Child


In [None]:
# Save the processed data set to csv file
titanic_data.to_csv("Processed_Titantic.csv", index=False, sep=',')
