# 项目：用逻辑回归预测泰坦尼克号幸存情况

## 分析目标

此数据分析报告的目的是，基于泰坦尼克号乘客的性别和船舱等级等属性，对幸存情况进行逻辑回归分析，从而能利用得到的模型，对未知幸存情况的乘客，根据属性预测是否从沉船事件中幸存。

## 简介

> 泰坦尼克号（英语：RMS Titanic）是一艘奥林匹克级邮轮，于1912年4月首航时撞上冰山后沉没。泰坦尼克号是同级的3艘超级邮轮中的第2艘，与姐妹船奥林匹克号和不列颠号为白星航运公司的乘客们提供大西洋旅行。

> 泰坦尼克号由位于北爱尔兰贝尔法斯特的哈兰·沃尔夫船厂兴建，是当时最大的客运轮船，由于其规模相当一艘现代航空母舰，因而号称“上帝也沉没不了的巨型邮轮”。在泰坦尼克号的首航中，从英国南安普敦出发，途经法国瑟堡-奥克特维尔以及爱尔兰昆士敦，计划横渡大西洋前往美国纽约市。但因为人为错误，于1912年4月14日船上时间夜里11点40分撞上冰山；2小时40分钟后，即4月15日凌晨02点20分，船裂成两半后沉入大西洋，死亡人数超越1500人，堪称20世纪最大的海难事件，同时也是最广为人知的海难之一。

数据集包括两个数据表：`titianic_train.csv`和`titanic_test.csv`。

`titianic_train.csv`记录了超过八百位泰坦尼克号乘客在沉船事件后的幸存情况，以及乘客的相关信息，包括所在船舱等级、性别、年龄、同乘伴侣/同胞数量、同乘父母/孩子数量，等等。

`titanic_test.csv`只包含乘客（这些乘客不在`titianic_train.csv`里）相关信息，此文件可以被用于预测乘客是否幸存。

`titianic_train.csv`每列的含义如下：
- PassengerId：乘客ID
- survival：是否幸存
   - 0	否
   - 1	是
- pclass：船舱等级
   - 1	一等舱
   - 2	二等舱
   - 3  三等舱
- sex：性别
- Age：年龄
- sibsp：同乘伴侣/同胞数量
- parch：同乘父母/孩子数量
- ticket：船票号
- fare：票价金额
- cabin：船舱号
- embarked：登船港口
   - C  瑟堡
   - Q  皇后镇
   - S  南安普敦
   
   
`titianic_test.csv`每列的含义和上面相同，但不具备survival变量的数据，即是否幸存。

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
raw_data = pd.read_csv("titanic_train.csv")
raw_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


评估数据

评估数据整齐度

In [3]:
raw_data.sample(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
221,222,0,2,"Bracken, Mr. James H",male,27.0,0,0,220367,13.0,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.225,,C
63,64,0,3,"Skoog, Master. Harald",male,4.0,3,2,347088,27.9,,S
291,292,1,1,"Bishop, Mrs. Dickinson H (Helen Walton)",female,19.0,1,0,11967,91.0792,B49,C
402,403,0,3,"Jussila, Miss. Mari Aina",female,21.0,1,0,4137,9.825,,S
482,483,0,3,"Rouse, Mr. Richard Henry",male,50.0,0,0,A/5 3594,8.05,,S
569,570,1,3,"Jonsson, Mr. Carl",male,32.0,0,0,350417,7.8542,,S
36,37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C
562,563,0,2,"Norman, Mr. Robert Douglas",male,28.0,0,0,218629,13.5,,S
849,850,1,1,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)",female,,1,0,17453,89.1042,C92,C


数据干净度

In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


从输出结果看，总共891行数据，其中 Age：年龄、Cabin：船舱号、Embarked：登船港口有缺失值。PassengerId：乘客ID应该为文本数据，survival：是否幸存、pclass：船舱等级、sex：性别、embarked：登船港口应该为分类数据

先将PassengerId：乘客ID，survival：是否幸存、pclass：船舱等级、sex：性别、embarked：登船港口转换类型

In [5]:
raw_data["PassengerId"] = raw_data["PassengerId"].astype('str')
raw_data["Survived"] = raw_data["Survived"].astype('category')
raw_data["Pclass"] = raw_data["Pclass"].astype('category')
raw_data["Sex"] = raw_data["Sex"].astype('category')
raw_data["Embarked"] = raw_data["Embarked"].astype('category')

缺失数据

Age：年龄缺失值

In [6]:
raw_data[raw_data["Age"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
859,860,0,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C
863,864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S
868,869,0,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S
878,879,0,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S


我们发现 Age 有缺失值的行有177行，占总行数的20%左右，占比较大，如果删除，会极大影响整体评估，因此我们将用整体年龄的平均值来填充 Age 的缺失值

In [7]:
raw_data["Age"] = raw_data["Age"].fillna(raw_data["Age"].mean())
raw_data["Age"].isnull().sum()

0

Cabin：船舱号缺失值

In [8]:
raw_data[raw_data["Cabin"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.000000,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.000000,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.000000,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S


我们认为船舱号的缺失不会影响逻辑回归模型的判断，因此可以保留这些空缺值

Embarked：登船港口缺失值

In [9]:
raw_data[raw_data["Embarked"].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


我们认为登船港口的缺失不会影响逻辑回归模型的判断，因此可以保留这些空缺值

重复数据

从数据含义来看，PassengerId：乘客ID应该一个id对应一个乘客，因此我们需要查看是否存在重复

In [10]:
raw_data["PassengerId"].duplicated().sum()

0

从输出结果看不存在重复数据

不一致数据

不一致数据存在于所有分类变量中，因此需要查看所有分类变量是否存在不同的值指代同一种目标的情况

In [11]:
raw_data['Survived'].value_counts()

Survived
0    549
1    342
Name: count, dtype: int64

In [12]:
raw_data['Pclass'].value_counts()

Pclass
3    491
1    216
2    184
Name: count, dtype: int64

In [13]:
raw_data['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [14]:
raw_data['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

从输出结果看，不存在不同值指代同一种目标的情况

无效或者错误数据

In [15]:
raw_data.describe()

Unnamed: 0,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0
mean,29.699118,0.523008,0.381594,32.204208
std,13.002015,1.102743,0.806057,49.693429
min,0.42,0.0,0.0,0.0
25%,22.0,0.0,0.0,7.9104
50%,29.699118,0.0,0.0,14.4542
75%,35.0,1.0,0.0,31.0
max,80.0,8.0,6.0,512.3292


不存在脱离显示意义的数据

逻辑回归

In [16]:
import statsmodels.api as sm

In [17]:
raw_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.000000,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.000000,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.000000,0,0,111369,30.0000,C148,C


由于 sibsp：同乘伴侣/同胞数量，parch：同乘父母/孩子数量都属于陪伴人员，因此我们认为可以把它们合并为同一列 Family_sum

In [18]:
raw_data["Family_sum"] = raw_data["SibSp"] + raw_data["Parch"]
raw_data = raw_data.drop(["SibSp","Parch"], axis = 1)

移除大概率不会影响幸存者概率的变量

In [19]:
clean_data = raw_data.drop(["Name","Ticket","PassengerId","Cabin","Embarked"], axis = 1)

In [20]:
clean_data

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Family_sum
0,0,3,male,22.000000,7.2500,1
1,1,1,female,38.000000,71.2833,1
2,1,3,female,26.000000,7.9250,0
3,1,1,female,35.000000,53.1000,1
4,0,3,male,35.000000,8.0500,0
...,...,...,...,...,...,...
886,0,2,male,27.000000,13.0000,0
887,1,1,female,19.000000,30.0000,0
888,0,3,female,29.699118,23.4500,3
889,1,1,male,26.000000,30.0000,0


将分类变量转换为虚拟变量

In [21]:
clean_data = pd.get_dummies(clean_data, dtype= int, drop_first=True, columns=['Sex', 'Pclass'])
clean_data

Unnamed: 0,Survived,Age,Fare,Family_sum,Sex_male,Pclass_2,Pclass_3
0,0,22.000000,7.2500,1,1,0,1
1,1,38.000000,71.2833,1,0,0,0
2,1,26.000000,7.9250,0,0,0,1
3,1,35.000000,53.1000,1,0,0,0
4,0,35.000000,8.0500,0,1,0,1
...,...,...,...,...,...,...,...
886,0,27.000000,13.0000,0,1,1,0
887,1,19.000000,30.0000,0,0,0,0
888,0,29.699118,23.4500,3,0,0,1
889,1,26.000000,30.0000,0,1,0,0


检查已有变量之间的相关性

In [22]:
clean_data.corr().abs() >0.8

Unnamed: 0,Survived,Age,Fare,Family_sum,Sex_male,Pclass_2,Pclass_3
Survived,True,False,False,False,False,False,False
Age,False,True,False,False,False,False,False
Fare,False,False,True,False,False,False,False
Family_sum,False,False,False,True,False,False,False
Sex_male,False,False,False,False,True,False,False
Pclass_2,False,False,False,False,False,True,False
Pclass_3,False,False,False,False,False,False,True


可以看到，除了本身之外都没有大于0.8的相关性

命名因变量 y， 自变量 x


In [23]:
y = clean_data["Survived"]

In [24]:
x = clean_data.drop("Survived", axis = 1)

添加截距

In [25]:
x = sm.add_constant(x)

建立逻辑回归模型

In [26]:
model = sm.Logit(y, x).fit()

Optimization terminated successfully.
         Current function value: 0.443547
         Iterations 6


查看输出

In [27]:
model.summary()

0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,884.0
Method:,MLE,Df Model:,6.0
Date:,"Thu, 11 Jul 2024",Pseudo R-squ.:,0.3339
Time:,18:29:58,Log-Likelihood:,-395.2
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,1.786e-82

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.8097,0.445,8.568,0.000,2.938,4.681
Age,-0.0388,0.008,-4.963,0.000,-0.054,-0.023
Fare,0.0032,0.002,1.311,0.190,-0.002,0.008
Family_sum,-0.2430,0.068,-3.594,0.000,-0.376,-0.110
Sex_male,-2.7759,0.199,-13.980,0.000,-3.165,-2.387
Pclass_2,-1.0003,0.293,-3.416,0.001,-1.574,-0.426
Pclass_3,-2.1324,0.289,-7.373,0.000,-2.699,-1.566


当我们把显著水平设置为0.05时，我们发现 Fare：票价金额对逻辑回归函数没有显著影响，因此可以把它删除，再进行计算

In [28]:
x = x.drop("Fare", axis = 1)
model = sm.Logit(y, x).fit()
model.summary()

Optimization terminated successfully.
         Current function value: 0.444623
         Iterations 6


0,1,2,3
Dep. Variable:,Survived,No. Observations:,891.0
Model:,Logit,Df Residuals:,885.0
Method:,MLE,Df Model:,5.0
Date:,"Thu, 11 Jul 2024",Pseudo R-squ.:,0.3323
Time:,18:29:58,Log-Likelihood:,-396.16
converged:,True,LL-Null:,-593.33
Covariance Type:,nonrobust,LLR p-value:,4.927e-83

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.0620,0.404,10.049,0.000,3.270,4.854
Age,-0.0395,0.008,-5.065,0.000,-0.055,-0.024
Family_sum,-0.2186,0.065,-3.383,0.001,-0.345,-0.092
Sex_male,-2.7854,0.198,-14.069,0.000,-3.173,-2.397
Pclass_2,-1.1798,0.261,-4.518,0.000,-1.692,-0.668
Pclass_3,-2.3458,0.242,-9.676,0.000,-2.821,-1.871


从图中我们可以看出，以下因素对幸存概率有影响：Age：年龄、Family_sum：家庭成员数量、Sex_male：男性：Pclass_2，Pclass_3：船舱等级

In [29]:
# Age
np.exp(-0.0395)

0.9612699539905982

年龄美增加一岁，则幸存概率减少4%左右

In [30]:
# Family_sum
np.exp(-0.2186)

0.803643111115195

每多一位家庭成员，则幸存概率减少20%左右

In [31]:
# Sex_male
np.exp(-2.7854)

0.061704402333015156

男性的幸村概率比女性低94%左右

In [32]:
# Pclass_2
np.exp(-1.1798)

0.30734020049483596

二等舱的生还率比一等舱低70%左右

In [33]:
# Pclass_3
np.exp(-2.3458)

0.09577055503172162

三等舱的生还率比一等舱低91%左右

利用得到的模型，对未知幸存情况的乘客，根据属性预测是否从沉船事件中幸存。

In [34]:
titanic_test = pd.read_csv("titanic_test.csv")
titanic_test.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


整理数据

In [35]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


从输出可以看到，Age、Fare、Cabin有空缺值，需要处理这些空缺值。另还要将 SibSp 和 Parch 合并为 Family_sum 。将不影响幸存概率的变量移除。 Pclass 、Sex 应当为分类变量。

In [36]:
titanic_test["Age"] = titanic_test["Age"].fillna(titanic_test["Age"].mean())
titanic_test["Family_sum"] = titanic_test["SibSp"] + titanic_test["Parch"]
titanic_test["Pclass"] = pd.Categorical(titanic_test["Pclass"], categories=['1','2','3'])
titanic_test["Sex"] = pd.Categorical(titanic_test["Sex"], categories=['female','male'])
titanic_test["Embarked"] = pd.Categorical(titanic_test["Embarked"], categories=['C','Q','S'])
titanic_test

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_sum
0,892,,"Kelly, Mr. James",male,34.50000,0,0,330911,7.8292,,Q,0
1,893,,"Wilkes, Mrs. James (Ellen Needs)",female,47.00000,1,0,363272,7.0000,,S,1
2,894,,"Myles, Mr. Thomas Francis",male,62.00000,0,0,240276,9.6875,,Q,0
3,895,,"Wirz, Mr. Albert",male,27.00000,0,0,315154,8.6625,,S,0
4,896,,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.00000,1,1,3101298,12.2875,,S,2
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,"Spector, Mr. Woolf",male,30.27259,0,0,A.5. 3236,8.0500,,S,0
414,1306,,"Oliva y Ocana, Dona. Fermina",female,39.00000,0,0,PC 17758,108.9000,C105,C,0
415,1307,,"Saether, Mr. Simon Sivertsen",male,38.50000,0,0,SOTON/O.Q. 3101262,7.2500,,S,0
416,1308,,"Ware, Mr. Frederick",male,30.27259,0,0,359309,8.0500,,S,0


将 titanic_test 转化为虚拟变量

In [37]:
titanic_test = pd.get_dummies(titanic_test, dtype = int, drop_first= True, columns=["Pclass","Sex"])
titanic_test

Unnamed: 0,PassengerId,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family_sum,Pclass_2,Pclass_3,Sex_male
0,892,"Kelly, Mr. James",34.50000,0,0,330911,7.8292,,Q,0,0,0,1
1,893,"Wilkes, Mrs. James (Ellen Needs)",47.00000,1,0,363272,7.0000,,S,1,0,0,0
2,894,"Myles, Mr. Thomas Francis",62.00000,0,0,240276,9.6875,,Q,0,0,0,1
3,895,"Wirz, Mr. Albert",27.00000,0,0,315154,8.6625,,S,0,0,0,1
4,896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.00000,1,1,3101298,12.2875,,S,2,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,"Spector, Mr. Woolf",30.27259,0,0,A.5. 3236,8.0500,,S,0,0,0,1
414,1306,"Oliva y Ocana, Dona. Fermina",39.00000,0,0,PC 17758,108.9000,C105,C,0,0,0,0
415,1307,"Saether, Mr. Simon Sivertsen",38.50000,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,0,0,1
416,1308,"Ware, Mr. Frederick",30.27259,0,0,359309,8.0500,,S,0,0,0,1


带入回归模型预测

In [38]:
x_test  = titanic_test[["Age","Family_sum", "Pclass_2", "Pclass_3", "Sex_male",]]
x_test = sm.add_constant(x_test)

In [39]:
result_test = model.predict(x_test)
result_test

0      0.587485
1      0.879434
2      0.324638
3      0.656963
4      0.940242
         ...   
413    0.627274
414    0.925647
415    0.548744
416    0.627274
417    0.520809
Length: 418, dtype: float64

我们获得了幸存者概率，我们可以把高于0.5的概率视为幸存，低于是为遇难

In [40]:
result_test > 0.5

0       True
1       True
2      False
3       True
4       True
       ...  
413     True
414     True
415     True
416     True
417     True
Length: 418, dtype: bool