# 背景

- 泰坦尼克号的生存预测 https://www.kaggle.com/competitions/titanic/
    - 训练集给出了一批乘客的特征以及他们是否生存（标签）
    - 需要根据训练集构建模型，去根据测试集中给出的乘客特征预测测试集乘客是否生存
    - 根据对标签的观察，是一个二分类的问题

In [1]:
import pandas as pd
import numpy as np
import catboost as cbt
from sklearn.utils import shuffle 

# 过程

## 1. 导入数据并观察数据情况

- 在此步骤中，需要通过对数据的观察，确定后续的数据清洗、特征工程、模型选型方法。

In [2]:
ori_train = pd.read_csv(r"E:\B_dsproject\dataset\titanic\train.csv", header=0)

- 打印列名
- 打印训练集数量
- 打印总体情况，观察是否有缺失
- 打印正负样本比例

结论：
- 训练集较小，不宜选择过于复杂的模型
- 有缺失值需要处理
- 正负样本比例在1:2，可以针对性的进行类别不平衡处理，也可以暂时搁置

In [3]:
print("Columns\n", ori_train.columns)
print("\nShape\n", ori_train.shape)
print(ori_train.info())
print("\n Label distribution \n", ori_train["Survived"].value_counts())

Columns
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Shape
 (891, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

 Label distribution 
 0    549
1    342
Name: Survived, dt

打印部分数据看看内容

结论：
- PassengerId 不宜作为训练特征使用
- Name不宜作为训练特征使用 （其中反映的mr miss也许可以利用，但暂时搁置）
- 应该作为类别特征的是 Pclass， Sex， Cabin ， Embarked  
- 应该作为数值特征的是 Fare
- 可以根据模型情况，设定为类别/数值特征的是 SibSp  Age  Parch
- Ticket应当处理后作为类别特征

In [4]:
print(ori_train[:20])

    PassengerId  Survived  Pclass  \
0             1         0       3   
1             2         1       1   
2             3         1       3   
3             4         1       1   
4             5         0       3   
5             6         0       3   
6             7         0       1   
7             8         0       3   
8             9         1       3   
9            10         1       2   
10           11         1       3   
11           12         1       1   
12           13         0       3   
13           14         0       3   
14           15         0       3   
15           16         1       2   
16           17         0       3   
17           18         1       2   
18           19         0       3   
19           20         1       3   

                                                 Name     Sex   Age  SibSp  \
0                             Braund, Mr. Owen Harris    male  22.0      1   
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.

- 更直观的打印缺失的比例

In [5]:
col_null = ori_train.isnull().sum(axis=0) / len(ori_train)
print(col_null)

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64


- 打印age、parch、SibSp的分布，看看是否可以作为类别特征

结论：
- 年龄如果要作为类别特征，需要进行分桶。
- SibSp和Parch完全可以作为类别特征（也可以作为数值型特征）。

In [6]:
print("\n Age distribution \n", ori_train["Age"].value_counts())
print("\n SibSp distribution \n", ori_train["SibSp"].value_counts())
print("\n Parch distribution \n", ori_train["Parch"].value_counts())


 Age distribution 
 24.00    30
22.00    27
18.00    26
19.00    25
30.00    25
         ..
55.50     1
70.50     1
66.00     1
23.50     1
0.42      1
Name: Age, Length: 88, dtype: int64

 SibSp distribution 
 0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

 Parch distribution 
 0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64


## 2. 总结结论与模型选型

数据集的结论有：

- 训练集较小，不宜选择过于复杂的模型
- 有缺失值需要处理
- 正负样本比例在1:2，可以针对性的进行类别不平衡处理，也可以暂时搁置
- 类别特征和连续值特征并存

选型：
- 简单的模型有：逻辑回归、SVM、GDBT类、FM、FFM等
- 因为存在缺失值，同时不宜过于复杂，决定选择GDBT类的模型：XGBOOST、lightGBM、catboost
- 因为当前数据集存在较多的类别特征，且需要进行特征间组合，因此**我们选择catboost**：
    - 其支持缺失值处理
    - 针对类别特征作出了优化
    - 建树的时候可以支持特征的组合

In [7]:
## 针对票，粗略的进行分类，构建为类别特征
def transformation_ticket(data):
    data = str(data).strip()
    if data.isnumeric(): 
        return 1
    else:
        return 0

- 根据选择的模型，进行：
    - 缺失值的处理
    - 样本的分割

In [8]:
ori_train = shuffle(ori_train, random_state=2023)
ori_train = ori_train.fillna({'Embarked': "NaN"})
ori_train["Ticket"] = ori_train["Ticket"].apply(transformation_ticket)
# ori_train["Title"] = ori_train["Name"].apply(title_gen)
train = ori_train[: int(0.8 * len(ori_train))]
print(train["Survived"].value_counts())
validation = ori_train[int(0.8 * len(ori_train)):]
print(validation["Survived"].value_counts())

0    427
1    285
Name: Survived, dtype: int64
0    122
1     57
Name: Survived, dtype: int64


- 设置模型的参数，评估指标，训练策略

In [9]:
model = cbt.CatBoostClassifier(iterations=500,
                           depth=6,
                           learning_rate=0.005, l2_leaf_reg=0.1,min_data_in_leaf=7, random_state=2023, feature_weights=[1, 1],
                           loss_function='Logloss', eval_metric="F1", early_stopping_rounds=50, nan_mode="Max",
                           verbose=True, use_best_model=True,cat_features = ["Pclass", "Sex", "SibSp", "Parch", "Embarked", "Ticket"])
# model = cbt.CatBoostClassifier(iterations=500,
#                            depth=6,
#                            learning_rate=0.005, l2_leaf_reg=0.1,min_data_in_leaf=7, random_state=2023, feature_weights=[1, 1],
#                            loss_function='Logloss', eval_metric="F1", early_stopping_rounds=20, nan_mode="Max",
#                                per_float_feature_quantization='5:border_count=1000',
#                            verbose=True, use_best_model=True,cat_features = ["Pclass", "Sex", "SibSp", "Parch", "Embarked", "Ticket"])

- 训练

In [10]:
model.fit(train[['Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked', "Ticket"]], train["Survived"], eval_set=(validation[['Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked', "Ticket"]], validation["Survived"]))

0:	learn: 0.7131783	test: 0.7128713	best: 0.7128713 (0)	total: 196ms	remaining: 1m 37s
1:	learn: 0.7142857	test: 0.7128713	best: 0.7128713 (0)	total: 222ms	remaining: 55.2s
2:	learn: 0.7214953	test: 0.7058824	best: 0.7128713 (0)	total: 256ms	remaining: 42.4s
3:	learn: 0.7429644	test: 0.7378641	best: 0.7378641 (3)	total: 291ms	remaining: 36s
4:	learn: 0.7537879	test: 0.7428571	best: 0.7428571 (4)	total: 325ms	remaining: 32.2s
5:	learn: 0.7562380	test: 0.7378641	best: 0.7428571 (4)	total: 362ms	remaining: 29.8s
6:	learn: 0.7571702	test: 0.7307692	best: 0.7428571 (4)	total: 399ms	remaining: 28.1s
7:	learn: 0.7537879	test: 0.7184466	best: 0.7428571 (4)	total: 436ms	remaining: 26.8s
8:	learn: 0.7557252	test: 0.7184466	best: 0.7428571 (4)	total: 471ms	remaining: 25.7s
9:	learn: 0.7509579	test: 0.7184466	best: 0.7428571 (4)	total: 498ms	remaining: 24.4s
10:	learn: 0.7509728	test: 0.7184466	best: 0.7428571 (4)	total: 531ms	remaining: 23.6s
11:	learn: 0.7480620	test: 0.6930693	best: 0.7428571 (

101:	learn: 0.7674419	test: 0.7647059	best: 0.7647059 (58)	total: 3.25s	remaining: 12.7s
102:	learn: 0.7674419	test: 0.7647059	best: 0.7647059 (58)	total: 3.28s	remaining: 12.6s
103:	learn: 0.7659574	test: 0.7647059	best: 0.7647059 (58)	total: 3.31s	remaining: 12.6s
104:	learn: 0.7674419	test: 0.7647059	best: 0.7647059 (58)	total: 3.34s	remaining: 12.6s
105:	learn: 0.7674419	test: 0.7647059	best: 0.7647059 (58)	total: 3.35s	remaining: 12.5s
106:	learn: 0.7674419	test: 0.7647059	best: 0.7647059 (58)	total: 3.38s	remaining: 12.4s
107:	learn: 0.7698259	test: 0.7647059	best: 0.7647059 (58)	total: 3.41s	remaining: 12.4s
108:	learn: 0.7698259	test: 0.7647059	best: 0.7647059 (58)	total: 3.45s	remaining: 12.4s
Stopped by overfitting detector  (50 iterations wait)

bestTest = 0.7647058824
bestIteration = 58

Shrink model to first 59 iterations.


<catboost.core.CatBoostClassifier at 0x1a56e8a3588>

- 观察模型权重

In [11]:
print(model.get_feature_importance())

[23.50720916 54.48406627  8.61342942  5.75702136  0.          4.44378882
  0.13805352  3.05643145]


## 3. 预测

In [12]:
test = pd.read_csv(r"E:\B_dsproject\dataset\titanic\test.csv", header=0)
def transformation(data):
    data = str(data).strip()
    if data.isnumeric(): 
        return 1
    else:
        return 0
test = test.fillna({'Embarked': "NaN"})
test["Ticket"] = test["Ticket"].apply(transformation)

In [14]:
test["Survived"] = model.predict(test[['Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked', "Ticket"]])

In [15]:
test[["PassengerId", "Survived"]].to_csv(r"E:\B_dsproject\dataset\titanic\submission.csv", index=False)

In [16]:
model.save_model("best")

## 4. 后续

- 特征方面：
    - 人工建立高阶的特征交互
    - 对年龄、费用等进行分桶尝试
    - 对ticket等进行精细化的分类
    - 对人名进行关系处理，构建亲属关系等
- 样本方面：
    - 尝试针对类别不平衡的措施
    - 选择一套好的参数后用所有样本训练一个模型，这样可以利用所有的样本
- 模型方面：
    - 尝试FM模型看看是否有提升