![logo](python_logo.jpg)

In [4]:
import pandas as pd

In [5]:
from pandas import Series

In [10]:
data = pd.read_csv('test_full_data.csv')


Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,dataset_name
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171,train
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599,train
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282,train
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803,train
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450,train


# 观察数据

In [11]:
data.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,dataset_name
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171,train
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599,train
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282,train
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803,train
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450,train


观察数据类型

In [13]:
dtype = data.dtypes
dtype

Age             float64
Cabin            object
Embarked         object
Fare            float64
Name             object
Parch             int64
PassengerId       int64
Pclass            int64
Sex              object
SibSp             int64
Survived        float64
Ticket           object
dataset_name     object
dtype: object

观察数据，提出假设，验证假设

通过少量数据， 我们可以观察到数据的不少特点，并且设计下一步数据清洗和处理的方案

1. PassengerId可能是和passenger一一对应的编号，它的数值本身无特别意义
2. Survived, Class尽管被识别成了数值，但是可能只是分类变量， 数值大小无特别意义。可以读取的时候当做字符处理。
3. Name包含姓, 称号， 名，可能会额外的在括号中补充别的信息
4. Ticket包含一段字符和一串数字， 字符不一定有； 如果有的话，用空格分割
5. Cabin中如果非空， 取值包含一个首字母和数字
6. 其他(大家请补充)...



当然从5行数据观察到的特点，不一定能代表所有数据的情况。但是会启发我们下一步怎么做。

针对第1, 2两点，为了确定每一列的数据到底是数值还是分类， 我想到的策略是:

计算每一列去重以后取值的个数， 一般来说，分类型特征去重以后取值的个数会远小于样本的数量。

而唯一的取值id一般和数据行一一对应， 所以数量会和样本数量一致。

In [3]:
for column_name in data.select_dtypes(include=['number']).columns: 
    print(column_name)
    print(data[column_name].nunique())
    print('--------------------')

Age
98
--------------------
Fare
281
--------------------
Parch
8
--------------------
PassengerId
1309
--------------------
Pclass
3
--------------------
SibSp
7
--------------------
Survived
2
--------------------


这里我们可以看到，Survived, Pclass, sibSp, Parch可能只是用不同的数值代表不同的分类，数值大小并无特别意义。而自动id一般没有分析的意义。

我们可以指定数据类型后重新读取数据。

In [4]:
data = pd.read_csv('test_full_data.csv', dtype={'Survived':str, 'Pclass':str, 'SibSp':str, 'Parch':str, 'SibSp':int})
data.head()

def to_int_01(x):
    try:
        return int(float(x))
    except (TypeError, ValueError):
        return None
    
data['Survived'] = data['Survived'].apply(to_int_01)
        

看上去一样？检查一下dtypes

In [5]:
data.dtypes

Age             float64
Cabin            object
Embarked         object
Fare            float64
Name             object
Parch            object
PassengerId       int64
Pclass           object
Sex              object
SibSp             int32
Survived        float64
Ticket           object
dataset_name     object
dtype: object

接下来处理Ticket， 对每一个取值进行分析。

In [6]:
s = "21171 dasdas dasdad"


def ticket_analysis(s, unexpects):
    try:
        if s.startswith('STON/O 2.'):
            return 'STON/O 2.', s[-7]
        
        if s.startswith('LINE'):
            return 'LINE', None
        
        if s.startswith('SC/AH Basle'):
            return 'SC/AH Basle', None
        
        splitted = s.split(' ')

        ticket_type = None
        ticket_number = None
        if len(splitted) == 1:
            ticket_number = int(splitted[0])
        elif len(splitted) == 2:
            ticket_type = [splitted[0], splitted[1]]
        else:
            raise ValueError
        return ticket_type, ticket_number
    except:
        unexpects.append(s)
        return None, None

In [7]:
unexpects = []
data['Ticket'].apply(lambda s:ticket_analysis(s, unexpects))

0                ([A/5, 21171], None)
1                 ([PC, 17599], None)
2         ([STON/O2., 3101282], None)
3                      (None, 113803)
4                      (None, 373450)
5                      (None, 330877)
6                       (None, 17463)
7                      (None, 349909)
8                      (None, 347742)
9                      (None, 237736)
10                 ([PP, 9549], None)
11                     (None, 113783)
12               ([A/5., 2151], None)
13                     (None, 347082)
14                     (None, 350406)
15                     (None, 248706)
16                     (None, 382652)
17                     (None, 244373)
18                     (None, 345763)
19                       (None, 2649)
20                     (None, 239865)
21                     (None, 248698)
22                     (None, 330923)
23                     (None, 113788)
24                     (None, 349909)
25                     (None, 347077)
26          

In [8]:
def ticket_analysis(s, unexpects):
    try:
        if ',' in s:
            splitted = s.split(',')
            return split[0], split[1]
        
        if s.startswith('STON/O 2.'):
            return 'STON/O 2.', s[-7]
        
        if s.startswith('LINE'):
            return 'LINE', None
        
        if s.startswith('SC/AH Basle'):
            return 'SC/AH Basle', None
        
        splitted = s.split(' ')

        ticket_type = None
        ticket_number = None
        if len(splitted) == 1:
            ticket_number = int(splitted[0])
        elif len(splitted) == 2:
            ticket_type = splitted[0]
            ticket_number = splitted[1]
        else:
            raise ValueError
        return ticket_type, ticket_number
    except:
        unexpects.append(s)
        return None, None
    


In [9]:
unexpects = []
data[['tickket_type', 'tickket_number']] = data['Ticket'].apply(lambda s:Series(ticket_analysis(s, unexpects)))
data

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,dataset_name,tickket_type,tickket_number
0,22.0,,S,7.2500,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171,train,A/5,21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599,train,PC,17599
2,26.0,,S,7.9250,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282,train,STON/O2.,3101282
3,35.0,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803,train,,113803
4,35.0,,S,8.0500,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450,train,,373450
5,,,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,0.0,330877,train,,330877
6,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463,train,,17463
7,2.0,,S,21.0750,"Palsson, Master. Gosta Leonard",1,8,3,male,3,0.0,349909,train,,349909
8,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,1.0,347742,train,,347742
9,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,1.0,237736,train,,237736


In [10]:
def name_analysis(s, unexpects):
    initial_part = None
    title = None
    name_part = None
    other_name_part = None
    nick_name = None
    
    try:
        quote_location = s.find('"')
        if quote_location > 0:
            nick_name = s[(quote_location + 1):-2]
            s = s[(quote_location + 1):]
            
        
        common_location = s.find(',')
        if common_location > 0:
            initial_part = s[:common_location]
            s = s[(common_location+1):]

        left_brace_location = s.find('(')
        if '(' in s:
            other_name_part = s[(left_brace_location+1):-2]
            s = s[:left_brace_location]

        splitted = s.split('.')
        if len(splitted) == 2:
            title = splitted[0].strip()
            name_part = splitted[1].strip()
        else:
            raise ValueError('Unexpected: {}'.format(s))
    except:
        unexpects.append(s)
        
    return initial_part, title, name_part, other_name_part, nick_name

    
data[['name_initial_part', 'name_title', 'name', 'other_name', 'nick_name']] = data['Name'].apply(lambda s:Series(name_analysis(s, unexpects)))
data

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,dataset_name,tickket_type,tickket_number,name_initial_part,name_title,name,other_name,nick_name
0,22.0,,S,7.2500,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171,train,A/5,21171,Braund,Mr,Owen Harris,,
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599,train,PC,17599,Cumings,Mrs,John Bradley,Florence Briggs Thaye,
2,26.0,,S,7.9250,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282,train,STON/O2.,3101282,Heikkinen,Miss,Laina,,
3,35.0,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803,train,,113803,Futrelle,Mrs,Jacques Heath,Lily May Pee,
4,35.0,,S,8.0500,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450,train,,373450,Allen,Mr,William Henry,,
5,,,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,0.0,330877,train,,330877,Moran,Mr,James,,
6,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463,train,,17463,McCarthy,Mr,Timothy J,,
7,2.0,,S,21.0750,"Palsson, Master. Gosta Leonard",1,8,3,male,3,0.0,349909,train,,349909,Palsson,Master,Gosta Leonard,,
8,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,1.0,347742,train,,347742,Johnson,Mrs,Oscar W,Elisabeth Vilhelmina Ber,
9,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,1.0,237736,train,,237736,Nasser,Mrs,Nicholas,Adele Ache,


In [11]:
# 可能随机分，没有精确到个人
data[data['Cabin'] == 'B96 B98']

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,dataset_name,tickket_type,tickket_number,name_initial_part,name_title,name,other_name,nick_name
390,36.0,B96 B98,S,120.0,"Carter, Mr. William Ernest",2,391,1,male,1,1.0,113760,train,,113760,Carter,Mr,William Ernest,,
435,14.0,B96 B98,S,120.0,"Carter, Miss. Lucile Polk",2,436,1,female,1,1.0,113760,train,,113760,Carter,Miss,Lucile Polk,,
763,36.0,B96 B98,S,120.0,"Carter, Mrs. William Ernest (Lucile Polk)",2,764,1,female,1,1.0,113760,test,,113760,Carter,Mrs,William Ernest,Lucile Pol,
802,11.0,B96 B98,S,120.0,"Carter, Master. William Thornton II",2,803,1,male,1,1.0,113760,test,,113760,Carter,Master,William Thornton II,,


In [12]:
[col for col in list(data.Cabin.dropna().unique()) if ' ' in col]

['C23 C25 C27',
 'F G73',
 'D10 D12',
 'B58 B60',
 'F E69',
 'C22 C26',
 'B57 B59 B63 B66',
 'B96 B98',
 'B51 B53 B55',
 'F G63',
 'C62 C64',
 'B82 B84',
 'C55 C57',
 'F E46',
 'F E57',
 'E39 E41',
 'B52 B54 B56']

In [13]:
from numpy import NaN

In [14]:
cabin_exploded = data.dropna(subset=['Cabin']).groupby('PassengerId')['Cabin'].apply(lambda s:Series(s.iat[0].split(' '))).reset_index().drop('level_1', axis=1)

def analysis_cabin(s):
    cabin_type = None
    cabin_number = None

    cabin_type = s[0]
    
    try:
        cabin_number = int(s[1:])
    except ValueError:
        pass

    return Series([cabin_type, cabin_number])
    
    
    
cabin_exploded[['cabin_type', 'cabin_number']] = cabin_exploded['Cabin'].apply(analysis_cabin)
cabin_exploded

cabin_exploded.groupby('cabin_type')['cabin_number'].apply(Series.value_counts)

cabin_type      
A           34.0    3
            18.0    1
            11.0    1
            5.0     1
            7.0     1
            32.0    1
            31.0    1
            19.0    1
            14.0    1
            16.0    1
            10.0    1
            20.0    1
            23.0    1
            26.0    1
            36.0    1
            24.0    1
            21.0    1
            9.0     1
            29.0    1
            6.0     1
B           63.0    5
            59.0    5
            66.0    5
            57.0    5
            98.0    4
            96.0    4
            51.0    3
            55.0    3
            53.0    3
            60.0    3
                   ..
E           8.0     2
            24.0    2
            50.0    2
            25.0    2
            44.0    2
            69.0    1
            36.0    1
            38.0    1
            40.0    1
            10.0    1
            41.0    1
            12.0    1
            39.0    1
            68.

In [15]:
def aggregate_cabin(s):
    m = len(s)
    most_frequent = s.value_counts().index[0]
    
    return Series([m, most_frequent], index = ['cabin_count', 'most_possible_cabin'])

cabin_features = cabin_exploded.groupby('PassengerId')['cabin_type'].apply(aggregate_cabin).unstack(level=1)
cabin_features

Unnamed: 0_level_0,cabin_count,most_possible_cabin
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,1,C
4,1,C
7,1,E
11,1,G
12,1,C
22,1,D
24,1,A
28,3,C
32,1,B
53,1,D


In [16]:
drops = ['Name', 'Ticket', 'Cabin', 'name_initial_part','name_title','name','other_name','nick_name', 'tickket_type','tickket_number']
data = data.set_index('PassengerId').join(cabin_features, how='outer').drop(drops, axis=1)
data

Unnamed: 0_level_0,Age,Embarked,Fare,Parch,Pclass,Sex,SibSp,Survived,dataset_name,cabin_count,most_possible_cabin
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,22.0,S,7.2500,0,3,male,1,0.0,train,,
2,38.0,C,71.2833,0,1,female,1,1.0,train,1,C
3,26.0,S,7.9250,0,3,female,0,1.0,train,,
4,35.0,S,53.1000,0,1,female,1,1.0,train,1,C
5,35.0,S,8.0500,0,3,male,0,0.0,train,,
6,,Q,8.4583,0,3,male,0,0.0,train,,
7,54.0,S,51.8625,0,1,male,0,0.0,train,1,E
8,2.0,S,21.0750,1,3,male,3,0.0,train,,
9,27.0,S,11.1333,2,3,female,0,1.0,train,,
10,14.0,C,30.0708,0,2,female,1,1.0,train,,


In [17]:
data_full = data

In [18]:
from model_lab.quick_score_card import QuickScoreCard, DataGenerator, build_data_generator

In [19]:
data_full.to_excel('test_data.xlsx', index=False)

In [20]:
data = pd.read_excel('test_data.xlsx')
data['Survived'] = data['Survived']


label_col_name = 'Survived'
dataset_col_name = 'dataset_name'

train_dataset_name = 'train'
oot_dataset_name = 'oot'
test_dataset_names = ['test']

data_generator = build_data_generator(
    data,
    label_col_name,
    dataset_col_name,
    train_dataset_name,
    oot_dataset_name,
    test_dataset_names
)


trainer = QuickScoreCard(data_generator)
trainer.run()


Doing task encoder_training ...


RuntimeWarning: divide by zero encountered in log