<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-1">Logistic Regression</a></span></li><li><span><a href="#データの読み取り" data-toc-modified-id="データの読み取り-2">データの読み取り</a></span></li><li><span><a href="#データの前処理" data-toc-modified-id="データの前処理-3">データの前処理</a></span><ul class="toc-item"><li><span><a href="#欠損値処理" data-toc-modified-id="欠損値処理-3.1">欠損値処理</a></span></li><li><span><a href="#カテゴリデータの処理" data-toc-modified-id="カテゴリデータの処理-3.2">カテゴリデータの処理</a></span></li></ul></li><li><span><a href="#データセット分割" data-toc-modified-id="データセット分割-4">データセット分割</a></span></li><li><span><a href="#標準化" data-toc-modified-id="標準化-5">標準化</a></span></li><li><span><a href="#L1正則化で学習" data-toc-modified-id="L1正則化で学習-6">L1正則化で学習</a></span></li><li><span><a href="#L2正則化で学習" data-toc-modified-id="L2正則化で学習-7">L2正則化で学習</a></span></li></ul></div>

# Logistic Regression

# データの読み取り

In [2]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [3]:
print(pd.options.display.max_rows, 'default:60')
print(pd.options.display.max_columns, 'default:20')
pd.options.display.max_columns = 50
print(pd.options.display.max_rows, 'default:60')
print(pd.options.display.max_columns, 'default:20')

60 default:60
20 default:20
60 default:60
50 default:20


In [4]:
# Data Dictionary
# Variable	Definition	Key
# survival 	Survival 	0 = No, 1 = Yes
# pclass 	Ticket class 	1 = 1st, 2 = 2nd, 3 = 3rd
# sex 	Sex 	
# Age 	Age in years 	
# sibsp 	# of siblings / spouses aboard the Titanic 	
# parch 	# of parents / children aboard the Titanic 	
# ticket 	Ticket number 	
# fare 	Passenger fare 	
# cabin 	Cabin number 	
# embarked 	Port of Embarkation 	C = Cherbourg, Q = Queenstown, S = Southampton
# Variable Notes
# 
# pclass: A proxy for socio-economic status (SES)
# 1st = Upper
# 2nd = Middle
# 3rd = Lower
# 
# age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
# 
# sibsp: The dataset defines family relations in this way...
# Sibling = brother, sister, stepbrother, stepsister
# Spouse = husband, wife (mistresses and fiancés were ignored)
# 
# parch: The dataset defines family relations in this way...
# Parent = mother, father
# Child = daughter, son, stepdaughter, stepson
# Some children travelled only with a nanny, therefore parch=0 for them.

In [5]:
path_to_train_csv = '../data/train.csv'
df = pd.read_csv(path_to_train_csv,
                 index_col='PassengerId')
print(df.shape)
print("乗客員：", len(df.index))
print("特徴量：", len(df.columns))
df.head()

(891, 11)
乗客員： 891
特徴量： 11


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# データの前処理

## 欠損値処理

In [6]:
df.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

上の結果より
- Age
- Cabin

には欠損値がおおいので、２つの特徴量を削除する

In [7]:
df_2 = df.drop(labels=['Age', 'Cabin'],
        axis=1)
print(df_2.shape)
df_2.head(2)

(891, 9)


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C


In [8]:
df_2.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    2
dtype: int64

上の結果より、"Embarked"が欠損しているデータが２つだけ存在するので、その２つを削除する。

In [9]:
df_3 = df_2.dropna(axis=0,
            subset=['Embarked'])
print(df_3.shape)
df_3.head(2)

(889, 9)


Unnamed: 0_level_0,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.25,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833,C


In [10]:
df_3.isnull().sum()

Survived    0
Pclass      0
Name        0
Sex         0
SibSp       0
Parch       0
Ticket      0
Fare        0
Embarked    0
dtype: int64

上の結果より欠損値はなくなった。

しかし、'Name'は特徴量として扱うには複雑そうなので、今回は除外する。<br>
(ファミリーネームの一致不一致に相関がありそうな気もするが）

In [11]:
df_4 = df_3.drop(labels='Name',
                 axis=1)
print(df_4.shape)
df_4.head(3)

(889, 8)


Unnamed: 0_level_0,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,1,0,A/5 21171,7.25,S
2,1,1,female,1,0,PC 17599,71.2833,C
3,1,3,female,0,0,STON/O2. 3101282,7.925,S


In [12]:
#df_4

## カテゴリデータの処理

In [13]:
#print(df_4.columns[1:])
#X = df_4[df_4.columns[1:]].values
#print(X.shape)
#print(X)
#ohe = OneHotEncoder()
#ohe.fit_transform(X).toarray()

In [14]:
df_5 = pd.get_dummies(df_4[df_4.columns[1:]])
print(df_5.shape)
df_5.columns

(889, 689)


Index(['Pclass', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male',
       'Ticket_110152', 'Ticket_110413', 'Ticket_110465', 'Ticket_110564',
       ...
       'Ticket_W./C. 14263', 'Ticket_W./C. 6607', 'Ticket_W./C. 6608',
       'Ticket_W./C. 6609', 'Ticket_W.E.P. 5734', 'Ticket_W/C 14208',
       'Ticket_WE/P 5735', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object', length=689)

上の結果より、"Ticket"の種類がかなり多い！<br>
そこで、"Ticket"も特徴量から削除する。

In [15]:
df_6 = df_4.drop(labels='Ticket',
                 axis=1)
print(df_6.shape)
print(df_6.columns)
df_6.head(3)

(889, 7)
Index(['Survived', 'Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')


Unnamed: 0_level_0,Survived,Pclass,Sex,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,1,0,7.25,S
2,1,1,female,1,0,71.2833,C
3,1,3,female,0,0,7.925,S


In [16]:
# 改めてone-hot encoding
df_7 = pd.get_dummies(df_6[df_6.columns[1:]])
print(df_7.shape)
df_7.head(3)
#df_7

(889, 9)


Unnamed: 0_level_0,Pclass,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,3,1,0,7.25,0,1,0,0,1
2,1,1,0,71.2833,1,0,1,0,0
3,3,0,0,7.925,1,0,0,0,1


# データセット分割

In [17]:
X = df_7.values
y = df_6[df_6.columns[0]].values
print(X.shape, y.shape)

(889, 9) (889,)


In [18]:
X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=0.3, random_state=0)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(622, 9)
(622,)
(267, 9)
(267,)


# 標準化

In [19]:
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)

# L1正則化で学習

In [20]:
lr = LogisticRegression(penalty='l1',
                        C=0.1)
lr.fit(X_train_std, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [21]:
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))

Training accuracy: 0.8038585209
Test accuracy: 0.767790262172


In [22]:
for x, y in zip(df_7.columns, lr.coef_[0]):
    print(x, '\t', y)
print(df_7.columns)
print(lr.coef_)

Pclass 	 -0.615616872775
SibSp 	 -0.126416026146
Parch 	 0.0
Fare 	 0.0
Sex_female 	 0.279105784943
Sex_male 	 -0.87905539988
Embarked_C 	 0.0
Embarked_Q 	 0.0
Embarked_S 	 -0.220677148803
Index(['Pclass', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male',
       'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')
[[-0.61561687 -0.12641603  0.          0.          0.27910578 -0.8790554
   0.          0.         -0.22067715]]


# L2正則化で学習

In [23]:
lr = LogisticRegression(penalty='l2',
                        C=0.1)
lr.fit(X_train_std, y_train)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [24]:
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy:', lr.score(X_test_std, y_test))

Training accuracy: 0.8038585209
Test accuracy: 0.771535580524


In [25]:
for x, y in zip(df_7.columns, lr.coef_[0]):
    print(x, '\t', y)
print(df_7.columns)
print(lr.coef_)

Pclass 	 -0.604657561265
SibSp 	 -0.218631213922
Parch 	 -0.0728829070586
Fare 	 0.116100895296
Sex_female 	 0.622831892021
Sex_male 	 -0.622831892021
Embarked_C 	 0.103656918971
Embarked_Q 	 0.0971859868429
Embarked_S 	 -0.152302636626
Index(['Pclass', 'SibSp', 'Parch', 'Fare', 'Sex_female', 'Sex_male',
       'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')
[[-0.60465756 -0.21863121 -0.07288291  0.1161009   0.62283189 -0.62283189
   0.10365692  0.09718599 -0.15230264]]


In [26]:
i = 3
print(lr.predict([X_test_std[i]]) , y_test[i])

[0] 0
