# Spaceship Titanic

- 두번째 시도
    - scikit-learn에서 제공해주는 LogisticRegression 함수를 이용해 로지스틱-회귀를 사용한 모델로 학습을 진행하였다.

## Import modules

In [1]:
import os
from datetime import datetime
from zipfile import ZipFile
from io import BytesIO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

from scipy.special import expit

## Set envs

In [2]:
PATH_INPUT = './data/in/'
PATH_OUTPUT = './data/out/'
NOW_STR = datetime.now().strftime('%Y%m%d_%H%M%S')
PATH_OUTPUT_NOW = f'./data/out/{NOW_STR}/'

## Get Data

In [3]:
df_train = pd.read_csv('./data/out/preprocessed_data/train_int.csv')
df_test = pd.read_csv('./data/out/preprocessed_data/test_int.csv')

## Train

### Set Input data

In [4]:
exception_cols = ['PassengerId', 'Name']
dependants = ['Transported']
independents = [ i for i in df_train.keys() if i not in dependants and i not in exception_cols ]
# 'CryoSleep', 'Cabin', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'Destination_55 Cancri e', 'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e'

X = df_train[independents]
Y = df_train[dependants]

In [5]:
X.shape, Y.shape

((8693, 15), (8693, 1))

### Build model

In [6]:
model = LogisticRegression(verbose=1)

### Train model

In [7]:
clf = model.fit(X, Y)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =           16     M =           10

At X0         0 variables are exactly at the bounds

At iterate    0    f=  6.02553D+03    |proj g|=  1.05949D+06

At iterate   50    f=  3.99461D+03    |proj g|=  8.50638D+02

At iterate  100    f=  3.90601D+03    |proj g|=  1.24506D+04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
   16    100    111      1     0     0   1.245D+04   3.906D+03
  F =   3906.0079469560947     

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT                 


  y = column_or_1d(y, warn=True)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
 This problem is unconstrained.
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


## Result

### Save result data

In [8]:
data_out_path = f'./data/out/{NOW_STR}/'

if not os.path.exists(os.path.join(data_out_path, "models")):
    os.makedirs(os.path.join(data_out_path, "models"))

# clf.save(os.path.join(data_out_path, "models", "model.h5"))
with open(os.path.join(data_out_path, "models", "model_scikit_logistic_regression.pkl"), "wb") as f:
    pickle.dump(clf, f)

## Validation

### Predict

In [9]:
df_test.head()

Unnamed: 0,PassengerId,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Cabin_Deck,Cabin_Num,Cabin_Side,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars
0,0013_01,1,3,27.0,0,0.0,0.0,0.0,0.0,0.0,6,3.0,2,1,0,0
1,0018_01,0,3,19.0,0,0.0,9.0,0.0,2823.0,0.0,5,4.0,2,1,0,0
2,0019_01,1,0,31.0,0,0.0,0.0,0.0,0.0,0.0,2,0.0,2,0,1,0
3,0021_01,0,3,38.0,0,0.0,6652.0,0.0,181.0,585.0,2,1.0,2,0,1,0
4,0023_01,0,3,20.0,0,10.0,0.0,635.0,0.0,0.0,5,5.0,2,1,0,0


In [10]:
X_test = df_test[independents]

In [11]:
predictions = clf.predict(X_test)
predictions = list(predictions)

#### Export prediction to csv file

In [12]:
# 1에 가까우면 -> True, 0에 가까우면 False
predictions = list(map(lambda v: True if v > 0.5 else False, predictions))

In [13]:
output = pd.DataFrame({ "PassengerId": df_test['PassengerId'].to_list(), "Transported": predictions })
output_dir = os.path.join(PATH_OUTPUT_NOW, "predict")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output.to_csv(os.path.join(output_dir, 'predict_scikit_logistic_regression.csv'), index=False)