# 머신러닝 학습모델 만들기 기본 익히기
- 타이카닉 생존자 예측 학습모델 만들기
- 데이터 셋: 타이타닉 데이터셋 (출처: Kaggle.com)
- 전처리의 필요성을 이해하고 전처리 기본 익히기

In [3]:
import numpy as np
import pandas as pd

In [4]:
train = pd.read_csv('./data/train.csv')

## 데이터 셋: 타이타닉 데이터셋 

In [5]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

* PassengerId: 승객 아이디
* Survived: 생존 여부, 1: 생존, 0: 사망
* Pclass: 등급
* Name: 성함
* Sex: 성별
* Age: 나이
* SibSp: 형제, 자매, 배우자 수
* Parch: 부모, 자식 수
* Ticket: 티켓번호
* Fare: 요즘
* Cabin: 좌석번호
* Embarked: 탑승 항구

## 전처리: train / validation 세트 나누기

1. feature(X) 와 label(y) 정의하기
2. feature, label을 정의했으면, 적절한 비율로 train / validation set 나누기

In [7]:
# feature(X)의 항목 list
#feature = ['Pclass', 'Sex', 'Age', 'Fare']
feature = ['Pclass', 'Age', 'Fare']
X = train[feature]

In [8]:
# label(y)의 항목
label = ['Survived']
y = train[label]

## learning data와 test data 분할
from sklearn.model_selection import train_test_split

In [9]:
from sklearn.model_selection import train_test_split

* **test_size**: validation set에 할당할 비율 (20% -> 0.2)
* **shuffle**: 셔플 옵션 (기본 True)
* **random_state**: 랜덤 시드값

In [19]:
# return받는 데이터의 순서 꼭 지키기, random_state=10, shuffle=True와 False인경우 확인
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [20]:
# 학습 데이터 shape 확인
X_train.shape, y_train.shape

((712, 3), (712, 1))

In [21]:
# 테스트 데이터 shape 확인
X_test.shape, y_test.shape

((179, 3), (179, 1))

## 학습하기
from sklearn.linear_model import SGDClassifier

In [22]:
# 모델 학습을 위한 모듈 import
from sklearn.linear_model import SGDClassifier

In [23]:
# 모델 객체 생성,SGD(Stochastic Gradient Descent)
model_sgd = SGDClassifier(random_state=0)
model_sgd

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
              max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty='l2',
              power_t=0.5, random_state=0, shuffle=True, tol=0.001,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [24]:
# 모델 학습 : NAN 값이 있으면 학습이 되지 않음
model_sgd.fit(X_train, y_train)

ValueError: ignored

## 예측하기
- 학습 모델이 없으면 예측도 없음

In [25]:
# 테스트 데이터를 넣어서 예측결과 확인, 모델이 없기 때문에 안됨
pred_y = model_sgd.predict(X_test)
pred_y

NotFittedError: ignored

In [26]:
#y_test['Survived']
y_test

Unnamed: 0,Survived
280,0
428,0
125,1
505,0
89,0
...,...
589,0
393,1
708,1
443,1


In [28]:
# 실제값과 예측값을 맞춘 평균 비율 구하기, 그러나, pred_y가 없기 때문에 안됨
(pred_y == y_test['Survived']).mean()

NameError: ignored

# 데이터 전처리
Machine Learning 모델 학습을 하기전 필요한 데이터로 정리 가공하는 작업
1. 결측값 처리하기
2. 문자형 데이터 수치형으로 변환
 - label encoding
 - one hot encoding
3. 데이터 수준 맞추기 
 - Normalize
 - Standard Scaling


##1. 결측치 해결

In [29]:
train = pd.read_csv('./data/train.csv')

In [30]:
# 데이터 정보 확인
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


###결측치를 확인하는 방법은 pandas의 **isnull()** 

NaN 값의 총 개수의 합계 구하기 : **isnull().sum()**

In [31]:
# train.isnull()
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

개별 column의 결측치 확인 방법

In [None]:
#train['Age'].notnull()
#train['Age'].isnull()
train['Age'].isnull().sum()

177

####1) 수치형 데이터(Numerical Column) 결측치 처리

In [None]:
train['Age'].fillna(0).describe()

count    891.000000
mean      23.799293
std       17.596074
min        0.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
Name: Age, dtype: float64

In [None]:
train['Age'].fillna(train['Age'].mean()).describe()

count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
Name: Age, dtype: float64

##### Simpleimputer(): 2개 이상 feature(column)의 결측값을 한 번에 처리할 때 유용함.  
fit() 을 통해 결측치를 학습하여 NaN 값을 자동으로 채워넣음.  
[Impute 문서 참조](https://scikit-learn.org/stable/modules/impute.html)  


[방법1] fit(), transform() 따로 실행

In [32]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [33]:
from sklearn.impute import SimpleImputer

In [34]:
# SimpleImputer() 객체 생성, strategy: 재워넣는 값
s_imputer = SimpleImputer(strategy='mean')

In [35]:
s_imputer.fit(train[['Age','Pclass']])

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

transform() : NaN 값을 대체해 주는 함수

In [36]:
result = s_imputer.transform(train[['Age','Pclass']])

In [37]:
result

array([[22.        ,  3.        ],
       [38.        ,  1.        ],
       [26.        ,  3.        ],
       ...,
       [29.69911765,  3.        ],
       [26.        ,  1.        ],
       [32.        ,  3.        ]])

In [None]:
train[['Age', 'Pclass']] = result

In [None]:
#train[['Age', 'Pclass']].notnull().sum()
train[['Age', 'Pclass']].isnull().sum()

Age       0
Pclass    0
dtype: int64

In [None]:
train[['Age', 'Pclass']].describe()

Unnamed: 0,Age,Pclass
count,891.0,891.0
mean,29.699118,2.308642
std,13.002015,0.836071
min,0.42,1.0
25%,22.0,2.0
50%,29.699118,3.0
75%,35.0,3.0
max,80.0,3.0


[방법2] fit_transform() : fit()과 transform()을 동시 실행

In [None]:
train = pd.read_csv('./data/train.csv')

In [None]:
train[['Age', 'Pclass']].isnull().sum()

Age       177
Pclass      0
dtype: int64

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
s_imputer = SimpleImputer(strategy='mean')

In [None]:
num_result = s_imputer.fit_transform(train[['Age', 'Pclass']])

In [None]:
# Age, Pclass 컬럼에 결과 적용
train[['Age', 'Pclass']] = num_result

In [None]:
train[['Age', 'Pclass']].isnull().sum()

Age       0
Pclass    0
dtype: int64

In [None]:
train[['Age', 'Pclass']].describe()

Unnamed: 0,Age,Pclass
count,891.0,891.0
mean,29.699118,2.308642
std,13.002015,0.836071
min,0.42,1.0
25%,22.0,2.0
50%,29.699118,3.0
75%,35.0,3.0
max,80.0,3.0


#### 2) 범주형 데이터(Categorical Column) 결측값 처리

In [38]:
train = pd.read_csv('./data/train.csv')

[방법1] column 별로 처리하는 경우

In [39]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [40]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [41]:
train['Embarked'].fillna('S') # 아직 적용은 안된 상태임.

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

In [42]:
train['Cabin'].value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
F33            3
E101           3
              ..
A26            1
C110           1
E34            1
E77            1
B50            1
Name: Cabin, Length: 147, dtype: int64

##### SimpleImputer() 사용 : 2개 이상의 범주형 column의 결측값 처리

In [43]:
s_imputer = SimpleImputer(strategy='most_frequent')

In [44]:
cat_result = s_imputer.fit_transform(train[['Cabin', 'Embarked']])

In [45]:
train[['Cabin', 'Embarked']] = cat_result

In [46]:
train[['Cabin', 'Embarked']].isnull().sum()

Cabin       0
Embarked    0
dtype: int64

## Encoding : 문자(categorical)를 수치(numerical)로 변환
- 머신러닝 모델 학습 시 수치형 데이터만 입력 가능   
- 학습에 필요한 모든 컬럼의 데이터를 수치로 변환 해야함

In [47]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        891 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


###1) Label Encoding

In [48]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [49]:
mapping = {
    'male':1,
    'female':0
}

In [50]:
#train['Sex'].map(lambda x: 1 if x =='male' else 0)
train['Sex'].map(mapping)

0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Sex, Length: 891, dtype: int64

In [51]:
train['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [52]:
from sklearn.preprocessing import LabelEncoder

In [53]:
le = LabelEncoder()

In [54]:
train['Sex_num'] = le.fit_transform(train['Sex'])

In [55]:
train['Sex_num'].value_counts()

1    577
0    314
Name: Sex_num, dtype: int64

In [56]:
#LabelEncoding하기 전의 값 확인
le.classes_  

array(['female', 'male'], dtype=object)

In [57]:
# 인코딩한 값 확인하기
le.inverse_transform([0, 1, 1, 0, 0, 1, 1])

array(['female', 'male', 'male', 'female', 'female', 'male', 'male'],
      dtype=object)

NaN 값이 포함되어 있다면, `LabelEncoder`가 정상 동작하지 않음.

In [58]:
le.fit_transform(train['Embarked'])

array([2, 0, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 0, 2, 2,
       1, 2, 2, 2, 0, 2, 1, 2, 0, 0, 1, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0,
       1, 2, 1, 1, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2, 2, 0, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 0, 2, 2, 0, 2, 1, 2, 0, 2, 2, 2, 0, 2, 2, 0, 1, 2, 0, 2, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 0, 2,
       2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 1, 2,
       1, 2, 2, 2, 2, 0, 2, 2, 2, 0, 1, 0, 2, 2, 2, 2, 1, 0, 2, 2, 0, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1,
       2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 1, 2, 2, 2,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 0,
       2, 2, 2, 1, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 1,

In [59]:
train['Embarked'] = train['Embarked'].fillna('S')

In [60]:
le.fit_transform(train['Embarked'])

array([2, 0, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 0, 2, 2,
       1, 2, 2, 2, 0, 2, 1, 2, 0, 0, 1, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0,
       1, 2, 1, 1, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2, 2, 0, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 0, 2, 2, 0, 2, 1, 2, 0, 2, 2, 2, 0, 2, 2, 0, 1, 2, 0, 2, 0, 2,
       2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 0, 2,
       2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 1, 2,
       1, 2, 2, 2, 2, 0, 2, 2, 2, 0, 1, 0, 2, 2, 2, 2, 1, 0, 2, 2, 0, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1,
       2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 1, 2, 2, 2,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 0,
       2, 2, 2, 1, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 1,

###2) 원 핫 인코딩 (One Hot Encoding)

In [61]:
train = pd.read_csv('./data/train.csv')

In [62]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


`Embarked`를 살펴봅시다

In [63]:
train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [64]:
train['Embarked'] = train['Embarked'].fillna('S')

In [65]:
train['Embarked'].value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

In [66]:
train['Embarked_num'] = LabelEncoder().fit_transform(train['Embarked'])

In [67]:
train['Embarked_num'].value_counts()

2    646
0    168
1     77
Name: Embarked_num, dtype: int64

Embarked는 탑승 항구의 영문 첫글자임
LabelEncoder를 통해서 수치형으로 변환해주어야함.  

데이터를 기계학습을 시키면, 기계는 데이터 안에서 관계를 학습하게 됨. 

즉, 'S' = 2, 'Q' = 1 일 때, `Q` + `Q` = `S` 가 된다 라고 학습함.

따라서, 독립적인 데이터는 별도의 column으로 분리하고, 해당 값만 **True** 나머지는 **False**로 함.   
이것을 원-핫-인코딩이라고 함

In [68]:
train['Embarked'][:6]

0    S
1    C
2    S
3    S
4    S
5    Q
Name: Embarked, dtype: object

In [69]:
train['Embarked_num'][:6]

0    2
1    0
2    2
3    2
4    2
5    1
Name: Embarked_num, dtype: int64

In [70]:
pd.get_dummies(train['Embarked_num'][:6])

Unnamed: 0,0,1,2
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
5,0,1,0


In [71]:
#one_hot = pd.get_dummies(train['Embarked_num'][:6])
one_hot = pd.get_dummies(train['Embarked_num'])

In [72]:
one_hot

Unnamed: 0,0,1,2
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [73]:
one_hot.columns = ['C','Q','S']

In [74]:
train[['C','Q','S']] = one_hot

In [75]:
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_num,C,Q,S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,2,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,2,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,2,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,2,0,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,2,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,2,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,0,1,0,0


In [76]:
train.drop(['C', 'Q', 'S'], axis=1, inplace=True)
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_num
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,2
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,2
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,2
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,0


In [77]:
train.to_csv('./data/train_clean.csv',index=False)

#####One Hot Encoding 정리 :
  - 카테고리형을 수치형으로 변환 하면서 생기는 수치형 간의 관계를 끊어주어서 독립적으로 학습이 되도록함.  
  - 카테고리(계절, 성별, 직업, 음식 등...)의 특성을 가지 범주형에 적용

## 특성 스케일링(feature scaling)
- 데이터 스케일링(data scaling)이라고도 함
- 특성들의 단위를 무시할 수 있도록, 특성들의 값의 범위를 비슷하게 만들어줌  
[정규화, 표준화 참고](https://bskyvision.com/849)


 ### Normalize (정규화) :    
 column 간에 다른 **min**, **max** 값을 가지는 경우, 정규화를 통해 최소치 0 / 최대값 1의 척도로 맞추어 주는 것  

X' = (X - Xmin) / (Xmax - Xmin)

 
* 넷플릭스 영화평점 (0점 ~ 10점): [2, 4, 6, 8, 10]
* CGV 영화평점 (0점 ~ 5점): [1, 2, 3, 4, 5]

In [78]:
movie = {'netflix': [2, 4, 6, 8, 10], 
         'cgv': [1, 2, 3, 4, 5]
         }

In [79]:
movie = pd.DataFrame(data=movie)
movie

Unnamed: 0,netflix,cgv
0,2,1
1,4,2
2,6,3
3,8,4
4,10,5


In [80]:
from sklearn.preprocessing import MinMaxScaler

In [81]:
min_max_scaler = MinMaxScaler()

In [82]:
min_max_movie = min_max_scaler.fit_transform(movie)

In [83]:
pd.DataFrame(min_max_movie, columns=['naver', 'netflix'])

Unnamed: 0,naver,netflix
0,0.0,0.0
1,0.25,0.25
2,0.5,0.5
3,0.75,0.75
4,1.0,1.0


### Standard Scaling (표준화)

X' = (X - μ)/σ   
 μ : 특성의 평균 값
 σ : 표준편차
- 종모양의 분포를 따른다고 가정하고 값들을 0의 평균, 1의 표준편차를 갖도록 변환해줌. 
- 표준화를 해주면 정규화처럼 특성값의 범위가 0과 1의 범위로 균일하게 바뀌지는 않
- scale의 범위가 너무 크면 노이즈 데이터가 생성되거나 overfitting이 될 가능성이 높아짐

In [84]:
from sklearn.preprocessing import StandardScaler

샘플 데이터 생성

In [85]:
x = np.arange(10)
# outlier 추가
x[9] = 1000

In [86]:
x.mean(), x.std()

(103.6, 298.8100399919654)

In [87]:
x=x.reshape(-1,1)
x # 1열로 되어 있는 2차원으로 형태조정

array([[   0],
       [   1],
       [   2],
       [   3],
       [   4],
       [   5],
       [   6],
       [   7],
       [   8],
       [1000]])

In [88]:
#pd.Series(x)
pd.DataFrame(x)

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,1000


In [89]:
standard_scaler = StandardScaler()

In [90]:
scaled = standard_scaler.fit_transform(x)

In [91]:
x.mean(), x.std()

(103.6, 298.8100399919654)

In [92]:
scaled.mean(), scaled.std()

(4.4408920985006264e-17, 1.0)

In [93]:
round(scaled.mean(), 2), scaled.std()

(0.0, 1.0)