# Scaling

데이터의 scale을 일치시키는 작업

> ex) 학습 데이터의 범위가 아래와 같다고 하자  
> $X_1$: 0 ~ 1  
> $X_2$: 100 ~ 1000  
> $X_3$: 10000 ~ 100000  

이 경우 모델은 $X_1$이나 $X_2$의 값이 크게 중요하지 않다고 생각할 수 있음

<br>

또한, 새로 들어오는 데이터가 아래와 같으며, 학습 데이터는 scale1과 같은 scale에서 학습됐다고 가정

|scale|data|
|-----|----|
|1    |[1, 2, 3, 4, 5]|
|2    |[10, 20, 30, 40, 50]|
|3    |[100, 200, 300, 400, 500]|

이 경우, 앞으로 들어올 test 데이터가 scale1의 범위를 갖으면 모델은 학습한 결과를 옳게 출력할 것  
하지만, scale2나 scale3에 대해서는 모델이 이전에 학습한 적이 없음  
따라서, 매우 저조한 퍼포먼스를 보이게 될 것

하지만 scale2와 3은 각각 10과 100으로 나누어 scale1과 동일하게 scale을 맞출 수 있음
이 경우, 모델은 올바르게 동작할 것

이와 같이 generalization 측면에서의 목적에서도 scaling은 중요

<br>

<span style="font-size: 15pt;"> 1. Normalization (정규화) </span>
- feature의 최솟값을 0으로 최댓값을 1로 만드는 작업
- 이상치에 매우 민감
- Equation:

$$ X_i^{\prime} = \frac {X_i -  X_{min}}{X_{max} - X_{min}}$$

- sklearn의 MinMaxScaler 이용
- 적용 방법

    > ```python
    > from sklearn.preprocessing import MinMaxScaler
    > minmax_scaler = MinMaxScaler()
    >
    > minmax_scaler.fit(data)             # 주어진 data를 통해 표준화시키는 방법 학습
    > minmax_scaler.transform(data)       # 학습된 object를 통해 data 표준화
    >
    > minmax_scaler.fit_transform(data)   # data를 표준화시키는 방법 학습과 동시에 주어진 data 표준화
    > ```

<br>

<span style="font-size: 15pt;"> 2. Standardization (표준화) </span>
- feature의 평균을 0, 분산을 1로 스케일링하는 방법
- 즉, 데이터의 분포를 표준정규분포로 만드는 작업
- 이상치에 매우 민감
- Equation:

$$ X_i^{\prime} = \frac {X_i -  \mu}{\sigma}$$

- sklearn의 StandardScaler 이용
- 적용 방법

    > ```python
    > from sklearn.preprocessing import StandardScaler
    > standard_scaler = StandardScaler()
    >
    > standard_scaler.fit(data)             # 주어진 data를 통해 표준화시키는 방법 학습
    > standard_scaler.transform(data)       # 학습된 object를 통해 data 표준화
    >
    > standard_scaler.fit_transform(data)   # data를 표준화시키는 방법 학습과 동시에 주어진 data 표준화
    > ```

<br>

<span style="font-size: 15pt;"> 3. Robust Scaling </span>
- 평균과 분산 대신 중간값과 사분위값을 사용하여 scaling
    - 중간값: 데이터 정렬 후 중간에 있는 값
    - 사분위값: 데이터 정렬 후 1/4, 3/4에 위치한 값
- 이상치의 영향력 감소
- Equation:

$$ X_i^{\prime} = \frac {X_i -  X_{median}}{Q_3 - Q_1}$$

- sklearn의 RobustScaler 이용
- 적용 방법

    > ```python
    > from sklearn.preprocessing import RobustScaler
    > robust_scaler = RobustScaler()
    >
    > robust_scaler.fit(data)             # 주어진 data를 통해 표준화시키는 방법 학습
    > robust_scaler.transform(data)       # 학습된 object를 통해 data 표준화
    >
    > robust_scaler.fit_transform(data)   # data를 표준화시키는 방법 학습과 동시에 주어진 data 표준화
    > ```

<br>

![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*y0esOCH8O2NV1c_8iY3ouA.png)

In [9]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [25]:
data = pd.read_csv('./국민건강보험공단_건강검진정보_2023.csv',
                   encoding='cp949',
                   usecols=['신장(5cm단위)', '허리둘레', '혈청지피티(ALT)', '감마지티피'])

In [26]:
data = data.dropna()

In [27]:
data

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
0,155,92.0,24.0,50.0
1,160,86.0,11.0,31.0
2,150,96.0,29.0,24.0
3,160,85.0,21.0,27.0
4,165,84.5,33.0,49.0
...,...,...,...,...
999995,170,78.0,13.0,22.0
999996,165,96.1,65.0,160.0
999997,155,87.0,26.0,25.0
999998,160,69.0,20.0,16.0


In [36]:
min_max_scaler = MinMaxScaler()

In [30]:
min_max_scaler.fit(data)


In [37]:
min_max_scaler.__dict__

{'feature_range': (0, 1), 'copy': True, 'clip': False}

In [32]:
data.loc[:] = min_max_scaler.transform(data)

  data.loc[:] = min_max_scaler.transform(data)


In [34]:
data.loc[:] = min_max_scaler.inverse_transform(data)

In [35]:
data

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
0,155.0,92.0,24.0,50.0
1,160.0,86.0,11.0,31.0
2,150.0,96.0,29.0,24.0
3,160.0,85.0,21.0,27.0
4,165.0,84.5,33.0,49.0
...,...,...,...,...
999995,170.0,78.0,13.0,22.0
999996,165.0,96.1,65.0,160.0
999997,155.0,87.0,26.0,25.0
999998,160.0,69.0,20.0,16.0


In [40]:
min_max_scaler = MinMaxScaler()

In [41]:
min_max_scaler.__dict__

{'feature_range': (0, 1), 'copy': True, 'clip': False}

In [43]:
data.loc[:] = min_max_scaler.fit_transform(data)

In [45]:
data.loc[:] = min_max_scaler.inverse_transform(data)

In [47]:
train, valid, test = data.iloc[:300000], data.iloc[300000:600000], data.iloc[600000:] # 학습도중 평가를 할 때 사용하는 데이터가 valid

In [48]:
data = pd.read_csv('./국민건강보험공단_건강검진정보_2023.csv',
                   encoding='cp949',
                   usecols=['신장(5cm단위)', '허리둘레', '혈청지피티(ALT)', '감마지티피'])
data = data.dropna()
data

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
0,155,92.0,24.0,50.0
1,160,86.0,11.0,31.0
2,150,96.0,29.0,24.0
3,160,85.0,21.0,27.0
4,165,84.5,33.0,49.0
...,...,...,...,...
999995,170,78.0,13.0,22.0
999996,165,96.1,65.0,160.0
999997,155,87.0,26.0,25.0
999998,160,69.0,20.0,16.0


In [49]:
data.describe()

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
count,993774.0,993774.0,993774.0,993774.0
mean,162.733796,81.306499,26.36773,35.390904
std,9.330001,10.861462,26.104448,62.448236
min,130.0,7.5,1.0,1.0
25%,155.0,74.0,15.0,15.0
50%,165.0,81.0,21.0,22.0
75%,170.0,88.0,30.0,37.0
max,195.0,999.0,6297.0,9999.0


In [52]:
data = data.query('`허리둘레` < 120 and `혈청지피티(ALT)` < 50 and `감마지티피` < 50') #이상치 제거
data

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
1,160,86.0,11.0,31.0
2,150,96.0,29.0,24.0
3,160,85.0,21.0,27.0
4,165,84.5,33.0,49.0
5,170,69.2,11.0,12.0
...,...,...,...,...
999994,160,71.1,20.0,16.0
999995,170,78.0,13.0,22.0
999997,155,87.0,26.0,25.0
999998,160,69.0,20.0,16.0


In [53]:
data.describe() # 이상치 제거 후 확인

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
count,798955.0,798955.0,798955.0,798955.0
mean,161.857326,79.591591,20.368272,21.6803
std,9.251037,10.025588,8.84219,9.709675
min,130.0,7.5,1.0,1.0
25%,155.0,72.0,14.0,14.0
50%,160.0,79.8,19.0,19.0
75%,170.0,86.0,25.0,27.0
max,195.0,119.9,49.0,49.0


In [54]:
min_max_scaler = MinMaxScaler()

In [55]:
train, valid, test = data.iloc[:300000], data.iloc[300000:600000], data.iloc[600000:]

In [56]:
train.loc[:] = min_max_scaler.fit_transform(train)
valid.loc[:] = min_max_scaler.transform(valid)
test.loc[:] = min_max_scaler.transform(test)

  train.loc[:] = min_max_scaler.fit_transform(train)
  valid.loc[:] = min_max_scaler.transform(valid)
  test.loc[:] = min_max_scaler.transform(test)


In [57]:
display(train)
display(valid)
display(test)

Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
1,0.461538,0.655704,0.208333,0.625000
2,0.307692,0.758479,0.583333,0.479167
3,0.461538,0.645427,0.416667,0.541667
4,0.538462,0.640288,0.666667,1.000000
5,0.615385,0.483042,0.208333,0.229167
...,...,...,...,...
375478,0.307692,0.544707,0.375000,0.479167
375480,0.384615,0.522097,0.416667,0.291667
375482,0.461538,0.439877,0.250000,0.375000
375484,0.615385,0.696814,0.291667,0.458333


Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
375487,0.153846,0.696814,0.562500,0.687500
375488,0.230769,0.583762,0.375000,0.416667
375491,0.461538,0.501542,0.375000,0.333333
375492,0.384615,0.556012,0.541667,0.312500
375493,0.615385,0.573484,0.250000,0.437500
...,...,...,...,...
751130,0.461538,0.429599,0.458333,0.479167
751132,0.769231,0.676259,0.375000,0.458333
751133,0.307692,0.687564,0.729167,0.500000
751134,0.307692,0.573484,0.354167,0.229167


Unnamed: 0,신장(5cm단위),허리둘레,혈청지피티(ALT),감마지티피
751138,0.615385,0.696814,0.333333,0.312500
751139,0.538462,0.423433,0.354167,0.395833
751140,0.538462,0.433710,0.375000,0.208333
751142,0.307692,0.511819,0.458333,0.520833
751143,0.461538,0.645427,0.354167,0.333333
...,...,...,...,...
999994,0.461538,0.502569,0.395833,0.312500
999995,0.615385,0.573484,0.250000,0.437500
999997,0.384615,0.665982,0.520833,0.500000
999998,0.461538,0.480987,0.395833,0.312500


0
