# 데이터 정규화

데이터를 특정 범위나 척도로 변환하여 처리하거나 분석할 때 사용되는 기술

데이터의 정규화의 목표는 서로 다른 단위나 범위를 가진 데이터를 동일한 기준으로 맞춤으로써, 
데이터 분석이나 머신러닝 모델의 성능을 향상시키는 것.

## #01. 작업준비
### 패키지 참조

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler


### 데이터 가져오기


In [5]:
df= pd.read_excel("https://data.hossam.kr/D05/gradeuate.xlsx")
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


## #02. Min-Max 정규화

모든 데이터의 범위를 0~1사이 범위로 맞추는 것.  
> -> 최소값은 0, 최대값은 1로 매핑된다. 

$ 정규화된 값 =(X-X_min) / (X_max-X_min) $
> ->데이터의 분포를 유지하면서 데이터를 특정 범위로 축소시키는데 유용함.

### 직접계산



In [6]:
Xmin = df['필기점수'].min()
Xmax = df['필기점수'].max()
df['필기점수정규화'] = (df['필기점수']-Xmin) / (Xmax-Xmin)
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수정규화
0,0,380,3.61,3,0.275862
1,1,660,3.67,3,0.758621
2,1,800,4.00,1,1.000000
3,1,640,3.19,4,0.724138
4,0,520,2.93,4,0.517241
...,...,...,...,...,...
395,0,620,4.00,2,0.689655
396,0,560,3.04,3,0.586207
397,0,460,2.63,2,0.413793
398,0,700,3.65,2,0.827586


### 파이선 활용

In [8]:
# 표준화 기능 제공하는 객체 생성
scaler = MinMaxScaler()

df['필기점수정규화2']=scaler.fit_transform(df[['필기점수']])
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수정규화,필기점수정규화2
0,0,380,3.61,3,0.275862,0.275862
1,1,660,3.67,3,0.758621,0.758621
2,1,800,4.00,1,1.000000,1.000000
3,1,640,3.19,4,0.724138,0.724138
4,0,520,2.93,4,0.517241,0.517241
...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655
396,0,560,3.04,3,0.586207,0.586207
397,0,460,2.63,2,0.413793,0.413793
398,0,700,3.65,2,0.827586,0.827586


## #03. 표준화(StandardScaler) z-score 표준화
$ 정규화된 값 = (X - 평균) /표준편차 $
> 데이터를 정규분포에 근사시켜서 이상치에 덜 민감하게 만들어줌

- 값들의 단위가 비슷하다면 MinMax (정규화)
- 값들의 단위가 다르면 Standard (표준화)
- 잘 모르겠으면 Standard (표준화)

### 직접계산 

In [9]:
mean_uni = df['학부성적'].mean()
std_uni = df['학부성적'].std()
df['StandardScale']= (df['학부성적'] - mean_uni) /std_uni
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수정규화,필기점수정규화2,StandardScale
0,0,380,3.61,3,0.275862,0.275862,0.578348
1,1,660,3.67,3,0.758621,0.758621,0.736008
2,1,800,4.00,1,1.000000,1.000000,1.603135
3,1,640,3.19,4,0.724138,0.724138,-0.525269
4,0,520,2.93,4,0.517241,0.517241,-1.208461
...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135
396,0,560,3.04,3,0.586207,0.586207,-0.919418
397,0,460,2.63,2,0.413793,0.413793,-1.996758
398,0,700,3.65,2,0.827586,0.827586,0.683455


### 파이선 활용

transform은 두번 실행하면 transform 한 상태에서 다시 표준화 시키기 떄문에 주의해야함

In [11]:
## 표준화 기능 제공하는 객체 생성
scaler = StandardScaler()

## fit을 줘서 데이터를 준다
scaler.fit(df[['학부성적']])
## transform을 통해 학습시킴
df['StandardScale(2)'] = scaler.transform(df[['학부성적']])
df


Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수정규화,필기점수정규화2,StandardScale,StandardScale(2)
0,0,380,3.61,3,0.275862,0.275862,0.578348,0.579072
1,1,660,3.67,3,0.758621,0.758621,0.736008,0.736929
2,1,800,4.00,1,1.000000,1.000000,1.603135,1.605143
3,1,640,3.19,4,0.724138,0.724138,-0.525269,-0.525927
4,0,520,2.93,4,0.517241,0.517241,-1.208461,-1.209974
...,...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135,1.605143
396,0,560,3.04,3,0.586207,0.586207,-0.919418,-0.920570
397,0,460,2.63,2,0.413793,0.413793,-1.996758,-1.999259
398,0,700,3.65,2,0.827586,0.827586,0.683455,0.684310


## #04. RobustScaler 
> - 이상치가 존재할 경우 사용하는 방법 

> - 이상치의 영향을 최소화하기 위해 사용하는 방법 -> 중앙값과 사분위수를 사용해서 데이터를 스케일링함


$ RobustScaler =  (X-median)/IQR $

### 직접계산

In [12]:
medi = df['병원경력'].median()
q1 = df['병원경력'].quantile(0.25)
q3 = df['병원경력'].quantile(0.75)
IQR=q3-q1

df['RobustScale(1)'] = (df['병원경력']-medi)/IQR
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수정규화,필기점수정규화2,StandardScale,StandardScale(2),RobustScale(1)
0,0,380,3.61,3,0.275862,0.275862,0.578348,0.579072,1.0
1,1,660,3.67,3,0.758621,0.758621,0.736008,0.736929,1.0
2,1,800,4.00,1,1.000000,1.000000,1.603135,1.605143,-1.0
3,1,640,3.19,4,0.724138,0.724138,-0.525269,-0.525927,2.0
4,0,520,2.93,4,0.517241,0.517241,-1.208461,-1.209974,2.0
...,...,...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135,1.605143,0.0
396,0,560,3.04,3,0.586207,0.586207,-0.919418,-0.920570,1.0
397,0,460,2.63,2,0.413793,0.413793,-1.996758,-1.999259,0.0
398,0,700,3.65,2,0.827586,0.827586,0.683455,0.684310,0.0


### 파이선

In [13]:
scaler = RobustScaler()
df['RobustScale(2)']=scaler.fit_transform(df[['병원경력']])
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수정규화,필기점수정규화2,StandardScale,StandardScale(2),RobustScale(1),RobustScale(2)
0,0,380,3.61,3,0.275862,0.275862,0.578348,0.579072,1.0,1.0
1,1,660,3.67,3,0.758621,0.758621,0.736008,0.736929,1.0,1.0
2,1,800,4.00,1,1.000000,1.000000,1.603135,1.605143,-1.0,-1.0
3,1,640,3.19,4,0.724138,0.724138,-0.525269,-0.525927,2.0,2.0
4,0,520,2.93,4,0.517241,0.517241,-1.208461,-1.209974,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135,1.605143,0.0,0.0
396,0,560,3.04,3,0.586207,0.586207,-0.919418,-0.920570,1.0,1.0
397,0,460,2.63,2,0.413793,0.413793,-1.996758,-1.999259,0.0,0.0
398,0,700,3.65,2,0.827586,0.827586,0.683455,0.684310,0.0,0.0
