<a href="https://colab.research.google.com/github/johyunkang/adp_certificate/blob/main/stats_textbook_07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 7 통계학과 머신러닝 

## 7.1 머신러닝 기본

## 7.2 정규화와 리지 회귀, 라소 회귀 

### 7.2.1 정규화

- 정규화 : 파라미터를 추정할 때 손실함수에 벌칙항을 도입함으로써 계수가 큰 값이 되는 것을 막는 기법
- 통계학에서는 파라미터의 **축소추정** 이라고도 부름 

### 7.2.2 리지회기 

- 정규화항으로 계수의 제곱합을 이용한 회귀모델
- L2 정규화라고도 부름
- 리지 회귀는 아래처럼 벌칙이 있는 잔차제곱합을 최소로 하는 계수를 추정 

![ridge](https://user-images.githubusercontent.com/291782/212477117-e75e7619-6a92-49c6-a309-1080a0c7e3ac.png)


- 위 식에서 a (알파)가 정규화의 강도를 지정하는 파라미터. a가 크면 벌칙의 영향이 강해지기 때문에 계수의 절대치는 작아


### 7.2.3 라소회귀 

- 정규화항으로 계수의 절댓값의 합을 이용한 회귀모델
- L1 정규화라고도 부름

![lasso](https://user-images.githubusercontent.com/291782/212477264-ba23268e-b074-4950-ad38-9b95ef6b5329.png)

- 벌칙항이 절댓값의 합이 된다는점을 제외하면 리지 회귀와 같음

### 7.2.4 정규화 강도를 지정하는 파라미터 결정 

- 정규화항에 나타나는 a는 한 가지 요소로만 결정되지 않음
- a도 포함해서 최적화의 대상으로 삼으면 반드시 a=0이 된 뒤 잔차제곱합을 최소로 하도록 움직이기 때문
- 이렇게 되면 일반적인 최소제곱법과 다르지 않게 됨
- 교차검증법을 사용할 것을 권장

### 7.2.5 독립변수의 표준화 

- 리지, 라소회귀를 실행하기 전 미리 독립변수를 평균 0, 표준편차 1로 표준화할 필요가 있음
- 데이터의 단위가 바뀌었을 때 회귀 계수의 절댓값의 크기가 변경되기 때문

### 7.2.6 리지회귀와 라소회귀의 추정결과 차이 

- 리지회귀는 전체적으로 절대치가 작은 회귀계수를 얻을 수 있음
- 라소회귀는 대부분의 파라미터가 0, 일부 파라미터만 0과 다른 값이 되기 쉬움.

아래는 리지와 라소의 벌칙 크기 비교

![ridge-lasso](https://user-images.githubusercontent.com/291782/212478620-ebf719a1-3547-4a72-9e28-27256a8fb46d.png)



## 7.3 파이썬을 이용한 리지 회귀와 라소 회귀 

In [1]:
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats

from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

import statsmodels.formula.api as smf
import statsmodels.api as sm

from sklearn import linear_model

%precision 3
%matplotlib inline

In [2]:
PATH = '/content/drive/MyDrive/Colab Notebooks/adp/파이썬으로배우는통계학교과서/data/'
x = pd.read_csv(PATH + '7-3-1-large-data.csv')
x.head()

Unnamed: 0,X_1,X_2,X_3,X_4,X_5,X_6,X_7,X_8,X_9,X_10,...,X_91,X_92,X_93,X_94,X_95,X_96,X_97,X_98,X_99,X_100
0,1.0,0.5,0.3333,0.25,0.2,0.1667,0.1429,0.125,0.1111,0.1,...,0.011,0.0109,0.0108,0.0106,0.0105,0.0104,0.0103,0.0102,0.0101,0.01
1,0.5,0.3333,0.25,0.2,0.1667,0.1429,0.125,0.1111,0.1,0.0909,...,0.0109,0.0108,0.0106,0.0105,0.0104,0.0103,0.0102,0.0101,0.01,0.0099
2,0.3333,0.25,0.2,0.1667,0.1429,0.125,0.1111,0.1,0.0909,0.0833,...,0.0108,0.0106,0.0105,0.0104,0.0103,0.0102,0.0101,0.01,0.0099,0.0098
3,0.25,0.2,0.1667,0.1429,0.125,0.1111,0.1,0.0909,0.0833,0.0769,...,0.0106,0.0105,0.0104,0.0103,0.0102,0.0101,0.01,0.0099,0.0098,0.0097
4,0.2,0.1667,0.1429,0.125,0.1111,0.1,0.0909,0.0833,0.0769,0.0714,...,0.0105,0.0104,0.0103,0.0102,0.0101,0.01,0.0099,0.0098,0.0097,0.0096


샘플사이즈는 150, x_1 부터 x_100 까지 100개 열이 있는 비교적 복잡한 데이터

### 7.3.3 표준화 

표준화는 각각의 변수에 평균값을 빼고 표준편차로 나누는 작업

이렇게 하면 평균이 0, 표준편차가 1이 됨

In [4]:
np.mean(x.X_1)

x -= np.mean(x, axis=0)
x /= np.std(x, ddof=1, axis=0)

In [8]:
# 평균값이 0이 되었는지 확인
print('평균값이 0이 되었는지 확인:\n ', np.mean(x, axis=0).head(3).round(3))

# 표준편차가 1이 되었는지 확인
print('\n\n표준편차가 1이 되었는지 확인: \n', np.std(x, ddof=1, axis=0).head(3))

평균값이 0이 되었는지 확인:
  X_1   -0.0
X_2   -0.0
X_3    0.0
dtype: float64


표준편차가 1이 되었는지 확인: 
 X_1    1.0
X_2    1.0
X_3    1.0
dtype: float64
