# Logistic Regression

## Logistic Regression 모델
> **`Logistic Regression`** 은 **분류**를 위한 **지도학습** 머신러닝 모델이다.  
**종속변수가 이진형 분포**를 보일경우 기존 선형모델로는 해결이 어려움이 있음.  
이진형 데이터(바이너리) 종속변수 분류문제를 해결하는데 사용되며 실제로는 클래스가 여러개인 분류문제에도 사용이 가능하다.  

![logistic](https://drive.google.com/uc?id=10GmOoYLDCvGf7atGImyfyuyeZg2qbU72)

위 그림에서 확인 할 수 있듯 기존 선형모델이 풀기 어려운 이진분류 문제를 해결하고자 선형을 비선형으로 변환한 모델이다.  
만약 모델(함수)의 출력결과를 0과 1사이로 만들어 줄 수 있다면 즉, 확률 문제로서 접근한다면 분류문제에 사용 가능 하겠다는 아이디어에서 변형 되었다.

## Odds (승산비)
> 로지스틱회귀 모델을 확률로서 접근하는데 가장 핵심적인 개념
$$ p : 어떤\ 일이\ 발생할\ 확률\ (승산) $$  
$$ Odds = {p \over {1-p}} $$
>> 0 < p < 1  
0 < 1-p < 1  
p가 0에 가까울 경우 0  
p가 1에 가까울 경우 무한대

위의 Odds를 그대로 사용하지 않고 log를 취해 사용하면 0을 기준으로 상호대칭적이며, 계산이 수월한 수식이 완성 됩니다.  
기존 선형회귀식에서 y위치에 log Odds를 적용하면 아래와 같은 식이 되고

$$ ln({Y \over {1-Y}}) = \beta_0 + \beta_1x$$

이를 다시 y에 대해 정리하면 sigmoid 식이 됩니다.

$$ y = {1 \over {1+\exp^{-(\beta_0 + \beta_1x)}}} $$  

![sigmoid](https://drive.google.com/uc?id=1Es8gzBJUKirvRLUc17qXdHCrNLX0gghx)

결국 로지스틱 회귀 모델도 $\beta_0$와 $\beta_1$ 를 추정하게 됩니다.
>> 0 < sigmoid(x) < 1  
sigmoid(0) = 0.5  
곧 시그모이드 함수를 거친 값을 확률값처럼 생각이 가능해짐

## 모델평가
> 예측모델인 **`linear regression`** 모델의 경우 최소자승법을 통해 모델을 평가하였다.  
분류모델인 **`logistic regression`** 모델은 분류모델 평가 지표를 사용하여 모델을 평가한다.  
대표적인 모델평가 함수인 **오차행렬(confusion matrix)** 와 **분류평가표(classification report)** 를 사용한다.

### 오차행렬(confusion matrix)
![conf1](https://drive.google.com/uc?id=1I4gkLs1Kji1UCseSU6rsxfi8Sp5Q0MOe)

TP - True Positive(실제값 1, 예측값 1로 정분류 된 갯수)  
FN - False Negative(실제값 1, 예측값 0으로 오분류 된 갯수)  
FP - Flase Positive(실제값 0, 예측값 1로 오분류 된 갯수)  
TN - True Negative(실제값 0, 예측값 0으로 정분류 된 갯수)

### 정확도(Accuracy) - 전체 샘플 중 모델이 바르게 분류한 비율
![conf2](https://drive.google.com/uc?id=1veqNRPag_-PkvGWxDc-1ZPh20L4q9CNB)  
$${TP + TN \over TP + FN + FP + TN}$$

### 정밀도(Precision) - 모델이 positive로 분류한 것 중 실제값이 positive이 비율
![conf3](https://drive.google.com/uc?id=1_JVlZ1KGklpCQF_uiZnp4Wli7leJdPvK)  
$${TP \over TP + FP}$$

### 재현율(Recall) - 실제값이 positive인 것 중 모델이 positive라 분류한 비율
![conf4](https://drive.google.com/uc?id=1dkUFhBtLyivJayOOppjUsU07a10Rh0Fi)  
$${TP \over TP + FN}$$

### f1-score - precision과 recall의 조화평균
![conf5](https://drive.google.com/uc?id=1tB56v7-P5S5_sFOcxrzEthq3-qyDB7hH)  
$${2 * precision * recall \over precision + recall}$$  
분류 문제의 클래스가 불균형할 때(imbalanced) 사용한다

## 로지스틱회귀 실습

In [1]:
# 필요모듈 import
import pandas as pd
from sklearn.datasets import load_breast_cancer

In [2]:
# 데이터 로드
cancer = load_breast_cancer()
data = cancer.data
label = cancer.target
columns = cancer.feature_names

In [3]:
# 데이터프레임 제작
df = pd.DataFrame(data, columns=columns)
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [4]:
df.shape

(569, 30)

In [5]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [7]:
# 데이터 분할
from sklearn.model_selection import train_test_split
# 훈련과 테스트셋에서의 레이블의 분포가 동일하게 하라. (stratify=label)
X_train, X_test, y_train, y_test = train_test_split(df, label, test_size=0.2, stratify=label)

![solver](./image/solver.png)

In [8]:
from sklearn.linear_model import LogisticRegression
lr_model =LogisticRegression()

In [13]:
from sklearn.linear_model import LogisticRegression
# 모델을 정의
lr_model =LogisticRegression(
    penalty='l1',  # l1 규제는 liblinear와 saga가 support
    C=0.1, 
    class_weight='balanced', 
    solver='liblinear',  
    #solver='saga',
    #solver='lbfgs'
)

In [9]:
# 모델 훈련
lr_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [10]:
lr_model_y_pred = lr_model.predict(X_test)

In [11]:
lr_model_y_pred

array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0])

In [13]:
lr_model.predict_proba(X_test)

array([[6.20617909e-03, 9.93793821e-01],
       [9.99998985e-01, 1.01542020e-06],
       [9.86341555e-02, 9.01365844e-01],
       [1.00000000e+00, 1.91377224e-12],
       [9.99999856e-01, 1.43900760e-07],
       [1.00000000e+00, 9.41733684e-17],
       [4.69258838e-01, 5.30741162e-01],
       [9.99999442e-01, 5.58172277e-07],
       [1.00000000e+00, 2.59973000e-11],
       [1.25105466e-02, 9.87489453e-01],
       [3.45316790e-03, 9.96546832e-01],
       [1.33287692e-01, 8.66712308e-01],
       [1.12851483e-01, 8.87148517e-01],
       [9.99999998e-01, 2.28310458e-09],
       [4.30900269e-03, 9.95690997e-01],
       [9.81215744e-01, 1.87842560e-02],
       [9.64798921e-01, 3.52010794e-02],
       [5.69865471e-02, 9.43013453e-01],
       [9.99999493e-01, 5.07307970e-07],
       [1.26408272e-02, 9.87359173e-01],
       [3.24638508e-04, 9.99675361e-01],
       [3.28496931e-03, 9.96715031e-01],
       [1.36512106e-03, 9.98634879e-01],
       [4.29289755e-02, 9.57071025e-01],
       [9.999969

In [14]:
from sklearn.metrics import accuracy_score,recall_score, precision_score, f1_score, roc_auc_score
print(accuracy_score(y_test, lr_model_y_pred))
print(recall_score(y_test, lr_model_y_pred))
print(precision_score(y_test, lr_model_y_pred))
print(f1_score(y_test, lr_model_y_pred))

0.9298245614035088
0.9583333333333334
0.9324324324324325
0.9452054794520548


In [15]:
from sklearn.metrics import classification_report
print('분류보고서')
print(classification_report(y_test, lr_model_y_pred))

분류보고서
              precision    recall  f1-score   support

           0       0.93      0.88      0.90        42
           1       0.93      0.96      0.95        72

    accuracy                           0.93       114
   macro avg       0.93      0.92      0.92       114
weighted avg       0.93      0.93      0.93       114

