# 로지스틱 회귀
- 분류를 위한 알고리즘 (연속형이 아닌 범주형 예측)
- 시그모이드 함수의 최적선을 찾고, 함수의 반환값으로 분류를 결정
- 주로 이진분류에 사용됨
- 예측확률이 0.5 이상이면 1로, 0.5 이하이면 0으로 예측함

## 하이퍼 파라미터
- solver 지정가능: lbfgs, liblinear, newton-cg, sag, saga
- penalty: 규제유형. L1, L2 설정가능
- C: 규제강도 조절 (C = 1 / alpha)

## 시그모이드 함수
$f(x) =$ $1\over 1+e^{-x}$

<img alt="시그모이드 곡선" src="./images/sigmoid.png" width="300px">


### 위스콘신 유방암 예측

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

In [3]:
cancer = load_breast_cancer()

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [5]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(cancer.data)

X_train, X_test, y_train, y_test = train_test_split(data_scaled, cancer.target, test_size=0.3, random_state=0)

In [6]:
from sklearn.metrics import accuracy_score, roc_auc_score

In [7]:
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
lr_preds = lr_clf.predict(X_test)

print('accuracy: {0:.3f}, roc_auc: {1:.3f}'.format(accuracy_score(y_test, lr_preds),
                                                   roc_auc_score(y_test, lr_preds)))

accuracy: 0.977, roc_auc: 0.972


#### Solver 를 변환하면서 예측

In [8]:
solvers = ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']

for solver in solvers:
    lr_clf = LogisticRegression(solver=solver, max_iter=600)
    lr_clf.fit(X_train, y_train)
    lr_preds = lr_clf.predict(X_test)
    
    print('solver: {0}, accuracy: {1:.3f}, roc_auc: {2:.3f}'.format(solver,
                                                                    accuracy_score(y_test, lr_preds),
                                                                    roc_auc_score(y_test, lr_preds)))

solver: lbfgs, accuracy: 0.977, roc_auc: 0.972
solver: liblinear, accuracy: 0.982, roc_auc: 0.979
solver: newton-cg, accuracy: 0.977, roc_auc: 0.972
solver: sag, accuracy: 0.982, roc_auc: 0.979
solver: saga, accuracy: 0.982, roc_auc: 0.979


#### GridSearchCV 로 최적파라미터 구하기

In [9]:
from sklearn.model_selection import GridSearchCV

In [10]:
params = {
    'solver': ['liblinear', 'lbfgs'],
    'penalty': ['l2', 'l1'],
    'C': [0.01, 0.1, 1, 1, 5, 10]
}

lr_clf = LogisticRegression()

grid_clf = GridSearchCV(lr_clf, param_grid=params, scoring='accuracy', cv=3)
grid_clf.fit(data_scaled, cancer.target)
print('최적 하이퍼 파라미터: {0}, 최적 평균 정확도: {1:.3f}'.format(grid_clf.best_params_,
                                                   grid_clf.best_score_))


최적 하이퍼 파라미터: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}, 최적 평균 정확도: 0.979


18 fits failed out of a total of 72.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
18 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/hakchangs/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/hakchangs/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/hakchangs/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penaltie