# 로지스틱 회귀
시그모이드 함수 최적선을 탐색하여, 시그모이드 함수의 반환 값을 확률로 간주에 확률에 따라 분류 결정

Sigmoid Function?
- 자연, 사회 현상에서 특정 변수의 확률 값은 선형이 아니라 시그모이드 함수의 형태를 띰.
- x의 값에 따라 y는 0에서 1 사이의 값을 반환함.

LogisticRegression class parameter: solver
- lbfgs: 메모리 절약, CPU 연산 병렬 수행
- liblinear: 작은 데이터 세트에 효과적으로 동작
- newton-cg: 정교한 최적화
- sag: Stochastic Average Gradient (경사 하강법 적용)
- saga: sag with L1 Regularization

### 위스콘신 유방암 데이터 세트

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()
data_scaled = scaler.fit_transform(cancer.data)

X_train, X_test, y_train, y_test = train_test_split(data_scaled, cancer.target,
                                                    test_size=0.3, random_state=0)

In [4]:
# LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)
lr_prds = lr_clf.predict(X_test)

print("accuracy: {0:.3f}, roc_auc: {1:.3f}".format(accuracy_score(y_test, lr_prds),
                                                   roc_auc_score(y_test,lr_prds)))

accuracy: 0.977, roc_auc: 0.972


In [5]:
# compare solver performance
solvers = ["lbfgs", "liblinear", "newton-cg", "sag", "saga"]

# LogisticRegression
for solver in solvers:
    lr_clf = LogisticRegression(solver=solver, max_iter=600)
    lr_clf.fit(X_train, y_train)
    lr_prds = lr_clf.predict(X_test)

    print("solver: {0}, accuracy: {1:.3f}, roc_auc:{2:.3f}".format(solver,
                                                                   accuracy_score(y_test, lr_prds), roc_auc_score(y_test, lr_prds)))

solver: lbfgs, accuracy: 0.977, roc_auc:0.972
solver: liblinear, accuracy: 0.982, roc_auc:0.979
solver: newton-cg, accuracy: 0.977, roc_auc:0.972
solver: sag, accuracy: 0.982, roc_auc:0.979
solver: saga, accuracy: 0.982, roc_auc:0.979


LogisticRegression 클래스 주요 하이퍼 파라미터로 penalty, C 설정 가능 (규제 내용 및 규제 강도)

In [7]:
from sklearn.model_selection import GridSearchCV

params = {
    "solver": ["liblinear", "lbfgs"],
    "penalty": ["l2", "l1"],
    "C": [0.01, 0.1, 1, 5, 10]
}

lr_clf = LogisticRegression()

grid_clf = GridSearchCV(lr_clf, param_grid=params, scoring="accuracy", cv=3)
grid_clf.fit(data_scaled, cancer.target)
print("best hyper parameter: {0}, best mean accuracy: {1:.3f}".format(grid_clf.best_params_,
                                                                      grid_clf.best_score_))

best hyper parameter: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}, best mean accuracy: 0.979


15 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 1091, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py", line 61, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

 0.96131997        nan 0.9753