# 다항 로지스틱 회귀모형 실습
- multi classification 중 하나이다
- 회귀분석과 다항 로지스틱 회귀모형은 머신러닝과 겹치는 부분이 많다

__[예제]__   
iris 데이터의 Species 를 분류하는 다항 로지스틱 회귀분석을 실시하고 오분류표를 만들어라

In [2]:
import pandas as pd 
iris = pd.read_csv('../data/iris.csv')

X = iris.drop(['target'],axis=1)
y = iris.target

In [3]:
y.value_counts()

Iris-versicolor    50
Iris-virginica     50
Iris-setosa        50
Name: target, dtype: int64

In [4]:
# 훈련셋 평가셋 분리하기
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, train_size=0.7,
                                                   test_size = 0.3, random_state=123)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(105, 4) (45, 4) (105,) (45,)


In [5]:
y_test.value_counts()

Iris-setosa        15
Iris-versicolor    15
Iris-virginica     15
Name: target, dtype: int64

In [6]:
y_train.value_counts()

Iris-versicolor    35
Iris-virginica     35
Iris-setosa        35
Name: target, dtype: int64

## sklearn 모델 생성

* sklearn 모델은 l2 패널티를 이용하여 전통적 통계모델에서 다중공선성의 문제를 내부적으로 해결해준다. 
* 독립변수 간의 상관성이 높은 변수라면, l2패널티를 0에 가깝게하여 변수를 삭제하는 것과 같은 효과를 보인다. 
- 해석력이 중요하다면 통계모델, 결과값의 정확도, 편리성이 중요하다면 머신러닝을 쓴다

In [8]:
# 모델 적합하기
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
print(model)

LogisticRegression()


## 모델 평가

In [9]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
# 테스트셋 예측
predicted = model.predict(X_test) 
# 머신러닝의 predict는 확률값이 아니라, 바로 분류를 해준다
# statsmodel 에선 predict가 확률값을 return 한다

# 오분류표 생성
cm = confusion_matrix(y_test, predicted) # True / Predict 값 순서로 입력
cmtb = pd.DataFrame(cm, columns=['prediected_setosa', 'predicted_versicolor', 'predicted_virginica'],
                   index = ['setosa', 'versicolor', 'virginica'])

cmtb

Unnamed: 0,prediected_setosa,predicted_versicolor,predicted_virginica
setosa,15,0,0
versicolor,0,14,1
virginica,0,0,15


In [18]:
cm

array([[15,  0,  0],
       [ 0, 14,  1],
       [ 0,  0, 15]], dtype=int64)

In [10]:
 model.predict(X_test)

array(['Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-setosa', 'Iris-setosa', 'Iris-virginica',
       'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-versicolor',
       'Iris-setosa', 'Iris-virginica', 'Iris-virginica', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-setosa', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-versicolor',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor'], dtype=object)

In [11]:
model.predict_proba(X_test)

array([[1.19428521e-03, 5.20355761e-01, 4.78449954e-01],
       [1.00882843e-05, 4.96972622e-02, 9.50292650e-01],
       [4.31641319e-02, 9.33403591e-01, 2.34322773e-02],
       [3.50768604e-03, 7.32081065e-01, 2.64411249e-01],
       [2.47257801e-07, 4.75204086e-03, 9.95247712e-01],
       [9.76994721e-01, 2.30051833e-02, 9.53043602e-08],
       [9.84979643e-01, 1.50203266e-02, 3.08227323e-08],
       [3.08264488e-03, 7.21555472e-01, 2.75361883e-01],
       [9.81966082e-01, 1.80338598e-02, 5.83304056e-08],
       [9.72866408e-01, 2.71335022e-02, 8.95164800e-08],
       [8.95440205e-08, 4.79957133e-03, 9.95200339e-01],
       [9.79252886e-01, 2.07470171e-02, 9.67583640e-08],
       [1.08298656e-04, 1.26063161e-01, 8.73828540e-01],
       [9.94240696e-01, 5.75929656e-03, 7.17413452e-09],
       [7.01638054e-03, 7.80942135e-01, 2.12041485e-01],
       [9.64959750e-01, 3.50401015e-02, 1.48432756e-07],
       [4.04729419e-04, 2.37485626e-01, 7.62109644e-01],
       [4.78721495e-04, 1.85373

In [12]:
# 정확도
print('Accuracy Score: ',accuracy_score(y_test, predicted))

print('\n')
# 분류 리포트 생성하기
class_report = classification_report(y_test, predicted)
print(class_report)

Accuracy Score:  0.9777777777777777


                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        15
Iris-versicolor       1.00      0.93      0.97        15
 Iris-virginica       0.94      1.00      0.97        15

       accuracy                           0.98        45
      macro avg       0.98      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45



In [13]:
roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovr') # One vs Rest

0.9985185185185186

In [14]:
help(roc_auc_score)
# class가 3~4 : ovr (One vs Rest)
# class가 5이상 : ovo (One vs One)

Help on function roc_auc_score in module sklearn.metrics._ranking:

roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)
    Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
    from prediction scores.
    
    Note: this implementation can be used with binary, multiclass and
    multilabel classification, but some restrictions apply (see Parameters).
    
    Read more in the :ref:`User Guide <roc_metrics>`.
    
    Parameters
    ----------
    y_true : array-like of shape (n_samples,) or (n_samples, n_classes)
        True labels or binary label indicators. The binary and multiclass cases
        expect labels with shape (n_samples,) while the multilabel case expects
        binary label indicators with shape (n_samples, n_classes).
    
    y_score : array-like of shape (n_samples,) or (n_samples, n_classes)
        Target scores.
    
        * In the binary case, it corresponds to an array 

## 다항 로지스틱 회귀 계수 해석 


In [15]:
# 회귀계수 확인하기
print('Intercept: \n', model.intercept_)
print('Coefficient: \n', model.coef_)

Intercept: 
 [  9.43048325   2.10041055 -11.5308938 ]
Coefficient: 
 [[-0.45764038  0.87256659 -2.30838912 -0.96065436]
 [ 0.37582283 -0.19465588 -0.16297912 -0.75280996]
 [ 0.08181755 -0.67791071  2.47136824  1.71346432]]


In [16]:
# 오즈비 계산하기
import numpy as np
np.exp(model.coef_)

array([[ 0.63277499,  2.39304495,  0.09942128,  0.38264242],
       [ 1.45618911,  0.82311786,  0.84960892,  0.47104108],
       [ 1.08525779,  0.50767657, 11.83863389,  5.54814881]])

In [17]:
pd.DataFrame(np.exp(model.coef_), columns=X_train.columns, index = model.classes_)

Unnamed: 0,sepal length,sepal width,petal length,petal width
Iris-setosa,0.632775,2.393045,0.099421,0.382642
Iris-versicolor,1.456189,0.823118,0.849609,0.471041
Iris-virginica,1.085258,0.507677,11.838634,5.548149



*<b> 다른 변수가 일정할 때, sepal width가 1단위 증가하면 Iris-setosa로 분류될 확률이 2.393189배 증가한다. </b>