# 성능평가
##### 예측,회귀 : MSE,RMSE,MAE,MAPE
##### 분류 : 정확도, 정밀도, 재현율, F1-score, ROC AUC

### 정확도: 전체 중 모델이 바르게 분류한 비율

In [15]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

class MyFakeClassifier(BaseEstimator):
    def fit(self,X,y):
        pass
    
    def predict(self,X):
        return np.zeros((len(X),1),dtype=bool)

digits=load_digits()
y=(digits.target==7).astype(int)
X_train,X_test,y_train,y_test=train_test_split(digits.data,y,random_state=11)

In [19]:
print('레이블 테스트 세트 크기:',y_test.shape)
print('테스트 세트 레이블 0과 1 분포도')
print(pd.Series(y_test).value_counts())

fakeclf=MyFakeClassifier()
fakeclf.fit(X_train,y_train)
fakepred=fakeclf.predict(X_test)
print(accuracy_score(y_test,fakepred))

레이블 테스트 세트 크기: (450,)
테스트 세트 레이블 0과 1 분포도
0    405
1     45
dtype: int64
0.9


### 정밀도: TP/(TP+FP) , 모델이 positive로 분류한 것 중 실제로 positive인 비율
### 재현율: TP/(TP+FN) , 실제값이 positive 인 것 중 모델이 posirivie로 분류한 비율

In [20]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix

def get_clf_eval(y_test,pred):
    confusion=confusion_matrix(y_test,pred)
    accuracy=accuracy_score(y_test,pred)
    precision=precision_score(y_test,pred)
    recall=recall_score(y_test,pred)
    print('오차행렬')
    print(confusion)
    print('정확도 : {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}'.format(accuracy,precision,recall))

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder

def fillna(df):
    df['Age'].fillna(df['Age'].mean(),inplace=True)
    df['Cabin'].fillna('N',inplace=True)
    df['Embarked'].fillna('N',inplace=True)
    df['Fare'].fillna(0,inplace=True)
    return df

def drop_features(df):
    df.drop(['Name','Ticket'],axis=1,inplace=True)
    return df

def format_features(df):
    df['Cabin']=df['Cabin'].str[:1]
    features=['Cabin','Sex','Embarked']
    for feature in features:
        le=LabelEncoder()
        le=le.fit(df[feature])
        df[feature]=le.transform(df[feature])
    return df

def transform_featurs(df):
    df=fillna(df)
    df=drop_features(df)
    df=format_features(df)
    return df


titanic_df=pd.read_csv('titanic_train.csv')
y_titanic_df=titanic_df['Survived']
X_titanic_df=titanic_df.drop(['Survived'],axis=1)
X_titanic_df=transform_features(X_titanic_df)

In [42]:
X_train,X_test,y_train,y_test=train_test_split(X_titanic_df,y_titanic_df,test_size=0.2,random_state=1)

lr_clf=LogisticRegression()
lr_clf.fit(X_train,y_train)
pred=lr_clf.predict(X_test)
get_clf_eval(y_test,pred)

오차행렬
[[92 14]
 [27 46]]
정확도 : 0.7709, 정밀도: 0.7667, 재현율: 0.6301


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- 정밀도에 비해 재현율이 낮음, treshod(임곗값)을 통해 조정 가능
- 임계값이 낮아질수록 재현율이 높아짐

In [45]:
from sklearn.preprocessing import Binarizer
X=[[1,-1,2],
  [2,0,0],
  [0,1.1,1.2]]

binarizer=Binarizer(threshold=1.1)
print(binarizer.fit_transform(X))

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]]


### F1스코어
- 정밀도와 재현율이 어느 한쪽으로 치우치지 않을 때 상대적으로 높은 값을 가짐

In [48]:
from sklearn.metrics import f1_score 

def get_eval_by_threshold(y_test , pred_proba_c1, thresholds):
    # thresholds list객체내의 값을 차례로 iteration하면서 Evaluation 수행.
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold=custom_threshold).fit(pred_proba_c1) 
        custom_predict = binarizer.transform(pred_proba_c1)
        print('임곗값:',custom_threshold)
        get_clf_eval(y_test , custom_predict)
        
def get_clf_eval(y_test , pred):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    # F1 스코어 추가
    f1 = f1_score(y_test,pred)
    print('오차 행렬')
    print(confusion)
    # f1 score print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1:{3:.4f}'.format(accuracy, precision, recall, f1))

thresholds = [0.4 , 0.45 , 0.50 , 0.55 , 0.60]
pred_proba = lr_clf.predict_proba(X_test)
get_eval_by_threshold(y_test, pred_proba[:,1].reshape(-1,1), thresholds)

임곗값: 0.4
오차 행렬
[[92 14]
 [23 50]]
정확도: 0.7933, 정밀도: 0.7812, 재현율: 0.6849, F1:0.7299
임곗값: 0.45
오차 행렬
[[92 14]
 [25 48]]
정확도: 0.7821, 정밀도: 0.7742, 재현율: 0.6575, F1:0.7111
임곗값: 0.5
오차 행렬
[[92 14]
 [27 46]]
정확도: 0.7709, 정밀도: 0.7667, 재현율: 0.6301, F1:0.6917
임곗값: 0.55
오차 행렬
[[94 12]
 [30 43]]
정확도: 0.7654, 정밀도: 0.7818, 재현율: 0.5890, F1:0.6719
임곗값: 0.6
오차 행렬
[[96 10]
 [35 38]]
정확도: 0.7486, 정밀도: 0.7917, 재현율: 0.5205, F1:0.6281
