# Deciding on a primary metric
## Confusion Matrix
![confusion_matrix](https://t1.daumcdn.net/cfile/tistory/99DC064C5BE056CE10?original)
True라고 예측한 경우 -> Posiitive
True Positive(TP) : 실제 True인 정답을 True라고 예측 (정답)
False Positive(FP) : 실제 False인 정답을 True라고 예측 (오답)
False라고 예측한 경우 -> Negative
False Negative(FN) : 실제 True인 정답을 False라고 예측 (오답)
True Negative(TN) : 실제 False인 정답을 False라고 예측 (정답)

## Precision(정밀도)
### := 모델이 True라고 분류한 것 중에서 실제 True인 것의 비율
![Precision](https://t1.daumcdn.net/cfile/tistory/99F66B345BE0596109)

## Recall(재현율)
### := 실제 True인 것 중에서 모델이 True라고 예측한 것의 비율
![Recall](https://t1.daumcdn.net/cfile/tistory/997188435BE05B0628)

## Accuracy(정확도)
### := True를 True라고, False를 False라고 옳게 예측한 경우
![Accuracy](https://t1.daumcdn.net/cfile/tistory/99745F3F5BE0613D1A)

## F1 score
### := Precision과 Recall의 조화평균
![F1 Score](https://t1.daumcdn.net/cfile/tistory/993482335BE0641515)

In [None]:
# A model predicting the presence of cancer as the positive class.
# -> This model should minimize the number of false negatives, so recall is a more appropriate metric.

# A classifier predicting the positive class of a computer program containing malware.
# -> To avoid installing malware, false negatives should be minimized, hence recall or F1-score are better metrics for this model.

# A model predicting if a customer is a high-value lead for a sales team with limited capacity.
# -> With limited capacity, the sales team needs the model to return the highest proportion of true positives compared to all predicted positives, thus minimizing wasted effort.

In [None]:
# Import confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
# The support is the number of occurrences of each class in y_true.

### Predict probability

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# Predict probabilities
# logreg.predict_proba(X_test) returns the probability of the sample for each class in the model, where classes are ordered as they are in self.classes_.
y_pred_probs = logreg.predict_proba(X_test)[:, 1] # Probability of having a diabetes diagnosis

print(y_pred_probs[:10])

### ROC curve

In [None]:
# Import roc_curve
from sklearn.metrics import roc_curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--') # Randomly guessing the class of each observation.

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()
# The ROC curve is above the dotted line, so the model performs better than randomly guessing the class of each observation.

In [None]:
# Import roc_auc_score
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report

# Calculate roc_auc_score
# roc_auc_score(y_true, y_score)
# y_score :=
# In the multilabel case, it corresponds to an array of shape (n_samples, n_classes). Probability estimates are provided by the predict_proba method and the non-thresholded decision values by the decision_function method.
print(roc_auc_score(y_test, y_pred_probs))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))

## Hyperparameter tuing with GridSearchCV

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(start=0.00001, stop=1, num=20)}

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))
# Highlighting that using the optimal hyperparameters does not guarantee a high performing model!

## Hyperparameter tuing with  RandomizedSearchCV
tests a fixed number of hyperparameter settings from specified probability distributions.

In [None]:
# Create the parameter space
params = {"penalty": ["l1", "l2"],
         "tol": np.linspace(0.0001, 1.0, 50),
         "C": np.linspace(0.1, 1.0, 50),
         "class_weight": ["balanced", {0:0.8, 1:0.2}]}

# Instantiate the RandomizedSearchCV object
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)

# Fit the data to the model
logreg_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))