### Dummy Classifier

*Source:* 
- https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html?highlight=dummy+classifier#
- Exercise 8


**Without SMOTE**

In [None]:
# Defining a Dummy Classifier model
dummy_clf = DummyClassifier()

# Fitting the model to the training data
dummy_clf.fit(X_train, y_train)

# Predicting y, the binary class label for stroke, for the test data.
y_dummy_clf_pred = dummy_clf.predict(X_test)

# K-Fold Cross-Validation
accuracies = cross_val_score(estimator = dummy_clf, X = X_train, y = y_train, cv = 10)   

# Reporting accuracy of the dummy classifier on the training and test set.
print("---------- WITHOUT SMOTE ----------")
print("Accuracy on training set: {:.3f}".format(dummy_clf.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(dummy_clf.score(X_test, y_test)))
print("%0.2f accuracy with a standard deviation of %0.2f" % (accuracies.mean(), accuracies.std())) # Printing the mean and standard deviation of the test scores

**Interpretation:** The dummy classifier is very accurate. This can be caused be the high imbalanced dataset, where only 5% of all instances are instances of stroke (y=1). Meaning that the dummy classifier would display high accuracy, although it always predicts "No Stroke" (y=0).

In [None]:
# Plottting confusion matrix using actual y_test and predicted y_test data sets 
plt.figure(figsize=(4,3))
ax = plt.subplot()
sns.heatmap(
    confusion_matrix(
        y_test,
        y_dummy_clf_pred
    ),
    annot=True,
    fmt = "d",
    linewidths=2
)
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.title("CONFUSION MATRIX: Dummy Classifier - Baseline Model (wo. SMOTE)",fontsize=14)
ax.xaxis.set_ticklabels(['No stroke', 'Stroke']); 
ax.yaxis.set_ticklabels(['No stroke', 'Stroke'])
plt.show()

**Interpretation:** "This indicates that the dummy classifier predicts y=0 every time while getting 95% accuracy due to the imbalanced target variable. We will inspect how the dummy classifier performs with the resampled balanced dataset"

**With SMOTE**

In [None]:
# ---------- WITH SMOTE ----------
# Defining Dummy Classifier model with the resampled dataset
dummy_clf_res = DummyClassifier()

# Fitting the model to the training data
dummy_clf_res.fit(X_train_res, y_train_res)

# Predicting y, the binary class label for stroke, for the test data.
y_dummy_clf_pred_res = dummy_clf_res.predict(X_test)

# K-Fold Cross-Validation
accuracies_res = cross_val_score(estimator = dummy_clf_res, X = X_train_res, y = y_train_res, cv = 10)   

# Reporting accuracy of the dummy classifier on the training and test set.
print("---------- WITH SMOTE ----------")
print("Accuracy on training set: {:.3f}".format(dummy_clf_res.score(X_train_res, y_train_res)))
print("Accuracy on test set: {:.3f}".format(dummy_clf_res.score(X_test, y_test)))
print("%0.2f accuracy with a standard deviation of %0.2f" % (accuracies_res.mean(), accuracies_res.std())) # Printing the mean and standard deviation of the test scores

In [None]:
# Plottting confusion matrix using actual y_test and predicted y_test data sets 
plt.figure(figsize=(4,3))
ax = plt.subplot()
sns.heatmap(
    confusion_matrix(
        y_test,
        y_dummy_clf_pred_res
    ),
    annot=True,
    fmt = "d",
    linewidths=2
)
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.title("CONFUSION MATRIX: Dummy Classifier - Baseline Model (w. SMOTE)",fontsize=14)
ax.xaxis.set_ticklabels(['No stroke', 'Stroke']); 
ax.yaxis.set_ticklabels(['No stroke', 'Stroke'])
plt.show()

**Interpretation:** This score looks better and more realistic after using the balanced dataset.