# Intro to Data Science

# Week 8: Classification Models


# Table of Contents
1. [What is Supervised Learning?](#What-is-Supervised-Learning-?)
2. [Supervised Learning Analogy: Sorting Fruits](#Supervised-Learning-Analogy:-Sorting-Fruits)
3. [Supervised Learning Process](#Supervised-Learning-Process:-two-steps)
4. [What are Classification Models?](#What-are-Classification-Models-?)
5. [Dataset Overview](#Dataset-Overview)
6. [Data Exploration and Visualization](#Data-Exploration-and-Visualization)
7. [Training Classification Models](#Training-Classification-Models)
8. [Evaluation Metrics](#Evaluation-Metrics)
9. [Decision Boundary Visualization](#Decision-Boundary-Visualization)
10. [Model Comparison](#Model-Comparison)
11. [Practice Exercises](#Practice-Exercises)
12. [Summary](#Summary)


This module introduces classification models in machine learning, focusing on models such as Logistic Regression, k-Nearest Neighbors (k-NN), decision trees, and Naive Bayes algorithm. 
It covers data preprocessing, training models with Python's Scikit-Learn, and evaluating model performance using various metrics. 
It also includes hands-on examples and visualization of decision boundaries using the Breast Cancer dataset.

### What is Supervised Learning ?

Supervised learning is a type of machine learning where the model is trained on a labeled dataset. In this context, a labeled dataset is one where each example consists of an input vector and a corresponding output value (the label). 

The goal of a supervised learning algorithm is to learn a function or mapping from inputs to outputs, based on the provided input-output pairs. Once this function is learned, the model can be used to predict the output for new, unseen input data.

Supervised learning can be further divided into two types of problems:

1. **Classification**: The output variable is a category or a class. Examples include email spam detection (spam or not spam), or disease diagnosis (disease present or not present).

2. **Regression**: The output variable is a real or continuous value. Examples include predicting house prices based on features like location, size, and age, or predicting a person's weight based on their height.


### Supervised Learning Analogy: Sorting Fruits

Imagine you're showing a child how to sort different types of fruits into separate baskets. You have apples, oranges, and bananas. You take each fruit, one by one, and tell the child, "This is an apple, it goes in the apple basket," "This is an orange, it goes in the orange basket," and so on. You do this until you've gone through all the fruits. This is your "training" phase - you're showing the child examples of correct sorting.

After that, you give the child a new basket of fruits to sort on their own. The child uses the knowledge they gained from your training to sort these new fruits. They've never seen these specific fruits before, but they can recognize them based on your instructions and sort them into the correct baskets. This is like the "prediction" phase - the child is predicting where each fruit goes based on their training.

In this analogy, the child is the machine learning model, the training fruits are your labeled dataset (with the fruit type being the label), and the new basket of fruits represents new, unseen data. 

That's the essence of supervised learning - learning from labeled examples to make predictions about new data.

### Supervised Learning Process: two steps

- **Learning (training):** Learn a model using training data
- **Testing:** Test the model using unseen test data to assess the model's performance
- **Accuracy:** Evaluating the performance of the model using various metrics that test for True positive and False positive rates.

![image.png](attachment:023fa261-d1b9-4c58-a042-a5b121dd9e4e.png)

### What are Classification Models ?


- Classification models are a type of supervised learning algorithm.
- They predict discrete categorical labels based on input features.
- These models are trained on labeled data (i.e., data with known output labels).
- The goal is to accurately map input features to output labels.
- Examples of classification models include Logistic Regression and k-Nearest Neighbors.
- Once trained, these models can classify unseen instances.
- Classification models have wide applications in fields like healthcare, finance, and marketing.

### What are some use case of Classification Models? 

1. **Healthcare**: Predicting whether a tumor is malignant or benign based on medical imagery and patient data.

2. **Finance**: Determining if a transaction is fraudulent or not based on transaction details.

3. **Marketing**: Predicting customer churn (i.e., whether a customer will stop doing business with a company) based on customer behavior data.

4. **Email Systems**: Classifying emails as "spam" or "not spam" based on email content.

5. **Social Media**: Detecting hate speech, offensive language, or bullying in social media posts.

6. **Natural Language Processing**: Sentiment analysis, i.e., classifying a text (such as a product review or a tweet) as expressing positive, negative, or neutral sentiment.

7. **Autonomous Vehicles**: Identifying objects around the vehicle (like cars, pedestrians, bicycles, etc.) to navigate safely.

8. **Face Recognition**: Identifying if the person in the image matches the identity on record.

9. **Loan Approval**: Banks determining whether to approve a loan based on customer's credit history, income, loan amount, etc. 


### Steps for defining Classification Model Tasks

The process of building and using a classification model involves several key steps:

1. **Define the Problem**: Clearly define the problem you're trying to solve. It's important to ensure that it's a classification problem, i.e., the target variable consists of discrete classes or categories.

2. **Gather Data**: Collect a labeled dataset relevant to the problem. Each instance in the dataset consists of a set of features and a target class. 

3. **Preprocess Data**: Clean and transform the data to a form that can be easily understood by the model. This might involve handling missing values, encoding categorical variables, or scaling numerical features.

4. **Split Data**: Divide the dataset into a training set and a testing set. The training set is used to train the model, while the testing set is used to evaluate the model's performance on unseen data.

5. **Choose a Model**: Select a suitable classification model for your problem. This choice might be based on the nature of your data, the complexity of the problem, or other considerations.

6. **Train the Model**: Use the training set to train your chosen model. This involves showing the model the features and corresponding classes, allowing the model to learn the relationship between them.

7. **Evaluate the Model**: Test the trained model on the testing set to evaluate its performance. This will give you an idea of how well the model will perform when making predictions on new, unseen data.

8. **Tune the Model**: Based on the evaluation, you might decide to adjust the model's parameters to improve its performance. This step might involve using techniques like cross-validation or grid search.

9. **Predict**: Once you're satisfied with the model's performance, you can use it to make predictions on new data. You provide the features of an instance, and the model will predict which class the instance most likely belongs to.

10. **Monitor and Update the Model**: Over time, as you collect more data or as the data distribution changes, you might need to retrain or update your model to ensure that it maintains a good performance.

Remember that building a classification model is often an iterative process. You might need to go back and forth between these steps until you have a model that performs well for your specific problem.

### Types of Classification Models

There are various types of various types of classification models that can be broadly categorized into several types based on the algorithm they employ. Here are some of the common types of classification models:

- **Logistic Regression**
- **K-Nearest Neighbors**
- **Decision Trees**
- **Naive Bayes**

## Logistic Regression

**Logistic Regression**

Logistic regression is a type of regression analysis in statistics used for prediction of outcome of a categorical dependent variable from a set of predictor or independent variables. In the context of machine learning, logistic regression is a method for binary classification.

The dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts the probability of 'success' as a function of the input variables.

Unlike linear regression, logistic regression can directly predict probabilities (values that are restricted to the (0,1) interval); furthermore, those probabilities are well-calibrated when compared to the probabilities predicted by some other classifiers, such as Naive Bayes.

<!-- ![image.png](attachment:67c6c2b1-d000-48af-9294-9eb2c6fba7a0.png) -->

<!-- <img src="attachment:67c6c2b1-d000-48af-9294-9eb2c6fba7a0.png" width="650" height="650"> -->

<!-- An example graph of a logistic regression curve fitted to data. The curve shows the probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See § Example for worked details.

### Key Concepts:

- **Logistic Function**: Logistic regression uses the logistic function, also known as the sigmoid function, which takes any real-valued number and maps it into a value between 0 and 1. The function is S-shaped and is defined as:

  f(x) = 1 / (1 + e^-x)

- **Decision Boundary**: A decision boundary is a surface that separates the data points into different classes. This could be a straight line (for two features) or a multi-dimensional surface for multiple features.

- **Maximum Likelihood Estimation (MLE)**: Logistic regression uses MLE to estimate the weights/parameters. It chooses the parameters that maximize the likelihood of observing the sample values rather than those that minimize the sum of squared errors (like in ordinary least squares used in linear regression). -->

### Assumptions:

1. Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal.

2. Logistic regression requires the observations to be independent of each other.

3. Logistic regression requires there to be little or no multicollinearity among the independent variables.

4. Logistic regression assumes linearity of independent variables and log odds.

5. Logistic regression typically requires a large sample size.

### Implementing Logistic Regression using Sckit-learn

#### Dataset: Breast Cancer Wisconsin (Diagnostic) Data Set 

The Breast Cancer Wisconsin (Diagnostic) Data Set, also known as the 'breast cancer dataset', is a real-valued multivariate dataset that is popularly used in machine learning for classification problems. The dataset is publicly available and is part of the sklearn.datasets module in the Scikit-Learn library.

The data was originally collected and made available by Dr. William H. Wolberg, a physician at the University of Wisconsin Hospital at Madison, Wisconsin, USA. The dataset was created in the mid-1990s and has been used in numerous academic papers and machine learning tutorials since then.

The dataset contains measurements from a digitized image of a fine needle aspirate (FNA) of a breast mass. It includes information about the characteristics of the cell nuclei present in the image. The dataset has 569 instances, each with 30 features, including things like the mean radius of the cells, mean texture, mean perimeter, mean area, mean smoothness, etc.

The goal in analyzing this dataset is to predict whether the tumor is malignant (harmful) or benign (not harmful) based on these features. This is a binary classification problem, as there are only two possible outcomes. 

Remember that this dataset, like many others used in machine learning, has been preprocessed to be easy to work with. Real-world data often requires a lot more cleaning and preprocessing before it can be used to train a machine learning model.

#### Import SKlearn
Scikit-learn, often referred to as sklearn, is an open-source Python library that provides a range of supervised and unsupervised learning algorithms. It's built on two core Python libraries—NumPy and SciPy. Scikit-learn is highly popular for data mining and data analysis, and its tools are robust and efficient, making it a popular choice among data scientists and machine learning practitioners.

In [None]:
!pip install scikit-learn

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#### Load the data

We load the dataset using the sklearn module. This dataset can also be downloaded and imported from UCI data repository: https://archive.ics.uci.edu/dataset/14/breast+cancer

In [None]:
import pandas as pd

cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

df['target'] = cancer.target

In [None]:
df

### Exploratory Data Analysis

In [None]:
#installing matplotlib
!pip install matplotlib

In [None]:
#visualize the output class
import matplotlib.pyplot as plt

df['target'].value_counts().plot(kind='bar')
plt.show()

### Data Pre-processing

#### Split the dataframe into features (independent variables) and target (dependent variable)

In [None]:
X = df.drop('target', axis=1)
y = df['target']

#### Standardize the input features

Standardization is a preprocessing technique used in machine learning and statistics to transform all features/variables to the same scale. It does this by subtracting the mean (average) and then dividing by the standard deviation. The result of this process is that the features will have a mean of 0 and a standard deviation of 1.

Here are some reasons why standardization is necessary for certain machine learning models:

- After standardization, the distribution of each feature will resemble a standard normal distribution (a Gaussian distribution with mean 0 and standard deviation 1).
- Standardization is important for some machine learning algorithms that are sensitive to the scale of the features. These include algorithms that use a form of distance measure (like k-nearest neighbors and support vector machines) and algorithms that use gradient descent for optimization.
- By standardizing features, we're ensuring that all features contribute equally to the model, improving the stability and performance of the model.
- Standardization is different from normalization, which scales features between a specific minimum and maximum value, often 0 and 1.

In Python, there are two types of standardization technique: StandardScaler, and MinMaxScaler. you can access them from `sklearn.preprocessing` to standardize features in your model. Remember, however, that not all algorithms require feature scaling. Tree-based algorithms like Decision Trees and Random Forest, for example, are not affected by the scale of the features.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(X)

In [None]:
df_scaled

#### Split the data into training and testing sets

Here we split the dataset to train and test data using a 80/20 split for training and test dataset respectively.We typically use a 80/20 or 75/25 split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_scaled, y, test_size=0.2, random_state=42)

#### Create and train the model

In [None]:
model_logistic = LogisticRegression()
model_logistic.fit(X_train, y_train)

#### Make predictions

In [None]:
predictions_logistic = model_logistic.predict(X_test)

#### Evaluate the model

In [None]:
accuracy = metrics.accuracy_score(y_test, predictions_logistic)
print(f'Accuracy: {accuracy*100}%')

### K-Nearest Neighbors

The K-Nearest Neighbors (KNN) algorithm is a type of instance-based learning method used for both regression and classification problems. The "K" in KNN is a parameter that refers to the number of nearest neighbors to include in the majority voting process.

**Key Concepts**

* *Instance-Based*: KNN is an instance-based learning algorithm, meaning it does not learn a model. Instead, it memorizes the instances from the training data and uses these instances to make predictions.

* *Distance Metric*: KNN calculates the distance between the query instance and all the instances in the training data to find the K-nearest neighbors. Common distance metrics used are Euclidean, Manhattan, and Minkowski.

* *Majority Voting*: For a classification problem, once the K-nearest neighbors are found, KNN algorithm uses a majority voting method to make the final prediction. The class that has the majority vote will be the prediction for the query instance.

![image.png](attachment:7da05240-4938-4863-bb18-22b16bef35cd.png)

source: https://datascientest.com/en/knn-what-is-the-knn-algorithm

The most important parameter in KNN is the number of neighbors (K). Too small a value, like 1 or 2, can lead to predictions that are highly sensitive to noise in the data. Too large a value can lead to predictions that are overly generalized. 

Cross-validation can be used to find the optimal K value. The goal is to find a balance between overfitting and underfitting. In practice, starting with K=5 is common.

**Pros and Cons of KNN**

*Pros*:
- Simple to understand and implement.
- No assumptions about the data.
- Can be used for both classification and regression problems.

*Cons*:
- Computationally intensive, particularly with a large number of predictors and observations.
- Not effective with large column data.
- Sensitive to irrelevant features and the scale of the data.

### Implementing K-Nearest Neighbors in Python using sckit-learn

Using the breast cancer dataset, let's train a model using K-NN

In [None]:
# Load the data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

df['target'] = cancer.target

In [None]:
df

In [None]:
#visualize the output class
import matplotlib.pyplot as plt

df['target'].value_counts().plot(kind='bar')
plt.show()

In [None]:
#Split the dataset into features (independent variables) and target (dependent variables)
X = df.drop('target', axis=1)
y = df['target']

In [None]:
#Standardize the features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create a KNN model
model_knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
from sklearn.metrics import accuracy_score


# Train the model
model_knn.fit(X_train, y_train)

# Make predictions on the test set
predictions_knn = model_knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions_knn)

print("Model accuracy is: ", accuracy)

### Decision Trees

Decision Trees are a type of Supervised Machine Learning algorithm that are mostly used in classification and regression problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on the most significant splitter/differentiator in input variables.

It is represented as a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

<!-- ![image.png](attachment:c3f583ae-4fbc-4963-be65-44af214947e0.png) -->

![image.png](attachment:8a785ba6-7837-4b85-a5d0-aa9c7527df8b.png)


source: https://www.datacamp.com/tutorial/decision-tree-classification-python


**Key Concepts**

* *Root Node*: It represents the entire population or sample, and it further gets divided into two or more homogeneous sets.

* *Splitting*: It is a process of dividing a node into two or more sub-nodes.

* *Decision Node*: When a sub-node splits into further sub-nodes, then it is called the decision node.

* *Leaf / Terminal Node*: Nodes that do not split are called Leaf or Terminal nodes.


**Pros and Cons of Decision Trees**

*Pros*:
- Simple to understand and interpret, and the decision-making process is transparent.
- Can handle both numerical and categorical data.
- Requires less data preprocessing: no need for encoding, scaling, etc.

*Cons*:
- A small change in data can cause a large change in the structure of the decision tree causing instability.
- Decision trees can easily overfit or underfit the dataset if not properly tuned.

### Implementing Decision Trees in Python using sckit-learn

Using the breast cancer dataset, let's implement a decision tree model in python using sckit-learn. 

In [None]:
# Load the data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

df['target'] = cancer.target

In [None]:
df

In [None]:
#visualize the output class
import matplotlib.pyplot as plt

df['target'].value_counts().plot(kind='bar')
plt.show()

In [None]:
#Split the dataset into features (independent variables) and target (dependent variables)
X = df.drop('target', axis=1)
y = df['target']

In [None]:
# #Standardize the features
# scaler = StandardScaler()
# df_scaled = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Create a Decision Tree model

from sklearn.tree import DecisionTreeClassifier

model_decison = DecisionTreeClassifier()

In [None]:
# Train the model
model_decison.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
predictions_decision = model_decison.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions_decision)

print("Model accuracy is: ", accuracy)

## Naive Bayes Classifier

Naive Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used for classification tasks. The "naive" assumption is that each feature is independent of the other features. While this is rarely the case in the real world, the algorithm can still be very effective for classification tasks.

**Bayes Theorem**

The Bayes Theorem describes the probability of an event based on prior knowledge of conditions related to the event. Mathematically, it is expressed as:

P(A|B) = [P(B|A) * P(A)] / P(B)

Where:
- P(A|B) is the probability of event A given event B is true
- P(B|A) is the probability of event B given event A is true
- P(A) and P(B) are the probabilities of events A and B respectively

**Working of Naive Bayes**

In the context of classification, the idea is to find the probability of a label given some observed features, which we can express as P(L|features). Using Bayes theorem, we can express this in terms of quantities we can compute more directly:

P(L|features) = [P(features|L) * P(L)] / P(features)

If we are trying to decide between two labels—let's call them L1 and L2—then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

P(L1|features) / P(L2|features) = [P(features|L1) * P(L1)] / [P(features|L2) * P(L2)]

All we need now is some model by which we can compute P(features|Li) for each label. Such a model is called a *generative model* because it specifies the hypothetical random process that generates the data. Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier. The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

This is where the "naive" in "Naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.

**Applications of Naive Bayes**

Naive Bayes, despite its simplicity, is widely used for text classification, spam filtering, sentiment analysis, and recommender systems. It's favored for its efficiency and ease of implementation.

**Implementing Naive Bayes in Python**

You can use Python's Scikit-Learn library which provides the Naive Bayes classifiers in `sklearn.naive_bayes`. Methods include:

- `GaussianNB`: Implements the Gaussian Naive Bayes algorithm for classification. Assumes that features follow a normal distribution.
- `MultinomialNB`: Implements the Naive Bayes algorithm for multinomially distributed data.
- `BernoulliNB`: Implements the Naive Bayes algorithm for data that is distributed according to multivariate Bernoulli distributions.

**Advantages and Disadvantages of Naive Bayes**

*Advantages*:
- Easy and fast to predict the class of the test dataset.
- Performs well in multi-class prediction.
- Performs well with categorical input variables compared to numerical variables. For numerical variables, a normal distribution is assumed (bell curve, which is a strong assumption).

*Disadvantages*:
- The assumption of independent features. In practice, it is almost impossible that model will get a set of predictors which are entirely independent.
- If a categorical variable has a category in the test dataset, which was not observed in the training dataset, then the model will assign a zero probability. It will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation.

### Implementing Naive Bayes in Python using sckit-learn

Using the breast cancer dataset, let's implement a decision tree model in python using sckit-learn. 

In [None]:
# Load the data
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)

df['target'] = cancer.target

In [None]:
df

In [None]:
#visualize the output class
import matplotlib.pyplot as plt

df['target'].value_counts().plot(kind='bar')
plt.show()

In [None]:
#Split the dataset into features (independent variables) and target (dependent variables)
X = df.drop('target', axis=1)
y = df['target']

In [None]:
#Standardize the features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Instantiate a Gaussian Classifier
from sklearn.naive_bayes import GaussianNB

model_naive = GaussianNB()

In [None]:
# Train the model
model_naive.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_predictions_naive = model_naive.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_predictions_naive)

print("Model accuracy is: ", accuracy)

### Evaluating Models

Evaluating a model is a critical step in the process of machine learning. After we've trained our model, we need to determine how well it can generalize to unseen data. This is where evaluation metrics come into play. There are several different metrics we can use to evaluate a model, and the best one to use often depends on the specific application and the type of problem we're solving (classification, regression, clustering, etc.). Since we are discussing about classification models, here are some commonly used evaluation metrics:

**Accuracy:**
- Simplest classification metric.
- Number of correct predictions divided by total predictions.
- Can be misleading with imbalanced datasets.

    Accuracy = (TP + TN) / (TP + TN + FP + FN)

    where:

        TP = True Positives
        TN = True Negatives
        FP = False Positives
        FN = False Negatives
    

**Precision:**
- Number of true positives divided by the sum of true positives and false positives.
    



- Measures model's accuracy in correctly classifying a sample as positive.

    **Precision = TP / (TP + FP)**




**Recall (Sensitivity or True Positive Rate):**
- Number of true positives divided by the sum of true positives and false negatives.
    
    **Recall = TP / (TP + FN)**
    
- Shows how well the model can find all the positive examples.

**F1-Score:**
- Harmonic mean of precision and recall.
- Useful balance between precision and recall, especially on uneven class distribution.

    **F1 Score = 2 * (Precision * Recall) / (Precision + Recall)**

**Confusion Matrix:**
- Descriptive table for the performance of a classification model on test data.
- Shows both the errors made and the types of errors (TP, FP, FN, TN).

**ROC Curve:**
- Graphical representation of the performance of a binary classifier.
- Plots true positive rate (TPR) against false positive rate (FPR).

**Area Under the ROC Curve (AUC-ROC):**
- Quantifies the two-dimensional area under the entire ROC curve.
- Represents the model's ability to correctly distinguish observations from two classes.

Python's `sklearn` library provides functionality to calculate these metrics:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, roc_auc_score

# Let's utilize the model generate from our previous naive bayes classification task to generate evaluation results.


# Accuracy
accuracy = accuracy_score(y_test, y_predictions_naive)
print(f"Accuracy: {accuracy}")

# Precision
precision = precision_score(y_test, y_predictions_naive)
print(f"Precision: {precision}")

# Recall
recall = recall_score(y_test, y_predictions_naive)
print(f"Recall: {recall}")

# F1-Score
f1 = f1_score(y_test, y_predictions_naive)
print(f"F1-Score: {f1}")

# Confusion Matrix
conf_mat = confusion_matrix(y_test, y_predictions_naive)
print(f"Confusion Matrix: \n{conf_mat}")

# ROC-AUC Score
roc_auc = roc_auc_score(y_test, y_predictions_naive)
print(f"ROC-AUC Score: {roc_auc}")

Remember that no single metric can tell the complete story. Looking at a combination of these metrics together can give you a more complete picture of your model's performance.

### Overfitting and Underfitting

**Overfitting and Underfitting** are two common problems that occur in machine learning and can lead to poor model performance.

**Overfitting:**
Overfitting occurs when a machine learning model captures noise along with the underlying pattern in data. It's like studying for an exam by memorizing the exact answers to the questions instead of understanding the concepts. An overfitted model:

- Performs exceptionally well on training data but poorly on unseen data (test data)
- Tends to capture random fluctuations in the training data
- Is a result of excessively complex models, such as decision trees that are too deep or not pruned, or polynomial regression models of high degree
- Can be mitigated by techniques like cross-validation, regularization, pruning, and increasing the amount of training data

**Underfitting:**
Underfitting, on the other hand, occurs when a machine learning model cannot capture the underlying trend of the data. It's like studying for an exam by only learning the most basic concepts. An underfitted model:

- Performs poorly on both training and unseen data
- Is usually a result of a model that is too simple to capture complex trends in data, like a linear model for non-linear data
- Can be mitigated by increasing the complexity of the model, feature engineering, etc.

It's important to find a balance between overfitting and underfitting. A good model should be able to capture the underlying trends in data (low bias) without capturing the noise (low variance). This is often referred to as the Bias-Variance Tradeoff.


### Dataset Overview

We will use the **Breast Cancer Wisconsin Dataset** available in Scikit-learn. It is commonly used for binary classification tasks.

- **Features**: 30 numeric features describing cell nuclei.
- **Target**: Binary (0 = malignant, 1 = benign)
- **Objective**: Predict if a tumor is benign or malignant.


In [None]:

from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Overview
print("Dataset shape:", df.shape)
df.head()



### Data Exploration and Visualization

Let's explore feature correlations and visualize the dataset.


In [None]:
!pip install seaborn

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Feature correlation heatmap
plt.figure(figsize=(12,10))
sns.heatmap(df.corr(), cmap='coolwarm', annot=False)
plt.title("Feature Correlation Heatmap")
plt.show()



### Evaluation Metrics

We will evaluate our models using **Confusion Matrix**, **Precision**, **Recall**, **F1-score**, and **ROC Curve**.


In [None]:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Example Confusion Matrix
y_true = [0, 1, 1, 0, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1, 1, 1]
cm = confusion_matrix(y_true, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Class 0', 'Class 1'])
disp.plot()
plt.show()



### Decision Boundary Visualization

Let's visualize decision boundaries for different classifiers.


In [None]:
!pip install mlxtend

In [None]:

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from mlxtend.plotting import plot_decision_regions

X, y = make_classification(n_features=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, random_state=42)

models = [("Logistic Regression", LogisticRegression()),
          ("KNN", KNeighborsClassifier()),
          ("Decision Tree", DecisionTreeClassifier(max_depth=3))]

for name, model in models:
    model.fit(X, y)
    plt.figure()
    plot_decision_regions(X, y, clf=model)
    plt.title(name)
    plt.show()



### Model Comparison

| Model              | Accuracy | Precision | Recall | F1-Score |
|-------------------|---------:|---------:|------:|--------:|
| Logistic Regression|    XX%   |    XX%   |  XX%  |   XX%   |
| k-NN               |    XX%   |    XX%   |  XX%  |   XX%   |
| Decision Tree      |    XX%   |    XX%   |  XX%  |   XX%   |

*Replace XX% with actual results from your model evaluation.*



### Summary

- Supervised learning involves training models on labeled data.
- Classification models predict discrete class labels.
- Common models: Logistic Regression, k-NN, Decision Trees.
- Evaluation metrics: Accuracy, Precision, Recall, F1-score, Confusion Matrix.
- Visualizing decision boundaries helps understand model behavior.



### 📋 Classification Models Quiz

---

### **1. (MCQ)**  
**Which of the following is a supervised learning algorithm used for classification?**

A) Logistic Regression  
B) k-Nearest Neighbors  
C) Decision Trees  
D) All of the above  

---

### **2. (True/False)**  
**Logistic Regression can be used to predict continuous numerical values.**

---

### **3. (Fill in the Blank)**  
**In k-Nearest Neighbors (k-NN), the value of _____ determines how many neighbors are considered when making a prediction.**

---

### **4. (MCQ)**  
**What is the main drawback of Decision Trees?**

A) They are hard to interpret  
B) They often underfit the data  
C) They are prone to overfitting  
D) They require feature scaling  

---

### **5. (True/False)**  
**Naive Bayes classifier assumes that features are dependent on each other.**

---

### **6. (MCQ)**  
**Which of the following evaluation metrics is NOT suitable for imbalanced classification datasets?**

A) Accuracy  
B) Precision  
C) Recall  
D) F1-Score  

---

### **7. (Fill in the Blank)**  
**A confusion matrix displays the number of ___________ and ___________ predictions made by a classification model.**

---

### **8. (MCQ)**  
**In Logistic Regression, the output of the model is passed through which function to produce a probability?**

A) ReLU Function  
B) Sigmoid Function  
C) Linear Function  
D) Softmax Function  

---

### **9. (True/False)**  
**In k-NN, feature scaling (like normalization) has no impact on the model’s performance.**

---

### **10. (Fill in the Blank)**  
**The ___________ metric combines both precision and recall into a single number, balancing the two.**

---

