# Introduction
We will use the Decision Tree algorithm using the Breast Cancer dataset. We'll demonstrate how the choices of different splitting criteria (Information and Gini Index) and tree pruning (max depth) affect the classification results. We'll print the results of the classification and visualize the accuracy vs hyperparameters for comparison.

# Naive Bayes Algorithm

## Overview
Naive Bayes is a probabilistic classification algorithm based on **Bayes' Theorem**. It assumes that the features are **conditionally independent**, given the class label. This strong independence assumption is what makes it "naive." Despite its simplicity, Naive Bayes is effective and widely used in various classification tasks, especially in **text classification** (e.g., spam detection, sentiment analysis).

## Bayes' Theorem
Bayes' Theorem is the core principle behind Naive Bayes. It calculates the posterior probability of a class, given a set of features:

$$
P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}
$$

Where:
- \(P(C|X)\) is the **posterior probability**: the probability of class \(C\) given the features \(X\).
- \(P(X|C)\) is the **likelihood**: the probability of observing the features \(X\) given class \(C\).
- \(P(C)\) is the **prior probability**: the probability of class \(C\) occurring.
- \(P(X)\) is the **evidence**: the total probability of observing features \(X\) across all classes.

## Types of Naive Bayes Classifiers
1. **Gaussian Naive Bayes**: Assumes that the continuous features follow a Gaussian (normal) distribution.
2. **Multinomial Naive Bayes**: Suitable for discrete data like word counts in text classification (bag of words model).
3. **Bernoulli Naive Bayes**: Works with binary/boolean features, making it ideal for binary data (e.g., whether a word appears in a document or not).

## Advantages
- **Simple and fast**: It is easy to implement and works well for large datasets.
- **Efficient for text classification**: Performs well with high-dimensional data, especially in **Natural Language Processing** (NLP) tasks.
- **Handles multiple classes**: Naturally supports multi-class classification problems.

## Limitations
- **Strong independence assumption**: Assumes that features are independent, which is often not the case in real-world data.
- **Zero-frequency problem**: If a category or feature value was never observed in the training data, it gets a probability of zero, which can be addressed by techniques like **Laplace smoothing**.

## Example Use Cases
- **Spam detection**: Classifying emails as spam or not based on word occurrence.
- **Sentiment analysis**: Determining whether a review or comment has positive or negative sentiment.
- **Document classification**: Categorizing documents into predefined classes.

# Setup
Import necessary librairies and load the Breast Cancer dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the breast cancer dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the dataset into training and testing sets

In [2]:
# Split the data into training and test sets (80% training, 20% test)
# random_state is set to 42 to ensure reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Create and train the Naive Bayes model

In [3]:
# Create the Naive Bayes classifier
clf = GaussianNB()

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)

In [4]:
# print the accuracy of the classifier
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.97


# Conclusion

This tutorial covers the Naive Bayes algorithm using the Breast Cancer dataset. It demonstrates how to create and train the Naive Bayes model for classification and prints the accuracy of the model on the test set.

Naive Bayes is a simple yet powerful algorithm for classification tasks, especially when dealing with text or categorical data.