# 🧬 Microbes Predictor

In this notebook, we will transition from health experts to microbioligists to save lives. The reality of life is that sadly, it takes more than 24 h to grow up the bacteria recovered from the blood of an infected patient, identify the species, and then determine to which antibiotics the organism is resistant, leading to a very high mortality rate for such infections. 

Our goal in this notebook is to instantly identify the microbe using machine learning given extracted features about them. Our classes include the following ten microbes  (`Spirogyra`, `Volvox`, `Pithophora`, `Yeast`, `Raizopus`, `Penicillum`, `Aspergillus sp`, `Protozoa`, `Diatom`, `Ulothrix`).

Our dataset have over 24K observations to help us learn to classify the microbes based on some meaningful features that you can read more about [here](https://www.kaggle.com/datasets/sayansh001/microbes-dataset).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split

#### Data Ingestion

TODO 1: Download the dataset from [Kaggle](https://www.kaggle.com/datasets/sayansh001/microbes-dataset) and rename it into `microbes.csv` and manually put `Serial` as the name of the first column.

In [None]:
dataset = pd.read_csv('microbes.csv')

dataset.head(10)

#### Data Cleaning

In [None]:
num_missing_vals = dataset.isnull().sum().sum()
print(f"Number of missing values: {num_missing_vals}")

#### Split the Dataset

In [None]:
# column split
x_data = dataset.drop(columns=['Serial', 'microorganisms'])     # Serial is ID and microorganisms is target
x_data = x_data.iloc[:, :10]                                    # First ten features contain enough information
y_data = dataset['microorganisms']                              # Target variable

# train-test split
x_data, y_data = x_data.to_numpy(), y_data.to_numpy()
x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=0, stratify=y_data)

#### Exploratory Analysis

In [None]:
plt.style.use('dark_background')
plt.figure(figsize=(10, 2), dpi=150)
plt.xticks(rotation=90)
labels, counts = np.unique(y_train, return_counts=True)
plt.bar(labels, counts, align='center')

## 😎 Put on Your Machine Learning Engineer Goggles

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score

from BayesClassifier import BayesClassifier
from KDEstimator import KDEstimator

### Gaussian Naive Bayes

In [None]:
# Initiate
gnb_model = BayesClassifier(mode='GNB')
# Fit
gnb_model.fit(x_train, y_train)
# Predict
gnb_y_pred = gnb_model.predict(x_val)
# Evalute
gnb_accuracy = gnb_model.score(x_val, y_val)
print("Gaussian Naive Bayes Accuracy:", gnb_accuracy)

<img src="https://media1.tenor.com/m/P1kvseQ5INIAAAAd/blank-stare-really.gif" width="200"/>

#### Linear Discriminant Analysis

In [None]:
# Initiate
lda_model = BayesClassifier(mode='LDA')
# Fit
lda_model.fit(x_train, y_train)
# Predict
lda_y_pred = lda_model.predict(x_val)
# Evaluate
lda_accuracy = lda_model.score(x_val, y_val)
print("Linear Discriminant Analysis Accuracy:", lda_accuracy)

<img src="https://media1.tenor.com/m/lL6lmpGAtSwAAAAd/cat-kid-retekmacska.gif" width=200/>

#### Quadratic Discriminant Analysis

In [None]:
qda_model = BayesClassifier(mode='QDA')
qda_model.fit(x_train, y_train)
qda_y_pred = qda_model.predict(x_val)
qda_accuracy = qda_model.score(x_val, y_val)
print("Quadratic Discriminant Analysis Accuracy:", qda_accuracy)

<img src="https://media1.tenor.com/m/AFKVFcP5to8AAAAd/julesmaru-really.gif" width="250"/>

#### Kernel Density Estimation

In [None]:
# TODO 2: Call KDEstimator with Gauss bump and bandwidth=0.5
kde_config = None
# TODO 3: Call BayesClassifier in KDE mode and pass it that instance
kde_model = None
kde_model.fit(x_train, y_train)
kde_y_pred = kde_model.predict(x_val)
kde_accuracy = kde_model.score(x_val, y_val)
print("Bayes Classifier with KDE Accuracy:", kde_accuracy)

<img src="https://media1.tenor.com/m/QxqYH15_UxYAAAAd/wow-omg.gif" width="250">

#### 1. Explain why all three of GNB, LDA, QDA performed much worse than KDE:

In [None]:
'''
Answer goes here. You can use the same argument to explain why LDA was better than QDA.
'''

#### 2. Why is Bayes Classifier with KDE much slower than Bayes Classifier with normality assumption. Explain in light of the inference complexity for both (even if roughly):

In [None]:
'''
Answer goes here
'''

#### 3. Deduce why the performance does not get better when Silverman bandwidth is used

In [None]:
'''
Answer goes here. 
This may help https://en.wikipedia.org/wiki/Kernel_density_estimation#A_rule-of-thumb_bandwidth_estimator
'''

#### [Extra] Try to exceed the current KDE accuracy by trying different hyperparameters (bandwidth or bump)