# Names and IDs
 1.
 2.

---
# Section 1
---

# I. Naive Bayes (40 pts)

In this part we will test digits classification on the MNIST dataset, using Bernoulli Naive Bayes (a generative model), in contrast to the Multivariate Logistic Regression (a discriminative model) we saw.

The MNIST dataset contains 28x28 grayscale images of handwritten digits between 0 and 9 (10 classes). For mathmatical analysis clarity, and for matching expected API, each image faltten to create a 1D array with 784 elements.

### Loading the MNIST dataset
Load the MNIST data set. The digits dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. Use

```
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
```

To fetch the original data. Each image is a 28 by 28 pixels in grayscale range [0,255] and the corresponding label is an integer $y\in [0,9]$. Each image should be transformed into a 1D integer array $x\in [0,255]^{784}$.

```
x_train = x_train.reshape(x_train.shape[0], 784)
x_test = x_test.reshape(x_test.shape[0], 784)
```

Divide your data into train and test sets in a 80-20 ration split. And plot a single sample of each digit as the original image, so you get a feeling how the data looks like.

In [None]:
# Implement here

### Bernoulli Naive Bayes
If we know how the digits are generated, then we know how to classify them (simply choose the digit class which will maximize the posterior probability) --- but which model should we use for describing the digits generation?

In this part we will try a very simplified model of digits creation (which is obviously not the same as the "real" model), using a Naive Bayes over an underlying Bernoulli distribution --- that is, we will assume that given a digit class, the pixels of the images are the result of independent coin flips, each with its own "head" probability.

Note that since we assume each pixl is either 0 (black) or 1 (white), we will need to adjust (preprocess) our data accrodingly (see below).

So, the model is stated as follows:
$$
\begin{align}
\text{Domain} && x \in \{0,1\}^{784} \\
\text{Prior} && \pi_j = \Pr(y=j) \\
\text{Likelihood} && P_j(x) = \Pr(x | y=j) \\
\end{align}
$$

Where for each $i\in 0\ldots 784$ it holds that the probability of a pixel $i$ to be on given that the digit is $j$ is:
$$
P_{ji}(x_i) = \Pr(x_i | y=j) =
\begin{cases}
p_{ji} & \text{if } x_i=1 \\
1-p_{ji} & \text{if } x_i=0 \\
\end{cases}
$$


#### Question 1
Research the differences between the three types of Naive Bayes classifiers: Bernoulli NB, Multinomial NB, and Gaussian NB.
Describe in your own words what makes each type unique and specify the kind of tasks for which you would prefer each one.

#### Answer 1
Put you answer here...

#### Question 2
Train a Naive Bayes classifier using the training data and apply predictions on the test data. Use the [sklearn.naive_bayes.BernoulliNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html) implementation (see the [source code for sklearn.naive_bayes](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py) for details).

Remember we need to preprocess the data in this case such that each pixel would become either black (0) or white (1). For this purpose, use the `binarize` parameter of the `BernoulliNB` function. Set this value to $0$ (this is the default), which in this case would mean every pixel with non-zero value will be set to 1.

1. Plot the confusion matrix of your classifier, as claculated on the test data (it is recommended to use [sklearn.metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)). Calculate the total accuracy (fraction of correctly classified images), and summarize the results in your own words.

    A **confusion matrix** for a multi-class classifier is a table that summarizes the performance of the model by comparing the predicted class labels to the true class labels: Each row represents the actual class, and each column represents the predicted class. The diagonal elements indicate the number of correct predictions for each class. Off-diagonal elements show misclassifications (e.g., how many times one class was predicted as another).


2. Plot the mean image of each class (estimated $\hat{p}_{ji}$) and generate one sample of each class (remember, you can do this since this is a generative model). You will need to access the `feature_log_prob_` attribute of the trained model.

3. Think of a way you can find the optimal threshold of the binarization part. **There is no need to actually perform this task --- just describe what you would have done.**

#### Answer 2
Put you answer here...

In [None]:
# code goes here

---
# Section 2 - Kaggle competition
---

# miRNA animals interaction prediction (60 pts)
In this section, you will explain the tools and methods you used in the competition. Fifty points will be given according to the explanations of the section and up to ten points according to your relative position in the competition. Participate in the following contest and answer the following questions:
https://www.kaggle.com/t/ae45745d840546ffa91755d7a06af0d7

In this section you are allow to use only Decision Tree as your ML model.



In [55]:
# imports
from google.colab import drive
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder,OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score,
    precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV



In [8]:
# data import
drive.mount('/content/drive')
x_train = pd.read_csv("/content/drive/Shareddrives/ML/x_train.csv", index_col="id")
y_train = pd.read_csv("/content/drive/Shareddrives/ML/y_train.csv", index_col="id")

x_test = pd.read_csv("/content/drive/Shareddrives/ML/x_test.csv", index_col="id")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### EDA - Exploratory Data Analysis (10 pts):
Use any visual tools to present and explain the data. Your answer must include statistics, images, and conclusions.

***Write your code below***


In [41]:
# Implement here
x_train['HotPairingMRNA_he_P1_L1'].nunique(dropna=True)


2

In [None]:
# Explain here

### Preprocessing (10 pts):
Describe in detail what did you do in the preprocessing phase and why you did it.

***Write your code below***

In [57]:
# Implement here
labeled_mask = y_train['label'].notnull()
x_labeled = x_train.loc[labeled_mask]
y_labeled = y_train.loc[labeled_mask]
x_not_labeled = x_train.loc[~labeled_mask]

x_trainval, x_traintest, y_trainval, y_traintest = train_test_split(
    x_labeled, y_labeled,
    test_size=0.2,
    stratify=y_labeled,
    random_state=42
)


class DropZeroStd(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        # get only numerical columns
        X = X.select_dtypes(include=['int64', 'float64'])
        stds = X.std(axis=0, skipna=True)
        self.columns_to_keep_ = stds[stds > 0].index.tolist()
        return self

    def transform(self, X):
        return X[self.columns_to_keep_]

def build_preprocessor(X):

    # Drop zero-std columns (done as a step in the full pipeline)
    dropzerostd = DropZeroStd()
    X_clean = dropzerostd.fit_transform(X.copy())

    # Detect boolean numeric columns
    bool_cols = [
        col for col in X_clean.select_dtypes(include=['int64', 'float64']).columns
        if set(X_clean[col].dropna().unique()).issubset({0, 1})
    ]

    # All object/category/bool columns
    cat_cols = (
        X_clean.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
        + bool_cols
    )
    cat_cols = list(set(cat_cols))  # remove duplicates

    # Final numeric columns = numeric - bool
    num_cols = [
        col for col in X_clean.select_dtypes(include=['int64', 'float64']).columns
        if col not in cat_cols
    ]

    # Pipelines
    num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])

    cat_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    # Column transformer
    preprocessor = Pipeline([
        ('drop_const', dropzerostd),
        ('column_processing', ColumnTransformer([
            ('num', num_pipeline, num_cols),
            ('cat', cat_pipeline, cat_cols)
        ]))
    ])

    return preprocessor

# 1. Build preprocessing + model pipeline
model_pipeline = Pipeline([
    ('preprocessor', build_preprocessor(x_trainval)),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# 2. Define parameter grid for the classifier
param_grid = {
    'classifier__max_depth': [3, 5, 10, None],
    'classifier__min_samples_leaf': [1, 5, 10]
}

# 3. Setup grid search with CV
grid = GridSearchCV(
    model_pipeline,
    param_grid=param_grid,
    cv=5,  # Number of folds
    scoring='accuracy',  # or accuracy, recall, etc.
    n_jobs=-1  # parallelism if supported
)

# 4. Run grid search on your training data
grid.fit(x_trainval, y_trainval)

# 5. Get best model and score
print("Best parameters:", grid.best_params_)
print("Best cross-validated F1 score:", grid.best_score_)

# 6. Use grid.best_estimator_ to make predictions on x_
y_pred = grid.score(x_traintest, y_traintest)



Best parameters: {'classifier__max_depth': 10, 'classifier__min_samples_leaf': 10}
Best cross-validated F1 score: 0.6080550098231827


In [None]:
# Explain here

### Model training (15 pts):
Train your Decision Tree model.
Explain in detail what model you used to achieve your highest score, what the hyper-parameters were, and why did you choose both the model and these parameters.
Attach at least two learning plot and explain them.

***Write your code below***

In [None]:
# Implement here

In [None]:
# Explain here

### Model evaluation (15 pts):
Eevaluate your ML model using different evaluation metrics.
For every evaluation metric mention below add your model score and answer the following questions:

What does this evaluation metric mean? is it relevant to this prediction task?
Do you think the score you got is good for this task?


***Write your code below***

In [60]:
# Implement here

best_model = grid.best_estimator_
y_pred = best_model.predict(x_test)
# If 'id' is a column in X_test:
submission = pd.DataFrame({
    'id': x_test.index,       # or X_test.index if it's the index
    'label': y_pred
})

submission.to_csv('submission.csv', index=False)


In [None]:
# List of evaluation metrics
# Accuracy -
# Balanced Accuracy -
# Micro Precision -
# Micro Recall -
# Micro F1-score -
# Macro Precision -
# Macro Recall -
# Macro F1-score -
# Weighted Precision -
# Weighted Recall -
# Weighted F1-score -

### Explainability (10 pts):
Explain the results of your model using SHAP and attach relevant outputs. Explain at least three conclusions following the SHAP outputs.

**Note:**
Use the animal names in your conclusions and not the label numbers.

***Write your code below***

In [None]:
# Implement here

In [None]:
# Explain here

### Competition rank (5 pts - bonus):
The competition will be open until 10.6.25 at 23:59. The results of the competition will be published about 12 hours later under the private tab on the leaderboard.

The scoring of this section is relative to the location (between 0-5 pts).

Indicate here your team name in the competition and **attach an additinal notebook\python code** with which we can reproduce the rank you received.


In [None]:
# My team name was: