In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **HW 2: Classification**
In *homework 2*, you need to finish:

1.  Basic Part:

> *   Step 1: Load the input Data
> *   Step 2: Implement Naive Bayesian classifier
> *   Step 3: Build the classifier and check the correctness of Table building
> *   Step 4: Split Data
> *   Step 5: Make prediction and perform evaluation
> *   Step 6: Generate results

2.  Advanced Part:

> *   Step 1: Load the input Data
> *   Step 2: Implement Gaussian Naive Bayesian classifier
> *   Step 3: Build the classifier and check the correctness of Table building
> *   Step 4 Improve the classifier for Ranking
> *   Step 5: Make prediction and perform evaluation
> *   Step 6: Generate results

# **1. Basic Part (55%)**
In the first part, you need to implement the Naive Bayesian classifier:
- input features: ***9 physiological features***
- output prediction: ***hospital_death***

## *Import Packages*

> Note: You **cannot** import any other package

In [2]:
import numpy as np
import pandas as pd
import csv
import math
import random
import pickle

## Global attributes
> You can add your own global attributes here


## Step 1: Load the input Data
Load the input file **hw2_basic_training.csv**

> Note: please don't change the input var name ***training_df, testing_df, X, and y***.

In [4]:
# TODO: modify your file path
training_df = pd.read_csv('/content/drive/MyDrive/IntroToMachineLearning/Assignment2/hw2_basic_training.csv')
# /content/drive/MyDrive/IntroToMachineLearning/Assignment2/hw2_basic_training.csv
testing_df = pd.read_csv('/content/drive/MyDrive/IntroToMachineLearning/Assignment2/hw2_basic_testing.csv')
# /content/drive/MyDrive/IntroToMachineLearning/Assignment2/hw2_basic_testing.csv
y = training_df['hospital_death']
X = training_df.drop('hospital_death', axis=1)

you can take a look at the input feature & ground truth format:

In [5]:
testing_df

Unnamed: 0,ventilated_apache,gcs_eyes_apache,gcs_motor_apache,gcs_verbal_apache,albumin_apache,bilirubin_apache,bun_apache,creatinine_apache,fio2_apache
0,0,4,6,5,0,0,0,0,0
1,1,3,5,1,0,0,0,0,0
2,0,4,6,5,0,0,0,0,0
3,0,3,5,4,0,0,0,0,1
4,0,4,6,5,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...
9807,0,3,5,3,1,1,1,1,0
9808,1,4,6,5,1,1,1,1,1
9809,0,4,6,5,1,1,1,1,0
9810,0,3,6,4,0,0,1,1,1


In [None]:
 X[:10]

Unnamed: 0,ventilated_apache,gcs_eyes_apache,gcs_motor_apache,gcs_verbal_apache,albumin_apache,bilirubin_apache,bun_apache,creatinine_apache,fio2_apache
0,1,4,6,1,0,0,1,1,1
1,0,4,6,4,1,1,1,1,0
2,0,4,6,5,0,0,0,0,0
3,0,3,6,4,0,0,1,1,0
4,1,3,3,1,0,0,1,1,1
5,0,4,6,5,0,0,0,0,0
6,1,4,6,5,0,0,1,1,1
7,0,4,6,5,1,1,1,1,0
8,1,4,6,5,0,0,0,0,0
9,0,4,6,5,1,1,1,1,0


In [None]:
y[:10]

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: hospital_death, dtype: int64

## Step 2: Implement Naive Bayesian classifier
In this part, you need to implement the Naive Bayesian classifier. Since the data is categorical, you can refer to the L3-Bayesian Classifier course slides, p.12~16. The Bayes' theorem is as follows:

$$P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}$$

We know that taking the logarithm of a series of multiplications can be transformed into a series of additions, making it easier to calculate. So, we can formulate it as follows by taking the logarithm of both sides:

$$\log(P(C|X)) = \log(P(X|C)) + \log(P(C)) - \log(P(X))$$

The term $\log(P(X))$ is a normalization constant that ensures the probabilities sum to 1 across all classes and is the same for all classes. Since it's constant during prediction for all classes, it doesn't affect class comparison. Therefore, in practice, you don't need to compute or include $\log(P(X))$ explicitly when comparing classes. Instead, you can focus on the relative values of the posterior probabilities.

So the **final equation** for implementation will be:
$$\log(P(C|X)) = \log(P(X|C)) + \log(P(C))$$

However, this equation may still encounter issues if the likelihood $P(X|C)$ equals 0, leading to an undefined $\log(P(X|C))$. To handle this exception, a simple way to avoid the issue is to assume the addition of one new record to the table to calculate the likelihood:

_likelihood = 1 / (len(self.feature_probs_table[c][feature]) + 1)


In [6]:
X.columns

Index(['ventilated_apache', 'gcs_eyes_apache', 'gcs_motor_apache',
       'gcs_verbal_apache', 'albumin_apache', 'bilirubin_apache', 'bun_apache',
       'creatinine_apache', 'fio2_apache'],
      dtype='object')

In [7]:
class NaiveBayes:
    def build_table(self, X, y):
        # classes for ground truth: there are only negative(0) and positive(1) for y (hospital_death)
        self.classes = np.unique(y)
        # record prior for two classes
        self.class_priors = {}
        # **feature_probs_table** is a 3D dictionary table:
        # structure: [class]    [feature]           [value] = probs
        # example:   [0]        ['gcs_eyes_apache'] [3]     = # of (hospital_death=0 && gcs_eyes_apache=3) / # of (hospital_death=0)
        # for more usage of python dict, you can refer to the link: https://www.w3schools.com/python/python_dictionaries.asp
        self.feature_probs_table = {}

        # consider negative(0) and positive(1) separately
        for c in self.classes:
            X_c = X[y == c]
            self.class_priors[c] = len(X_c)/len(X) # TODO: calculate the prior
            self.feature_probs_table[c] = {}

            for feature in X.columns:
                self.feature_probs_table[c][feature] = {}
                for value in np.unique(X_c[feature]):
                    value_count = len(X_c[feature][X_c[feature]==value]) # TODO: Calculate the count of data points with the current feature value and the current class
                    # print(str(c) + ": " + str(feature) + " " + str(value_count))
                    total_count = len(X_c) # TODO: Calculate the total count of data points with the current feature within the current class
                    # print(str(c) + ": " + str(feature) + " " + str(total_count))
                    self.feature_probs_table[c][feature][value] = value_count/total_count # TODO: Calculate and store the conditional probability of the feature value given the class
                    # print()

    def predict(self, X):
        predictions = [self._predict(x) for x in X.to_dict(orient='records')]
        # predictions = [self._predict(x) for x in X.to_dict()]
        # print(str('prediciton :') + predictions)
        return predictions

    def _predict(self, x):
        # print("test x")
        # print(x)

        log_posteriors = []

        # this for loop implement: log(posteior) = log(prior) + log(likelihood)
        # print("NEW PREDICT")
        for c in self.classes:
            log_prior = np.log(self.class_priors[c])
            log_likelihood = 0
            for feature, value in x.items():
                if value in self.feature_probs_table[c][feature]:
                    # you can look up the table for the likelihood
                    _likelihood = self.feature_probs_table[c][feature][value] # TODO: calculate likelihood by the self.feature_probs_table
                    # print("PRINT LIKEHOOD")
                    # print(_likelihood)
                else:
                    # to fix issue of some cases not appear in the table
                    _likelihood = 1 / (len(self.feature_probs_table[c][feature]) + 1)
                log_likelihood += np.log(_likelihood)
            log_posterior = log_prior + log_likelihood
            # print("LOGTEST")
            # print(log_likelihood)
            # print(log_prior)
            # print(log_posterior)
            # print((c, log_posterior))
            log_posteriors.append((c, log_posterior))
        # print("PRINTING LOG LIST")
        # print(log_posteriors)
        return max(log_posteriors, key = lambda i : i[1])[0] # TODO: Return the class with the highest logarithm of posterior probability as the predicted class
        # return min(log_posteriors, key = lambda i : i[1])[0] # TODO: Return the class with the highest logarithm of posterior probability as the predicted class


## Step 3: Build the classifier and check the correctness of Table building
You can easily build an instance of your classifier and then create the table.

To check whether you have correctly built the table of the Naive Bayesian classifier, there is an example for you to ensure that your implementation is correct.


In [8]:
# Create and build the dictionary table for Naive Bayes classifier
nb_classifier = NaiveBayes()
nb_classifier.build_table(X, y)

And you also need to output the dictionary variable ***feature_probs_table*** as a pickle file, which will be examined for correctness.
> Note: Since this is for checking the implementation of the build_table method, please ensure that the input for your table building, ***X and y,*** is taken from the provided hw2_basic_training.csv file ***BEFORE*** splitting the dataset into training and validation sets.

> Hint: Two values for you to check the implementation correctness:

> `nb_classifier.feature_probs_table[0]['gcs_eyes_apache'][3]` is
0.15299

> `nb_classifier.feature_probs_table[1]['gcs_eyes_apache'][3]` is
0.15896

In [9]:
nb_classifier.feature_probs_table

{0: {'ventilated_apache': {0: 0.7118389628113272, 1: 0.2881610371886728},
  'gcs_eyes_apache': {1: 0.07033776867963153,
   2: 0.04786079836233367,
   3: 0.15298532923916752,
   4: 0.7288161037188673},
  'gcs_motor_apache': {1: 0.04424428522688502,
   2: 0.002633913340156943,
   3: 0.0047355851245308766,
   4: 0.04499488229273286,
   5: 0.08399863527806209,
   6: 0.8193926987376322},
  'gcs_verbal_apache': {1: 0.15969976117366086,
   2: 0.02025247355851245,
   3: 0.03582395087001024,
   4: 0.12038212214261344,
   5: 0.663841692255203},
  'albumin_apache': {0: 0.6001228249744115, 1: 0.39987717502558856},
  'bilirubin_apache': {0: 0.6406004776526782, 1: 0.35939952234732175},
  'bun_apache': {0: 0.20649607642442852, 1: 0.7935039235755714},
  'creatinine_apache': {0: 0.20155578300921187, 1: 0.7984442169907882},
  'fio2_apache': {0: 0.7972705561241897, 1: 0.2027294438758103}},
 1: {'ventilated_apache': {0: 0.3357620817843866, 1: 0.6642379182156134},
  'gcs_eyes_apache': {1: 0.331598513011152

In [10]:
nb_classifier.class_priors

{0: 0.9159375, 1: 0.0840625}

In [11]:
if round(nb_classifier.feature_probs_table[0]['gcs_eyes_apache'][3], 5) == 0.15299 and \
   round(nb_classifier.feature_probs_table[1]['gcs_eyes_apache'][3], 5) == 0.15896:
    print('pass')
else:
    print('fail')

pass


In [12]:
# TODO: change your path to save the pickle file
pickle_file_path = 'hw2_basic_table'
with open(pickle_file_path, 'wb') as table_file:
    pickle.dump(nb_classifier.feature_probs_table, table_file)
    table_file.close()

## Step 4: Split Data
Split the data in *X, and y* into the training dataset and validation dataset.
> Note: You can use what you have implemented in hw1.

Since many students may not understand the meaning of a validation set, let's provide more explanation:

The purpose of a validation set is to determine whether our model is overfitting the training data.
- Underfitting: If the performance on the training set is poor (e.g., you haven't prepared enough for exam 1).
- Overfitting: If the performance on the training set is high, but the performance on the validation set is poor. (e.g., if you've focused solely on practicing with "past exam papers" (考古題) for exam 1, you might answer those questions correctly but struggle with new, unfamiliar questions during the actual exam).

If we achieve good performance on both the training set and the validation set, it indicates that the model has not only learned from the training data but also has the ability to make accurate inferences on unseen data.

![](https://hackmd.io/_uploads/SJLptEZWT.png)

Please split the dataset into training set and validation set

> Note: The purpose of ***random_state*** is to ensure that you can reproduce the results each time you split your dataset. This is often helpful for debugging.


In [13]:
def train_val_split(X, y, val_size, random_state):
    ... # TODO: implement your own train_val_split
    #80% Training
    #20% Validation
    LenDataList = len(X)
    LenTraining = math.floor(LenDataList*val_size/100)
    LenValidation = LenDataList - LenTraining
    X_train = X[:LenTraining]
    X_val = X[LenTraining:]
    y_train = y[:LenTraining]
    y_val = y[LenTraining:]

    return X_train, X_val, y_train, y_val

In [14]:
# TODO: Split the data into training and validation sets
# Note: please follow template for the format of return variables
X_train, X_val, y_train, y_val = train_val_split(X, y, 80, 0) # TODO

In [15]:
# Check Split Data
y_val

64000    0
64001    0
64002    0
64003    1
64004    0
        ..
79995    0
79996    0
79997    0
79998    0
79999    0
Name: hospital_death, Length: 16000, dtype: int64

## Step 5: Make predictions and perform evaluation
The method we will evaluate the performance of the Bayesian classifier is ***F1-socre***:

$$\text{precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$$

$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$$

$$F1\text{-}score = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$

![](https://hackmd.io/_uploads/B1tfT9UWp.png)

Since the number of ground truth ***hospital_death*** where the outcome is positive is much less than the number of negative outcomes, we should focus on the f1-score of the positive class.

You need to implement the f1-score function to evaluate the performance of the Naive Bayesian classifier.

> Note: You should test your classifier by evaluating it on the training set and the validation set.


In [16]:
def cal_f1_score(y_true, y_pred):
    # Calculate True Positives, False Positives, False Negatives
    tp = sum(1 for true, pred in zip(y_true, y_pred) if true == 1 and pred == 1)
    fp = sum(1 for true, pred in zip(y_true, y_pred) if true == 1 and pred == 0) # TODO: calculate the false positive
    fn = sum(1 for true, pred in zip(y_true, y_pred) if true == 0 and pred == 1) # TODO: calculate the false negative

    # Calculate precision and recall
    precision = tp/(tp+fp) # TDOO: calculate the precision
    recall = tp/(tp+fn) # TODO: calculate the recall
    print("TP | FP | FN")
    print(str(tp) + ' | ' + str(fp) + ' | ' + str(fn))

    # Calculate F1-score
    f1_score = (2*precision*recall)/(precision+recall) # TODO: calculate the f1-score

    return f1_score

In [17]:
 # TODO: build table on the training set
nb_classifier_Pers = NaiveBayes()
nb_classifier_Pers.build_table(X_train, y_train)
# nb_classifier.feature_probs_table || nb_classifier.class_priors

# TODO: Make predictions on training set and calculate the f1-score
Y_Train_Prediction = nb_classifier_Pers.predict(X_train)
Y_Train_List = y_train.tolist()
Train_F1_Score = cal_f1_score(Y_Train_Prediction, Y_Train_List)

# TODO: Make predictions on validation set and calculate the f1-score
Y_Valid_Prediction = nb_classifier_Pers.predict(X_val)
Y_Valid_List = y_val.tolist()
Valid_F1_Score = cal_f1_score(Y_Valid_Prediction, Y_Valid_List)

# X_train, X_val, y_train, y_val

TP | FP | FN
2337 | 6861 | 3018
TP | FP | FN
636 | 1710 | 734


In [21]:
print("Train: ", end='')
print(Train_F1_Score)
print("Valid: ", end='')
print(Valid_F1_Score)

Train: 0.3211708925994641
Valid: 0.34230355220667386


In [22]:
nb_classifier.class_priors

{0: 0.9159375, 1: 0.0840625}

In [23]:
nb_classifier_Pers.feature_probs_table

{0: {'ventilated_apache': {0: 0.7110410094637224, 1: 0.2889589905362776},
  'gcs_eyes_apache': {1: 0.07079887458436354,
   2: 0.047779009293204874,
   3: 0.15232330121920026,
   4: 0.7290988149032313},
  'gcs_motor_apache': {1: 0.04436865887969989,
   2: 0.0026600733225338904,
   3: 0.004638076562366783,
   4: 0.044794952681388014,
   5: 0.08459374200699121,
   6: 0.8189444965470202},
  'gcs_verbal_apache': {1: 0.16001364140165403,
   2: 0.020291584960354676,
   3: 0.03613266263108535,
   4: 0.11980561002643021,
   5: 0.6637565009804758},
  'albumin_apache': {0: 0.6010913121323216, 1: 0.3989086878676784},
  'bilirubin_apache': {0: 0.6415721715406258, 1: 0.3584278284593742},
  'bun_apache': {0: 0.2067013385625373, 1: 0.7932986614374627},
  'creatinine_apache': {0: 0.20173927871088754, 1: 0.7982607212891124},
  'fio2_apache': {0: 0.797919686247762, 1: 0.20208031375223803}},
 1: {'ventilated_apache': {0: 0.34136321195144725, 1: 0.6586367880485527},
  'gcs_eyes_apache': {1: 0.3271708683473

## Step 6: Generate result
> Note: Please follow the format mension in the slides, the can only change the path for saving your code down below.

In [24]:
predictions = nb_classifier_Pers.predict(testing_df) # TODO: predict on the testing_df
#testing_df

# TODO: Specify the CSV file path
csv_file_path = 'hw2_basic_prediction.csv'

# Write the predictions to the CSV file
with open(csv_file_path, 'w', newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['hospital_death'])
    for prediction in predictions:
        writer.writerow([prediction])

# **1. Advanced Part (40%)**
In advanced part, you need to implement the Gaussian Bayesian classifier:
- input features: ***3 physiological features***
- output prediction: ***diabetes_mellitus***

## Global attributes
> You can add your own global attributes here


## Step 1: Load the input Data
Load the input file **hw2_advanced_training.csv**
> Note: please don't change the input var name ***training_df, testing_df, X, and y***.

In [25]:
# TODO: modify your file path
training_df = pd.read_csv('/content/drive/MyDrive/IntroToMachineLearning/Assignment2/hw2_advanced_training.csv')
testing_df = pd.read_csv('/content/drive/MyDrive/IntroToMachineLearning/Assignment2/hw2_advanced_testing.csv')
y = training_df['diabetes_mellitus']
X = training_df.drop('diabetes_mellitus', axis=1)

In [28]:
testing_df

Unnamed: 0,age,bmi,glucose_apache
0,58,30.955908,377.0
1,88,17.764481,145.0
2,41,57.093426,163.0
3,82,39.100346,96.0
4,85,21.755430,170.0
...,...,...,...
9884,77,24.243339,248.0
9885,64,30.217293,89.0
9886,70,22.474579,45.0
9887,64,33.127140,177.0


In [26]:
X[:10]

Unnamed: 0,age,bmi,glucose_apache
0,64,40.808081,358.0
1,82,22.782294,183.0
2,54,19.651056,84.0
3,18,29.722807,89.0
4,46,24.404819,569.0
5,73,16.366843,267.0
6,63,26.386867,236.0
7,40,23.765104,89.0
8,58,47.528345,89.0
9,76,39.14707,193.0


In [27]:
y[:10]

0    1
1    0
2    0
3    0
4    1
5    1
6    1
7    1
8    0
9    1
Name: diabetes_mellitus, dtype: int64

you can check whether the standardization works

## Step 2: Implement Gaussian Naive Bayesian classifier
In this part, you need to implement the Gaussian Naive Bayesian classifier.

The main difference between Naive Bayesian and Gaussian Naive Bayesian is the likelihood part. For Gaussian NB, we can use the probability density function (PDF) of the ***Gaussian distribution*** (also known as the Normal distribution):

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} exp({-\frac{(x - \mu)^2}{2\sigma^2}})$$

The reason we need to use Gaussian is that when the data type is continuous numbers instead of discrete numbers, we can't build a table by just counting all the possible cases. However, we can assume the data distribution follows a Gaussian (or Normal) distribution by calculating its mean and variance. Then, we can approximate the values, even if some records don't appear in the training set.


In [29]:
class GaussianNaiveBayesian:
    def build_table(self, X, y):
        # classes for ground truth: there are only negative(0) and positive(1) for y (hospital_death)
        self.classes = np.unique(y)
        # record prior for two classes
        self.class_priors = {}
        # **feature_mean_var_table** is a 3D dictionary table:
        # structure: [class]    [feature]           ['mean'] = mean
        # structure: [class]    [feature]           ['var']  = var
        # example:   [0]        ['gcs_eyes_apache'] ['mean'] = mean for feature='gcs_eyes_apache' when hospital_death=0
        # example:   [0]        ['gcs_eyes_apache'] ['var']  = var for feature='gcs_eyes_apache' when hospital_death=0
        self.feature_mean_var_table = {}
        for c in self.classes:
            X_c = X[y == c]
            #Add Laplace Smoothing
            self.class_priors[c] = len(X_c)/len(X) # TODO: calculate prior
            self.feature_mean_var_table[c] = {}
            for feature in X.columns:
                self.feature_mean_var_table[c][feature] = {}
                # Calculate mean and var for each feature
                self.feature_mean_var_table[c][feature]['mean'] = round(np.mean(X_c[feature]), 5)# TDOO: calculate the mean
                # np.sum(X_c[feature])/len(X_c[feature])
                self.feature_mean_var_table[c][feature]['var'] = round(np.var(X_c[feature], ddof=1), 5)# TODO: calculate the var

    def _calculate_likelihood(self, x, mean, var):
        currentsd = np.sqrt(var)
        likehood_gauss = ( 1/(math.sqrt(2*math.pi*var)) ) * ( math.exp( -( (x-mean)**2 / (2*var)) ) )
        # print(likehood_gauss)
        return likehood_gauss # TODO: calculate the Gaussian (normal) distribution pdf function as likelihoo

    def predict(self, X):
        predictions = [self._predict(x) for x in X.to_dict(orient='records')]
        return predictions

    def _predict(self, x):
        log_posteriors = []
        # this for loop implement: log(posteior) = log(prior) + log(likelihood)
        for c in self.classes:
            log_prior = np.log(self.class_priors[c])
            log_likelihood = 0
            for feature, value in x.items():
                currentmean = self.feature_mean_var_table[c][feature]['mean']
                currentvar = self.feature_mean_var_table[c][feature]['var']
                # log_likelihood += np.log(np.random.normal(currentmean, currentsd, value)) # TODO: calculate the log likelihood
                log_likelihood += self._calculate_likelihood(value, currentmean, currentvar) # TODO: calculate the log likelihood
            log_posterior = log_prior + log_likelihood
            # print("LOGTEST")
            # print(log_likelihood)
            # print(log_prior)
            # print(log_posterior)
            # print((c, log_posterior))
            log_posteriors.append((c, log_posterior))
        # print("PRINTING LOG LIST")
        # print(log_posteriors)
        return max(log_posteriors, key = lambda i : i[1])[0] # TODO: Return the class with the highest logarithm of posterior probability as the predicted class


## Step 3: Build the classifier and check the correctness of Table building
You can easily build an instance of your classifier and then create the table.

To check whether you have correctly built the table of the ***Gaussian Naive Bayesian classifier***, there is an example for you to ensure that your implementation is correct.


In [34]:
# Initialize and build_table the model
gnb_classifier = GaussianNaiveBayesian()
gnb_classifier.build_table(X, y)

In [35]:
gnb_classifier.feature_mean_var_table

{0: {'age': {'mean': 61.98273, 'var': 292.33361},
  'bmi': {'mean': 28.61544, 'var': 63.57263},
  'glucose_apache': {'mean': 141.82843, 'var': 5397.29461}},
 1: {'age': {'mean': 64.59767, 'var': 205.38742},
  'bmi': {'mean': 31.99082, 'var': 80.94882},
  'glucose_apache': {'mean': 219.66866, 'var': 12779.24182}}}

In [36]:
gnb_classifier.class_priors

{0: 0.7517538461538461, 1: 0.24824615384615384}

And you also need to output the dictionary variable ***feature_mean_var_table*** as a pickle file, which will be examined for correctness.
> Note: Since this is for checking the implementation of the build_table method, please ensure that the input for your table building, ***X and y,*** is taken from the provided hw2_advanced_training.csv file ***BEFORE*** splitting the dataset into training and validation sets.

> Hint: Two values for you to check the implementation correctness:


> `gnb_classifier.feature_mean_var_table[0]['bmi']['mean']` is
-0.10014

> `gnb_classifier.feature_mean_var_table[0]['bmi']['var']` is
0.90803

In [37]:
# *** 10/18 update: the value of ***
if round(gnb_classifier.feature_mean_var_table[0]['bmi']['mean'], 5) == 28.61544 and \
   round(gnb_classifier.feature_mean_var_table[0]['bmi']['var'], 5) == 63.57263:
    print('pass')
else:
    print('fail')

pass


In [38]:
# TODO: change your path to save the pickle file
pickle_file_path = 'hw2_advanced_table'
with open(pickle_file_path, 'wb') as table_file:
    pickle.dump(gnb_classifier.feature_mean_var_table, table_file)
    table_file.close()

## Step 4 Improve the classifier for Ranking 15%:

To make your model have better performance, you can try different ways to modify your model.

> hints (**you don't need to follow the hints**):

1. You can deal with the **outliers**
2. You can try first **converting real numbers into discrete values** and then using Naive Bayesian for classification.
3. You can try **def a new class for giving the prior a different weight** for decision-making.
4. Anything you want to try based on Bayesian.

> Note: You need to consider what kind of operations should also be performed on the testing_df.

In [None]:
# TODO: you can try the hints written above to get higher ranking score

# training_df = pd.read_csv('hw2_advanced_training.csv') # TODO: modify your file path
# testing_df = pd.read_csv('hw2_advanced_testing.csv') # TODO: modify your file path
# Note: **you can change the order of following steps if you want**
# Note: **BUT, please make sure you have saved 'hw2_advanced_table' for submission BEFORE making the following improvements.**

# ... # hints(optional): deal with outliers
# ... # hints(optional): converting real numbers into discrete values
# ... # hints(optional): def a new class for giving the prior a different weight
# ... # hints: Split the data into training and validation sets
# ... # hints: build table for Bayesian Classifier
# ... # anything you want to try based on Bayesian

In [39]:
class GaussianNaiveBayesian_Pers:
    def build_table(self, X, y):
        # classes for ground truth: there are only negative(0) and positive(1) for y (hospital_death)
        self.classes = np.unique(y)
        # record prior for two classes
        self.class_priors = {}
        # **feature_mean_var_table** is a 3D dictionary table:
        # structure: [class]    [feature]           ['mean'] = mean
        # structure: [class]    [feature]           ['var']  = var
        # example:   [0]        ['gcs_eyes_apache'] ['mean'] = mean for feature='gcs_eyes_apache' when hospital_death=0
        # example:   [0]        ['gcs_eyes_apache'] ['var']  = var for feature='gcs_eyes_apache' when hospital_death=0
        self.feature_mean_var_table = {}
        for c in self.classes:
            X_c = X[y == c]
            #Add Laplace Smoothing [ (len(X_c)+1)/(len(X)+1*2) ]
            Lambda = 1000000000
            self.class_priors[c] = (len(X_c)+Lambda)/(len(X)+Lambda*2) # TODO: calculate prior
            self.feature_mean_var_table[c] = {}
            for feature in X.columns:
                self.feature_mean_var_table[c][feature] = {}
                # Calculate mean and var for each feature
                self.feature_mean_var_table[c][feature]['mean'] = round(np.mean(X_c[feature]), 5)# TDOO: calculate the mean
                # np.sum(X_c[feature])/len(X_c[feature])
                self.feature_mean_var_table[c][feature]['var'] = round(np.var(X_c[feature], ddof=1), 5)# TODO: calculate the var

    def _calculate_likelihood(self, x, mean, var):
        currentsd = np.sqrt(var)
        likehood_gauss = ( 1/(math.sqrt(2*math.pi*var)) ) * ( math.exp( -( (x-mean)**2 / (2*var)) ) )
        # print(likehood_gauss)
        return likehood_gauss # TODO: calculate the Gaussian (normal) distribution pdf function as likelihoo

    def predict(self, X):
        predictions = [self._predict(x) for x in X.to_dict(orient='records')]
        return predictions

    def _predict(self, x):
        log_posteriors = []
        # this for loop implement: log(posteior) = log(prior) + log(likelihood)
        for c in self.classes:
            log_prior = np.log(self.class_priors[c])
            log_likelihood = 0
            for feature, value in x.items():
                currentmean = self.feature_mean_var_table[c][feature]['mean']
                currentvar = self.feature_mean_var_table[c][feature]['var']
                # log_likelihood += np.log(np.random.normal(currentmean, currentsd, value)) # TODO: calculate the log likelihood
                log_likelihood += self._calculate_likelihood(value, currentmean, currentvar) # TODO: calculate the log likelihood
            log_posterior = log_prior + log_likelihood
            # print("LOGTEST")
            # print(log_likelihood)
            # print(log_prior)
            # print(log_posterior)
            # print((c, log_posterior))
            log_posteriors.append((c, log_posterior))
        # print("PRINTING LOG LIST")
        # print(log_posteriors)
        return max(log_posteriors, key = lambda i : i[1])[0] # TODO: Return the class with the highest logarithm of posterior probability as the predicted class

In [40]:
def train_val_split_AD(X, y, val_size, random_state):
    ... # TODO: implement your own train_val_split
    #80% Training
    #20% Validation
    LenDataList = len(X)
    LenTraining = math.floor(LenDataList*val_size/100)
    LenValidation = LenDataList - LenTraining
    X_train = X[:LenTraining]
    X_val = X[LenTraining:]
    y_train = y[:LenTraining]
    y_val = y[LenTraining:]

    return X_train, X_val, y_train, y_val

X_train_AD, X_val_AD, y_train_AD, y_val_AD = train_val_split_AD(X, y, 80, 0) # TODO

In [41]:
testx = X_train_AD[y_train_AD == 0]
print(len(testx))

39050


In [42]:
testx2 = X_train_AD[y_train_AD == 1]
print(len(testx2))

12950


In [43]:
len(testx2)/len(X_train_AD)

0.24903846153846154

## Step 5: Make predictions and perform evaluation
You should test your model by evaluating the training set and validation set using the ***cal_f1_score*** function you implemented.


In [44]:
def cal_f1_score_AD(y_true, y_pred):
    # Calculate True Positives, False Positives, False Negatives
    tp = sum(1 for true, pred in zip(y_true, y_pred) if true == 1 and pred == 1)
    fp = sum(1 for true, pred in zip(y_true, y_pred) if true == 1 and pred == 0) # TODO: calculate the false positive
    fn = sum(1 for true, pred in zip(y_true, y_pred) if true == 0 and pred == 1) # TODO: calculate the false negative
    tn = sum(1 for true, pred in zip(y_true, y_pred) if true == 0 and pred == 0) # TODO: calculate the true negative

    # if((tp+fp)==0):
    #   fp = 1
    # if((tp+fn)==0):
    #   fn = 1


    # Calculate precision and recall
    print("TP | FP | FN | TN")
    print(str(tp) + ' | ' + str(fp) + ' | ' + str(fn) + ' | ' + str(tn))
    precision = tp/(tp+fp) # TDOO: calculate the precision
    recall = tp/(tp+fn) # TODO: calculate the recall

    # if(precision+recall==0):
    #   precision = 1

    # Calculate F1-score
    f1_score = (2*precision*recall)/(precision+recall) # TODO: calculate the f1-score

    return f1_score

In [45]:
# TODO: build table on the training set
gnb_classifier_Pers = GaussianNaiveBayesian_Pers()
gnb_classifier_Pers.build_table(X_train_AD, y_train_AD)

# TODO: Make predictions on training set and calculate the f1-score
Y_Train_PredictionADgnb = gnb_classifier_Pers.predict(X_train_AD)
Y_Train_ListADgnb = y_train_AD.tolist()
Train_F1_ScoreADgnb = cal_f1_score_AD(Y_Train_PredictionADgnb, Y_Train_ListADgnb)
# cal_f1_score_AD(Y_Train_PredictionADgnb, Y_Train_ListADgnb)

# # TODO: Make predictions on validation set and calculate the f1-score
Y_Valid_PredictionADgnb = gnb_classifier_Pers.predict(X_val_AD)
Y_Valid_ListADgnb = y_val_AD.tolist()
Valid_F1_ScoreADgnb = cal_f1_score_AD(Y_Valid_PredictionADgnb, Y_Valid_ListADgnb)

# Testing Prediction to All Dataset
Y_ALL_PREDICT_AD_GNB = gnb_classifier_Pers.predict(X)
Y_ALL_LIST_AD_GNB = y.tolist()
ALL_F1_SCORE_AD_GNB = cal_f1_score_AD(Y_ALL_PREDICT_AD_GNB, Y_ALL_LIST_AD_GNB)

# X_train_AD, X_val_AD, y_train_AD, y_val_AD

TP | FP | FN | TN
5533 | 8862 | 7417 | 30188
TP | FP | FN | TN
1371 | 2177 | 1815 | 7637
TP | FP | FN | TN
6904 | 11039 | 9232 | 37825


In [46]:
gnb_classifier_Pers.feature_mean_var_table

{0: {'age': {'mean': 61.97519, 'var': 292.96809},
  'bmi': {'mean': 28.62505, 'var': 63.5693},
  'glucose_apache': {'mean': 142.03088, 'var': 5438.23959}},
 1: {'age': {'mean': 64.49135, 'var': 207.05943},
  'bmi': {'mean': 31.999, 'var': 81.01078},
  'glucose_apache': {'mean': 219.87574, 'var': 12824.23284}}}

In [47]:
gnb_classifier_Pers.class_priors

{0: 0.5000065248303545, 1: 0.4999934751696456}

Normal= {0: 0.7509615384615385, 1: 0.24903846153846154}

λ:1 = {0: 0.7509518864659052, 1: 0.24904811353409484}

λ:10 = {0: 0.7509518864659052, 1: 0.24904811353409484}

λ:1000 = {0: 0.7416666666666667, 1: 0.25833333333333336}

λ:1000 000 000 = {0: 0.5000065248303545, 1: 0.4999934751696456} [0.40]

In [48]:
print("Train: ", end='')
print(Train_F1_ScoreADgnb)
print("Valid: ", end='')
print(Valid_F1_ScoreADgnb)
print("All: ", end='')
print(ALL_F1_SCORE_AD_GNB)

Train: 0.404680928871823
Valid: 0.4071874071874072
All: 0.4051762082220723


## Step 6: Generate result
> Note: Please follow the format mentioned in the slides. You can only change the path for saving your code down below.


In [50]:
predictions = gnb_classifier_Pers.predict(testing_df) # TODO: predict on the testing_df

# TODO: Specify the CSV file path
csv_file_path = 'hw2_advanced_prediction.csv'

# Write the predictions to the CSV file
with open(csv_file_path, 'w', newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['diabetes_mellitus'])
    for prediction in predictions:
        writer.writerow([prediction])

# Report *(5%)*

Report should be submitted as a pdf file **hw2_report.pdf**

* Briefly describe why we take log when implement the Bayesian classifier? (1%)
* Briefly describe the difference between Naïve Bayesian and Gaussian Naïve Bayesian classifier? (1%)
* Briefly describe the difficulty you encountered (1%)
* Summarize how you solve the difficulty and your reflections (2%)
* **No more than one page**

# Save the Code File
Please save your code and submit it as an ipynb file! (**hw2.ipynb**)

# Submission:
1. hw2_basic_prediction.csv
2. hw2_basic_table: **make sure you build_table BEFORE split train_val set, and pass the given example**
3. hw2_advanced_prediction.csv
4. hw2_advanced_table: **make sure you build_table BEFORE split train_val set or pre-processing, and pass the given example**
5. hw2.ipynb
6. hw2_report.pdf