# Practical 2 : Generative and Discriminative Models


In this practical, we will compare the Naïve Bayes Classifier (NBC) and Logistic Regression on six
datasets. As part of the practical you should briefly read the following paper:



**On Discriminative vs. Generative classifiers: A comparison of logistic regression
and naive Bayes**  
*Andrew Y. Ng and Michael I. Jordan*  
Advances in Neural Information Processing Systems (NIPS) 2001.

The paper is available on OLAT. 

You should read the Introduction and the Experiments sections. The goal of this practical is
to qualitatively reproduce some of the experimental results in this paper. You are strongly
encouraged to read the rest of the paper, which is rather short and straightforward to read,
though some of you may want to skip the formal proofs.

## Naïve Bayes Classifier

You should implement a Naïve Bayes Classifier directly in python. To keep your code tidy,
we recommend implementing it as a class. Make sure that your classifier can handle binary, continuous and categorical features, and an arbitrary number of class labels. Suppose the data has 3
different features, the first being binary, the second being continuous and the third being categorical, and that there are
4 classes. Write an implementation that you can initialise as follows:

    nbc = NBC(feature_types=['b', 'r', 'c'], num_classes=4)

Along the lines of classifiers provided in sklearn, you want to implement two more functions,
**fit** and **predict**. 
Recall the joint distribution of a generative model: $p(\mathbf{x}, y \mid \theta, \pi) = p(y | \pi) \cdot p(\mathbf{x} \mid y, \theta)$.
The fit function is expected to estimate all the parameters ($\theta$ and $\pi$) of the NBC. The predict function is expected to compute the probabilities that the new input belongs to all classes and
then return the class that has the largest probability.

    nbc.fit(X_train, y_train)
    ypredicted = nbc.predict(X_test)
    test_accuracy = np.mean(ypredicted == ytest)

Here we import the libraries. 

In [61]:
%matplotlib inline
import pylab
pylab.rcParams['figure.figsize'] = (10., 10.)

import pickle as cp
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal, multinomial
import pandas as pd

Before implementing NBC, we suggest you first implement the three types of the distributions of the parameters of NBC. Your implementation should have two functions: **estimate** and **get_probability**. The estimate function takes some data as input and computes the maximum likelihood estimators (MLE) for the parameters $\theta$ of the distribution $p(x | \theta)$. The get_probability function takes a new input value $x_{new}$ and returns $p(x_{new} | \theta)$. For example, in the case of continuous features, we can use the Gaussian distribution. The estimate function will find the parameters $\mu$ and $\sigma$ for the Gaussian distribution with respect to the input data, and the function get_probability will return $\mathcal{N}(x_{new} \mid \mu, \sigma)$. 

![alt text](pics/mle_4.png)


You can import statistic libraries for the implementation of the distributions. We recommend using the statistical functions provided by `scipy.stats`. Read the documentation here: https://docs.scipy.org/doc/scipy/reference/stats.html


In [85]:
# Distribution for continuous features
class ContFeatureParam:
    def estimate(self, X):
        # TODO: Estimate the parameters for the Gaussian distribution 
        # so that it best describes the input data X
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
    
        self.mean = np.mean(X)
        self.variance = np.var(X)
        if self.variance.size == 1:
            if self.variance == 0:
                self.variance = 10**-6
        else:
            for i,value in enumerate(self.variance):
                if value == 0:
                    self.variance[i] = 10**-6
        
        return self.mean, self.variance
        # self.mean, self.var = norm.stats(moments="mv")
        
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################

    def get_probability(self, val, j, dfCont):
        # TODO: returns the density value of the input value val
        # Note the input value val could be a vector rather than a single value
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################       
        
        #Directly
        return multivariate_normal.pdf(val, mean=dfCont[j].iloc[0], cov=dfCont[j].iloc[1])
        
        # by hand implementation
        #pi = np.pi
        #return np.sqrt(1/(2*pi*self.variance))*np.exp(-(1/(2*self.variance)*(val-self.mean)**2))
    
    
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################

# Distribution for binary features
class BinFeatureParam:
    def estimate(self, X):
        # TODO: Estimate the parameters for the Bernoulli distribution 
        # so that it best describes the input data X
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        
        self.ones = np.count_nonzero(X==1)
        self.prob = self.ones/X.size
        return self.prob, (1-self.prob)
        
        #count = np.count_nonzero(X)
        #othercount = X.size - count
        ##theta = count / X.size
        ##LL = count * np.log(theta) + othercount * np.log(1-theta)
        #
        #self.theta_prime = count/(othercount+count)
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################

    def get_probability(self, val, j, dfprob):
        # TODO: returns the density value of the input value val
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        
        # val = individual rows of dataset
        # j = iterator of nr of classes
        # dfprob = df with nr of probability of 1's and (counter-) probability of 0's for each class
        k = np.sum(val)
        
        return dfprob[j].iloc[0]**k * dfprob[j].iloc[1]**(val.size-k)
        # out probability per column
        
        
        '''
        if val == 1:
            bin_p = np.divide(self.ones, self.n)
            if bin_p == 0:
                return 10**-6
            if bin_p == 1:
                return 1 - 10**-6
            else:
                return bin_p
        if val == 0:
            bin_p = np.divide(self.zeros, self.n)
            if bin_p == 0:
                return 10**-6
            if bin_p == 1:
                return 1 - 10**-6
            else:
                return bin_p
        else:
            print("Non-Binary!")
        '''
        #return self.theta_prime if val == 1 else 1-self.theta_prime
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################
        

# Distribution for categorical features
class CatFeatureParam:
    def estimate(self, X):
        # TODO: Estimate the parameters for the Multinoulli distribution 
        # so that it best describes the input data X
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        
        self.n = X.size
        self.features, self.counts = np.unique(X, return_counts=True)
        self.probs = np.array(np.divide(self.counts, self.n))
        return self.features, self.probs
        
        
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################

    def get_probability(self, val, j, dfCat):
        # TODO: returns the density value of the input value val
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        
        if val in self.features:
            position = np.where(self.features==val)
            mul_p = self.probs[position]
            if mul_p == 1:
                return 1 - 10**-6
            else:
                return mul_p
            
        else:
            return 10**-6
        
        #return multinomial.pmf(val, 1, self.probs)
        
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################
        
        

In [63]:
#Test Gauss
x = np.array([1,2,3,3,2,1])

test_cont = ContFeatureParam()
est= test_cont.estimate(x)
print(test_cont.estimate(x))
print(test_cont.get_probability(2,1,est))


(2.0, 0.6666666666666666)


AttributeError: 'numpy.float64' object has no attribute 'iloc'

In [64]:
#Test Bernoulli
x = np.array([1,1,1,1,1,1])
test_bin = BinFeatureParam()
test_bin.estimate(x)

print(test_bin.estimate(x))
test_bin.get_probability(0)

(1.0, 0.0)


TypeError: get_probability() missing 2 required positional arguments: 'j' and 'dfprob'

In [65]:
#Test Multi
x = np.array([2,2,2])
test_mul = CatFeatureParam()
np.array
print(test_mul.estimate(x))
print(test_mul.get_probability(1))


(array([2]), array([1.]))


TypeError: get_probability() missing 2 required positional arguments: 'j' and 'dfCat'

Let us now implement a class for NBC. We'll keep it simple and try to follow the sklearn models. We'll have an init function, fit function and predict function.

**Hints for function fit**: Recall the joint distribution of a generative model: $p(\mathbf{x}, y \mid \theta, \pi) = p(y | \pi) \cdot p(\mathbf{x} \mid y, \theta)$. 
The fit function will estimate the parameters for NBC based on the training data. 
Here we give you some hints how to estimate the $\theta$ in $p(\mathbf{x} \mid y, \theta)$. 

For each class $c$, we want to estimate the $\theta_c$ for the distribution $p(\mathbf{x} \mid y = c, \theta_c)$. 
Since the assumption of NBC that the features are conditionally independent given the class $c$, the class conditional distribution is a product of $D$ distributions, one for each feature: $p(\mathbf{x} \mid y = c, \theta_c) = \prod_{j}^{D} p(x_j \mid y = c, \theta_{jc})$. Hence, we need to estimate the $\theta_{jc}$ based on the data with class $c$ and feature $j$. 

![alt text](pics/fit_4.png)


**Hints for function predict**: The predict function should compute the probabilities $p(y = c \mid \mathbf{x}_{new}, \pi, \theta)$ for the new inputs $\mathbf{x}_{new}$ on all classes by applying the Bayes rule:

$$p(y = c \mid \mathbf{x}_{new}, \pi, \theta) = \frac{p(y = c \mid \pi_c) \cdot p(\mathbf{x}_{new} \mid y=c, \theta)}{\sum^{C}_{c'=1}p(y=c' \mid \pi_{c'}) \cdot p(\mathbf{x}_{new} \mid y=c', \theta_{c'})},$$

and then return the class that has the largest probability:

$$y_{predict} = \underset{c}{arg\,\max} \, {p(y = c \mid \mathbf{x}_{new}, \theta_c)}.$$

Here we give you some hints on the computation of $p(\mathbf{x}_{new} \mid y=c, \theta_c)$. 
Due to the conditional independence assumption, we have $p(\mathbf{x}_{new} \mid y=c, \theta_c) = \prod_{j}^{D} p(x^j_{new} \mid y = c, \theta_{jc})$. Since we have got the parameters $\theta_{jc}$ in the fit phase,  we can use them to compute the probabilities for the new data. 

![alt text](pics/predict_3.png)

In [66]:
class NBC:
    # Inputs:
    #   feature_types: the array of the types of the features, e.g., feature_types=['r', 'r', 'r', 'r']
    #   num_classes: number of classes of labels
    def __init__(self, feature_types=[], num_classes=0):
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        self.feature_types = feature_types
        self.num_classes = num_classes
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################
    # The function uses the input data to estimate all the parameters of the NBC
    # You should use the parameters based on the types of the features
    def fit(self, X, y):
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        
        # get list of iterators for nr of classes
        self.classes = np.arange(self.num_classes)
        
        # writing masks to split columns into feature groups
        self.ContFeatureList = np.zeros(len(self.feature_types))
        self.BinFeatureList = np.zeros(len(self.feature_types))
        self.CatFeatureList = np.zeros(len(self.feature_types))
        
        for i,ftype in enumerate(self.feature_types):
            if ftype == "r":
                self.ContFeatureList[i] = 1
                
            elif ftype == "b":
                self.BinFeatureList[i] = 1
                
            elif ftype == "c":
                self.CatFeatureList[i] = 1
                
            else:
                print("{} is not a valid type ".format(ftype))
        

        df = pd.DataFrame(X)
        df2 = pd.DataFrame(y)
        
        # Continous Feature Processing (& Mean per Class Calculation for further Processing )
        Contdf = df.loc[:,self.ContFeatureList[:] == 1] # apply mask to dataframe
        ContEstimates = []
        self.MeanPerClass = []
        self.ContFeatureObject = ContFeatureParam()
        for i in self.classes:
            Contdfcopy = Contdf.copy()
            Contdfcopy = Contdfcopy.loc[df2[0] == i] # select rows where class = c
            ContEstimates.append(self.ContFeatureObject.estimate(Contdfcopy))   
            self.MeanPerClass.append(np.divide(np.size(df2.loc[df2[0] == i]),np.size(df2)))
        
        # Binary Feature Processing
        Bindf = df.loc[:,self.BinFeatureList[:] == 1]
        BinEstimates = []
        self.BinFeatureObject = BinFeatureParam()
        for i in self.classes:
            Bindfcopy = Bindf.copy()
            Bindfcopy = Bindfcopy.loc[df2[0] == i]
            BinEstimates.append(self.BinFeatureObject.estimate(Bindfcopy))  
        
        # Categorical Feature Processing
        Catdf = df.loc[:,self.CatFeatureList[:] == 1]
        CatEstimates = []
        self.CatFeatureObject = CatFeatureParam()
        for i in self.classes:
            Catdfcopy = Catdf.copy()
            Catdfcopy = Catdfcopy.loc[df2[0] == i]
            CatEstimates.append(self.CatFeatureObject.estimate(Catdfcopy))  
        
        
        
        # place in dataframe for convenience
        self.dfCont = []
        for c in range(len(ContEstimates)):
            self.dfCont.append(pd.DataFrame(ContEstimates[c]))
        
        self.dfBin = []
        for c in range(len(BinEstimates)):
            self.dfBin.append(pd.DataFrame(BinEstimates[c]))
            
        self.dfCat = []
        for c in range(len(CatEstimates)):
            self.dfCat.append(pd.DataFrame(CatEstimates[c]))
            
        return self.dfBin
    
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################
                
    # The function takes the data X as input, and predicts the class for the data
    def predict(self, X):
        # The code below is just for compilation. 
        # You need to replace it by your own code.
        ###################################################
        ##### YOUR CODE STARTS HERE #######################
        ###################################################
        # return yhat based on estimators calculated in fit
        dftest = pd.DataFrame(X)
        
        pdf = []
    
        # Continous 
        dftestCont = dftest.loc[:,self.ContFeatureList[:] == 1]
        if dftestCont.empty:
            pass
        else:
            for i,row in enumerate(dftestCont.values):
                pdf_p_class1 = []
                for j in self.classes:
                    pdf_p_class1.append(self.ContFeatureObject.get_probability(dftestCont.values[i], j, 
                                                                              self.dfCont))
                pdf.append(pdf_p_class1)
                
        # Binary
        dftestBin = dftest.loc[:,self.BinFeatureList[:] == 1]
        if dftestBin.empty:
            pass
        else:
            
            for i,row in enumerate(dftestBin.values):
                pdf_p_class2 = []
                for j in self.classes:
                    pdf_p_class2.append(self.BinFeatureObject.get_probability(dftestBin.values[i], j, 
                                                                              self.dfBin))
                pdf.append(pdf_p_class2)
        
        
        # Categorical
        dftestCat = dftest.loc[:,self.CatFeatureList[:] == 1]
        if dftestCat.empty:
            pass
        else:
            for i,row in enumerate(dftestCat.values):
                pdf_p_class3 = []
                for j in self.classes:
                    pdf_p_class3.append(self.CatFeatureObject.get_probability(dftestCat.values[i], j, 
                                                                              self.dfCat))
                pdf.append(pdf_p_class3)
        
        return np.argmax((np.multiply(pdf,self.MeanPerClass)) ,axis=1)
        
        ###################################################
        ##### YOUR CODE ENDS HERE #########################
        ###################################################

    

**Implementation Issues**
- Fell free to add auxiliary functions. 
- Don't forget to compute $p(y=c | \pi)$ 
- Remember to do all the calculations in log space to avoid running into underflow issues. Read more: (Mur) Chapter 3.5.3
- Your implementation should be able to handle missing values
- As far as possible use matrix operations. So assume that Xtrain, ytrain, Xtest will all
be numpy arrays. Try and minimise your use of python loops. (In general, looping over
classes or features is OK, but looping over data is probably not a good idea.)
- The variance parameter for Gaussian distributions should never be exactly 0, so in
case your calculated variance is 0, you may want to set it to a small value such as 1e − 6.
Note that this is essential to ensure that your code never encounters division by zero or
taking logarithms of 0 errors. Also, you want to ensure that the estimates for the parameter for the Bernoulli or Multinoulli random variables
is never exactly 0 or 1. For this reason you should consider using Laplace smoothing (https://en.wikipedia.org/wiki/Additive_smoothing).


You can use the below code to do sanity check of your implementation using the iris dataset. All features of the iris dataset are continuous, so you do not need to implement all types of feature parameters to check your code. 

You should expect your implementation has an accuracy larger than 90%. 

In [67]:
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris['data'], iris['target']

N, D = X.shape
Ntrain = int(0.8 * N)
shuffler = np.random.RandomState(seed=4).permutation(N)
Xtrain = X[shuffler[:Ntrain]]
ytrain = y[shuffler[:Ntrain]]
Xtest = X[shuffler[Ntrain:]]
ytest = y[shuffler[Ntrain:]]


nbc_iris = NBC(feature_types=['r', 'r', 'r', 'r'], num_classes=3)
nbc_iris.fit(Xtrain, ytrain)

yhat = nbc_iris.predict(Xtest)
test_accuracy = np.mean(yhat == ytest)
print("Accuracy:", test_accuracy)

yhat

Accuracy: 0.9666666666666667




array([0, 0, 2, 2, 1, 0, 0, 0, 2, 1, 0, 0, 2, 2, 2, 0, 0, 2, 1, 1, 1, 2,
       2, 1, 2, 1, 1, 2, 2, 2], dtype=int64)

In [68]:
ytest

array([0, 0, 2, 2, 1, 0, 0, 0, 2, 1, 0, 0, 2, 1, 2, 0, 0, 2, 1, 1, 1, 2,
       2, 1, 2, 1, 1, 2, 2, 2])

In [69]:
test = []
dftest = pd.DataFrame(test)
if dftest.empty:
    print("empty")

empty


## Logistic Regression

For logistic regression, you should use the implementation in sklearn. Adding the following
line will import the LR model.

    from sklearn.linear_model import LogisticRegression

Read the information provided on the following links to understand some details about how the
logistic regression model is implemented in scikit-learn.
- http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression


## Comparing NBC and LR

### Experiments

You will compare the classification error of the NBC and LR trained on increasingly
larger training datasets. Because the datasets are so small, you should do this multiple times and
average the classification error. One run should look as follows:
- Shuffle the data, put 20% aside for testing.
    
    ```N, D = X.shape
    Ntrain = int(0.8 * N)
    shuffler = np.random.permutation(N)
    Xtrain = X[shuffler[:Ntrain]]
    ytrain = y[shuffler[:Ntrain]]
    Xtest = X[shuffler[Ntrain:]]
    ytest = y[shuffler[Ntrain:]]
    
    ```  


- Train the classifiers with increasingly more data. For example, we can train classifiers with 10%, 20%, ..., 100% of the training data. For each case store the classification errors on the test set of the classifiers.

You may want to repeat this with at least 200 random permutations (possibly as large as 1000)
to average out the test error across the runs. In the end, you will get average test errors as a
function of the size of the training data. Plot these curves for NBC and LR on the datasets.

In [70]:
# inputs:
#   nbc: Naive Bayes Classifier
#   lr: Logistic Regression Classifier
#   X, y: data
#   num_runs: we need repeat num_runs times and store average results
#   num_splits: we want to compare the two models on increasingly larger training sets.
#               num_splits defines the number of increasing steps. 
# outputs:
#   the arrays of the test errors across the runs of the two classifiers 
def compareNBCvsLR(nbc, lr, X, y, num_runs=200, num_splits=10):
    # The code below is just for compilation. 
    # You need to replace it by your own code.
    ###################################################
    ##### YOUR CODE STARTS HERE #######################
    ###################################################
    tst_errs_nbc = np.zeros((num_splits))
    tst_errs_lr = np.zeros((num_splits))
    return tst_errs_nbc, tst_errs_lr
    ###################################################
    ##### YOUR CODE ENDS HERE #########################
    ###################################################

The utility function below defines the function for plotting. 

In [71]:
def makePlot(nbc_perf, lr_perf, title=None):
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)

    ax.tick_params(axis='both', labelsize=20)

    ax.set_xlabel('Percent of training data used', fontsize=20)
    ax.set_ylabel('Classification Error', fontsize=20)
    if title is not None: ax.set_title(title, fontsize=25)

    xaxis_scale = [(i + 1) * 10 for i in range(10)]
    plt.plot(xaxis_scale, nbc_perf, label='Naive Bayes')
    plt.plot(xaxis_scale, lr_perf, label='Logistic Regression', linestyle='dashed')
    
    ax.legend(loc='upper right', fontsize=20)

### Datasets

Tasks: For each dataset,
1. prepare the data for the two classifiers
2. compare the two classifiers on the dataset and generate the plots
3. write a short report of how you prepare the data and your observations of the comparison 

In [94]:
# imports for data preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import StratifiedShuffleSplit

**Dataset 1: Iris Dataset**

https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

In [90]:
# TODO: insert your code for experiments
###################################################
##### YOUR CODE STARTS HERE #######################
###################################################
from sklearn.datasets import load_iris
iris_obj = load_iris()
# the code transform the iris dataset to a dataframe
iris = pd.DataFrame(iris_obj.data, columns=iris_obj.feature_names,index=pd.Index([i for i in range(iris_obj.data.shape[0])])).join(pd.DataFrame(iris_obj.target, columns=pd.Index(["species"]), index=pd.Index([i for i in range(iris_obj.target.shape[0])])))
iris.info()
###################################################
##### YOUR CODE ENDS HERE #########################
###################################################

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int32  
dtypes: float64(4), int32(1)
memory usage: 11.4 KB


**Dataset 2: Voting Dataset**

https://archive.ics.uci.edu/ml/datasets/congressional+voting+records


In [91]:
# TODO: insert your code for experiments
###################################################
##### YOUR CODE STARTS HERE #######################
###################################################

# Load Data
voting = pd.read_csv('./datasets/datasets/voting.csv')

# Drop two columns with most missing values, then drop all rows with missing values
voting = voting.drop("export-administration-act-south-africa", axis=1)
voting = voting.drop("water-project-cost-sharing", axis=1)
voting = voting.dropna()

# Ordinal Encoding

ordinal_encoder = OrdinalEncoder()

# # Encoding
# voting_encoded = ordinal_encoder.fit_transform(voting) # Rep=1, Dem=0 // Yes=1, No=0

# Party encoding
voting_party = voting[["label"]]
voting_party_encoded = ordinal_encoder.fit_transform(voting_party) # Rep=1, Dem=0

# Voting encoding
voting_votes = voting.drop("label", axis=1)
voting_votes_encoded = ordinal_encoder.fit_transform(voting_votes) # Yes=1, No=0

# Shuffle 


X = voting_votes_encoded
y = voting_party_encoded

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=1154)

# NBC
NBC_voting = NBC(feature_types=['b','b','b','b','b','b','b','b','b','b','b','b','b','b',], num_classes=2)
library = NBC_voting.fit(Xtrain, ytrain)

yhat = NBC_voting.predict(Xtest)
test_accuracy = np.mean(yhat == ytest)
print("Accuracy:", test_accuracy)



###################################################
##### YOUR CODE ENDS HERE #########################
###################################################

Accuracy: 0.5714285714285714


**Dataset 3: Car Evaluation Dataset**

https://archive.ics.uci.edu/ml/datasets/car+evaluation

In [114]:
# TODO: insert your code for experiments
###################################################
##### YOUR CODE STARTS HERE #######################
###################################################
car = pd.read_csv('datasets/datasets/car.csv')
car.info()
car.value_counts()
one_hot_encoder = OneHotEncoder()

X = car.drop("buying", axis = 1)
y = car["buying"]
print(y)
X_encoded = one_hot_encoder.fit_transform(X)
X_encoded_array = X_encoded.toarray()



###################################################
##### YOUR CODE ENDS HERE #########################
###################################################

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   buying         1728 non-null   object
 1   maint          1728 non-null   object
 2   doors          1728 non-null   object
 3   persons        1728 non-null   object
 4   lug_boot       1728 non-null   object
 5   safety         1728 non-null   object
 6   acceptability  1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB
0       vhigh
1       vhigh
2       vhigh
3       vhigh
4       vhigh
        ...  
1723      low
1724      low
1725      low
1726      low
1727      low
Name: buying, Length: 1728, dtype: object


**Dataset 4: Breast Cancer Dataset**

https://archive.ics.uci.edu/ml/datasets/breast+cancer

In [None]:
# TODO: insert your code for experiments
###################################################
##### YOUR CODE STARTS HERE #######################
###################################################
cancer = pd.read_csv('./datasets/breast-cancer.csv')
cancer.info()
###################################################
##### YOUR CODE ENDS HERE #########################
###################################################

**Dataset 5: Ionosphere Dataset**

https://archive.ics.uci.edu/ml/datasets/ionosphere

In [None]:
# TODO: insert your code for experiments
###################################################
##### YOUR CODE STARTS HERE #######################
###################################################
ionosphere = pd.read_csv('./datasets/ionosphere.csv')
ionosphere.info()
###################################################
##### YOUR CODE ENDS HERE #########################
###################################################

**Dataset 6: Sonar Dataset**

http://archive.ics.uci.edu/ml/datasets/connectionist+bench+%28sonar,+mines+vs.+rocks%29

In [None]:
# TODO: insert your code for experiments
###################################################
##### YOUR CODE STARTS HERE #######################
###################################################
sonar = pd.read_csv('./datasets/sonar.csv')
sonar.info()
###################################################
##### YOUR CODE ENDS HERE #########################
###################################################