# <font color='gray'> Lab 3: Classification II</font>


-
## Introduction 


The aim of this lab is familiarise ourselves with the concepts of **priors**, **dimensionality** and **Bayes** in
classification problems. 

- This lab constitutes your third course-work activity.
- A report answering the <font color = 'red'>**questions in</font><font color = "maroon"> red**</font> only should be submitted by the 5th of April. 
- The report should be a separate file in **pdf format** (so **NOT** *doc, docx, notebook* etc.), well identified with your name, student number, assignment number (for instance, Assignment 3), module code. 
- No other means of submission other than the appropriate QM+ link is acceptable at any time (so NO email attachments, etc.)
- **PLAGIARISM** <ins>is an irreversible non-negotiable failure in the course</ins> (if in doubt of what constitutes plagiarism, ask!). 


## **1. Data Proportions with `MaxEnt` + `Bayes`**


#### 0. Loading the dataset

*   This first cell loads the `Iris` flower dataset. The Iris flower fataset is a classic dataset used to identify types of flowers based on features describing their petals. It also comes with a description of itself:



In [None]:
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
#print(iris)
print(iris.DESCR)

*   As you can read, the full dataset contains four attributes (`sepal length`, `sepal width`, `petal length`, `petal width`, in centimeters) , but for simplicity (and to be able to plot them!) we will only work with the first two (`sepal length`, `sepal width`). 

In [None]:
X = iris.data[:, :2] 
print(X[:4,:]) # printing the first few samples, to get a feeling about the data
print(iris.feature_names[:2])

*  This dataset is labeled. The labels are the type of Iris flower each sample is: `setosa`, `versicolor`, and `virginica`. The labels are represented by numbers: 0, 1 and 2, respectively:

In [None]:
Y = iris.target
print(Y[:10]) # printing the first few labels
print(iris.target_names)

#### 1. Plotting the dataset

*   Since we now have only two attributes, we can plot them in 2-D. To differentiate the labels, we will use different markers: 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# marker_list = ['P', 'X', 'o']
marker_list = ['+', '.', 'x']
# marker_list = ['*', 's', 'D']

fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111)
ax.set_aspect('equal')

for l in [0, 1, 2]:
  ax.scatter(X[Y == l, 0], X[Y == l, 1], 
             marker=marker_list[l], s=70, color='black',
            #  color = plt.cm.Accent.colors[l], edgecolors='k',
             label='{:d} ({:s})'.format(l, iris.target_names[l]))

ax.legend(fontsize=12)
ax.set_xlabel(iris.feature_names[0], fontsize=14)
ax.set_ylabel(iris.feature_names[1], fontsize=14)
ax.grid(alpha=0.3)
ax.set_xlim(X[:, 0].min() - 0.5, X[:, 0].max() + 0.5)
ax.set_ylim(X[:, 1].min() - 0.5, X[:, 1].max() + 0.5)
plt.show()

Let's split our data into train and test sets (you should know by now why we do this!). We will use a simple 50/50 split:

In [None]:
Xtr = X[::2,:] # train data set (features)
Ytr = Y[::2] # train data set (labels)

Xte = X[1::2,:] # test data set (features)
Yte = Y[1::2] # test data set (labels)
#print (Xtr)

#### 2. Train a Multi-class Logistic Regression classifier 

In the previous lab, we implemented a logistic regression classifier (a.k.a. `MaxEnt`) by ourseleves. Here, we will instead import it from the python's `scikit-learn` library. 
(Link to `scikit-learn` [documentation](https://scikit-learn.org/stable/documentation.html), and its [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) page in particular).
Some of the code here is from [this example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html) from scikit-learn's own documentation.

To learn more about `multinomial logistic regression` visit [Binary vs. Multi-Class Logistic Regression (by Chris Yeh)](https://chrisyeh96.github.io/2018/06/11/logistic-regression.html)
 

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg_classifier = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')
log_reg_classifier.fit(Xtr, Ytr)

x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5

h = .02  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = log_reg_classifier.predict(np.c_[xx.ravel(), yy.ravel()])

# First, plotting the decision regions:
Z = Z.reshape(xx.shape)
fig = plt.figure(1, figsize=(7, 7))
ax = fig.add_subplot(111)
ax.set_aspect('equal')
ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Pastel2)


# Now plotting the training points:
for l in [0, 1, 2]:
  ax.scatter(Xtr[Ytr == l, 0], Xtr[Ytr == l, 1], 
             marker=marker_list[l], color='black', s=70,
            label='{:d} ({:s})'.format(l, iris.target_names[l]))
   

ax.set_xlabel(iris.feature_names[0], fontsize=14)
ax.set_ylabel(iris.feature_names[1], fontsize=14)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.legend(fontsize=12, loc='upper right')
ax.grid(alpha=0.3)
plt.show()


We trained our model using training data only. In the above plot, we are displaying the training data and the resulting three decision regions. To get a visual feeling of how well our model is generalising to unseen data, let's display the test data as well. We are going to plot test samples in red, to differentiate them from the training data:


In [None]:
fig = plt.figure(1, figsize=(7, 7))
ax = fig.add_subplot(111)
ax.set_aspect('equal')

# Fitst plotting the decision regsions again:
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Pastel2)

# plotting the train points in black and with reduced opacity:
for l in [0, 1, 2]:
  plt.scatter(Xtr[Ytr == l, 0], Xtr[Ytr == l, 1], 
             marker=marker_list[l], color='black', s=70, alpha=0.5,)
# plotting the test points in red:
for l in [0, 1, 2]:
  plt.scatter(Xte[Yte == l, 0], Xte[Yte == l, 1], 
             marker=marker_list[l], color='red', s=70, 
              label='{:d} ({:s})'.format(l, iris.target_names[l]))
  
  
ax.set_xlabel(iris.feature_names[0], fontsize=14)
ax.set_ylabel(iris.feature_names[1], fontsize=14)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.legend(fontsize=12, loc='upper right')
ax.grid(alpha=0.3)
plt.show()

---
> **Q0:** Using the code snippet below, compute the accuracy of our logistic model on both **train** and **test** data.


> **A0:**
--- Included code below and the test accuracy is 82%

In [None]:
# first computing the train accuracy:
train_accuracy = np.average(Ytr==log_reg_classifier.predict(Xtr))
print('train accuracy (computed ourselves) = {}'.format(train_accuracy))
print('train accuracy (using an already available method of our classifier object) = {}'.format(log_reg_classifier.score(Xtr, Ytr)))
# now computing the test accuracy ...   
test_accuracy = np.average(Ytr==log_reg_classifier.predict(Xte))
print('test accuracy (computed ourselves) = {}'.format(test_accuracy))
print('test accuracy (using an already available method of our classifier object) = {}'.format(log_reg_classifier.score(Xte, Yte)))

---
> **Q1:** How many instances of each class are there in the training data? (A sample code is provided for you below as a hint)

> **A1:**
---Prss: 25 instances are present for each class

In [None]:

print(sum(Ytr==0))
print(sum(Ytr==1))
print(sum(Ytr==2))
#print (sum(Yte==0))

And the confusion matrix (here, the true classes correspond to the rows of the confusion matrix, whereas the predicted classes correspond to the columns).

In [None]:
# Computing the confusion matrix:
# first, from scikit-learn library:

from sklearn.metrics import confusion_matrix

train_confusion_matrix = confusion_matrix(y_true=Ytr, y_pred=log_reg_classifier.predict(Xtr))
print('train confusion matrix:\n {}\n'.format(train_confusion_matrix))

test_confusion_matrix = confusion_matrix(y_true=Yte, y_pred=log_reg_classifier.predict(Xte))
print('test confusion matrix:\n {}'.format(test_confusion_matrix))

cm = train_confusion_matrix
train_confusion_matrix_normalised = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm = test_confusion_matrix
test_confusion_matrix_normalised = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

print('Normalised train confusion matrix:\n {}\n'.format(train_confusion_matrix_normalised))
print('Normalised test confusion matrix:\n {}\n'.format(test_confusion_matrix_normalised))

If you would like to display the confusion matrix more nicely (taken from [here](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)):

In [None]:
fig, ax = plt.subplots()
ax.set_aspect('equal')
ax.set_xlim(-0.5, 2.5)
ax.set_ylim(-0.5, 2.5)

cm = test_confusion_matrix
classes = iris.target_names
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       # ... and label them with the respective list entries
       xticklabels=classes, yticklabels=classes,
       title='Test Confusion Matrix',
       ylabel='True label',
       xlabel='Predicted label')

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
thresh = cm.max() / 2.
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > thresh else "black")
fig.tight_layout()

Since we are going to print the evaluation of our classifiers a few times, let's create a new function that will do this job:

In [None]:
def print_classifier_report(Xtrain, ytrain, Xtest, ytest, classifier, plot_regions=False):
  train_confusion_matrix = confusion_matrix(y_true=ytrain, y_pred=classifier.predict(Xtrain))
  print('Train accuracy = {:.3f}'.format(classifier.score(Xtrain, ytrain)))
  print('Train confusion matrix:\n {}\n'.format(train_confusion_matrix))

  test_confusion_matrix = confusion_matrix(y_true=ytest, y_pred=classifier.predict(Xtest))
  print('Test accuracy = {:.3f}'.format(classifier.score(Xtest, ytest)))
  print('Test confusion matrix:\n {}\n'.format(test_confusion_matrix))

  cm = train_confusion_matrix
  train_confusion_matrix_normalised = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
  cm = test_confusion_matrix
  test_confusion_matrix_normalised = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

  print('Normalised train confusion matrix:\n {}\n'.format(train_confusion_matrix_normalised))
  print('Normalised test confusion matrix:\n {}\n'.format(test_confusion_matrix_normalised))
  
  if plot_regions:

    fig = plt.figure(1, figsize=(7, 7))
    ax = fig.add_subplot(111)
    ax.set_aspect('equal')

    # Fitst plotting the decision regsions again:
    x_min = min(Xtrain[:, 0].min(), Xtest[:, 0].min()) - .5 # min of the x axis
    x_max = max(Xtrain[:, 0].max(), Xtest[:, 0].max()) + .5 # max of the x axis
    y_min = min(Xtrain[:, 1].min(), Xtest[:, 1].min()) - .5 # min of the y axis
    y_max = max(Xtrain[:, 1].max(), Xtest[:, 1].max()) + .5 # max of the y axis
    h = .02  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = classifier.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Pastel2)

    # plotting the train points in black and with reduced opacity:
    marker_list = ['+', '.', 'x']
    for l in [0, 1, 2]:
      plt.scatter(Xtrain[ytrain == l, 0], Xtrain[ytrain == l, 1], 
                 marker=marker_list[l], color='black', s=70, alpha=0.5,)
    # plotting the test points in red:
    for l in [0, 1, 2]:
      plt.scatter(Xtest[ytest == l, 0], Xtest[ytest == l, 1], 
                 marker=marker_list[l], color='red', s=70, 
                  label='{:d} ({:s})'.format(l, iris.target_names[l]))


    ax.set_xlabel(iris.feature_names[0], fontsize=14)
    ax.set_ylabel(iris.feature_names[1], fontsize=14)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.legend(fontsize=12, loc='upper right')
    ax.grid(alpha=0.3)
    plt.show()

And now just to check if the code works:

In [None]:
print_classifier_report(Xtr, Ytr, Xte, Yte, log_reg_classifier, plot_regions=True)

### 3. Imbalanced Classes
We now assume that the classifier is trained based on data gathered in some geographic region A, where the proportions of flowers of each type are different. Specifically, we assume that flower type 2 (`virginica Iris`) is 5 times less common.


In [None]:
index_A_tr = [i for i, x in enumerate(Xtr) if Ytr[i]!=2 or i % 5 == 0]
XtrImbalanced_A = Xtr[index_A_tr,:] # train data set (features)
YtrImbalanced_A = Ytr[index_A_tr] # train data set (labels)

# just to check the imbalanced-ness of the new training data:
for l in [0, 1, 2]:
  print('Number of instances of class {} in the new training data: {}'.format(l, sum(YtrImbalanced_A == l)))
  
  
index_A_te = [i for i, x in enumerate(Xte) if Yte[i]!=2 or i % 5 == 0]
XteImbalanced_A = Xte[index_A_te,:] # test data set (features)
YteImbalanced_A = Yte[index_A_te] # test data set (labels)

print()
# just to check the imbalanced-ness of the new test data:
for l in [0, 1, 2]:
  print('Number of instances of class {} in the new test data: {}'.format(l, sum(YteImbalanced_A == l)))

*   We will now train a logistic classifier on this new train data which has an imbalanced proportion of classes, and investigate the new accuracy, new confusion matrix, and the new decision regions:

In [None]:
log_reg_classifier_imbalanced_A = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')
log_reg_classifier_imbalanced_A.fit(XtrImbalanced_A, YtrImbalanced_A)

print_classifier_report(XtrImbalanced_A, YtrImbalanced_A, XteImbalanced_A, YteImbalanced_A, log_reg_classifier_imbalanced_A, True)

---
> **Q2:** Compare the new decision boundary with the previous one. Describe the difference.

> **A2:** 
---

*   Note	that	although	the	overall	accuracy	is	high,	one	class	is	being	heavily	mis-classified.	

---
#### <font color='maroon'>**Exercise 1:** Which class is being heavily mis-classified? Why has this happened? <ins>[1 mark]</ins></font>
---

Prss: Virginca is mis-classified heavily. Due to less training instances for Virginica class

---
#### <font color='maroon'>**Exercise 2:** Obtain the accuracy for this class from the test dataset and identify the other class that it is being confused with. <ins>[1 mark]</ins></font>
---

Prss: 3 instances out of 5 instances falls in other class Versicolar. It is 40% accuracy observed from the normalised test matrix

Now let's assume our classifier (which was trained using data of region A) is taken to another geographic region B, where the flower proportions are different again, specifically flower type 1 (`versicolor Iris`) is 5 times less common.


In [None]:
index_B_te = [i for i, x in enumerate(Xte) if Yte[i]!=1 or i % 5 == 0]
XteImbalanced_B = Xte[index_B_te,:] # test data set (features)
YteImbalanced_B = Yte[index_B_te] # test data set (labels)


*   To visually compare the flower frequencies:


In [None]:
fig, ax = plt.subplots()
counts_A = np.bincount(YteImbalanced_A)
counts_B = np.bincount(YteImbalanced_B)

xlocations = np.array([0, 1, 2])
ax.bar(xlocations-0.15, counts_A, width=0.3, align='center', label='A')
ax.bar(xlocations+0.15, counts_B, width=0.3, align='center', hatch='//', label='B')

ax.set_xticks([0, 1, 2])
ax.set_xticklabels(iris.target_names)
ax.set_xlim(-1,3.0)

ax.grid(axis='y', alpha=0.75)
ax.set_xlabel('Iris category')
ax.set_ylabel('Frequency')
ax.set_title('Comparing the frequency of flower types between regions A and B')
ax.legend()
plt.show()

*   So how would our model that is trained in region A perform if it used in region B? Let's see:

In [None]:
print_classifier_report(XtrImbalanced_A, YtrImbalanced_A, XteImbalanced_B, YteImbalanced_B, log_reg_classifier_imbalanced_A, True)

*   Analyse the distribution of circle and cross points in the predictor space
and the resulting decision boundaries (note that if samples were
correctly classified, the virginica samples (crosses) should be in the gray region,  and the
versicolor samples (circles) should be in light green region).


---
#### <font color='maroon'>**Exercise 3:** What is the new accuracy for class 2 (virginica)? Compare this accuracy with the accuracy obtained in the previous section and explain any discrepancies. <ins>[1 mark]</ins></font>
---

44 percentage is the new accuracy for class2.  The accuracy is improved from 40% to 44% in the new test data. 

## Naive Bayes

In this cell we obtain a Naïve Bayes classifier from a dataset corresponding to the region A. 	

Although Naive Bayes is simple enough that we can write our own code for it, let's just directly use scikit-learn, (as documented [here](https://scikit-learn.org/stable/modules/naive_bayes.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)):

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb_A = GaussianNB()
gnb_A.fit(XtrImbalanced_A, YtrImbalanced_A)

print_classifier_report(XtrImbalanced_A, YtrImbalanced_A, XteImbalanced_A, YteImbalanced_A, gnb_A, True)

### Naive Bayes with priors


*   In Naïve Bayes classifiers, priors can be specified. At test time the prior is combined with the likelihood via Bayes theorem to classify the samples.

*   By default, Bayes classifiers use the observed data frequency to estimate
the prior, although you can specify them yourself.

*   Let's specify a balanced prior (even though the data is imbalanced) by passsing prior = [1/3,1/3,1/3].

In [None]:
gnb_A_with_uniform_priors = GaussianNB(priors=np.array([1/3,1/3,1/3]))
gnb_A_with_uniform_priors.fit(XtrImbalanced_A, YtrImbalanced_A)

print_classifier_report(XtrImbalanced_A, YtrImbalanced_A, XteImbalanced_A, YteImbalanced_A, gnb_A_with_uniform_priors, True)

---
> **Q3:** Compare the new decision boundary with the previous one. Describe the difference.
Compare your results against the ones obtained by letting the prior be determined from the data, by
analyzing any changes in the decision boundaries and in the confusion matrix.

> **A3:**
---

Now, suppose as before that we only have access to data from region A, but we know that our model is going to be used in region B. We can reuse the likelihoods already learnt and set the "priors" of region B. 


---
#### <font color='maroon'>**Exercise 4:** What prior should you use to get maximum accuracy in region B? What accuracy do you get by using this value? <ins>[1 mark]</ins></font>
---
Note that you should be able to get higher accuracy than by using of the MaxEnt classifier from
region A in B as we did before. Also keep in mind that a good prior should reflect the
relative frequency of flowers in region B.

Use the following code block to answer this exercise.

Prss:  The prior should be changed to (25/55,5/55,25/55). The accuracy increases to 81.8%

In [None]:
# code block to be used to answer the above exercise
  
gnb_A_with_uniform_priors = GaussianNB(priors=np.array([25/55,5/55,25/55]))
gnb_A_with_uniform_priors.fit(XtrImbalanced_A, YtrImbalanced_A)

print_classifier_report(XtrImbalanced_A, YtrImbalanced_A, XteImbalanced_B, YteImbalanced_B, gnb_A_with_uniform_priors, True)

## **3. Scaling with the number of dimensions (a.k.a attributes, features)**

Run the following cell. You will create a dataset consisting of 25 examples for  2 classes (so a total of 50 instances):

In [None]:
import numpy as np

np.random.seed(2)

#dim = 2     # Generate data with only 2 dimensions.
dim = 200   # Generate data with 200 dimensions.
m1, m2, s1, s2 = [1, 0, 1, 1]
numberInstances=25

Ytr = np.append(np.ones(numberInstances, dtype = int), np.zeros(numberInstances, dtype = int), axis = 0)
Xtr = np.append(m1 + s1 * np.random.randn(numberInstances, dim),  m2 + s2 * np.random.randn(numberInstances, dim), axis = 0)
Yte = np.append(np.ones(numberInstances, dtype = int), np.zeros(numberInstances, dtype = int), axis = 0)
Xte = np.append(m1 + s1 * np.random.randn(numberInstances, dim),  m2 + s2 * np.random.randn(numberInstances, dim), axis = 0)
#print(Xtr)

As you can see in the following figure, the dataset is weakly separable.



In [None]:
import matplotlib.pyplot as plt

# split the data in df_train based on their class labels
train_Label_0 = Xtr[Ytr == 0]
train_Label_1 = Xtr[Ytr == 1]

# plot the data
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111)
ax.set_aspect('equal')

ax.scatter(train_Label_0[:,0], train_Label_0[:,1], 
            c = 'b', marker = 'o', label = 'Label 0')
ax.scatter(train_Label_1[:,0], train_Label_1[:,1], 
            c = 'r', marker = 'x', label = 'Label 1')

# set a title and labels
ax.set_title('Plot of the $training$ dataset', fontsize = 18)
ax.set_xlabel("Attribute 1", fontsize = 14)
ax.set_ylabel("Attribute 2", fontsize = 14)

# set dimensions of x and y axes 
ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.grid(alpha=0.2)

# set legend
ax.legend(fontsize=14)

# show the plot
plt.show()

*   Run logistic regression. Note the train and test accuracy.


In [None]:
from sklearn.linear_model import LogisticRegression

log_reg_classifier = LogisticRegression(C=1e5, solver='lbfgs')
log_reg_classifier.fit(Xtr, Ytr)

print('Train accuracy (Logistic Regression): {:.2f}'.format(log_reg_classifier.score(Xtr, Ytr)))
print('Test accuracy (Logistic Regression): {:.2f}'.format(log_reg_classifier.score(Xte, Yte)))


*   Run naïve Bayes. Note the train and test accuracy. They should be similar.



In [None]:
from sklearn.naive_bayes import GaussianNB

gnb_classifier = GaussianNB()
gnb_classifier.fit(Xtr, Ytr)

print('Train accuracy (Naive Bayes): {:.2f}'.format(gnb_classifier.score(Xtr, Ytr)))
print('Test accuracy (Naive Bayes): {:.2f}'.format(gnb_classifier.score(Xte, Yte)))


*   Now lets suppose that instead of 2 attributes, there are 200 attributes, each of
which are weakly informative.

  Set dim = 200 in Cell 1 and re-run the remaining cells.
* Note the train and test accuracy of each approach

Prss: Increasing the attributes increases the accuracy. When the data was 2 dimension it was not linearly separable when 200 attributes its easy to separate the model( Ex: paper 2d to multifold)

---
#### <font color='maroon'>**Exercise 5:** Compare the performance of both classifiers in the 2-feature scenario with the performance in the 200-feature scenario and explain any differences you might observe. <ins>[1 mark]</ins></font>
---

## 4. Exploring ROC curves

*   In part 2, we worked with imbalanced datasets. Specifically, we

---



---


observed that minority classes are likely to be misclassified. One way to
address this is to adjust the boundary of the classifier.


*   The following cell generates samilar data that we used in the previous part (with 2 dimensions only) and will train maxEnt and Bayes models on its training data.

In [None]:
import numpy as np

np.random.seed(2)

dim = 2     
m1, m2, s1, s2 = [1, 0, 1, 1]
numberInstances=50

Ytr = np.append(np.ones(numberInstances, dtype = int), np.zeros(numberInstances, dtype = int), axis = 0)
Xtr = np.append(m1 + s1 * np.random.randn(numberInstances, dim),  m2 + s2 * np.random.randn(numberInstances, dim), axis = 0)
Yte = np.append(np.ones(numberInstances, dtype = int), np.zeros(numberInstances, dtype = int), axis = 0)
Xte = np.append(m1 + s1 * np.random.randn(numberInstances, dim),  m2 + s2 * np.random.randn(numberInstances, dim), axis = 0)


log_reg_classifier = LogisticRegression(solver='lbfgs')
log_reg_classifier.fit(Xtr, Ytr)

gnb_classifier = GaussianNB()
gnb_classifier.fit(Xtr, Ytr)

print('Done')

*   In the following, we change the decision boundary by
adjusting a threshold that by default is 0.5. Try different values for the
threshold and observe how the confusion matrix and accuracy change. By
changing the threshold value (experiment with it!), you can make the classifier prefer class 1
over class 2, which could be useful in an application where they have
varying importance.


In [None]:
y_test_probilities_LogisticRegression = log_reg_classifier.predict_proba(Xte)[:,1]
y_test_probilities_NaiveBayes = gnb_classifier.predict_proba(Xte)[:,1]

thr = 0.3

from sklearn.metrics import recall_score, accuracy_score, precision_score, confusion_matrix

y_test_pred_LR = y_test_probilities_LogisticRegression > thr
y_test_pred_NB = y_test_probilities_NaiveBayes > thr


print('with threshold {}, the accuracy score of Naive Bayes (on test data) is: \t{:.2f}'.format(thr, accuracy_score(Yte,y_test_pred_NB)))
print('with threshold {}, the recall score of Naive Bayes (on test data) is: \t{:.2f}'.format(thr, recall_score(Yte,y_test_pred_NB)))
print('with threshold {}, the precision score of Naive Bayes (on test data) is: \t{:.2f}'.format(thr, precision_score(Yte,y_test_pred_NB)))
print('-'*20)
print('with threshold {}, the accuracy score of Logistic Regression (on test data) is: \t{:.2f}'.format(thr, accuracy_score(Yte,y_test_pred_LR)))
print('with threshold {}, the recall score of Logistic Regression (on test data) is: \t{:.2f}'.format(thr, recall_score(Yte,y_test_pred_LR)))
print('with threshold {}, the precision score of Logistic Regression (on test data) is: \t{:.2f}'.format(thr, precision_score(Yte,y_test_pred_LR)))

import pandas as pd

print('\n'+'='*30+'\n')
print('with threshold {}, confusion matrix on test data for Naive Bayes classifier:'.format(thr))
print(pd.DataFrame(confusion_matrix(Yte, y_test_pred_NB),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))

print('-'*20)

print('with threshold {}, confusion matrix on test data for Logistic Regression classifier:'.format(thr))
print(pd.DataFrame(confusion_matrix(Yte, y_test_pred_LR),
                 columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))



---
> **Q4:** Describe and then briefly explain the trend in the accuracy, precision and recall of a classified with respect to the value of threshold


> **A4:**
---

*   The following code, [modified from here](https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65), generates and plots the ROC curve, which is just false-positive-rates and true positive-rates for different values of thresholds against each other, and reports the AUC:

In [None]:
from sklearn.metrics import roc_curve, auc


plt.figure(figsize=(8,8))
plt.title('ROC Curve')
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.005, 1, 0, 1.005])
plt.xticks(np.arange(0,1, 0.05), rotation=90)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
    

fpr, tpr, auc_thresholds = roc_curve(Yte, y_test_probilities_LogisticRegression)
print(auc(fpr, tpr))
plt.plot(fpr, tpr, linewidth=3, alpha=0.7, label='Logistic Regression')

fpr, tpr, auc_thresholds = roc_curve(Yte, y_test_probilities_NaiveBayes)
print(auc(fpr, tpr)) 
plt.plot(fpr, tpr, linewidth=3, alpha=0.7, label='Naive Bayes')

plt.legend(loc='best')


