<div style="text-align: right"> <b>Last Updated:</b> 9JUNE2020 </div>

# Classification

__Authors:__ Dale Bowman, PhD; Natasha A Sahr, PhD

The classification problem involves the training of a _classifier_ to determine to which of several groups a new observation vector should be assigned. There are many different types of classifiers can be trained.

The difference between classification and clustering is simple. In clustering, the grouping in the dataset is unknown or latent and we are searching for these unknown patterns in the data. In classification, we have a dataset where the grouping is known and want to use that information to build a tool to classify a new object whose group is unknown. 

Clustering is an _unsupervised_ method since the correct classes are unknown. Classification is a _supervised_ method since the classes are known. 

We will look at two commonly used classifiers: __linear discriminant analysis__ (LDA) and $\mathbf{k}$__-nearest neighbors__ (KNN).

## Linear Discriminant Analysis

Suppose we have response variable $Y$ which can belong to one of $C$ different categories (classes). Associated with $Y$ is a set of features, $\mathbf{X}$. For each of the $C$ categories there is associated a __prior probability__ of belonging to that class, denoted $\pi_i$ for$i=1,\dots,C$. If no information is known about the class probabilities, we can set $\pi_1=\pi_2=\cdots=\pi_C$. For each category of $Y$, there is an associated _density function_, $f_i(\mathbf{x}) = P(\mathbf{X} = \mathbf{x}| Y=i)$. That is <$f_i(\mathbf{x})$ is the probability that $Y$ is in category $i$ given the observed $\mathbf{X}$. The probability $f_i(\mathbf{x})$ will be relatively high if $Y$ is in class $i$ and relatively low if $Y$ is in another category.

To find the probability that $Y$ is in class $k$ given the value of observed vector $\mathbf{X}$ we use _Bayes Theorem_ which states that 

$$Pr(Y=k|\mathbf{X}= \mathbf{x}) = \frac{\pi_k f_k(\mathbf{x})}{\sum_{i=1}^C \pi_i f_i(\mathbf{x})}.$$

We assign $Y$ to the class with the highest probability. The probability $Pr(Y=k|\mathbf{X}= \mathbf{x})$ is called a __posterior probability__. In general, we won’t know the value of the prior probabilities, $\pi_i$’s, or the form of the density functions, $f_i(\mathbf{x})$.

### When $p=1$

The LDA classifier can be illustrated when there is only one feature ($p=1$). For this case, LDA assumes a _normal_ (or Gaussian) distribution for $f_i(\mathbf{x})$. The general equation for the normal density is 

$$f_i(x) = \frac{1}{\sqrt{2 \pi \sigma_i^2}} \exp \left( \frac{1}{2 \sigma_i^2} (x-\mu_i)^2 \right).$$

In the normal distribution, $\mu_i$ is the __mean__ (measuring central tendency) and $\sigma_i^2$ is the __variance__ (measuring dispersion). It is symmetric about the mean.

The LDA classifier makes a further assumption that the variances are equal across categories, i.e. $\sigma_1^2=\sigma_2^2=\cdots=\sigma_C^2=\sigma^2$. If this assumption is not reasonable, a _quadratic discriminant analysis_ can be used. Using the normal model the posterior probability becomes, 

$$P(Y=k|X=x) = \frac{ \pi_k \exp \left(\frac{1}{2\sigma^2} (x-\mu_k)^2\right)}{\sum_{i=1}^C \pi_i \exp \left(\frac{1}{2\sigma^2} (x-\mu_i)^2\right)}$$

A new observation with feature value $x_0$ is classified into the category for which this probability is the biggest. If we take the log of this equation and drop some common terms, this is equivalent to choosing to assign $x_0$ to the category for which $\delta_k$ is largest, for 

$$\delta_k(x) = x_0 \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k).$$

We can illustrate the LDA classifier for the case where $K=2$ and $\pi_1=\pi_2=0.5$ in Figure 1 below. The <font color="red"> red line is a normal distribution with $\mu = - 1.25$ and $\sigma^2 = 1$</font> and the <font color="blue"> blue line is a normal distribution with $\mu = 1.25$ and  $\sigma^2 = 1$</font>. The black line ($x = 0$) shows the _decision boundary_ found using the LDA classifier. The response, $Y$, will be classified into: 
- <font color="red"> the red group for values of $X < 0$</font>; and,
- <font color="blue"> blue group for values of $X > 0$</font>.

__Figure 1__

<body>
    <div class="img-box">
        <img src="images/lda-1.jpg" alt="img1" style="width:100%" />
    </div>
</body>

Of course, in practice we don’t know the population mean or variance of the $C$ classes, so we estimate these parameters from the training data. We use the sample means within each category to estimate the $\mu_i$’s and the sample variance $S^2$ using all the data to estimate the common variance $\sigma^2$. We can estimate the prior probabilities using the proportions of responses in each category in our training data. For example, if we have 20 observations in category 1 and a total of 100 observations, we would estimate $\pi_1$ as $\frac{20}{100} = 0.2$.

For the case where $p>2$, the LDA is constructed similarly using a multivariate normal distribution. This distribution assumes that each of the $p$ features in the observation vector, $X$, has a normal distribution and it takes into account any linear relationships between the features using the correlation between them.

### LDA Programming Example

We have previously used the `iris.csv` dataset. The `iris.csv` dataset contains 5 variables:

- `SepalLength`: the sepal length (cm)
- `SepalWidth`: the sepal width (cm)
- `PetalLength`: the petal length (cm)
- `PetalWidth`: the petal width (cm)
- `Species`: the flower species (cm)

Import `pandas` as `pd`. Use the function `read_csv` in `pandas` to read in the `iris.csv` dataset and name it `dta_iris`. 

__Note__: The variable names are not found in the file and the file does not have an index.

Print the first 5 observations to familiarize yourself with the data again.

In [6]:
dta_iris = pd.read_csv("datasets/iris.csv", 
                       header=None,
                       names=['SepalLength',
                              'SepalWidth',
                              'PetalLength',
                              'PetalWidth',
                              'Species'])

dta_iris.head(5)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


To perform LDA, import `numpy` as `np` and `pandas` as `pd`. We will need these libraries. 

In [5]:
import numpy as np
import pandas as pd

Once the dataset is loaded into a dataframe object, the first step is to separate the dataset into features and corresponding labels. 

Extract the first 4 columns of `dta_iris` using the `iloc` function. These columns correspond to the features. Use the `values` function correctly format the dataframe into an array. Name the resulting object `X`.

In [9]:
X = dta_iris.iloc[:, 0:4].values

Extract the last column of `dta_iris` using the `iloc` function. This column corresponds to the labels. Use the `values` function correctly format the dataframe into an array. Name the resulting object `Y`.

In [10]:
Y = dta_iris.iloc[:, 4].values

We want to split our features and labels into a training set (`X_train`, `Y_train`) and a testing set (`X_test`, `Y_test`). To do this, we need to  from `sklean.model_selection` import the `train_test_split` function.

In [11]:
from sklearn.model_selection import train_test_split

Use the `train_test_split` function with the arguments
- `X`: the features in an array
- `Y`: the labels in an array
- `test_size=0.2`: the proportion of the dataset to include in the test split
The resulting object will have 4 names: `X_train, X_test, Y_train, Y_test`. 

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

We need to scale the features so `SepalWidth`, `SepalLength`, `PetalLength`, and `PetalWidth` are on the same scale. To do this, we need to from `sklearn.preprocessing` import the `StandardScaler` function.

In [23]:
from sklearn.preprocessing import StandardScaler

Since `SepalWidth`, `SepalLength`, `PetalLength`, and `PetalWidth` are in `X_train` and `X_test`, we will transform both. 

First, use the function `fit_transform` with the object `X_train` on `StandardScaler()`. Rename the object `X_train1`. 

Next, use the function `fit_transform` with the object `X_test` on `StandardScaler()`. Rename the object `X_test1`. 

In [25]:
X_train1 = StandardScaler().fit_transform(X_train)
X_test1 = StandardScaler().fit_transform(X_test)

Whew! There is a lot of data preparation required prior to training a classifier. Thankfully, LDA implentation requires only four lines of code using `scikit-learn`. 

1. From `sklearn.discriminant_analysis` import the function `LinearDiscriminantAnalysis` named as `LDA`.
2. Use the function `LDA` with the argument `n_components=1` and name it `lda_iris`. The parameter `n_components` refers to the number of linear discriminates we want to retreive. In setting this argument to `1`, we are checking the performance of a classifier with a single linear discriminant.
3. Use the function `fit_transform` with arguments `X_train1` and `Y_train1`on the `lda_iris` object and name it `lda_iris_fit`.
4. Use the function `transfor` with argument `X_test1` on the `lda_iris` object and name it `lda_iris_transform`. 

In [37]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda_iris = LDA(n_components=1)
lda_iris_fit = lda_iris.fit_transform(X_train1, Y_train)
lda_iris_transform = lda_iris.transform(X_test1)

The `lda_iris_fit` object contains the training linear discriminants. The `lda_iris_transform` object contains the testing linear discriminants.

We want to use the LDA to predict classes, hence `classification`. Use the function `predict` with the argument `X_test1` on the object `lda_iris` and name it `lda_pred_iris_labels`. This named object will contain the predicted labels for the features in the test dataset. 

In [42]:
lda_pred_iris_labels = lda_iris.predict(X_test1)

Sometimes, we may want to evaluate the performance by extracting the predicted probabilities that the observation (row of features) would fall into each class. To get this matrix, use the function `predict_proba` with argument `X_test1` on the object `lda_iris` and name it `lda_pred_iris_prob`. 

In [41]:
lda_pred_iris_prob = lda_iris.predict_proba(X_test1)

We now have performed all steps to train and predict classes with an LDA classifier. 

## $k$-Nearest Neighbor

The $k$-nearest neighbor (KNN) method of classification is also based on the posterior probabilities of the response being in a category given the observed vectors. The difference is in how this probability is estimated. For a given value of $K$ and a new observation vector, $x_0$, first the $K$ observation vectors in the data set that are nearest to $x_0$ are found. Out of this set of $k$ vectors, the posterior probability of class $j$ is found by counting the number of the $K$ nearest neighbors that are in class $j$ divided by $K$. The new observation vector, $x_0$, is then classified into the class with the highest posterior probability.

The choice of $K$ has a great impact on the KNN classifier. When $K$ is low, the classifier tends to be overly flexible, while for $K$ too high the classifier becomes less flexible and will make more mistakes. If we use a part of the data to train the classifier (_training data set_) and then test it on the rest of the data (_test data set_), then for $K$ too low, you will get low error rates on the training set but you may get very high error rates on the test set. This implies the classifier has been _over fit_. On the other hand if $K$ is too high the error rate on the training set will be too low and will not be useful for classification.

### $k$-Nearest Neighbor Programming Example

We will use the same `iris.csv` dataset as in the LDA example. Recall, data has already been pre-processed for use. We have: 

 - the training set: `X_train1` the features, `Y_train` the labels
 - the testing set: `X_test1` the features, `Y_test` the labels
 
To perform the KNN classifier, we need to import `KNeighborsClassifier` from `sklearn.neighbors` as `KNNclass`.

In [67]:
from sklearn.neighbors import KNeighborsClassifier as KNNclass

Use the function `KNNclass` with the argument `n_neighbors=5` and assign the object to `knnclass_iris`. The choice of the number of neighbors will depend on classifier tuning. For now, assume we will use 5 neighbors as it is one of the most commonly used values for the KNN algorithm. 

In [68]:
knnclass_iris = KNNclass(n_neighbors=5)

Fit the KNN algorithm using the function `fit` with arguments `X_train1` and `Y_train` on `knnclass_iris`. Name the object `knnclass_iris_fit`. 

In [70]:
knnclass_iris_fit = knnclass_iris.fit(X_train1, Y_train)

To get the predicted classes from the KNN algorithm, use the function `predict` with argument `X_test1` on `knnclass_iris` and name the resulting array `knnclass_pred_iris_labels`. 

In [72]:
knnclass_pred_iris_labels = knnclass_iris.predict(X_test1)

We now have performed all steps to train and predict classes with an KNN classifier. 

## Assessing the Classifier

### Confusion Matrix

For simplicity, consider a classifier where the response variable has only two classes, say positive (1) and negative (0). A _confusion matrix_ is a tool to evaluate the predictive value of the classifier. The table below is an example of a confusion matrix for a two category classifier.

__Figure 2__

<body>
    <div class="img-box">
        <img src="images/confusionMatrixSimple.png" alt="img1" style="width:100%" />
    </div>
</body>

__Source:__ https://justmachinelearning.com/2019/09/26/simple-guide-to-the-confusion-matrix/

A prediction is a true positive (TP) if the actual value was positive and the classifier predicted it to be positive. A prediction is a true negative (TN) if the actual value was negative and the classifier predicted it to be negative. TP and TN are correct predictions and should be large for a _good classifier_. 

A prediction is a false positive (FP) if the actual response was negative and the predicted response was positive. A prediction is a false negative (FN) if the actual response was positive and the classifier predicted it to be negative. FP and FN are errors and should be small for a _good classifier_. 

Consider a medical setting for evaluating diagnostic tests. TP is the number of people that have the disease and the test correctly showed that they have the disease. TN is the number of people that don’t have the disease and the test correctly says they don’t. FP is the number of people who do not have the disease but the test incorrectly predicts that they do and FN is the number of people who do have the disease but the test predicts they don’t.

There are some additional measures used to assess the classifier.

The _sensitivity_ also called recall is defined as the ratio of true positive to total number of actual positives, $\frac{TP}{TP+FN}$. The sensitivity will be between 0 and 1. The better the classifier, the larger the sensitivitiy.

The _specificity_ also called precision is a similar measure for the negative response. It is the ratio of true negative to the total number of actual negatives, $\frac{TN}{TN+FP}$. The specificity is also between 0 and 1. The better the classifer, the larger the specificity. 

A measure of the error that is often used is the _false discovery rate_ (FDR). The FDR measures how many of the actual negative responses were predicted as positive by the classifier, $FDR = \frac{FP}{FP+TN} = 1 - \text{specificity}$. The better the classifier, the smaller the FDR. 

It is possible to construct confusion matrices for situations where there are more than two classes but the specificity, sensitivity and FDR are defined differently.

An expanded graphic is in Figure 3. Can you spot the metric missing?

__Figure 3__

<body>
    <div class="img-box">
        <img src="images/confusionMatrix.jpg" alt="img1" style="width:100%" />
    </div>
</body>

__Source:__ https://manisha-sirsat.blogspot.com/2019/04/confusion-matrix.html

### LDA Programming Example cont.

Remeber our LDA programming example on the `iris.csv` dataset. We now want to evaluate the performance of the classifier for our test dataset. Our observed labels were called `Y_test` and our predicted labels were called `lda_pred_iris_labels`. 

We first need to import `confusion_matrix`, `classification_report`, and `accuracy_score` from `sklearn.metrics`.

In [47]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

To get the confusion matrix, use the function `confusion_matrix` with the arguments `Y_test` and `lda_pred_iris_labels`. Name the object `lda_iris_cm`. Print the confusion matrix `lda_iris_cm` using the `print` function.

In [73]:
lda_iris_cm = confusion_matrix(Y_test, lda_pred_iris_labels)
print(lda_iris_cm)

[[11  0  0]
 [ 0  8  2]
 [ 0  0  9]]


To get the accuracy, use the function `accuracy_score` with the arguments `Y_test` and `lda_pred_iris_labels`. Name the object `lda_iris_acc`. Print the confusion matrix `lda_iris_acc` using the `print` function.

In [52]:
lda_iris_acc = accuracy_score(Y_test, lda_pred_iris_labels)
print(lda_iris_acc)

0.9333333333333333


To get the precision and recall for each of the classes, use the function `classification_report` with the arguments `Y_test` and `lda_pred_iris_labels`. Name the object `lda_iris_met`. Print the confusion matrix `lda_iris_met` using the `print` function.

In [57]:
lda_iris_met = classification_report(Y_test, lda_pred_iris_labels)
print(lda_iris_met)

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      0.80      0.89        10
 Iris-virginica       0.82      1.00      0.90         9

       accuracy                           0.93        30
      macro avg       0.94      0.93      0.93        30
   weighted avg       0.95      0.93      0.93        30



### $k$-Nearest Neighbor Programming Example cont

We can also evaluate the performance of our KNN classifier. Our observed labels were called `Y_test` and our predicted labels were called `knnclass_pred_iris_labels`. 

Generate the confusion matrix as done previously with the function `confusion_matrix`. Name the confusion matrix for the KNN classifier as `knnclass_iris_cm`. Print the confusion matrix with the `print` function.

In [74]:
knnclass_iris_cm = confusion_matrix(Y_test, knnclass_pred_iris_labels)
print(knnclass_iris_cm)

[[11  0  0]
 [ 0  8  2]
 [ 0  0  9]]


Generate the classification report as done previously with the function `classification_report`. Name the classification report for the KNN classifier as `knnclass_iris_met`. Print the classification report with the `print` function. 

In [75]:
knnclass_iris_met = classification_report(Y_test, knnclass_pred_iris_labels)
print(knnclass_iris_met)

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        11
Iris-versicolor       1.00      0.80      0.89        10
 Iris-virginica       0.82      1.00      0.90         9

       accuracy                           0.93        30
      macro avg       0.94      0.93      0.93        30
   weighted avg       0.95      0.93      0.93        30



### ROC Curves

The __ROC curve__ is a graphical tool to examine the errors of a classifier. ROC stands for receiver operating characteristics, a term from communications theory. Figure 4 below shows a typical ROC curve for a classifier.

__Figure 4__

<body>
    <div class="img-box">
        <img src="images/ROC-1.jpg" alt="img1" style="width:100%" />
    </div>
</body>

The ROC curve has the sensitivity on the vertical axis (y-axis) and 1-specificity (FDR) on the horizontal axis (x-axis). The ideal ROC curve hugs the top left corner which corresponds to a high true positive rate and a low false positive rate. The forty-five degree line that goes from bottom left (0,0) to upper right (1,1) shown as a black dotted line in the plot is considered to be a classifier that is no better than guessing i.e. the posterior probabilities are both 0.5.

A single number summary of the ROC curve is the _AUC_ which stands for area under the (ROC) curve. The larger the AUC the better the classifier has done. The forty-five degree line has an AUC of 0.50 so a good classifier will have AUC higher than random guessing. For the ROC curve in Figure 4 the AUC is 0.8342. This is a moderately good classifier, certainly better than random.

### Complete Example

We are going to work through a complete example using a new dataset. Refer to the previous sections for full instructions if you need assistance. 

 The `binary.csv` dataset contains 4 variables:

- `admit`: the admittance status (0=not admitted, 1=admitted)
- `gre`: the student's GRE score
- `gpa`: the student's GPA
- `rank`: rank of the institution (1=highest to 4=lowest prestige)

Use the function `read_csv` in `pandas` to read in the `binary.csv` dataset and name it `dta_admit`. 

__Note__: The variable names are found in the file and the file does not have an index.

Print the first 5 observations to familiarize yourself with the data.

In [81]:
dta_admit = pd.read_csv("datasets/binary.csv")
dta_admit.head(5)

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


We need to process the data. First thing it to split the data into features and labels. The labels are in the first column, named `admit`. The features are in the other three columns. Create `X` and `Y` objects.

In [82]:
Y = dta_admit.iloc[:, 0].values
X = dta_admit.iloc[:, 1:4].values

Split the data into training and testing data. Obtain four objects: `X_train`, `X_test`, `Y_train`, `Y_test`. 

In [85]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

Scale the features. Obtain two objects: `X_train1`, `X_test1`.

In [86]:
X_train1 = StandardScaler().fit_transform(X_train)
X_test1 = StandardScaler().fit_transform(X_test)

Perform LDA. There will be three lines of code:
1. Use the `LDA` function. Name the object `lda_admit`. 
2. Use the `fit_transform` function. Name the object `lda_admit_fit`.
3. Use the `predict` function. Name the object `lda_pred_admit_labels`.

In [87]:
lda_admit = LDA(n_components=1)
lda_admit_fit = lda_admit.fit_transform(X_train1, Y_train)
lda_pred_admit_labels = lda_admit.predict(X_test1)

Perform KNN classification. There will be three lines of code:
1. Use the `KNNclass` function. Name the object `knnclass_admit`. 
2. Use the `fit_transform` function. Name the object `knnclass_admit_fit`.
3. Use the `predict` function. Name the object `knnclass_pred_admit_labels`.

In [88]:
knnclass_admit = KNNclass(n_neighbors=5)
knnclass_admit_fit = knnclass_admit.fit(X_train1, Y_train)
knnclass_pred_admit_labels = knnclass_admit.predict(X_test1)

We are now interested in evaluating the ROC and AUC. To do that, import `roc_curve` and `roc_auc_score` from `sklearn.metrics`. 

In [89]:
from sklearn.metrics import roc_curve, roc_auc_score

__FOR LDA__:

Use the `roc_curve` function with the arguments `Y_test` and `lda_pred_admit_labels`. Name the object `fpr_lda, tpr_lda, thresholds_lda`. 

In [91]:
fpr_lda, tpr_lda, thresholds_lda = roc_curve(Y_test, lda_pred_admit_labels)

Use `predict_proba` function on `lda_admit` with argument `X_test1`. Name the object  `lda_pred_admit_proba`. Select the probabilities for the positive outcome only using `[:,1]`.

In [95]:
lda_pred_admit_proba = lda_admit.predict_proba(X_test1)[:,1]

Use the `roc_auc_score` function with the arguments `Y_test` and `lda_pred_admit_proba`. Name the object `auc_lda`. Use the function `print` to show `auc_lda`.

In [97]:
auc_lda = roc_auc_score(Y_test, lda_pred_admit_proba)
print(auc_lda)

0.6991210277214335


The AUC is approximately 0.70 for the LDA performed on this data.

__FOR LDA__:

To get the curve, we plot `fpr_lda` on the x-axis and `tpr_lda` on the y_axis. First, import `plotly.graph_objects` as `go`. 

Use the `roc_curve` function with the arguments `Y_test` and `knnclass_pred_admit_labels`. Name the object `fpr_knnclass, tpr_knnclass, thresholds_knnclass`. 

In [132]:
fpr_knnclass, tpr_knnclass, thresholds_knnclass = roc_curve(Y_test, knnclass_pred_admit_labels)

Use `predict_proba` function on `knnclass_admit` with argument `X_test1`. Name the object  `knnclass_pred_admit_proba`. Select the probabilities for the positive outcome only using `[:,1]`.

In [133]:
knnclass_pred_admit_proba = knnclass_admit.predict_proba(X_test1)[:,1]

Use the `roc_auc_score` function with the arguments `Y_test` and `knnclass_pred_admit_proba`. Name the object `auc_knnclass`. Use the function `print` to show `auc_knnclass`.

In [134]:
auc_knnclass = roc_auc_score(Y_test, knnclass_pred_admit_proba)
print(auc_knnclass)

0.5341446923597025


The AUC is approximately 0.53 for the KNN classification performed on this data.

__ROC Curves__

To get the ROC curves for random guessing, LDA, and KNN classification on one plot, use the code below:

<blockquote><tt>
<pre>import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_lda, 
                         y=tpr_lda, 
                         name='LDA',
                         line=dict(color='royalblue'))) 
fig.add_trace(go.Scatter(x=fpr_knnclass, 
                         y=tpr_knnclass, 
                         name='KNN classification',
                         line=dict(color='red'))) 
fig.add_trace(go.Scatter(x=[0,1], 
                         y=[0,1], 
                         name='No classifier',
                         line=dict(color='black', 
                                   dash='dash'))) 

fig.update_layout(
    xaxis=dict(title_text="False Positive Rate"),
    yaxis=dict(title_text="True Positive Rate"))
 
fig.show()</pre>
</tt></blockquote>

In [136]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_lda, 
                         y=tpr_lda, 
                         name='LDA',
                         line=dict(color='royalblue'))) 
fig.add_trace(go.Scatter(x=fpr_knnclass, 
                         y=tpr_knnclass, 
                         name='KNN classification',
                         line=dict(color='red'))) 
fig.add_trace(go.Scatter(x=[0,1], 
                         y=[0,1], 
                         name='No classifier',
                         line=dict(color='black', 
                                   dash='dash'))) 

fig.update_layout(
    xaxis=dict(title_text="False Positive Rate"),
    yaxis=dict(title_text="True Positive Rate"))
 
fig.show()