# Lab 3: Multiclass Classification with XGBoost

In [None]:
## Notebook Settings
# Add autotime of each block
#!pip install ipython-autotime
%load_ext autotime

### Goals:
- Learn the basics of cyber network data with respect to consumer IoT devices
- Load network data into a data frame
- Explore network data and features
- Use XGBoost to build a classification model
- Evaluate the model
- Experiment on your own with feature selection, aggregation, and/or XGBoost parameters

This lab builds on the previous labs and will utilize some of those skills.

### Background

#### The Internet of Things and Data at a Massive Scale

Gartner estimates there are currently over 8.4 billion Internet of Things (IoT) devices. By 2020, that number is [estimated to surpass 20 billion](https://www.zdnet.com/article/iot-devices-will-outnumber-the-worlds-population-this-year-for-the-first-time/). These types of devices range from consumer devices (e.g., Amazon Echo, smart TVs, smart cameras, door bells) to commercial devices (e.g., building automation systems, keycard entry). All of these devices exhibit behavior on the Internet as they communicate back with their own clouds and user-specified integrations.

#### Types of Network Data

The most detailed type of data that is typically collected on a network is full Packet CAPture (PCAP) data. This information is detailed and contains everything about the communication, including: source address, destination address, protocols used, bytes transferred, and even the raw data (e.g., image, audio file, executable). PCAP data is fine-grained, meaning that there is a record for each frame being transmitted. A typical communication is composed of many individual packets/frames.

If we aggregate PCAP data so that there is one row of data per communication session, we call that flow level data. A simplified example of this relationship is shown in the figure below.

![PCAP_flow_relationship](pcap_vs_flow.png)

For this tutorial, we use data from the University of New South Wales. In a lab environment, they [collected nearly three weeks of IoT data from 21 IoT devices](http://149.171.189.1). They also kept a detailed [list of devices by MAC address](http://149.171.189.1/resources/List_Of_Devices.txt), so we have ground-truth with respect to each IoT device's behavior on the network.

**Our goal is to utilize the behavior exhibited in the network data to classify IoT devices.**

### Data Investigation

In [None]:
!ls /dli/data/kdd-data

In [None]:
!ls /dli/data/kdd-data/iot

Let's first see some of the data. We'll load a PCAP file in using PyShark (a Python wrapper for Tshark).

In [None]:
import pyshark
cap = pyshark.FileCapture("/dli/data/kdd-data/iot/16-09-27.pcap")

In [None]:
print(cap[0])

In [None]:
print(cap[0].ip)

There's really a lot of features there! In addition to having multiple layers (which may differ between packets), there are a number of other issues with working directly with PCAP. Often the payload is encrypted (note the SSL layer in the above example), rendering it useless. The lack of aggregation also makes it difficult to differentiate between packets. What we really care about for this application is what a *session* looks like. In other words, how a Roku interacts with the network is likely quite different than how a Google Home interacts. 

To save time for the tutorial, all three weeks of PCAP data have already been transformed to flow data, and we can load that in to a typical Pandas dataframe. Due to how the data was created, we have a header row (with column names) as well as a footer row. We want to use the header but will skip the footer.

In [None]:
import pandas as pd
pdf = pd.read_csv("/dli/data/kdd-data/iot/bro/conn.log", sep='\t', skipfooter=1)
print("==> pdf shape: ",pdf.shape)

We can look at what this new aggregated data looks like, and get a better sense of the columns and their data types.

In [None]:
pdf.head()

In [None]:
pdf.dtypes

In [None]:
# if time allows, investigate the data here

### Adding ground truth labels back to the data

We'll need some labels for our classification task, so we've already prepared a file with those labels.

In [None]:
labels_pdf = pd.read_csv("/dli/data/kdd-data/iot/lab_mac_labels.csv", sep=',')
labels_pdf.head()

In order to use XGBoost, we need numeric category IDs. We'll use `.cat.codes` to create them and add them back to the labels data frame.

In [None]:
labels_pdf = labels_pdf.assign(CategoryID=(labels_pdf['Category']).astype('category').cat.codes)

In [None]:
labels_pdf.head()

We now perform a series of merges to add the ground truth data (device name, connection, category, and categoryID) back to the dataset. Since each row of netflow has two participants, we'll have to do this twice - once for the originator (source) and once for the responder (destination).

In [None]:
merged_pdf = pd.merge(pdf,labels_pdf, how='left', left_on=['orig_l2_addr'], right_on=['MAC'])

In [None]:
merged_pdf = merged_pdf.rename(columns = {'Device':'orig_device',
                                          'MAC':'orig_MAC',
                                          'Connection':'orig_connection',
                                          'Category':'orig_category',
                                          'CategoryID':'orig_category_id'})

In [None]:
merged_pdf = pd.merge(merged_pdf,labels_pdf, how='left', left_on=['resp_l2_addr'], right_on=['MAC'])

In [None]:
merged_pdf = merged_pdf.rename(columns = {'Device':'resp_device',
                                          'MAC':'resp_MAC',
                                          'Connection':'resp_connection',
                                          'Category':'resp_category',
                                          'CategoryID':'resp_category_id'})

Let's just look at our new dataset to make sure everything's okay.

In [None]:
merged_pdf.head()

In [None]:
merged_pdf.dtypes

### Exploding the Netflow Data into Originator and Responder Rows

We now have netflow that has one row per (sessionized) communication between an originator and responder. However, in order to classify an individual device, we need to explode data. Instead of one row that contains both originator and responder, we'll explode to one row for originator information (orig_bytes, orig_pkts, orig_ip_bytes) and one for responder information (resp_bytes, resp_pkts, resp_ip_bytes).

The easiest way to do this is to create two new dataframes, rename all of the columns, then `concat` them back together. Just for sanity, we'll also check the new shape of our exploded data frame.

In [None]:
orig_comms_pdf = merged_pdf[['ts','id.orig_h','id.orig_p','proto','service','duration','orig_bytes','orig_pkts','orig_ip_bytes',
                             'orig_device','orig_MAC','orig_category','orig_category_id']]
orig_comms_pdf.columns = ['ts','ip','port','proto','service','duration','bytes','pkts','ip_bytes','device','MAC','category','category_id']

In [None]:
resp_comms_pdf = merged_pdf[['ts','id.resp_h','id.resp_p','proto','service','duration','resp_bytes','resp_pkts','resp_ip_bytes',
                             'resp_device','resp_MAC','resp_category','resp_category_id']]
resp_comms_pdf.columns = ['ts','ip','port','proto','service','duration','bytes','pkts','ip_bytes','device','MAC','category','category_id']

In [None]:
exploded_pdf = pd.concat([orig_comms_pdf,resp_comms_pdf])
print("==> shape = ", exploded_pdf.shape)

In [None]:
exploded_pdf.head()

We're going to need the number of categories (classes) quite a bit, so we'll make a variable for it for easier access.

In [None]:
num_categories = int(len(exploded_pdf.groupby('category_id')['category_id'].nunique()))
print("==> number of IoT categories =", num_categories)

### Binning the Data and Aggregating the Features

But wait, there's still more data wrangling to be done! While we've exploded the flows into rows for orig/resp, we may want to bin the data further by time. The rationale is that any single communication may not be an accurate representation of how a device typically reacts in its environment. Imagine the simple case of how a streaming camera typically operates (most of its data will be uploaded from the device to a destination) versus how it operates during a firmware update (most of the data will be pushed down to the device, after which a brief disruption in connectivity will occur).

There's a lof ot different time binning we could do. It also would be useful to investigate what the average duration of connection is relative to how many connections per time across various time granularities. With that said, we'll just choose a time bin of 1 hour to begin with. In order to bin, we'll use the following formula:

$$\text{hour_time_bin}=\left\lfloor{\frac{ts}{60*60}}\right\rfloor$$

In [None]:
import numpy as np

In [None]:
exploded_pdf['hour_time_bin'] = exploded_pdf['ts'].apply(lambda x: int(np.floor(x/(60*60))))

We also have to make a choice about how we'll aggregate the binned data. One of the simplest ways is to sum the bytes and packets. There are really two choices for bytes, `bytes` and `ip_bytes`. With Bro, `bytes` is taken from the TCP sequence numbers and is potentially inaccurate, so we select `ip_bytes` instead for both originator and responder. We'll also use the sum of the number of packets.

In [None]:
one_hour_time_bin_pdf = exploded_pdf[['bytes','pkts','ip_bytes','MAC','category_id','hour_time_bin']].groupby(['MAC','category_id','hour_time_bin']).sum()
one_hour_time_bin_pdf = one_hour_time_bin_pdf.reset_index()

We finally have some data that's ready for classification. Let's take a look.

In [None]:
one_hour_time_bin_pdf.head()

### Creating the Training and Testing Datasets

We'll take a tradition 70/30 train/test split, and we'll randomly sample into a train and test data frame.

In [None]:
pdf_msk = np.random.rand(len(one_hour_time_bin_pdf)) < 0.7

In [None]:
train_pdf = one_hour_time_bin_pdf[pdf_msk]
test_pdf = one_hour_time_bin_pdf[~pdf_msk]

print("==> train length =",len(train_pdf))
print("==> test length =",len(test_pdf))

Prepare the training input (`train_X`), training target (`train_Y`), test input (`test_X`) and test target (`test_Y`) datasets.

In [None]:
train_X = train_pdf[['pkts','ip_bytes']].values
train_Y = train_pdf['category_id'].values

test_X = test_pdf[['pkts','ip_bytes']].values
test_Y = test_pdf['category_id'].values

### Configure XGBoost

**Finally** we're moving off the CPU and to the GPU with [XGBoost](https://xgboost.readthedocs.io/en/latest/). The package provides support for gradient boosted trees and can leverage distributed GPU compute environments.

In [None]:
import xgboost as xgb

Getting data into a format for XGBoost is really easy. Just make a `DMatrix` for both training and testin.

In [None]:
xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

Like any good ML package, there's quite a few parameters to set. We're going to start with the softmax objective function. This will let us get a predicted category out of our model. We'll also set other parameters like the maximum depth and number of threads. You can read more about the parameters [here](https://xgboost.readthedocs.io/en/latest/parameter.html). Experiment with them!

In [None]:
param = {}
param['objective'] = 'multi:softmax'
param['eta'] = 0.1
param['max_depth'] = 8
param['silent'] = 1
param['nthread'] = 4
param['learning_rate'] = 0.1
param['num_class'] = num_categories
param['max_features'] = 'auto'
param['n_gpus'] = -1
param['tree_method'] = 'gpu_hist'
# param

XGBoost allows us to define a watchlist so what we can keep track of performance as the algorithm trains. We'll configure a simple watchlist that is watching `xg_train` and `xg_gest` error rates.

In [None]:
watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 20

### Training our First XGBoost Model

Now it's time to train. Training is easy, keeping with the typical 80/20 rule for data science.

In [None]:
bst = xgb.train(param, xg_train, num_round, watchlist)

Prediction is also easy (and fast).

In [None]:
pred = bst.predict(xg_test)

We might want to get a sense of how our model is by calculating the error rate.

In [None]:
error_rate = np.sum(pred != test_Y) / test_Y.shape[0]
error_rate

That's not great, but it's not terrible considering we made quite a few seemingly abritrary decisions in both the feature selection and aggregation phases. Maybe we want to get some more insight into how our model is performing by analyzing the ROC curves for each class, micro average, and macro average. We'll revert back to traditional Python data science tools to do this analysis.

### Analyzing the Model's Performance

We'll start by importing some packages we'll need to perform this analysis. For simplicity in an already large notebook, we'll put them in a single cell.

In [None]:
# sklearn is used to binarize the labels as well as calculate ROC and AUC
from sklearn.metrics import roc_curve, auc,recall_score,precision_score
from sklearn.preprocessing import label_binarize

# scipy is used for interpolating the ROC curves
from scipy import interp

# our old friend matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# choose whatever style you want
plt.style.use('fivethirtyeight')

# cycle is used just to make different colors for the different ROC curves
from itertools import cycle

A ROC curve analysis can be trickey for multiclass problems. One way to deal with it is to look at the ROC curve for each class. We'll take some steps to format our data so that it plays nicely with input requirements from sklearn (ah 80/20 rule, we meet again). We also will need to rerun our model with a different objective function.

### Rerunning the Model with the `softprob` Objective Function

We used the `softmax` objective function above, but what we really want out of model this time is probabilities that a netflow communication belongs to each of the classes. This is easy enough to do with XGBoost, as we just change the objective function to `softprob`. For simplicity, all of the configuration is in a single cell below rather than spread out. Note the only difference is the objective function change.

In [None]:
pdf_msk = np.random.rand(len(one_hour_time_bin_pdf)) < 0.7

train_pdf = one_hour_time_bin_pdf[pdf_msk]
test_pdf = one_hour_time_bin_pdf[~pdf_msk]

train_X = train_pdf[['pkts','ip_bytes']].values
train_Y = train_pdf['category_id'].values

test_X = test_pdf[['pkts','ip_bytes']].values
test_Y = test_pdf['category_id'].values

xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

param = {}
param['objective'] = 'multi:softprob'
param['eta'] = 0.1
param['max_depth'] = 8
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_categories
param['n_gpus'] = -1
param['tree_method'] = 'gpu_hist'

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 20

Train the model.

In [None]:
bst = xgb.train(param, xg_train, num_round, watchlist)

Okay, so we have our new model. We now take some steps to make sure the data is in a format that makes sklearn happy. First we'll use the `predict` function to compute the probabilities. To extend `roc_curve` to multiclass, we'll also need to binarize the labels. Let's keep our sanity by also making sure the lengths match.

In [None]:
len(bst.predict(xg_test))

In [None]:
probs = bst.predict(xg_test).reshape(test_Y.shape[0],param['num_class'])
test_Y_binarize = label_binarize(test_Y, classes=np.arange(param['num_class']))

print("==> length of probs =",len(probs))
print("==> length of test_Y_binarize =", len(test_Y_binarize))

Some more housekeeping. We'll create Python dictionaries to hold FPR, TPR, and AUC values.

In [None]:
fpr = dict()
tpr = dict()
roc_auc = dict()

For each of our classes, we'll computer FPR, TPR, and AUC. We're also compute the [micro and macro averages](http://rushdishams.blogspot.com/2011/08/micro-and-macro-average-of-precision.html).

In [None]:
print("==> number of classes =", num_categories)

In [None]:
# calculate FPR, TPR, and ROC AUC for every class
for i in range(num_categories):
    fpr[i], tpr[i], _ = roc_curve(test_Y_binarize[:, i], probs[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# calculate the micro average FPR, TPR, and ROC AUC (we'll calculate the macro average below)
fpr["micro"], tpr["micro"], _ = roc_curve(test_Y_binarize.ravel(), probs.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

### Plotting the ROC Curves

Phew! Lots of code below, but it's fairly straightofrward and [adapted from an example in the scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#multiclass-settings). Before we plot though, we'll create a simple category lookup dictionary so we can label the classes with their actual names (not their category IDs).

In [None]:
category_lookup = labels_pdf[['Category','CategoryID']].drop_duplicates().set_index('CategoryID').T.to_dict()

In [None]:
# aggregate all of the false positive rates across all classes
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(num_categories)]))

# interpolate all of the ROC curves
mean_tpr = np.zeros_like(all_fpr)
for i in range(param['num_class']):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# average the TPR
mean_tpr /= num_categories

# compute the macro average FPR, TPR, and ROC AUC
fpr['macro'] = all_fpr
tpr['macro'] = mean_tpr
roc_auc['macro'] = auc(fpr['macro'], tpr['macro'])

# plot all of the ROC curves on a single plot (for comparison)
plt.figure(figsize=(9,9))
plt.plot(fpr['micro'], tpr['micro'],
         label="micro-average ROC curve (area = {0:0.2f})"
               "".format(roc_auc['micro']),
         color='deeppink', linestyle=':', linewidth=4)

plt.plot(fpr['macro'], tpr['macro'],
         label="macro-average ROC curve (area = {0:0.2f})"
               "".format(roc_auc['macro']),
         color='navy', linestyle=':', linewidth=4)

num_colors = param['num_class']
cm = plt.get_cmap('gist_rainbow')

colors = cycle([cm(1.*i/num_colors) for i in range(num_colors)])

lw = 2
for i, color in zip(range(param['num_class']), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label="ROC curve for "+category_lookup[i]['Category']+" class (area = {1:0.2f})"
             "".format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate", fontsize=12)
plt.ylabel("True Positive Rate", fontsize=12)
plt.title("ROC Curves for IoT Device Categories")
plt.legend(loc="lower right")
plt.show()

It's not a *terrible* plot, but it gets a little messy. We can also plot each class as its own subplot.

First we make a few variables so we can control the layout.

In [None]:
total_subplots = num_categories
plot_grid_cols = 3
plot_grid_rows = total_subplots // plot_grid_cols
plot_grid_rows += total_subplots % plot_grid_cols

position_index = range(1, total_subplots+1)

Now we make the grid of plots.

In [None]:
plt.figure()
fig, axs = plt.subplots(plot_grid_rows, plot_grid_cols, sharex=True, sharey=True, figsize=(15,15))

lw = 2

plt_num = 0
for row in range(plot_grid_rows):
    for col in range(plot_grid_cols):
        if(plt_num <= 12):
            axs[row,col].plot(fpr[plt_num], tpr[plt_num], lw=lw)
            axs[row,col].set_title(category_lookup[plt_num]['Category']+' Devices ROC Curve', fontsize=14)
            axs[row,col].text(0.7, 0.1,"AUC = {:.4f}".format(roc_auc[plt_num]), size=11)
        elif(plt_num == 13):
            axs[row,col].plot(fpr['micro'], tpr['micro'], lw=lw)
            axs[row,col].set_title("Micro Average ROC Curve", fontsize=14)
            axs[row,col].text(0.7, 0.1,"AUC = {:.4f}".format(roc_auc['micro']), size=12)
        elif(plt_num == 14):
            axs[row,col].plot(fpr['macro'], tpr['macro'], lw=lw)
            axs[row,col].set_title("Macro Average ROC Curve", fontsize=14)
            axs[row,col].text(0.7, 0.1,"AUC = {:.4f}".format(roc_auc['macro']), size=12)
        axs[row,col].set_xlabel('False Positive Rate', fontsize=10)
        axs[row,col].set_ylabel('True Positive Rate', fontsize=10)
        plt_num += 1
            
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.subplots_adjust(wspace=0.2, hspace=0.4)
plt.show()

### Conclusions

As we've shown, it's possible to get fairly decent multiclass classification results for IoT data using only basic features (bytes and packets) when aggregated. This isn't surprising, based on the fact that we used expert knowledge to assign category labels. In addition, the majority of the time, IoT devices are in a "steady state" (idle), and are not heavily influenced by human interaction. This lets us take larger samples (e.g., aggregate to longer time bins) while still maintaining decent classification performance. It should also be noted that this is a very clean dataset. The traffic is mainly IoT traffic (e.g., little traditional compute traffic), and there are no intentional abnormal activities injected (e.g., red teaming).

We used Bro data, but it's also possible to use the raw PCAP data as input for classification. The preprocessing steps are more arduous than for flow data though. It'd be a great exercise...

### More to Explore: Possible Exercises

##### (1) It may be useful to investigate other time binnings. Can you build another model that uses data binned to a different granularity (e.g., 5 minutes)?

In [None]:
# your solution here

##### (2) We used the `sum` of bytes and packets for a device when aggregated to the hour. What about other ways to handle these quantitative features (e.g., average)? Would that improve the classification results?

In [None]:
# your solution here

##### (3) We selected specific parameters for XGBoost. These could probably use a bit more thought. You can [read more about the parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) and try adjusting them on our previous dataset.

In [None]:
# a reminder about our parameters
print(param)

In [None]:
# your solution here

##### (4) There are additional features in the netflow data that we didn't use. Some other quantitative fields (e.g., duration) and categorical fields (e.g., protocol, service, ports) may be useful for classification. Build another XGBoost model using some/all of these fields.

In [None]:
# your solution here

##### (5) We assigned categories to each device based on one expert's intuition. There are other ways to assign categories. Maybe there should be *fewer* categories even. Experiment with creating fewer categories, build a classification model, and compare your results with our previous findings.

In [None]:
# your solution here

##### (6) In addition to the Bro `conn` log, there are also additional logs that are generated as part of the process. One log is `dns.log`, and it includes the DNS requests and responses for all of the devices. It's [been](http://www2.eet.unsw.edu.au/~vijay/pubs/conf/17infocom.pdf) [shown](https://www.nanog.org/sites/default/files/Nadji.pdf) that the set of DNS along with the count of these requests can accurately classify IoT devices. Utilize the DNS log file to build a XGBoost classifier for IoT devices.

In [None]:
# this is a list of the Bro directory; Bro log types end in .log
!ls /dli/data/kdd-data/iot/bro/*.log

In [None]:
# your solution here

##### (7) There may be other interesting things in the netflow data. Investigate the data some more!

In [None]:
# your solution here

### References

1. Nadji, Y., "Passive DNS-based Device Identification", *NANOG 67*, https://www.nanog.org/sites/default/files/Nadji.pdf.
1. Shams, R., "Micro- and Macro-average of Precision, Recall, and F-Score", http://rushdishams.blogspot.com/2011/08/micro-and-macro-average-of-precision.html.
1. Sivanathan, A. et al., "Characterizing and Classifying IoT Traffic in Smart Cities and Campuses", *2017 IEEE Conference on Computer Communications Workshops*, May 2017, http://www2.eet.unsw.edu.au/~vijay/pubs/conf/17infocom.pdf.
1. University of New South Wales Internet of Things Network Traffic Data Collection, http://149.171.189.1

<hr>

### Possible Solutions

#### Possible Solution to (1)

***Task:***  Investigate models using 5 minute aggregations
<details><summary>Click for Answer</summary>
<code>

# create 5-min time bins
exploded_pdf['five_min_time_bin'] = exploded_pdf['ts'].apply(lambda x: int(np.floor(x/(60*5))))
five_min_time_bin_pdf = exploded_pdf[['bytes','pkts','ip_bytes','MAC','category_id','five_min_time_bin']].groupby(['MAC','category_id','five_min_time_bin']).sum()
five_min_time_bin_pdf = five_min_time_bin_pdf.reset_index()

pdf_msk = np.random.rand(len(five_min_time_bin_pdf)) < 0.7

train_pdf = five_min_time_bin_pdf[pdf_msk]
test_pdf = five_min_time_bin_pdf[~pdf_msk]

train_X = train_pdf[['pkts','ip_bytes']].values
train_Y = train_pdf['category_id'].values

test_X = test_pdf[['pkts','ip_bytes']].values
test_Y = test_pdf['category_id'].values

xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

param = {}
param['objective'] = 'multi:softprob'
param['eta'] = 0.1
param['max_depth'] = 8
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_categories
param['n_gpus'] = -1
param['tree_method'] = 'gpu_hist'

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 20

bst = xgb.train(param, xg_train, num_round, watchlist) 
                                                          
fpr = dict()
tpr = dict()
roc_auc = dict()
                                                          
# calculate FPR, TPR, and ROC AUC for every class
for i in range(num_categories):
    fpr[i], tpr[i], _ = roc_curve(test_Y_binarize[:, i], probs[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# calculate the micro average FPR, TPR, and ROC AUC (we will calculate the macro average below)
fpr["micro"], tpr["micro"], _ = roc_curve(test_Y_binarize.ravel(), probs.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# aggregate all of the false positive rates across all classes
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(num_categories)]))

# interpolate all of the ROC curves
mean_tpr = np.zeros_like(all_fpr)
for i in range(param['num_class']):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

# average the TPR
mean_tpr /= num_categories

# compute the macro average FPR, TPR, and ROC AUC
fpr['macro'] = all_fpr
tpr['macro'] = mean_tpr
roc_auc['macro'] = auc(fpr['macro'], tpr['macro'])

total_subplots = num_categories
plot_grid_cols = 3
plot_grid_rows = total_subplots // plot_grid_cols
plot_grid_rows += total_subplots % plot_grid_cols

position_index = range(1, total_subplots+1)

plt.figure()
fig, axs = plt.subplots(plot_grid_rows, plot_grid_cols, sharex=True, sharey=True, figsize=(15,15))

lw = 2

plt_num = 0
for row in range(plot_grid_rows):
    for col in range(plot_grid_cols):
        if(plt_num <= 12):
            axs[row,col].plot(fpr[plt_num], tpr[plt_num], lw=lw)
            axs[row,col].set_title(category_lookup[plt_num]['Category']+' Devices ROC Curve', fontsize=14)
            axs[row,col].text(0.7, 0.1,"AUC = {:.4f}".format(roc_auc[plt_num]), size=11)
        elif(plt_num == 13):
            axs[row,col].plot(fpr['micro'], tpr['micro'], lw=lw)
            axs[row,col].set_title("Micro Average ROC Curve", fontsize=14)
            axs[row,col].text(0.7, 0.1,"AUC = {:.4f}".format(roc_auc['micro']), size=12)
        elif(plt_num == 14):
            axs[row,col].plot(fpr['macro'], tpr['macro'], lw=lw)
            axs[row,col].set_title("Macro Average ROC Curve", fontsize=14)
            axs[row,col].text(0.7, 0.1,"AUC = {:.4f}".format(roc_auc['macro']), size=12)
        axs[row,col].set_xlabel('False Positive Rate', fontsize=10)
        axs[row,col].set_ylabel('True Positive Rate', fontsize=10)
        plt_num += 1
            
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.subplots_adjust(wspace=0.2, hspace=0.4)
plt.show()    
</code>
</details>