# `mugatu` KDD-CUP-99 example

**this is a work in progress**

The [KDD-CUP-99](https://kdd.ics.uci.edu/databases/kddcup99/task.html) dataset is designed for learning to detect network intrusions. Let's take a first pass at exploring the dataset with Mapper to see how some of the normal and abnormal cases cluster on a graph.

In [None]:
import pandas as pd
import sklearn.datasets
import panel as pn
import holoviews as hv
import dask

# configure default scheduler for dask (used to parallelize clustering)
dask.config.set(scheduler='processes')
# activate the holoviews bokeh extension (for plotting the graph)
hv.extension("bokeh")
# activate panel widgets in jupyter
pn.extension()

In [None]:
import mugatu

## Prepare data

Here's the plan:

* pull the data using the loader built in to `sklearn`
* discard a few of the columns (categorical variables and "duration", which is zero in most of the dataset)
  * obviously it would be better to restructure these columns so that we can incorporate the information
* pull out some of the target categories for coloring the Mapper graph to see whether the structure we're learning correlates with the problem we're trying to solve

In [None]:
data = sklearn.datasets.fetch_kddcup99()
df = pd.DataFrame(data=data["data"], columns=data["feature_names"])
len(df)

In [None]:
# let's just remove the categorical columns and the duration for now
cat_cols = ["protocol_type", "service", "flag"]
df = df.drop(cat_cols, 1)
df = df.drop("duration", 1)
for c in df.columns:
    df[c] = df[c].astype(float)
df.head()

In [None]:
target = pd.get_dummies(data["target"].astype(str))
labels = {"normal":target["normal."].values, "neptune":target["neptune."].values,
         "smurf":target["smurf."].values, "back":target["back."].values, "satan":target["satan."].values}

## Toss all that in the GUI

I had OK luck with these parameters:

* `pca_dim = 10`
* `k = 3`
* `num_intervals = 12`
* `f = 0.5`
* `balance = False`
* `lens1 = "isolationforest"`
* `lens2 = "svd1"`

In [None]:
mapper = mugatu.Mapperator(df, title="kddcup99", color_data=labels)

In [None]:
mapper.panel()