![DLI Logo](images/DLI_Header.png)

# Sensitive Information Detection with Morpheus

In this notebook, you will get to perform sensitive information detection using Morpheus, similar to what was discussed in the GTC session _[Morpheus: AI Inferencing for Cybersecurity Pipelines](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s33291/)_ you viewed earlier in this course.

## Notebook Outline

- **Coding Environment**: Get an overview of the relevant features of this GPU-accelerated coding environment.
- **Data Overview**: Examine the data we will be sending through the Morpheus pipeline and performing sensitive information detection on.
- **SID Pipeline Overview**: Understand the Morpheus SID pipeline command.
- **Run the SID Pipeline**: Use Morpheus to launch a SID pipeline.
- **Results Overview**: Examine the output of the Morpheus SID pipeline.
- **Next Steps**: Learn how to get started with Morpheus for your own project.

## Coding Environment

This interactive JupyterLab environment has a number of features that are relevant to the Morpheus pipeline that you will be running. In this section we give an overview of those you will be using in this content.

### NVIDIA GPU

This environment is backed by an NVIDIA GPU. Run the following cell to see details for the GPU provided in this environment. Note: using `!` at the beginning of a Jupyter Python cell (like the one below) allows you to execute command line commands.

In [None]:
!nvidia-smi

### Morpheus SDK CLI

You will be using the Morpheus SDK CLI to execute pipelines. The SDK CLI is already installed in this environment.

In [None]:
!morpheus

### Terminal

In addition to running command line commands from Jupyter cells using `!`, JupyterLab also provides terminals where you can run command line commands directly. You can also use the JupyterLab launcher to start a new terminal tab anytime you like: from the Jupyter menu above choose *File -> New Launcher* and then click the *Terminal* icon from the available options.

Open the terminal now and execute the `morpheus` command. Following the guidance of the output, feel free to try out some of the `help` command to see a little more about the capabilities of the pipeline.

### Triton Inference Server

We will using the [Triton inference server](https://developer.nvidia.com/nvidia-triton-inference-server) to perform inference as a part of our Morpheus pipeline. A Triton server is already up and running in this environment and is available at the hostname `triton`. Run the cell below to send a GET request to Triton's readiness check endpoint. A response that includes `200 OK` indicates that the Triton server is up and running.

In [None]:
!curl -v triton:8000/v2/health/ready

## Data Overview

In this example we will be streaming data into the pipeline from file, namely `data/pcap_dump.jsonlines`:

In [None]:
!ls -lh data/pcap_dump.jsonlines

`pcap_dump.jsonlines` contains 93085 packet captures, each represented as a JSON object.

In [None]:
!cat data/pcap_dump.jsonlines | wc -l # Count number of lines / packet captures.

We will be using the [`jq`](https://stedolan.github.io/jq/) library to help us read both this input JSON data, and later the output data. Run the following cell to look at 3 arbitrary packet captures from the input data, paying special attention to the `data` fields, which might include sensitive information we would like know is being sent through the network.

In [None]:
# Look at 3 arbitrary packet captures at indices `1`, `31`, `2022`. Feel free to modify the values inside the `[]` to view different packet captures.
!cat data/pcap_dump.jsonlines | jq -s '.[1,31,2022]' | tr -d '\\' # Remove backslashes for easier reading.

## SID Pipeline Overview

We use the `morpheus` command line tool to construct pipelines, which can consume data from a variety of sources, perform various preprocessing operations on the data, do inference on the preprocessed data, postprocess the data, and send the results of the pipeline to a variety of destinations.

We will be using the following pipeline:

```sh
morpheus --log_level=DEBUG run \
    --num_threads=5 \
    --edge_buffer_size=32 \
    --use_cpp=True \
    --pipeline_batch_size=1024 \
    --model_max_batch_size=32 \
    pipeline-nlp \
        --labels_file=/dli/task/data/labels_nlp.txt \
        --model_seq_length=256 \
        from-file --filename=/dli/task/data/pcap_dump.jsonlines \
        deserialize \
        preprocess --vocab_hash_file=/dli/task/data/bert-base-uncased-hash.txt --truncation=True --do_lower_case=True --add_special_tokens=False \
        monitor --description='Preprocessing rate' \
        inf-triton --force_convert_inputs=True --model_name=sid-minibert-onnx --server_url=triton:8001 \
        monitor --description='Inference rate' --smoothing=0.001 --unit inf \
        add-class \
        filter \
        serialize --exclude '^ts_' \
        to-file --filename=/dli/task/data/output/sid-minibert-onnx-output.jsonlines --overwrite
```

Before running the pipeline, let's take a look at many of its key components.

### Running a Morpheus Pipeline

The first part of the pipeline consists of calling `morpheus run` and passing it a variety of command line arguments:

```sh
morpheus --log_level=DEBUG run \
    --num_threads=5 \
    --edge_buffer_size=32 \
    --use_cpp=True \
    --pipeline_batch_size=1024 \
    --model_max_batch_size=32 \
```

Using `morpheus run --help` we can see the purpose of each of these command line arguments.

In [None]:
!morpheus run --help

### Running an NLP Pipeline

Our pipeline will perform sensitive information detection by doing inference on each packet capture using a pre-trained NLP model, therefore we will use the `morpheus run pipeline-nlp` subcommand:

```sh
pipeline-nlp \
        --labels_file=/dli/task/data/labels_nlp.txt
        --model_seq_length=256 \
        from-file --filename=/dli/task/data/pcap_dump.jsonlines \
        deserialize \
        preprocess --vocab_hash_file=/dli/task/data/bert-base-uncased-hash.txt --truncation=True --do_lower_case=True --add_special_tokens=False \
        monitor --description='Preprocessing rate' \
        inf-triton --force_convert_inputs=True --model_name=sid-minibert-onnx --server_url=triton:8001 \
        monitor --description='Inference rate' --smoothing=0.001 --unit inf \
        add-class \
        filter \
        serialize --exclude '^ts_' \
        to-file --filename=/dli/task/data/output/sid-minibert-onnx-output.jsonlines --overwrite
```

### Input and Preprocessing

We will provide the pipeline a labels file, via the `--labels_file` argument that lists the classes we would like to identify in our data.

In [None]:
!cat data/labels_nlp.txt

As examined earlier, we will use the input file `data/pcap_dump.jsonlines` as input into the pipeline.

`deserialize` deserializes the input from JSON, and `preprocess` converts the words from input into tokens for inference by the NLP model.

### Inference

With `inf-triton` we use the Triton inference server with a minibert NLP model fine-tuned for SID to perform inference on each of the incoming packet captures. The model has already been loaded into Triton for you.

### Postprocessing and Output

After performing inference on each of the captured packets, we use the `add-class` argument to add detected classifications to each model before serializing the classifications and writing them back to file at `data/output/sid-minibert-onnx-output.jsonlines`.

## Run the SID Pipeline

Open a JupyterLab terminal and do `./launch_sid.sh` from the command line. In addition to other output, you should see `====Pipeline Started====`.

After a minute or less, you will also start to info about the preprocessing and inference rates for the pipeline:

```
Preprocessing rate: 93085messages [01:02, 1488.33messages/s]
Inference rate: 66560inf [01:02, 1070.50inf/s]
```

## Results Overview

Now that the pipeline is running, we can confirm that it is writing its results to `data/output/sid-minibert-onnx-output.jsonlines` as we specified in the pipeline.

In [None]:
!ls data/output

If you don't see `sid-minibert-onnx-output.jsonlines` in the output above, wait a few seconds for the pipeline to spin up, and run the cell again until you can confirm it is present.

In [None]:
# Get the length of the output file once a second for 5 seconds.
!for i in {1..5}; do cat data/output/sid-minibert-onnx-output.jsonlines | wc -l; sleep 1; done

Examining the output we can see that the pipeline has correctly labeled packets containing the classes of sensitive information we passed into the pipeline.

In [None]:
!cat data/output/sid-minibert-onnx-output.jsonlines | jq -s '.[0:3]' | tr -d '\\' # View the first 3 packet captures annotated with SID labels.

### Sample Outputs

Here is a small collection of sample outputs highlighting some examples of our pipeline's results. In addition to the sample outputs below, feel free to explore the data yourself.

**Secret Keys**

In [None]:
# Show the first 3 data fields for packets identified as having secret keys.
!cat data/output/sid-minibert-onnx-output.jsonlines | \
  jq -s 'map(select(.secret_keys == true) | {data: .data, secret_key: .secret_keys})[0:3]' | \
  `# Remove backslashes for easier reading.` \
  tr -d '\\'

**Email and Password**

In [None]:
# Show the first 3 data fields for packets identified as having both email addresses and passwords.
!cat data/output/sid-minibert-onnx-output.jsonlines | \
  jq -s 'map(select(.email == true and .password == true) | {data: .data, email: .email, password: .password})[0:3]' | \
  `# Remove backslashes for easier reading.` \
  tr -d '\\'

## Next Steps

There are several steps you can take to learn more and get more experience with Morpheus.

### Morpheus Early Access Program

Morpheus is currently in early access, and you can [request access today](https://developer.nvidia.com/morpheus-early-access). As a part of the early access program you'll have access to all the components you need to run Morpheus, including, but not limited to:

- Containers and Helm Charts for the Morpheus SDK CLI, Triton-backed Morpheus AI engine, Kafka message broker, MLFlow plugin.
- Extensive documentation for system setup and Morpheus use.
- Several end-to-end example workflows using Morpheus for SID, Anomalous Behavior Profiling, Phishing Detection, Digital Fingerprinting and more.
- Collections of notebooks and scripts for model fine-tuning, performance analysis, and more.

Signing up for [early access](https://developer.nvidia.com/morpheus-early-access) is quick, easy, and free.

Additionally, 

### GTC On Demand

Check out several presentations about Morpheus both from NVIDIA, Booz Allen Hamilton, Redhat, and Splunk for free at [GTC On-Demand](https://www.nvidia.com/en-us/on-demand/search/?facet.mimetype[]=event%20session&layout=list&page=1&q=morpheus&sort=date).

### GTC 2022

[Sign up](https://www.nvidia.com/auth/gtc?scope=openid+email+profile&client_id=FNwji43RhQ7B2YKM7B6rC6N7KA_Gu3-ohkaoljY9NJ8&redirect_uri=https%3A%2F%2Fevents.rainfocus.com%2Foauth%2Fnvidia%2F1629402589906001W3aJ&response_type=code&state=a804cc2db39acc087a63bdbb3226aec5de887edc9917f9846386305c1b118ce190236089340893ba3329b39a125b4691) for [GTC 2022](https://www.nvidia.com/gtc/), where there will be lots to discuss and learn about Morpheus, including a full day deep dive workshop on Morpheus.

## Conclusion

Thank you for taking the time to see Morpheus in action. We hope you will find tremendous use with this powerful tool.

Please return to the main course page (where you launched this interactive environment from), and continue to the next section.

![DLI Logo](images/DLI_Header.png)