<img src="../img/GTK_Logo_Social Icon.jpg" width=175 align="right" />


# Worksheet 10: Using GPT for Anomaly Detection - Answers

This worksheet covers concepts relating to Anomaly Detection using an LLM.  It should take no more than 20-30 minutes to complete.  Please raise your hand if you get stuck.  

There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple. 

## Import the Libraries
For this exercise, we will be using:
* Pandas (https://pandas.pydata.org/pandas-docs/stable/)
* OpenAI (https://pypi.org/project/openai/)

## Obtain an OpenAI Key
For this exercise, you will need an OpenAI API Key. Please go to https://openai.com and get your API key. For the purposes of this exercise, you do not need a paid account.  

In [103]:
import pandas as pd 
import os 
import openai
import ast
import json

Copy your OpenAI API key here.

In [104]:
openai.api_key = "<Your OpenAI API Key Here...>"

## Step 1:  Read in the Data

For this exercise, we're going to see how well OpenAI functions to detect anomalies in data.  We're going to use the same CPU dataset that we used in the Anomaly Detection worksheet.

First we're going to read the "training" data, into a dataframe followed by the test data.  If you recall, the training data does not contain any anomalies, but the test data does about 10 seconds into the data. 

In [105]:
training_data = pd.read_csv('../data/cpu-train-b.csv', parse_dates=[0])
testing_data = pd.read_csv('../data/cpu-test-b.csv', parse_dates=[0])

## Step 2:  Format the Data for the Prompt:
Next, we're going to create a prompt that has a sample of the training data, as well as the unknown data. There are many ways to format the prompt, but we're going to do something like this.  We want every line in the dataframe to be formatted as shown below:

```
Timestamp: 2022-01-01 23:01:01, cpu 2.0
```

As a first step, write a function which iterates over the rows of a dataframe and prints them in the format above.

In [106]:
def format_data_for_prompt(df: pd.DataFrame) -> str:
    '''
    This function accepts a dataframe as input and formats that data
    for the OpenAI prompt.
    '''
    data = ""
    for index, row in df.iterrows():
        data += f"Timestamp: {row['datetime']}, cpu: {row['cpu']}\n"
    return data

## Step 3:  Generate the Prompt:
Now that we have that, use the function to craft a prompt.  In our experiments, the following structure worked well:

```
The following contains 50 lines representing normal CPU timeseries data.
Timestamp: 2017-01-27 19:03:00, cpu: 0.88
Timestamp: 2017-01-28 02:09:00, cpu: 1.22
Timestamp: 2017-01-28 04:04:00, cpu: 1.57
Timestamp: 2017-01-28 00:10:00, cpu: 0.87
Timestamp: 2017-01-28 02:08:00, cpu: 1.28
Timestamp: 2017-01-28 03:28:00, cpu: 1.42
Timestamp: 2017-01-28 01:03:00, cpu: 2.23
...

Below are 40 rows of unknown data.  Find any anomalies in this data.
Timestamp: 2017-01-28 04:42:00, cpu: 1.71
Timestamp: 2017-01-28 04:43:00, cpu: 1.58
Timestamp: 2017-01-28 04:44:00, cpu: 1.86
Timestamp: 2017-01-28 04:45:00, cpu: 1.66
Timestamp: 2017-01-28 04:46:00, cpu: 1.61
Timestamp: 2017-01-28 04:47:00, cpu: 1.52
Timestamp: 2017-01-28 04:48:00, cpu: 1.43
...


Output the anomalous rows as a python object.
```
You do not and should not need to use the entire training set as that would create a really long prompt and the end result would be an expensive API call.  For the example we arbitrarily chose to use a sample of 50 rows.  For the testing set, you'll need to send at least the first 20 rows.

In [108]:
normal_data = format_data_for_prompt(training_data.sample(50))
prompt = f"The following contains 50 lines representing normal CPU timeseries data.\n{normal_data}\n"

In [109]:
unknown_data = format_data_for_prompt(testing_data.head(40))
prompt += f"Below are 40 rows of unknown data.  Find any anomalies in this data.\n\n{unknown_data}\n\n"
prompt += "Output the anomalous rows as a python object."

print(prompt)

The following contains 50 lines representing normal CPU timeseries data.
Timestamp: 2017-01-27 19:03:00, cpu: 0.88
Timestamp: 2017-01-28 02:09:00, cpu: 1.22
Timestamp: 2017-01-28 04:04:00, cpu: 1.57
Timestamp: 2017-01-28 00:10:00, cpu: 0.87
Timestamp: 2017-01-28 02:08:00, cpu: 1.28
Timestamp: 2017-01-28 03:28:00, cpu: 1.42
Timestamp: 2017-01-28 01:03:00, cpu: 2.23
Timestamp: 2017-01-28 00:14:00, cpu: 0.73
Timestamp: 2017-01-28 01:37:00, cpu: 2.41
Timestamp: 2017-01-28 01:10:00, cpu: 1.93
Timestamp: 2017-01-27 21:26:00, cpu: 0.98
Timestamp: 2017-01-27 22:41:00, cpu: 2.01
Timestamp: 2017-01-27 23:24:00, cpu: 1.75
Timestamp: 2017-01-27 19:31:00, cpu: 0.61
Timestamp: 2017-01-28 01:53:00, cpu: 1.81
Timestamp: 2017-01-27 23:39:00, cpu: 1.41
Timestamp: 2017-01-28 00:45:00, cpu: 1.48
Timestamp: 2017-01-28 04:01:00, cpu: 1.62
Timestamp: 2017-01-27 23:46:00, cpu: 1.23
Timestamp: 2017-01-27 21:04:00, cpu: 1.14
Timestamp: 2017-01-27 22:22:00, cpu: 0.86
Timestamp: 2017-01-27 21:05:00, cpu: 1.11
Tim

## Step 4:  Call the LLM
The next step is to actually call the LLM using the prompt you have generated.  We have provided a convenience function below to actually make the call.

For this step, make the call, and you will have to clean up the response so that you just get the actual response. 

### Note About Model Choice: 
In our experiments developing this lab, we found that the model `text-davinci-003` worked the best for this use case.  Unfortunately, OpenAI is deprecating this model as well as the completetions interface. 

In [115]:
def make_openai_call(prompt: str):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        temperature=1,
        max_tokens=1024,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    return response

def clean_results(response):
    raw_response = response['choices'][0]['text'].strip()
    raw_response = raw_response.replace("Anomalous Rows: \n", "")
    return json.dumps(ast.literal_eval(raw_response))

In [110]:
response = make_openai_call(prompt)

In [114]:
print(results)

[["Timestamp: 2017-01-28 04:53:00, cpu: 0.04"], ["Timestamp: 2017-01-28 04:54:00, cpu: 0.07"], ["Timestamp: 2017-01-28 04:55:00, cpu: 0.03"], ["Timestamp: 2017-01-28 04:56:00, cpu: 0.07"], ["Timestamp: 2017-01-28 04:57:00, cpu: 0.03"], ["Timestamp: 2017-01-28 04:58:00, cpu: 0.04"], ["Timestamp: 2017-01-28 04:59:00, cpu: 0.06"], ["Timestamp: 2017-01-28 05:00:00, cpu: 0.05"], ["Timestamp: 2017-01-28 05:01:00, cpu: 0.07"], ["Timestamp: 2017-01-28 05:02:00, cpu: 0.09"], ["Timestamp: 2017-01-28 05:03:00, cpu: 0.04"], ["Timestamp: 2017-01-28 05:04:00, cpu: 0.07"], ["Timestamp: 2017-01-28 05:05:00, cpu: 0.03"], ["Timestamp: 2017-01-28 05:06:00, cpu: 0.05"], ["Timestamp: 2017-01-28 05:07:00, cpu: 0.05"], ["Timestamp: 2017-01-28 05:08:00, cpu: 0.05"], ["Timestamp: 2017-01-28 05:09:00, cpu: 0.03"], ["Timestamp: 2017-01-28 05:10:00, cpu: 0.07"], ["Timestamp: 2017-01-28 05:11:00, cpu: 0.06"], ["Timestamp: 2017-01-28 05:12:00, cpu: 1.1"], ["Timestamp: 2017-01-28 05:13:00, cpu: 2.13"], ["Timestamp: 

## Final Thoughts
How did the LLM do in identifying anomalous entries in your time series?

It is interesting to see that if you ask the LLM for a description, the LLM is also capable of telling you why the data point is anomalous. This could be interesting to try this with log entries to identify unusual behavior. 

The obvious concerns are security and cost.