## Exercise 2: Diagnosing what and why

This notebook demonstrates how to hunt and diagnose datasets for malicious activity. This dataset is something that a Threat Hunter, Security Operations (SOC) analyst or a detection engineer will encounter in their day-to-day role. We'll use a large language model library called **PandasAI** that provides an LLM interface to dataframes to explore, query and diagnose issues

**What's the story?**

You are a threat hunter who is proactively looking to secure your organization. You create a hypothesis that you will find some sneaky malicious activity and start looking at network data. This is your exploratory data analysis (EDA) process.


### Key Questions:
- What can I explore in the data that I have? 
- Is there a strong dependency of certain fields on one another? How do these interactions play out?

# Imports

In [None]:
import pandas as pd
import pandasai as pai 
import os

from dotenv import load_dotenv

# pandasai imports
from pandasai.llm.openai import OpenAI
from pandasai import SmartDataframe
from pandasai import clear_cache
from pandasai import Agent

# Initialization

In [None]:
READ_FROM_PICKLE = True

In [None]:
# finds .env file and loads the vars
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY", "Key not found")

In [None]:
if READ_FROM_PICKLE:
    malicious_df = pd.read_pickle("../data/malicious_features_numeric.pkl")
    benign_df = pd.read_pickle("../data/benign_features_numeric.pkl")

In [None]:
# Instantiate a LLM
llm = OpenAI(api_token=openai_api_key)

In [None]:
malicious_smart = SmartDataframe(
    df=pd.DataFrame(malicious_df),
    config={"llm": llm, "verbose": True},
    name="Mirai botnet network packet capture.",
    description="A dataframe that is derived from a packet capture of the Mirai botnet network traffic.",
)

In [None]:
benign_smart = SmartDataframe(
    df=pd.DataFrame(benign_df),
    config={"llm": llm, "verbose": True},
    name="Packet capture of normal operation network travvid.",
    description="A dataframe that is derived from a packet capture of the regular operation of a network.",
)

In [None]:
malicious_smart.columns

# EDA

## Statistical

### Exploration via prompting

In [None]:
top_5_source_IPs = malicious_smart.chat("Which are the 5 most popular source IP addresses?")

top_5_source_IPs

In [None]:
print(malicious_smart.last_code_generated)

In [None]:
top_5_dst_ports = malicious_smart.chat("Find the most used destination ports.")
top_5_dst_ports

In [None]:
print(malicious_smart.last_code_generated)

### Prompt engineering

In [None]:
rare_ports = malicious_smart.chat(
    "Which are the most commonly used known destination ports?"
)
rare_ports

In [None]:
rare_ports = malicious_smart.chat(
    "Which are the most used destination ports less than or equal to 1024?"
)
rare_ports

### Correlation

In [None]:
correlation = malicious_smart.chat(
    """1. Calculate the correlation between the source port and the length of a packet.
       2. Explain if the correlation that you calculated is significant and why.
    """
)
correlation

### Hypothesis testing
- Is there sufficient reason to believe that one data points has interaction with another? 
- If yes, how do we quantify it?

In [None]:
hypothesis = malicious_smart.chat(
    "Is the difference between dst_ip_total_bytes and Packet Length statistically significant?"
)
hypothesis

### Outliers

In [None]:
outliers = malicious_smart.chat(
    "Find the z score of the Packet Length and then calculate the top ten outliers."
)
outliers

## Visualizations

In [None]:
malicious_smart.chat(
    "Plot the heatmap of the correlations of all variables."
)

#### ATTENDEE EXERCISE: Clear more visualizations that you think will be helpful via prompting

# Agents

In [None]:
from pandasai import Agent

In [None]:
agent = Agent(malicious_smart, config={"llm": llm},memory_size=1000)

In [None]:
agent.chat('Is this dataframe indicating malicious or benign network behavior?')

In [None]:
agent.chat('The dataframe that you have been given is a packet capture of computer network traffic. It has numerical features that characterize the packets that have been observed in this network. A packet capture is malicious if it has too many requests to ports that are unusual. Usual ports are 23 for Telnet, 22 for SSH and 80 for HTTP. Is this packet capture malicious or benign?')

In [None]:
agent.explain()

# Clear cache

In [None]:
clear_cache()