## Exercise 1: Exploring Attack Surface Management Data with Pandas and Seaborn

This notebook demonstrates how to import and analyze an **Attack Surface Management (ASM)** dataset. This dataset is something that a vulnerability management analyst will encounter in their day-to-day role.
We'll use **pandas** for data manipulation and **seaborn** for visualization.

**What's the story?**

You are a new analyst in your security operations center. As a new member of the team, you have been asked to explore the **attack surface** of your organization. The **attack surface** is the set of points on the boundary of a system, a system element, or an environment where an attacker can try to enter, cause an effect on, or extract data from, that system, system element, or environment.

As a member of an organization's security team, it is usually a great idea to have an understanding of what you are trying to keep secure-- your data, assets, and even your personnel! We will explore the assets in your organization as part of this exercise. This is the first step in many data science exercises-- data exploration!


### Key Questions:
- What is the distribution of risk levels in my environment?
- How many assets are in the cloud vs. on-prem?
- Which services are most exposed?
- How does vulnerability severity vary across risk levels?

Let's import some packages and configure our plotting:

In [None]:
# Packages to import

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:

# Configure visualization style
sns.set_style("whitegrid")

# Load dataset (Update path if necessary)
file_path = "../data/attack_surface_management_data_part_3.csv"  # Ensure the file is in the same directory
df = pd.read_csv(file_path)

# Display first few rows
df.head()

In [None]:
# Display dataset info
df.info()

In [None]:
# Check for missing values -- a good idea if you want to check the quality of your data!
df.isnull().sum()

In [None]:
# Summary statistics for numerical columns
df.describe()

In [None]:
# How old is my data? When was the data captured?
df['timestamp'] = pd.to_datetime(df['timestamp'])

Why might there be way fewer datapoints on some dates relative to others? Do you think you will see the same pattern in average risk scores for the day? Why or why not?

In [None]:
# Count plot for risk levels
# What is the distribution of risk levels in my environment?
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x="priority", order=["Critical", "High", "Medium", "Low"], hue = "priority", palette="Reds")
plt.title("Distribution of Risk Levels")
plt.xlabel("Risk Level")
plt.ylabel("Count")
plt.show()

In [None]:
# Count cloud vs. on-prem assets
# How many assets are in the cloud vs. on-prem?
plt.figure(figsize=(6, 4))
sns.countplot(data=df, x="asset_type", order=['cloud', 'on-prem'], hue = 'asset_type', palette="Blues_d")
plt.title("Cloud vs. On-Prem Assets")
plt.xticks(rotation=30)
plt.xlabel("Hosting Provider")
plt.ylabel("Count")
plt.show()

In [None]:
# Count most common exposed services
# Which services are most exposed?
plt.figure(figsize=(8, 5))
sns.countplot(data=df, y="protocol", order=df["protocol"].value_counts().index, hue = "protocol", palette="viridis")
plt.title("Most Exposed Services")
plt.xlabel("Count")
plt.ylabel("Service")
plt.show()

In [None]:
# Box plot: Risk Level vs. Vulnerability Severity
# - How does vulnerability severity vary across risk levels?
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x="risk_score", y="vulnerability_severity", order=["Critical", "High", "Medium", "Low"], hue = 'vulnerability_severity', palette="coolwarm")
plt.title("Vulnerability Severity by Risk Level")
plt.xlabel("Risk Level")
plt.ylabel("Vulnerability Severity")
plt.show()

#### ATTENDEE EXERCISE: What is the relationship between vulnerability severity and priority?

## Key Takeaways:
- **Risk Level Distribution**: Helps prioritize mitigation efforts.
- **Cloud vs. On-Prem Assets**: Identifies potential exposure in cloud environments.
- **Exposed Services**: Highlights commonly exposed attack vectors.
- **Risk vs. Vulnerabilities**: Shows correlation between risk level and detected issues.

### Next Steps You Could Take:
- Drill down into specific IPs and domains for targeted mitigation.
- Identify misconfigured or outdated technologies.
- Monitor high-risk assets for frequent scanning.
- Plot the risk scores over time to see how it is changing!