# Module 4 Lab Notebook: Data Visualization with Python (Matplotlib + Seaborn)

**Course:** Introduction to Data Science and Python  
**Module 4:** Data Visualization  
**Estimated Time:** 90‚Äì120 minutes  

## üéØ Lab Goals
By the end of this lab, you will be able to:
- Create common plots using **Matplotlib** (line, bar, scatter, histogram)
- Create statistical graphics using **Seaborn** (boxplot, heatmap)
- Apply visualization best practices (titles, labels, readability)
- Use visualizations to identify trends and outliers in cybersecurity-style data

## üìÅ Files for this Lab
- Dataset: `module4_cyber_events.csv` (provided in this module)


## ‚úÖ Step 0: Setup

Run the cell below to import libraries. If you get an import error, review Module 0 (libraries install).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: improve plot readability
plt.rcParams["figure.figsize"] = (10, 5)

print("Libraries loaded successfully.")

## ‚úÖ Step 1: Load the Dataset

1. Make sure `module4_cyber_events.csv` is in the same folder as this notebook.
2. Load it into a DataFrame and preview the first rows.

In [None]:
df = pd.read_csv("module4_cyber_events.csv")
df.head()

## ‚úÖ Step 2: Quick Exploration

Use `.info()` and `.describe()` to understand columns, missing values, and data ranges.

In [None]:
df.info()

In [None]:
df.describe(include="all")

## ‚úÖ Step 3: Prepare Time Data

Convert the `timestamp` column into a datetime type. Then create two helper columns:
- `date` (YYYY-MM-DD)
- `hour` (0‚Äì23)

In [None]:
df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce")
df["date"] = df["timestamp"].dt.date
df["hour"] = df["timestamp"].dt.hour

df[["timestamp", "date", "hour"]].head()

# Part A ‚Äî Matplotlib Plots

## ‚úÖ Step 4: Line Plot (Trend Over Time)

Create a line chart showing **count of events per hour**.

**Hint:** group by `timestamp` (or resample) and count rows.

In [None]:
events_per_hour = df.set_index("timestamp").resample("h").size()

plt.plot(events_per_hour.index, events_per_hour.values)
plt.title("Events Over Time (Hourly)")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## ‚úÖ Step 5: Bar Chart (Events by Type)

Create a bar chart showing the **number of events per event_type**.

In [None]:
counts_by_type = df["event_type"].value_counts()

plt.bar(counts_by_type.index, counts_by_type.values)
plt.title("Event Counts by Type")
plt.xlabel("Event Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## ‚úÖ Step 6: Histogram (Distribution of Packet Size)

Create a histogram of `packet_bytes` to understand its distribution.

In [None]:
plt.hist(df["packet_bytes"], bins=20)
plt.title("Distribution of Packet Size (Bytes)")
plt.xlabel("Packet Bytes")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

## ‚úÖ Step 7: Scatter Plot (Severity vs Packet Size)

Create a scatter plot where:
- X = `packet_bytes`
- Y = `severity`

Then answer in a Markdown cell:
- Do you see any pattern? Any outliers?

In [None]:
plt.scatter(df["packet_bytes"], df["severity"], alpha=0.7)
plt.title("Severity vs Packet Size")
plt.xlabel("Packet Bytes")
plt.ylabel("Severity")
plt.tight_layout()
plt.show()

**Your observation (write below):**  
- Pattern:  
- Outliers:  


# Part B ‚Äî Seaborn Statistical Graphics

## ‚úÖ Step 8: Boxplot (Severity by Event Type)

Create a Seaborn boxplot of `severity` grouped by `event_type`.

Goal: identify which event types tend to have higher severity and where outliers appear.

In [None]:
sns.boxplot(data=df, x="event_type", y="severity")
plt.title("Severity by Event Type (Boxplot)")
plt.xlabel("Event Type")
plt.ylabel("Severity")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## ‚úÖ Step 9: Heatmap (Correlation)

Compute correlations for numeric columns and display a heatmap.

**Tip:** Use `df.select_dtypes("number")` to isolate numeric columns.

In [None]:
numeric_df = df.select_dtypes("number")
corr = numeric_df.corr()

sns.heatmap(corr, annot=True)
plt.title("Correlation Heatmap (Numeric Features)")
plt.tight_layout()
plt.show()

# Part C ‚Äî Mini Dashboard (3 Visuals)

## ‚úÖ Step 10: Create a Simple Dashboard

Create **three visuals** that would help someone quickly understand security activity.

Minimum requirements:
1. Trend over time (line)
2. Breakdown by type (bar or countplot)
3. Distribution or outliers (histogram or boxplot)

**Instruction:** Put all three plots in sequence below with short titles and clear labels.

In [None]:
# Dashboard Visual 1: Trend over time
events_per_hour = df.set_index("timestamp").resample("h").size()
plt.plot(events_per_hour.index, events_per_hour.values)
plt.title("Dashboard 1: Events Over Time (Hourly)")
plt.xlabel("Time")
plt.ylabel("Event Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Dashboard Visual 2: Events by type
sns.countplot(data=df, x="event_type")
plt.title("Dashboard 2: Event Counts by Type")
plt.xlabel("Event Type")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Dashboard Visual 3: Outliers / distribution
sns.boxplot(data=df, x="event_type", y="severity")
plt.title("Dashboard 3: Severity by Event Type")
plt.xlabel("Event Type")
plt.ylabel("Severity")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## ‚úçÔ∏è Reflection (Required)

Answer each question in 1‚Äì3 sentences:

1. Which visualization was most useful for identifying patterns or anomalies? Why?  
2. What improvement would you make if presenting these visuals to a non-technical stakeholder?  
3. How could visualization support faster or more accurate security decisions?

## ‚úÖ What to Submit

1. Your completed notebook saved as: `Module4_VisualizationLab_LastName.ipynb`  
2. Make sure all charts render correctly before submitting.

If you get an error, include a short note at the bottom describing:
- what you tried
- the exact error message
