# Chapter 1

## 1.3 Distribution Analysis

### Context and Motivation

Understanding the shape of a data distribution is fundamental in statistics, especially when interpreting system logs and operational telemetry. 
Distributions reveal patterns such as central mass, skewness, heavy tails, and modality — all of which can indicate either expected variability or early signs of system instability.

In this notebook, we continue exploring the Apache logs used in Days 1 and 2, focusing on **visual and statistical analysis of distributions**.


### 1.3.1 Dataset Overview

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set(style='whitegrid')

url = "https://raw.githubusercontent.com/logpai/loghub/refs/heads/master/Apache/Apache_2k.log_structured.csv"
df = pd.read_csv(url)
df['Time_parsed'] = pd.to_datetime(df['Time'], errors='coerce')
df['Hour'] = df['Time_parsed'].dt.hour
df['DayOfWeek'] = df['Time_parsed'].dt.day_name()
df.head()

### 1.3.2 Distribution of Log Volume per Hour

In [None]:
hourly_counts = df['Hour'].value_counts().sort_index()
hourly_counts.describe()

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(hourly_counts, bins=12, kde=True, color='cornflowerblue')
plt.title("Distribution of Log Volume per Hour")
plt.xlabel("Event Count per Hour")
plt.ylabel("Frequency")
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.kdeplot(hourly_counts, fill=True, linewidth=2, color='slateblue')
plt.title("Kernel Density Estimate (KDE) — Hourly Log Volume")
plt.xlabel("Log Volume")
plt.ylabel("Density")
plt.show()

In [None]:
plt.figure(figsize=(10, 2))
sns.boxplot(x=hourly_counts, color='tomato')
plt.title("Boxplot — Hourly Log Volume")
plt.xlabel("Log Volume")
plt.show()

### 1.3.3 Tail Behavior & Outlier Detection

In [None]:
q1 = hourly_counts.quantile(0.25)
q3 = hourly_counts.quantile(0.75)
iqr = q3 - q1
upper_fence = q3 + 1.5 * iqr

outliers = hourly_counts[hourly_counts > upper_fence]
outliers

Outliers above the upper whisker suggest **exceptional log activity**, which may signal unusual workload, failure spikes, or logging anomalies.


### 1.3.4 Final Considerations

Distributions help uncover the nature of system behavior:

- A **right-skewed distribution** suggests rare but intense events.
- **Outliers** can be precursors to failures or reveal uneven usage.
- **KDEs and boxplots** provide compact, powerful ways to visualize spread, center, and anomalies.

This understanding will support more accurate thresholds, expectations, and future hypothesis testing in the next chapters.
