# EDA: Phishing URL Dataset

This notebook provides a quick exploratory analysis for Lab 2 Track 1 (phishing detection).

In [None]:
from pathlib import Path

import pandas as pd

DATA_PATH = Path('../data/raw/phishing_site_urls.csv')
df = pd.read_csv(DATA_PATH)
df.columns = ['URL', 'Label']
df['URL'] = df['URL'].astype(str).str.strip().str.lower()
df['Label'] = df['Label'].astype(str).str.strip().str.lower()

print(df.head())
print('\nLabel counts:')
print(df['Label'].value_counts())

In [None]:
import matplotlib.pyplot as plt

counts = df['Label'].value_counts().sort_index()
ax = counts.plot(kind='bar', color=['#4C78A8', '#F58518'])
ax.set_title('Class Distribution')
ax.set_xlabel('Label')
ax.set_ylabel('Count')
plt.tight_layout()
plt.show()

In [None]:
feature_df = pd.DataFrame({
    'url_length': df['URL'].str.len(),
    'num_dots': df['URL'].str.count(r'\\.'),
    'num_digits': df['URL'].str.count(r'\\d'),
    'num_slashes': df['URL'].str.count(r'/')
})

summary = feature_df.describe().T[['mean', 'std', 'min', 'max']]
print(summary)

print('\nFeature means by class:')
print(pd.concat([feature_df, df['Label']], axis=1).groupby('Label').mean())

## Insight
Phishing URLs in this dataset tend to be longer and often include more separators and digits than benign URLs.  
These patterns suggest attackers frequently use nested paths, encoded tokens, and deceptive naming structures.  
Because these are lexical signals, a baseline URL-feature model can provide strong initial separation before advanced modeling.