
# Exploratory Data Analysis (EDA)
## Project 4: ML vs Classical Regression

This notebook performs a **systematic exploratory analysis** of the dataset used for
system performance prediction. The purpose of this EDA is to:

- Understand feature distributions
- Detect non-linearities and interactions
- Identify scaling and variance issues
- Justify model selection (linear vs non-linear)

This notebook intentionally focuses on **analysis and reasoning**, not model training.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use("default")



## 1. Dataset Loading

We begin by loading the raw dataset generated by `generate_data.py`.


In [None]:

df = pd.read_csv("../data/raw/system_performance.csv")
df.head()



## 2. Basic Dataset Inspection

We inspect:
- number of samples
- feature types
- missing values


In [None]:

df.info()


In [None]:

df.isnull().sum()



## 3. Descriptive Statistics

This helps identify:
- scale differences
- variance ranges
- potential normalization needs


In [None]:

df.describe()



## 4. Feature Distributions

We visualize each feature to understand its distribution and scale.


In [None]:

plt.figure()
plt.hist(df["input_size"], bins=40)
plt.xlabel("input_size")
plt.ylabel("Frequency")
plt.title("Distribution of input_size")
plt.show()


In [None]:

plt.figure()
plt.hist(df["threads"], bins=40)
plt.xlabel("threads")
plt.ylabel("Frequency")
plt.title("Distribution of threads")
plt.show()


In [None]:

plt.figure()
plt.hist(df["cache_kb"], bins=40)
plt.xlabel("cache_kb")
plt.ylabel("Frequency")
plt.title("Distribution of cache_kb")
plt.show()


In [None]:

plt.figure()
plt.hist(df["packet_count"], bins=40)
plt.xlabel("packet_count")
plt.ylabel("Frequency")
plt.title("Distribution of packet_count")
plt.show()


In [None]:

plt.figure()
plt.hist(df["execution_time"], bins=40)
plt.xlabel("execution_time")
plt.ylabel("Frequency")
plt.title("Distribution of execution_time")
plt.show()



## 5. Relationship Between Features and Target

Scatter plots reveal linearity, curvature, and heteroscedasticity.


In [None]:

plt.figure()
plt.scatter(df["input_size"], df["execution_time"], alpha=0.4)
plt.xlabel("input_size")
plt.ylabel("Execution Time")
plt.title("Execution Time vs input_size")
plt.show()


In [None]:

plt.figure()
plt.scatter(df["threads"], df["execution_time"], alpha=0.4)
plt.xlabel("threads")
plt.ylabel("Execution Time")
plt.title("Execution Time vs threads")
plt.show()


In [None]:

plt.figure()
plt.scatter(df["cache_kb"], df["execution_time"], alpha=0.4)
plt.xlabel("cache_kb")
plt.ylabel("Execution Time")
plt.title("Execution Time vs cache_kb")
plt.show()


In [None]:

plt.figure()
plt.scatter(df["packet_count"], df["execution_time"], alpha=0.4)
plt.xlabel("packet_count")
plt.ylabel("Execution Time")
plt.title("Execution Time vs packet_count")
plt.show()



## 6. Correlation Analysis

Correlation measures linear dependency only.
Low correlation does NOT imply irrelevance.


In [None]:

corr = df.corr(numeric_only=True)
corr


In [None]:

plt.figure(figsize=(6,5))
plt.imshow(corr, cmap="coolwarm")
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=45)
plt.yticks(range(len(corr.columns)), corr.columns)
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()



## 7. Interaction Effects (Conceptual)

Some system behaviors are inherently non-linear:
- diminishing returns from threads
- cache saturation
- mixed network + compute load

These effects motivate polynomial and tree-based models.



## 8. Feature Scaling Considerations

Linear and polynomial regression are sensitive to feature scale.
Random Forests are largely scale-invariant.



## 9. Outlier Inspection

Outliers may represent rare but valid system states.


In [None]:

plt.figure()
plt.boxplot(df["execution_time"], vert=False)
plt.title("Execution Time Outliers")
plt.show()



## 10. EDA Summary


- Execution time shows non-linear dependence on multiple features
- Several features operate on vastly different scales
- Linear correlation is insufficient to explain behavior
- Non-linear models are justified

