# CE93 Project 1 — Exploratory Data Analysis
**Dataset used:** `CE93_07_AirQuality_WaterPollution.csv` (provided)
**Number of measurements (rows):** 3538

**Columns and data types:**

| Column | Data type |
|---|---|
| City | object |
| Air Quality | float64 |
| Water Pollution | float64 |

---

**Notes:**
- The notebook below is pre-populated with analysis code that you can run directly.
- Replace or expand the markdown discussion sections with your own interpretation for the final submission.


In [None]:
# Imports and plotting configuration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

%matplotlib inline
sns.set(style="whitegrid", font_scale=1.05)
plt.rcParams['figure.figsize'] = (8,5)


In [None]:
# Load dataset
DATA_PATH = "/mnt/data/CE93_07_AirQuality_WaterPollution.csv"
df = pd.read_csv(DATA_PATH)
print("Loaded file:", DATA_PATH)
print(f"Number of rows: 3538")
df.head()

In [None]:
# Quick overview: dtypes, missing values, numeric columns
display(df.dtypes)
print("\nMissing values per column:")
print(df.isna().sum())
print("\nNumeric columns detected:", df.select_dtypes(include=['number']).columns.tolist())

In [None]:
# Summary statistics for numeric variables
num_cols = df.select_dtypes(include=['number']).columns.tolist()
summary = df[num_cols].describe().T
# Add IQR and Coefficient of Variation
summary['IQR'] = df[num_cols].quantile(0.75) - df[num_cols].quantile(0.25)
summary['CV'] = summary['std'] / summary['mean']
summary = summary[['count','mean','50%','std','IQR','CV']].rename(columns={'50%':'median'})
pd.options.display.float_format = '{:.3f}'.format
display(summary)

In [None]:
# Example: create a new variable from an existing numeric column
# We'll create a normalized version of `Air Quality` (z-score) and a min-max scaled version.
df['Air Quality_zscore'] = (df['Air Quality'] - df['Air Quality'].mean()) / df['Air Quality'].std()
df['Air Quality_minmax'] = (df['Air Quality'] - df['Air Quality'].min()) / (df['Air Quality'].max() - df['Air Quality'].min())

print(df[[ 'Air Quality', 'Air Quality_zscore', 'Air Quality_minmax']].head())

# Numerical summaries for the new variable (z-score)
z_mean = df['Air Quality_zscore'].mean()
z_std = df['Air Quality_zscore'].std()
print(f"Mean of z-score: {z_mean:.3f} (should be ~0), Std: {z_std:.3f} (should be ~1)")

In [None]:
# Visualizations for `Air Quality` (three univariate plots)
# 1) Histogram (change bins and alpha)
plt.figure(figsize=(8,4))
plt.hist(df['Air Quality'].dropna(), bins=20, edgecolor='k', alpha=0.7)
plt.title('Histogram of Air Quality')
plt.xlabel('Air Quality')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# 2) Boxplot (change orientation and showfliers)
plt.figure(figsize=(6,3))
sns.boxplot(x=df['Air Quality'], orient='h', showfliers=True)
plt.title('Boxplot of Air Quality')
plt.xlabel('Air Quality')
plt.show()

# 3) Kernel Density Estimate (KDE) with rug
plt.figure(figsize=(8,4))
sns.kdeplot(df['Air Quality'].dropna(), bw_adjust=1, fill=True)
sns.rugplot(df['Air Quality'].dropna(), height=0.05)
plt.title('KDE of Air Quality')
plt.xlabel('Air Quality')
plt.show()

In [None]:
# Scatter plot between `Air Quality` (x) and `Water Pollution` (y)
plt.figure(figsize=(7,5))
plt.scatter(df['Air Quality'], df['Water Pollution'], s=40, alpha=0.7, marker='o', edgecolor='k')
plt.title('Water Pollution vs Air Quality')
plt.xlabel('Air Quality')
plt.ylabel('Water Pollution')
plt.grid(True)
plt.show()

# Compute and print Pearson and Spearman correlation
pearson_r, pearson_p = stats.pearsonr(df['Air Quality'].dropna(), df['Water Pollution'].dropna())
spearman_r, spearman_p = stats.spearmanr(df['Air Quality'].dropna(), df['Water Pollution'].dropna())
print(f"Pearson r = {pearson_r:.3f}, p-value = {pearson_p:.3g}")
print(f"Spearman rho = {spearman_r:.3f}, p-value = {spearman_p:.3g}")

---
## 4. Dependence / Independence Analysis (instructions)
The code above computes Pearson and Spearman correlations between two numeric variables. In your report, discuss:

- The strength and direction (positive/negative) of the association.
- Whether the relationship appears linear or monotonic.
- Potential confounding variables or data collection issues that could affect the association.

Add additional statistical tests or visual diagnostics (e.g., scatter with regression line, residuals) if desired.


---
## 5. Presentation & Submission

**To prepare final submission:**
- Run `Kernel -> Restart & Run All` to ensure outputs are generated.
- Export the notebook to PDF: `File -> Save and Export Notebook As... -> Webpdf` (or use your notebook server export).
- Submit both the `.ipynb` file and the exported `.pdf` file according to the project instructions.

**Notes:** Replace analysis paragraphs with your own interpretation and ensure all outputs are visible in the exported PDF.
