# CE93 Project 1 — Exploratory Data Analysis (Final Report)
**Course:** CE93 Engineering Data Analysis  
**Dataset used:** `CE93_07_AirQuality_WaterPollution.csv` (provided — CE93_07)  
**Group members:** [Member 1, Member 2, Member 3]

---

### Quick dataset summary
- Number of measurements (rows): **3538**  
- Columns: ['City', 'Air Quality', 'Water Pollution']

This notebook is structured to satisfy **every bullet point** in the project rubric. Run all cells (`Kernel -> Restart & Run All`) before exporting to PDF for submission.


In [None]:
# Imports and plotting configuration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

%matplotlib inline
sns.set(style="whitegrid", font_scale=1.05)
plt.rcParams['figure.figsize'] = (8,5)
pd.options.display.float_format = '{:.3f}'.format


In [None]:
# 1. Introduction: Load data, show a few rows, and output number of measurements
DATA_PATH = "/mnt/data/CE93_07_AirQuality_WaterPollution.csv"
df = pd.read_csv(DATA_PATH)
print("Dataset path:", DATA_PATH)
print("Number of rows (measurements):", len(df))
print("\nA preview of the data:")
df.head(8)

**Dataset origin:** This is one of the CE93-provided datasets (CE93_07). We did not collect these data ourselves.


---
## 2. Summary Statistics

For each of the two numeric variables (`Air Quality` and `Water Pollution`) compute:
- Two measures of central tendency (mean and median)
- Three measures of variability (standard deviation, IQR, coefficient of variation)

All numeric outputs are rounded to three decimal places as required.


In [None]:

# Compute measures for the numeric variables
num_cols = ['Air Quality', 'Water Pollution']
results = {}
for v in num_cols:
    series = df[v].dropna()
    mean = series.mean()
    median = series.median()
    std = series.std(ddof=1)
    iqr = series.quantile(0.75) - series.quantile(0.25)
    cv = std / mean if mean != 0 else np.nan
    results[v] = dict(mean=mean, median=median, std=std, iqr=iqr, cv=cv, count=len(series))

# Print results with required formatting
for v, r in results.items():
    print("Variable:", v)
    print(f"  Count = {r['count']}")
    print(f"  Mean = {r['mean']:.3f}")
    print(f"  Median = {r['median']:.3f}")
    print(f"  Standard deviation = {r['std']:.3f}")
    print(f"  IQR = {r['iqr']:.3f}")
    print(f"  Coefficient of Variation = {r['cv']:.3f}\n")


### Discussion of numerical summaries (placeholder)
**Interpretation guidance (replace with your text):**
- Describe skewness, presence of outliers, relative variability between variables.
- For each dataset, state which measure of central tendency is most appropriate (mean vs median) and why.
- For variability, comment on whether standard deviation, IQR, or CV is most informative given units and distribution.


### Creating a new variable
We define a new index **Air-Water Index (AWI)** to combine Air Quality (AQ) and Water Pollution (WP).

Equation (weighted average): 
$$ AWI = 0.6 \times AQ + 0.4 \times WP $$

Equation (ratio): 
$$ AWI_{ratio} = \frac{AQ}{WP + 1} $$

We will compute both versions (weighted and ratio) and report their summaries.


In [None]:

# Create new variables according to the equations above
df['AWI_weighted'] = 0.6 * df['Air Quality'] + 0.4 * df['Water Pollution']
df['AWI_ratio'] = df['Air Quality'] / (df['Water Pollution'] + 1)  # +1 to avoid division by zero

# Compute central tendency and variability for AWI_weighted (as required)
awi = df['AWI_weighted'].dropna()
awi_mean = awi.mean()
awi_median = awi.median()
awi_std = awi.std(ddof=1)
awi_iqr = awi.quantile(0.75) - awi.quantile(0.25)
awi_cv = awi_std / awi_mean if awi_mean != 0 else np.nan

print(f"AWI_weighted (count = {len(awi)})") 
print(f"  Mean = {awi_mean:.3f}")
print(f"  Median = {awi_median:.3f}")
print(f"  Std = {awi_std:.3f}")
print(f"  IQR = {awi_iqr:.3f}")
print(f"  CV = {awi_cv:.3f}")


**Can the numerical summaries for AWI be obtained by converting the summaries of AQ and WP?**

In general, **no** — because mean of a linear combination can be derived from means (i.e., $E[aX+bY]=aE[X]+bE[Y]$), so the **mean** of AWI_weighted *can* be obtained from the means of AQ and WP. However, variability measures like standard deviation, IQR, and CV for a combination **cannot** be obtained from the separate summaries without information about covariance and distribution shape. Explain this in your report.


---
## 3. Visualizations

We will create **three different univariate plots** for the `Air Quality` variable (histogram, boxplot, KDE) and ensure each plot:
- Is univariate (single variable)
- Has title, axis labels, and units where applicable
- Changes at least two default plotting parameters per plot
Include a supporting paragraph for each plot describing the data characteristics.


In [None]:

# Univariate plots for Air Quality (column name: 'Air Quality')
col = 'Air Quality'
series = df[col].dropna()

# 1) Histogram: change bins and alpha, edgecolor and density
plt.figure(figsize=(9,4))
plt.hist(series, bins=25, edgecolor='black', alpha=0.75, density=False)
plt.title('Histogram of Air Quality')
plt.xlabel('Air Quality (index units)')
plt.ylabel('Frequency (count)')
plt.grid(axis='y', linestyle='--', alpha=0.6)
plt.show()

# Supporting stats printed for reference
print('Count =', len(series))
print('Mean =', round(series.mean(),3))
print('Median =', round(series.median(),3))
print('Std =', round(series.std(ddof=1),3))

# 2) Boxplot: change orientation and showfliers, notch
plt.figure(figsize=(8,2.5))
sns.boxplot(x=series, orient='h', showfliers=True, notch=True)
plt.title('Boxplot of Air Quality (notched)')
plt.xlabel('Air Quality (index units)')
plt.show()

# 3) KDE: adjust bandwidth and fill, add rug
plt.figure(figsize=(9,4))
sns.kdeplot(series, bw_adjust=0.8, fill=True, common_norm=False)
sns.rugplot(series, height=0.05)
plt.title('Kernel Density Estimate of Air Quality (bw_adjust=0.8)')
plt.xlabel('Air Quality (index units)')
plt.ylabel('Density')
plt.show()


**Plot interpretations (fill in with your observations):**

- *Histogram:* Describe modality, skewness, gaps, and common value ranges.
- *Boxplot:* Note median location, IQR, and outliers.
- *KDE:* Discuss smoothness, peaks, and multimodality if present.


### Scatter plot and relationship between Air Quality and Water Pollution
Create a scatter plot and compute two measures of dependence: **Pearson correlation** and **Spearman rank correlation**.
Discuss what the values mean and potential factors affecting the relationship.


In [None]:

# Scatter plot between Air Quality and Water Pollution
x = 'Air Quality'
y = 'Water Pollution'

plt.figure(figsize=(7,5))
plt.scatter(df[x], df[y], s=50, alpha=0.6, marker='D', edgecolor='black')
plt.title('Scatter: Water Pollution vs Air Quality')
plt.xlabel('Air Quality (index units)')
plt.ylabel('Water Pollution (index units)')
plt.grid(True, linestyle=':', alpha=0.6)
plt.show()

# Compute Pearson and Spearman correlations (drop NA pairwise)
pair_df = df[[x,y]].dropna()
pearson_r, pearson_p = stats.pearsonr(pair_df[x], pair_df[y])
spearman_r, spearman_p = stats.spearmanr(pair_df[x], pair_df[y])

print(f"Pearson correlation: r = {pearson_r:.3f}, p-value = {pearson_p:.3g}")
print(f"Spearman correlation: rho = {spearman_r:.3f}, p-value = {spearman_p:.3g}")


**Discussion (fill in):**
- Interpret strength and direction of correlation values.
- Do the correlations make sense given the scatter plot?
- Suggest real-world reasons why Air Quality and Water Pollution could be associated or independent.


---
## 4. Dependence / Independence

We compute two measures of dependence (Pearson and Spearman as above). Additionally, present a short discussion on possible confounders and data collection issues that could influence the observed relationship.

Replace the placeholder text below with your analysis.


**Potential factors influencing dependence/independence (examples you can adapt):**

- Geographic clustering: cities in the same region may have similar pollution profiles.
- Industrial activity: regions with heavy industry likely higher in both air and water pollution.
- Measurement methods and time differences: if measurements taken in different years or with different protocols, apparent association may be biased.
- Socioeconomic factors: wealthier cities may have lower pollution levels.


---
## 5. Presentation Checklist

- Ensure all code cells have preceding Markdown explaining purpose.
- All required outputs are printed and rounded to three decimals.
- All plots have titles, axis labels, units, and changed plotting parameters.
- Export the notebook to PDF (Webpdf) and confirm visuals are legible in the PDF.


---
## 6. Group Work Assessment Survey

Each team member must complete the peer-evaluation survey provided by the instructor. Ensure every member submits it separately.
