# Cross-Country Comparison

**Objective:**  
Synthesize the cleaned datasets from Benin, Sierra Leone, and Togo to identify relative solar potential and key differences across countries.

---


### 1. Load Cleaned Datasets
Load the cleaned CSV files for each country and combine them into a single DataFrame. Add a `Country` column to differentiate between datasets.


In [None]:
# =====================================================
# Task 3: Cross-Country Comparison
# =====================================================

# 1. Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

sns.set_style("whitegrid")

### 1. Load Cleaned Datasets
Load the cleaned CSV files for each country and combine them into a single DataFrame. Add a `Country` column to differentiate between datasets.


In [None]:
# =====================================================
# 2. Load Cleaned Datasets
# =====================================================
benin = pd.read_csv("../data/benin_clean.csv", parse_dates=["Timestamp"])
sierra_leone = pd.read_csv("../data/sierraleone_clean.csv", parse_dates=["Timestamp"])
togo = pd.read_csv("../data/togo_clean.csv", parse_dates=["Timestamp"])

# Add country column
benin["Country"] = "Benin"
sierra_leone["Country"] = "Sierra Leone"
togo["Country"] = "Togo"

# Combine datasets
df_combined = pd.concat([benin, sierra_leone, togo], ignore_index=True)

# Quick check
df_combined.head()

### 2. Metric Comparison
Compare key solar metrics (GHI, DNI, DHI) across countries:
- **Boxplots** to visualize distribution and spread per country.
- **Summary statistics** (mean, median, standard deviation) for each metric.

In [None]:
# =====================================================
# 3. Metric Comparison
# =====================================================
metrics = ["GHI", "DNI", "DHI"]

# Boxplots
for metric in metrics:
    plt.figure(figsize=(8,5))
    sns.boxplot(x="Country", y=metric, data=df_combined, palette="Set2")
    plt.title(f"{metric} Distribution by Country")
    plt.ylabel(metric)
    plt.show()

# Summary Table
summary_table = df_combined.groupby("Country")[metrics].agg(["mean", "median", "std"])
print("Summary Statistics by Country:")
print(summary_table)


### 3. Statistical Testing
- Run **one-way ANOVA** to test if GHI differences between countries are significant.
- Use **Kruskalâ€“Wallis test** if data is not normally distributed.
- Note p-values and interpret results.

In [None]:
# =====================================================
# 4. Statistical Testing
# =====================================================

# Extract GHI values per country
ghi_benin = df_combined[df_combined["Country"]=="Benin"]["GHI"]
ghi_sl = df_combined[df_combined["Country"]=="Sierra Leone"]["GHI"]
ghi_togo = df_combined[df_combined["Country"]=="Togo"]["GHI"]

# ANOVA
f_val, p_val = stats.f_oneway(ghi_benin, ghi_sl, ghi_togo)
print(f"\nANOVA F-statistic: {f_val:.3f}, p-value: {p_val:.5f}")

# Kruskal-Wallis test (non-normal)
h_val, p_val_kw = stats.kruskal(ghi_benin, ghi_sl, ghi_togo)
print(f"Kruskal-Wallis H-statistic: {h_val:.3f}, p-value: {p_val_kw:.5f}")

### 4. Key Observations
- Summarize notable trends in bullet points, e.g.:
    **Benin** exhibits the highest mean GHI (237.48) and DNI (167.14), suggesting strong overall solar potential, but also shows the largest variability in GHI (std = 327.17), indicating occasional extreme fluctuations.
- **Togo** has slightly lower mean GHI (225.03) compared to Benin but similar variability (std = 316.45), with a slightly higher median DHI (1.5) than Benin, suggesting more consistent diffuse irradiance.
- **Sierra Leone** shows the lowest mean values for GHI (187.21) and DNI (104.21) and slightly lower variability in DHI, indicating relatively lower solar energy potential and more frequent low-irradiance periods.

### 5. Bonus: Visual Summary
- Create a **bar chart** ranking countries by average GHI to quickly identify highest solar potential.

In [None]:

# =====================================================
# 6. Bonus: Average GHI Ranking
# =====================================================
avg_ghi = df_combined.groupby("Country")["GHI"].mean().sort_values(ascending=False)

plt.figure(figsize=(6,4))
sns.barplot(x=avg_ghi.index, y=avg_ghi.values, palette="viridis")
plt.ylabel("Average GHI")
plt.title("Average GHI by Country")
plt.show()