## Solar challenge :country - Togo

## Summary Statistics & Missing-Value Report

We begin the EDA by checking the summary statistics of all numeric columns and identifying columns with missing values. Columns with more than 5% null values are flagged.


In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('../data/togo-dapaong_qc.csv')

# Display the first few rows of the dataset
df.head()

In [None]:
# Summary statistics
df.describe()

In [None]:
# Missing value report
missing_report = df.isna().sum()
print("Missing Values Per Column:\n", missing_report)

# Flag columns with >5% missing data
threshold = 0.05 * len(df)
high_missing_cols = missing_report[missing_report > threshold]
print("\nColumns with >5% missing values:\n", high_missing_cols)


## Outlier Detection & Basic Cleaning

We check for outliers using Z-scores in sensor and irradiance readings, and handle missing values in key columns using median imputation.


In [None]:
from scipy.stats import zscore
import numpy as np

key_columns = ['GHI', 'DNI', 'DHI', 'ModA', 'ModB', 'WS', 'WSgust']
df_clean = df.copy()

cols = ['ModA', 'ModB', 'WS', 'WSgust']
z_scores_df = df[cols].apply(zscore)
for col in cols:
    outlier_flags = z_scores_df[col].abs() > 3
    outlier_rows = df[outlier_flags]
    print(f"col {col} has {len(outlier_rows)} outliers and {round(len(outlier_rows) / len(df) * 100, 2)}% of the data")

# Compute Z-scores and flag outliers
z_scores = df_clean[key_columns].apply(zscore)
outliers = (np.abs(z_scores) > 3).any(axis=1)

print(f"Total outliers flagged: {outliers.sum()}")

# Impute missing values with median
df_clean[key_columns] = df_clean[key_columns].fillna(df_clean[key_columns].median())

# Export cleaned data
df_clean.to_csv("data/togo_clean.csv", index=False)
print("✅ Cleaned data exported to 'data/togo_clean.csv'")

## Time Series Analysis

We analyze GHI, DNI, DHI, and ambient temperature (Tamb) over time using line plots to observe patterns by month, day, or unusual spikes.


In [None]:
import matplotlib.pyplot as plt

# Convert the 'Timestamp' column to datetime
df_clean['Timestamp'] = pd.to_datetime(df_clean['Timestamp'])

# Check the data types to confirm the change
print(df_clean.dtypes)


In [None]:
plt.figure(figsize=(8, 6))
for col in ['GHI', 'DNI', 'DHI', 'Tamb']:
    plt.plot(df_clean['Timestamp'], df_clean[col], label=col)
plt.legend(loc='upper right')
plt.title("Time Series of Solar and Temperature Data")
plt.xlabel("Timestamp")
plt.ylabel("Values")
plt.grid(True)
plt.tight_layout()
plt.show()


## Cleaning Impact on ModA and ModB

We analyze the effect of cleaning actions on sensor readings by grouping by the cleaning flag.


In [None]:
cleaning_impact = df.groupby('Cleaning')[['ModA', 'ModB']].mean()
cleaning_impact.plot(kind='bar', figsize=(8, 6))
plt.title('Average ModA and ModB Before and After Cleaning')
plt.xlabel('Cleaning (0 = No, 1 = Yes)')
plt.ylabel('Average Value (W/m²)')
plt.xticks(rotation=0)
plt.show()

## Correlation and Relationship Analysis

Visualize relationships between variables using a heatmap and scatter plots.


In [None]:
import seaborn as sns

# Heatmap
corr_cols = ['GHI', 'DNI', 'DHI', 'TModA', 'TModB']
sns.heatmap(df_clean[corr_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()


In [None]:
# Scatter plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.scatterplot(x='WS', y='GHI', data=df_clean, ax=axes[0])
sns.scatterplot(x='WSgust', y='GHI', data=df_clean, ax=axes[1])
sns.scatterplot(x='RH', y='Tamb', data=df_clean, ax=axes[2])
axes[0].set_title("WS vs GHI")
axes[1].set_title("WSgust vs GHI")
axes[2].set_title("RH vs Tamb")
plt.tight_layout()
plt.show()

## Wind and Distribution Analysis

We visualize wind direction and speed using a wind rose and inspect the distribution of GHI and WS using histograms.


In [None]:
#Wind Rose:
from windrose import WindroseAxes
ax = WindroseAxes.from_ax()
ax.bar(df['WD'], df['WS'], normed=True, opening=0.8, edgecolor='white')
ax.set_legend()
plt.title('Wind Rose (togo)')
plt.show()


In [None]:
# Histograms
df_clean[['GHI', 'WS']].hist(bins=30, figsize=(10, 5))
plt.suptitle("Distribution of GHI and WS")
plt.tight_layout()
plt.show()


## Temperature and Relative Humidity Analysis

We examine the relationship between temperature and humidity to understand their mutual influence on solar radiation.


In [None]:
sns.scatterplot(x='RH', y='GHI', data=df_clean)
plt.title("Relative Humidity vs GHI")
plt.tight_layout()
plt.show()

sns.scatterplot(x='RH', y='Tamb', data=df_clean)
plt.title("Relative Humidity vs Temperature")
plt.tight_layout()
plt.show()



## Bubble Chart: GHI vs. Tamb with Bubble Size as RH

We create a bubble chart to explore how RH or BP modulates the relationship between GHI and ambient temperature.


In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['Tamb'], df['GHI'], s=df['RH']*10, alpha=0.5)
plt.title('GHI vs. Tamb (Bubble Size = RH)')
plt.xlabel('Tamb (°C)')
plt.ylabel('GHI (W/m²)')
plt.show()
