### Exploratory Data Analysis (EDA)

In this notebook, we conduct exploratory data analysis (EDA) on the final merged dataset produced during the data preparation stage. The dataset integrates in-situ water quality measurements, satellite-derived spectral features, climate variables, population density, and spatial indicators such as river mouth and junction proximity. At this stage, all variables—including Total Alkalinity (TA), Electrical Conductance (EC), and Dissolved Reactive Phosphorus (DRP)—are treated as independent variables for exploratory purposes, just like any other variable in the dataset. The goal of this EDA is to understand the structure of the data, examine variable distributions, explore relationships across variables, and identify any data quality issues or notable patterns present in the dataset.

In [None]:
# https://simplemaps.com/gis/country/za --> download shapefile

In [None]:
%pip install folium

In [2]:
import pandas as pd
import geopandas as gpd
from utils import primary_dataset
import folium

## Pulling the merged dataset

In [3]:
water_quality_df = pd.read_csv('data/wq.csv')
water_quality_df.head(10)

Unnamed: 0.1,Unnamed: 0,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,month,pop_density_nn,distance_km_to_pd_cell,river_mouth
0,0,Namibia,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31,0.57699,0.325999,0
1,1,South Africa,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31,5.049022,0.251093,0
2,2,South Africa,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31,23.239988,0.418343,0
3,3,South Africa,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31,687.465759,0.069949,0
4,4,South Africa,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31,6.092811,0.23173,0
5,5,South Africa,-27.010111,26.698083,2011-01-04,12433.5,10433.5,9579.5,8531.5,0.129651,0.042672,98.6,82.2,289.8,192.0,2011-01-31,77.849716,0.465583,0
6,6,South Africa,-25.127778,27.628889,2011-01-04,7814.0,5189.5,6664.0,5222.0,0.079431,-0.124394,105.5,125.0,438.0,163.0,2011-01-31,3.473888,0.222919,0
7,7,South Africa,-25.20639,27.558,2011-01-04,14137.5,10466.5,10315.5,8536.0,0.1563,0.007266,103.700005,116.62,568.0,69.0,2011-01-31,6.692756,0.383093,0
8,8,South Africa,-24.69514,27.40906,2011-01-04,15543.0,10647.0,11919.5,9642.5,0.131944,-0.056389,111.8,181.831,583.0,158.0,2011-01-31,2.95071,0.252874,0
9,9,South Africa,-26.984722,26.632278,2011-01-04,13683.0,10207.0,14011.5,11850.5,-0.011862,-0.157091,98.200005,196.0,452.0,158.0,2011-01-31,7.390572,0.529222,0


In [None]:
# Keep in mind, primary_dataset only uses the competition dataset (landsat and terraclimate)
df = primary_dataset.primary_dataset()

gdf_points = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["longitude"], df["latitude"]),
    crs="EPSG:4326"
)

path_to_data = 'data/za_shp'
provinces = gpd.read_file(path_to_data, layer='za')
provinces.columns = ['id', 'province', 'source', 'geometry']

In [None]:
gdf_with_province = gpd.sjoin(
    gdf_points,
    provinces[["province", "geometry"]],
    how="left",
    predicate="within"
).to_crs("EPSG:4326")

In [None]:
gdf_with_province.head()

In [None]:
# plot
zoom_start = 8
m = folium.Map(location=[gdf_with_province['latitude'][0], gdf_with_province['longitude'][1]], tiles="OpenStreetMap", zoom_start=zoom_start)

for coord in zip(gdf_with_province['latitude'], gdf_with_province['longitude']):
    coord = list(coord)

    folium.Marker(
        location=coord,
    ).add_to(m)

m

In [None]:
# !jupyter trust simplemaps.ipynb