# SIADS 593 WN26 Milestone I: Sustainable Water Quality Notebook

**Authors:**

Sungmin Kim  
Randy Best  
Kyle Rodriguez

#### **Introduction**

Welcome to the Exploratory Data Analysis of water quality across various river locations in South Africa! As a key factor for environmental sustainability, we would like to use data to identify and emphasize the key factors that significantly influence water quality. As a team we will take advantage of the valuable opportunity to apply data exploration techniques to real-world environmental data and contribute to advancing water quality monitoring.

**Primary Dataset**

With a dataset containing three target water quality parameters **Total Alkalinity**, **Electrical Conductance**, and **Dissolved Reactive Phosphorus** collected between 2011 and 2015 from approximately 200 river locations across South Africa, our goal is to find insight on which environmental conditions most affect these 3 metrics. Each data point includes the geographic coordinates (latitude and longitude) of the sampling site, the date of collection, and the corresponding water quality measurements. Features of the dataset also include four spectral bands—**SWIR22** (Shortwave Infrared 2), **NIR** (Near Infrared), **Green**, and **SWIR16** (Shortwave Infrared 1)—were utilized from Landsat, along with derived spectral indices such as **NDMI** (Normalized Difference Moisture Index) and **MNDWI** (Modified Normalized Difference Water Index). In addition, the **PET** (Potential Evapotranspiration) variable was incorporated from the **TerraClimate** dataset to account for climatic influences on water quality.

- **SWIR22** – Sensitive to surface moisture and turbidity variations in water bodies.  
- **NIR** – Helps in identifying vegetation and suspended matter in water.  
- **Green** – Useful for detecting water color and surface reflectance changes.  
- **SWIR16** – Provides information on surface dryness and sediment concentration.  
- **NDMI** – Derived from NIR and SWIR16, indicates moisture and vegetation–water interaction.  
- **MNDWI** – Derived from Green and SWIR22, effective for distinguishing open water areas and reducing built-up noise.  
- **PET** – Extracted from the TerraClimate dataset, represents potential evapotranspiration influencing hydrological and water quality dynamics.

The dataset spans a five-year period from 2011 to 2015. Using API-based data extraction methods, both Landsat and TerraClimate features were retrieved directly from the <a href="https://planetarycomputer.microsoft.com/">Microsoft Planetary Computer</a> portal. These combined spectral, index-based, and climatic features were used as predictors in a regression model to estimate three key water quality parameters: Total Alkalinity (TA), Electrical Conductance (EC), and Dissolved Reactive Phosphorus (DRP).

**Secondary Datasets**

Our secondary datasets comprise gridded population estimation from <a href="https://hub.worldpop.org/project/categories?id=18">WorldPop :: Population Density</a> and a <a href="https://en.wikipedia.org/wiki/List_of_rivers_of_South_Africa">List of rivers of South Africa</a> from Wikipedia. The population density data includes geographical coordinates and the number of people per square kilometre based on country totals adjusted to match the corresponding official United Nations population estimates that have been prepared by the Population Division of the Department of Economic and Social Affairs of the United Nations Secretariat (<a href="https://population.un.org/wpp/">2019 Revision of World Population Prospects</a>). The list of rivers dataset includes river name, province and location, source location and mouth/junction at their location name and geographical coordinates.

- Gridded population estimates are particularly useful as they provide decision-makers and data users with the flexibility to aggregate population estimates into different spatial units in existing enumeration areas or custom areas. Estimated population density per grid-cell. The projection is Geographic Coordinate System, WGS84. 
- The water sources of each quality sample in our original data are linked to geographical coordinates. So, we will use a simple list of rivers to reference each sample to its nominal water source.
 

 #### **About the Notebook**

In this notebook, we **load previously extracted data** from CSV files generated in a separate extraction notebook. This approach ensures a smoother and faster workflow, allowing us to focus on data analysis without waiting for time-consuming data retrieval.

### **Load In Dependencies**

To run this demonstration notebook, you will need to have the following packages imported below installed. This may take some time.  

```%pip install numpy pandas geopandas tqdm requests lxml matplotlib plotly seaborn nbformat openpyxl watermark```  
```%pip install plotly[express,jupyter]```  
```%pip install scipy scikit-learn```

In [2]:
# Data manipulation and analysis
import numpy as np
import pandas as pd
import geopandas as gpd

# useful tools
from datetime import date
import os
import requests
import openpyxl

# Modules
from utils import primary_dataset, wiki_scraper, convert_to_decimal_degrees, pop_density

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

%load_ext watermark
%watermark -v
%watermark --iversions

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 9.9.0

geopandas: 1.1.2
numpy    : 2.4.1
openpyxl : 3.1.5
pandas   : 3.0.0
requests : 2.32.5



### **Load Data**

#### Primary Dataset 1.0

In [3]:
# Water Quality Data: we will start out with a primary dataset, and follow up by joining other datasets to add variability.

wq_data = primary_dataset.primary_dataset()
print('Initial shape:',wq_data.shape)
wq_data.head()

We will explore Water Quality over the course of 60 months.
Initial shape: (9319, 14)


Unnamed: 0,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,month
0,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31
1,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31
2,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31
3,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31
4,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31


#### Secondary Dataset 2.0 - List of rivers in South Africa

- Wouldn't be nice to load some data directly from the internet, without crowding space on the local machine?
1. Bing Search: "how to read a table from wikipedia pandas" --> ```HTTPError: HTTP Error 403: Forbidden```
2. Bing Search: "HTTPError: HTTP Error 403: Forbidden" --> https://stackoverflow.com/questions/16627227/how-do-i-avoid-http-error-403-when-web-scraping-with-python
3. Add ```import requests```, ```from io import StringIO```, *header* and ```response = requests.get(url, headers=headers)``` PLUS ``StringIO(response.text)`` --> https://docs.python-requests.org/ & https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

In [4]:
# Scrape a list of rivers from South Africa @ https://en.wikipedia.org/wiki/List_of_rivers_of_South_Africa
# Custom headers to mimic a real browser
# def wiki_scraper(url="https://en.wikipedia.org/wiki/List_of_rivers_of_South_Africa", headers={"User-Agent": "Chrome/107.0.0.0 Safari/537.36"}, table=1, subset='Mouth / junction coordinates'):

rivers = wiki_scraper.wiki_scraper(url='https://en.wikipedia.org/wiki/List_of_rivers_of_South_Africa')

# Convert Degrees, Minutes and Seconds to Decimal Degrees
# Formula: Decimal Degrees = Degrees + (Minutes ÷ 60) + (Seconds ÷ 3,600)
# For example, to convert 30° 15′ 50″: Decimal Degrees = 30 + (15 ÷ 60) + (50 ÷ 3,600) = 30.2639°.
# https://stackoverflow.com/questions/33997361/how-to-convert-degree-minute-second-to-degree-decimal

# extensive 'print' debugging to figure out '\ufeff' was at the end of each string in the pd.Series
rivers['mouth/junctioncoordinates'] = rivers['mouth/junctioncoordinates'].apply(lambda x: x.replace('\ufeff', ""))
rivers['latitude'] = convert_to_decimal_degrees.convert_to_decimal_degrees(rivers['mouth/junctioncoordinates'])['latitude']
rivers['longitude'] = convert_to_decimal_degrees.convert_to_decimal_degrees(rivers['mouth/junctioncoordinates'])['longitude']
rivers= rivers[rivers['latitude'].apply(lambda x: isinstance(x, float))]

# round to 5 decimals
rivers['latitude'] = round(rivers['latitude'].astype(float), 5)
rivers['longitude'] = round(rivers['longitude'].astype(float), 5)

# all river names lower case
rivers['river'] = rivers['river'].str.lower()


rivers.info()

<class 'pandas.DataFrame'>
Index: 50 entries, 0 to 274
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   river                           50 non-null     str    
 1   drainagebasin[a]                29 non-null     str    
 2   provinceandlocation             50 non-null     str    
 3   sourcelocation(town/mountains)  36 non-null     str    
 4   tributaryof(river)              25 non-null     str    
 5   daminriver                      19 non-null     str    
 6   mouth/junctionatlocation(town)  43 non-null     str    
 7   mouth/junctioncoordinates       50 non-null     str    
 8   latitude                        50 non-null     float64
 9   longitude                       50 non-null     float64
dtypes: float64(2), str(8)
memory usage: 4.3 KB


#### Secondary Dataset 2.1 - River Mouths

Using the rivers website above, we separated the river that end in a junction from rivers that end at a mouth. The two terms can be used interchangeably, however mouth refers to a river ending in a ocean/sea or lake, while a junction refers to a river running into another river. This nuance should be noted befor exploration. 

In [5]:
river_mouths = pd.read_excel('data/south_africa_river_mouths.xlsx')
river_mouths['latitude'] = round(river_mouths['latitude'], 5)
river_mouths['longitude'] = round(river_mouths['longitude'], 5)

# all river_names lowercase
river_mouths['river'] = river_mouths['river_name'].str.lower()
river_mouths = river_mouths[['river_name','province','mouth_location','latitude','longitude','river']] ######
river_mouths.info()

<class 'pandas.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   river_name      21 non-null     str    
 1   province        21 non-null     str    
 2   mouth_location  21 non-null     str    
 3   latitude        21 non-null     float64
 4   longitude       21 non-null     float64
 5   river           21 non-null     str    
dtypes: float64(2), str(4)
memory usage: 1.1 KB


#### Secondary Dataset 2.2 - Population Density

In [6]:
# Follow the yellow-brick-road
path = 'data/Population_density'
filenames = pop_density.pop_density()
filenames
# Source - https://stackoverflow.com/q/20906474
# Posted by jonas, modified by community. See post 'Timeline' for change history
# Retrieved 2026-01-30, License - CC BY-SA 4.0

['data/Population_density\\zaf_pd_2011_1km_UNadj_ASCII_XYZ.zip',
 'data/Population_density\\zaf_pd_2012_1km_UNadj_ASCII_XYZ.zip',
 'data/Population_density\\zaf_pd_2013_1km_UNadj_ASCII_XYZ.zip',
 'data/Population_density\\zaf_pd_2014_1km_UNadj_ASCII_XYZ.zip',
 'data/Population_density\\zaf_pd_2015_1km_UNadj_ASCII_XYZ.zip']

- Create a list of pandas dataframes using population density
- Each dataframe has 3 columns X (longitude), Y (latitude) and Z (population density in [units])

In [7]:
def world_pop(filename: str) -> pd.DataFrame:
    '''
    input: filename
    return: list of pandas dataframes
    '''
    # Initiate a list of population dataframes
    # Map country codes to full names (only South African population)
    country_map = {"zaf": "South Africa"}


    # Create a list with filename parts
    parts = filename.split("_")
    parts = filename.split("\\")
    parts = parts[-1].split("_")
    print(f"parts of year {parts[2]}",parts)

    # Extract metadata from filename
    country_code = parts[0].lower()
    year = int(parts[2])
    country_name = country_map.get(country_code, "Unknown")

    # Read CSV
    df = pd.read_csv(filename)

    # Rename XYZ columns to descriptive names
    df = df.rename(columns={
        "X": "longitude",
        "Y": "latitude",
        "Z": "population_density"
    })

    # Insert metadata columns
    df.insert(0, "country", country_name)
    df.insert(1, "year", year)

    return df

all_dfs = [world_pop(filename) for filename in filenames]

# Combine all countries & years into one DataFrame
population_density_all = pd.concat(all_dfs, ignore_index=True)

# Sort for consistency
population_density_all = population_density_all.sort_values(
    by=["country", "year", "latitude", "longitude", "population_density"]
).reset_index(drop=True)

# Quick check
population_density_all.head()


parts of year 2011 ['zaf', 'pd', '2011', '1km', 'UNadj', 'ASCII', 'XYZ.zip']
parts of year 2012 ['zaf', 'pd', '2012', '1km', 'UNadj', 'ASCII', 'XYZ.zip']
parts of year 2013 ['zaf', 'pd', '2013', '1km', 'UNadj', 'ASCII', 'XYZ.zip']
parts of year 2014 ['zaf', 'pd', '2014', '1km', 'UNadj', 'ASCII', 'XYZ.zip']
parts of year 2015 ['zaf', 'pd', '2015', '1km', 'UNadj', 'ASCII', 'XYZ.zip']


Unnamed: 0,country,year,longitude,latitude,population_density
0,South Africa,2011,37.794583,-46.979583,0.0
1,South Africa,2011,37.802917,-46.979583,0.0
2,South Africa,2011,37.81125,-46.979583,0.0
3,South Africa,2011,37.819583,-46.979583,0.0
4,South Africa,2011,37.827917,-46.979583,0.0


### **Load and let the join begin**

#### Secondary Dataset 1.3 - Countries and Provinces

- 1.3.1 Load and add countries to primary dataset so each sample can be linked to a nominal location
    - https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-0-details/

- 1.3.2 Load and add provinces too
    - https://simplemaps.com/gis/country/za

We can incorporate the primary dataset here, and start joining.

In [8]:
# Let's create a function for loading in the next datasets. We will use our primary dataset to join immediately.

def shapeF(df, col='ADMIN', path_to_file=str, layer=None) -> pd.DataFrame:
    '''
    input:
    ------
    - df: pandas dataframe
    - col: str (column name)

    returns
    -------
    joined and cleaned pandas dataframe
    '''
    
    if col == 'ADMIN':
        new_col_name = 'country'
    else:
        new_col_name = 'province'

    print(new_col_name)

    # 1) Create GeoDataFrame from lat/lon
    gdf = gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df["longitude"], df["latitude"]),
        crs="EPSG:4326" # "Coordinate Reference System"
    )

    # 2) Load Natural Earth country boundaries
    count = gpd.read_file(path_to_file, layer=layer) # other dependent files (.shx, .dhp, etc.) are located in this path*
    #.to_crs("EPSG:4326")
    # WARNINGS SUPPRESSED

    # 3) Spatial join to add country column
    gdf_join = gpd.sjoin(
        gdf,
        count[[col, "geometry"]],
        how="left",
        predicate="within"
    ).to_crs("EPSG:4326")

    # 4) Rename + cleanup country columns
    gdf_join = (
        gdf_join
        .rename(columns={col: new_col_name})
        .drop(columns=["geometry", "index_right"], errors="ignore")
    )

    # 7) Move `country` to be the FIRST column
    cols = gdf_join.columns.tolist()
    cols.insert(0, cols.pop(cols.index(new_col_name)))
    gdf_join = gdf_join[cols]

    # 8) Quick check
    return gdf_join

In [9]:
# define variables for our function
cols_and_paths = {'ADMIN': 'data/new_earth/ne_50m_admin_0_countries.shp', 'name': 'data/za_shp'}

gdf_with_country = shapeF(wq_data, col='ADMIN', path_to_file=cols_and_paths['ADMIN'])
final_df = shapeF(gdf_with_country, col='name', path_to_file=cols_and_paths['name'], layer='za')
final_df.head()

country
province


Unnamed: 0,province,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,month
0,Northern Cape,Namibia,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31
1,Mpumalanga,South Africa,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31
2,Gauteng,South Africa,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31
3,Free State,South Africa,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31
4,Free State,South Africa,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31


#### We only want to explore South African Rivers, because we're biased.

In [10]:
final_df = final_df[final_df['country']=='South Africa']

#### Lastly, we need to save copies for further analysis...because this notebook is getting too long.

- I've never heard of, or worked with pickle files before. Have you ever heard of "Pickle Rick" from the show *Rick and Morty*?
- This will maintain types of our current dataframes

In [None]:
final_df.to_pickle('data/i_need_pop_density.pkl')
# population_density_all.to_pickle('data/pop_density.pkl')

In [16]:
population_density_all[population_density_all['country']=='South Africa']

Unnamed: 0,country,year,longitude,latitude,population_density
0,South Africa,2011,37.794583,-46.979583,0.000000
1,South Africa,2011,37.802917,-46.979583,0.000000
2,South Africa,2011,37.811250,-46.979583,0.000000
3,South Africa,2011,37.819583,-46.979583,0.000000
4,South Africa,2011,37.827917,-46.979583,0.000000
...,...,...,...,...,...
8157540,South Africa,2015,29.627917,-22.129583,62.707832
8157541,South Africa,2015,29.636250,-22.129583,26.883219
8157542,South Africa,2015,29.644583,-22.129583,34.937401
8157543,South Africa,2015,29.652917,-22.129583,131.900742


#### Next, we can join the population density variable on the dataset

In [None]:
# raise NotImplementedError()