# SIADS 593 WN26 Milestone I: Sustainable Water Quality Notebook

**Authors:**

Sungmin Kim  
Randy Best  
Kyle Rodriguez

#### **Introduction**

Welcome to the Exploratory Data Analysis of water quality across various river locations in South Africa! As a key factor for environmental sustainability, we would like to use data to identify and emphasize the key factors that significantly influence water quality. As a team we will take advantage of the valuable opportunity to apply data exploration techniques to real-world environmental data and contribute to advancing water quality monitoring.

**Primary Dataset**

With a dataset containing three target water quality parameters **Total Alkalinity**, **Electrical Conductance**, and **Dissolved Reactive Phosphorus** collected between 2011 and 2015 from approximately 200 river locations across South Africa, our goal is to find insight on which environmental conditions most affect these 3 metrics. Each data point includes the geographic coordinates (latitude and longitude) of the sampling site, the date of collection, and the corresponding water quality measurements. Features of the dataset also include four spectral bands—**SWIR22** (Shortwave Infrared 2), **NIR** (Near Infrared), **Green**, and **SWIR16** (Shortwave Infrared 1)—were utilized from Landsat, along with derived spectral indices such as **NDMI** (Normalized Difference Moisture Index) and **MNDWI** (Modified Normalized Difference Water Index). In addition, the **PET** (Potential Evapotranspiration) variable was incorporated from the **TerraClimate** dataset to account for climatic influences on water quality.

- **SWIR22** – Sensitive to surface moisture and turbidity variations in water bodies.  
- **NIR** – Helps in identifying vegetation and suspended matter in water.  
- **Green** – Useful for detecting water color and surface reflectance changes.  
- **SWIR16** – Provides information on surface dryness and sediment concentration.  
- **NDMI** – Derived from NIR and SWIR16, indicates moisture and vegetation–water interaction.  
- **MNDWI** – Derived from Green and SWIR22, effective for distinguishing open water areas and reducing built-up noise.  
- **PET** – Extracted from the TerraClimate dataset, represents potential evapotranspiration influencing hydrological and water quality dynamics.

The dataset spans a five-year period from 2011 to 2015. Using API-based data extraction methods, both Landsat and TerraClimate features were retrieved directly from the <a href="https://planetarycomputer.microsoft.com/">Microsoft Planetary Computer</a> portal. These combined spectral, index-based, and climatic features were used as predictors in a regression model to estimate three key water quality parameters: Total Alkalinity (TA), Electrical Conductance (EC), and Dissolved Reactive Phosphorus (DRP).

**Secondary Datasets**

Our secondary datasets comprise gridded population estimation from <a href="https://hub.worldpop.org/project/categories?id=18">WorldPop :: Population Density</a> and a <a href="https://en.wikipedia.org/wiki/List_of_rivers_of_South_Africa">List of rivers of South Africa</a> from Wikipedia. The population density data includes geographical coordinates and the number of people per square kilometre based on country totals adjusted to match the corresponding official United Nations population estimates that have been prepared by the Population Division of the Department of Economic and Social Affairs of the United Nations Secretariat (<a href="https://population.un.org/wpp/">2019 Revision of World Population Prospects</a>). The list of rivers dataset includes river name, province and location, source location and mouth/junction at their location name and geographical coordinates.

- Gridded population estimates are particularly useful as they provide decision-makers and data users with the flexibility to aggregate population estimates into different spatial units in existing enumeration areas or custom areas. Estimated population density per grid-cell. The projection is Geographic Coordinate System, WGS84. 
- The water sources of each quality sample in our original data are linked to geographical coordinates. So, we will use a simple list of rivers to reference each sample to its nominal water source.
 

 #### **About the Notebook**

In this notebook, we **load previously extracted data** from CSV files generated in a separate extraction notebook. This approach ensures a smoother and faster workflow, allowing participants to focus on data analysis and model development without waiting for time-consuming data retrieval.

## Load In Dependencies

To run this demonstration notebook, you will need to have the following packages imported below installed. This may take some time.  

```%pip install numpy pandas geopandas tqdm requests lxml matplotlib plotly seaborn watermark```  
```%pip install nbformat```  
```%pip install plotly[express,jupyter]```  
```%pip install scipy scikit-learn openpyxl```

In [1]:
# Data manipulation and analysis
import numpy as np
import pandas as pd
import geopandas as gpd
from sklearn.neighbors import BallTree
from scipy.spatial import cKDTree

# useful tools
from datetime import date
from tqdm import tqdm
import os
import requests
from io import StringIO
import glob
import re
import zipfile
import openpyxl

# Visualization libraries
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Modules
import primary_dataset

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

%load_ext watermark
%watermark -v
%watermark --iversions

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 9.8.0

geopandas : 1.1.2
matplotlib: 3.10.8
numpy     : 2.4.0
openpyxl  : 3.1.5
pandas    : 2.3.3
plotly    : 6.5.2
re        : 2.2.1
requests  : 2.32.5
scipy     : 1.17.0
seaborn   : 0.13.2
sklearn   : 1.8.0
tqdm      : 4.67.1



## Load Data

In [2]:
# Water Quality Data

wq_data = primary_dataset.primary_dataset()
wq_data.head()

We will explore Water Quality over the course of 60 months.


Unnamed: 0,Latitude,Longitude,Sample Date,nir,green,swir16,swir22,NDMI,MNDWI,pet,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus,binned_months
0,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31
1,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31
2,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31
3,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31
4,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31


In [3]:
# We will start out with a simple dataset, and join other datasets to add variability
wq_data.shape

(9319, 14)

## Joining the datasets

In [4]:
# 1) Copy original dataset
df = wq_data.copy()

# 2) Clean column names (safe + recommended)
df.columns = df.columns.str.strip().str.lower()

# 3) Create GeoDataFrame from lat/lon
gdf_points = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["longitude"], df["latitude"]),
    crs="EPSG:4326"
)

# 4) Load Natural Earth country boundaries
countries = gpd.read_file(
    "ne_50m_admin_0_countries.shp" #https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-admin-0-details/
)#.to_crs("EPSG:4326")
# WARNINGS SUPPRESSED

# 5) Spatial join to add country column
gdf_with_country = gpd.sjoin(
    gdf_points,
    countries[["ADMIN", "geometry"]],
    how="left",
    predicate="within"
).to_crs("EPSG:4326")

# 6) Rename + cleanup
gdf_with_country = (
    gdf_with_country
    .rename(columns={"ADMIN": "country"})
    .drop(columns=["geometry", "index_right"], errors="ignore")
)

# 7) Move `country` to be the FIRST column
cols = gdf_with_country.columns.tolist()
cols.insert(0, cols.pop(cols.index("country")))
final_df = gdf_with_country[cols]

# 8) Quick check
final_df.head()

Unnamed: 0,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,binned_months
0,Namibia,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31
1,South Africa,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31
2,South Africa,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31
3,South Africa,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31
4,South Africa,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31


In [5]:
path = 'Population_density'
if not os.path.exists(path):
    dir_list = os.listdir()
    for file in dir_list:
        # print(file)
        if file.endswith('.zip'):
            with zipfile.ZipFile(file) as f:
                f.extractall(path)

In [6]:

# Source - https://stackoverflow.com/q/20906474
# Posted by jonas, modified by community. See post 'Timeline' for change history
# Retrieved 2026-01-30, License - CC BY-SA 4.0

# Map country codes to full names
country_map = {"zaf": "South Africa"}

# list of population dataframes
all_dfs = []

# Get data file names
filenames = glob.glob(path + "/*.csv")
# print(filenames)
for filename in filenames:

    parts = filename.split("_")
    parts = filename.split("\\")
    parts = parts[-1].split("_")
    # print(parts)
    # Basic safety check on filename structure
    if len(parts) < 3:
        continue

    # Extract metadata from filename
    country_code = parts[0].lower()
    year = int(parts[2])
    country_name = country_map.get(country_code, "Unknown")

    # Read CSV
    df = pd.read_csv(filename)

    # Rename XYZ columns to descriptive names
    df = df.rename(columns={
        "X": "longitude",
        "Y": "latitude",
        "Z": "population_density"
    })

    # Insert metadata columns
    df.insert(0, "country", country_name)
    df.insert(1, "year", year)

    all_dfs.append(df)

# Combine all countries & years into one DataFrame
population_density_all = pd.concat(all_dfs, ignore_index=True)

# Sort for consistency
population_density_all = population_density_all.sort_values(
    by=["country", "year", "latitude", "longitude", "population_density"]
).reset_index(drop=True)

# Quick check
population_density_all.head()


Unnamed: 0,country,year,longitude,latitude,population_density
0,South Africa,2011,37.794583,-46.979583,0.0
1,South Africa,2011,37.802917,-46.979583,0.0
2,South Africa,2011,37.81125,-46.979583,0.0
3,South Africa,2011,37.819583,-46.979583,0.0
4,South Africa,2011,37.827917,-46.979583,0.0


In [7]:

# --- 1) Copies + normalize column names ---
wq = final_df.copy()
pd_all = population_density_all.copy()

wq.columns = wq.columns.str.strip().str.lower()
pd_all.columns = pd_all.columns.str.strip().str.lower()

# --- 2) Ensure numeric types ---
for col in ["longitude", "latitude"]:
    wq[col] = pd.to_numeric(wq[col], errors="coerce")
    pd_all[col] = pd.to_numeric(pd_all[col], errors="coerce")

pd_all["population_density"] = pd.to_numeric(pd_all["population_density"], errors="coerce")
pd_all["year"] = pd.to_numeric(pd_all["year"], errors="coerce")

# --- 3) Extract sample year ---
wq["sample_year"] = pd.to_datetime(wq["sample date"], errors="coerce").dt.year

# --- 4) Initialize output columns ---
wq["pd_year"] = np.nan
wq["pop_density_nn"] = np.nan
wq["distance_km_to_pd_cell"] = np.nan

# --- 5) Helper: approx distance in km ---
def approx_deg_to_km(dlon, dlat, lat):
    km_lat = dlat * 111.0
    km_lon = dlon * 111.0 * np.cos(np.deg2rad(lat))
    return np.sqrt(km_lat**2 + km_lon**2)

# --- 6) Filter PD to clean rows only ---
pd_all_clean = pd_all.dropna(subset=["year", "longitude", "latitude", "population_density"]).copy()
pd_all_clean["year"] = pd_all_clean["year"].astype(int)

# --- 7) Build KDTree per year once ---
trees = {}
pd_by_year = {}

for yr, group in pd_all_clean.groupby("year"):
    group = group.reset_index(drop=True)
    coords = group[["longitude", "latitude"]].to_numpy()
    trees[yr] = cKDTree(coords)
    pd_by_year[yr] = group

# --- 8) Match WQ points year-by-year ---
valid_wq = wq.dropna(subset=["sample_year", "longitude", "latitude"]).copy()
valid_wq["sample_year"] = valid_wq["sample_year"].astype(int)

for yr in sorted(valid_wq["sample_year"].unique()):
    if yr not in trees:
        continue

    idx_rows = valid_wq.index[valid_wq["sample_year"] == yr]
    query_pts = valid_wq.loc[idx_rows, ["longitude", "latitude"]].to_numpy()

    dist, idx = trees[yr].query(query_pts, k=1)
    matched = pd_by_year[yr].iloc[idx].reset_index(drop=True)

    wq.loc[idx_rows, "pd_year"] = yr
    wq.loc[idx_rows, "pop_density_nn"] = matched["population_density"].to_numpy()

    dlon = valid_wq.loc[idx_rows, "longitude"].to_numpy() - matched["longitude"].to_numpy()
    dlat = valid_wq.loc[idx_rows, "latitude"].to_numpy() - matched["latitude"].to_numpy()
    lat  = valid_wq.loc[idx_rows, "latitude"].to_numpy()

    wq.loc[idx_rows, "distance_km_to_pd_cell"] = approx_deg_to_km(dlon, dlat, lat)

# --- 9) Build FINAL joined dataset: all original columns + new features ---
original_cols = final_df.copy()
original_cols.columns = original_cols.columns.str.strip().str.lower()
original_cols = original_cols.columns.tolist()

new_cols = ["pop_density_nn", "distance_km_to_pd_cell"]

# Keep original order + add new cols at the end (only if they exist)
final_cols = original_cols + [c for c in new_cols if c in wq.columns]
wq_joined = wq[final_cols].copy()

# --- 10) Show everything (not a subset) ---
wq_joined.head()

Unnamed: 0,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,binned_months,pop_density_nn,distance_km_to_pd_cell
0,Namibia,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31,0.57699,0.325999
1,South Africa,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31,5.049022,0.251093
2,South Africa,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31,23.239988,0.418343
3,South Africa,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31,687.465759,0.069949
4,South Africa,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31,6.092811,0.23173


### Handling Missing Values

It's important for us to know how many attributes we are working with. Below, we can see that there are 17 columns of data to choose from so far, and 9319 rows for each.

Missing values in the dataset were carefully handled to ensure data consistency and prevent. 

In [8]:
# eloquently sort values while observing if there are any missing values
print('shape:',wq_joined.shape)
wq_joined.isna().sum().sort_values(ascending=False)

shape: (9319, 17)


country                          0
latitude                         0
longitude                        0
sample date                      0
nir                              0
green                            0
swir16                           0
swir22                           0
ndmi                             0
mndwi                            0
pet                              0
total alkalinity                 0
electrical conductance           0
dissolved reactive phosphorus    0
binned_months                    0
pop_density_nn                   0
distance_km_to_pd_cell           0
dtype: int64

In [9]:
wq_joined.head()

Unnamed: 0,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,binned_months,pop_density_nn,distance_km_to_pd_cell
0,Namibia,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31,0.57699,0.325999
1,South Africa,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31,5.049022,0.251093
2,South Africa,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31,23.239988,0.418343
3,South Africa,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31,687.465759,0.069949
4,South Africa,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31,6.092811,0.23173


In [10]:
wq_joined["distance_km_to_pd_cell"].agg(
    ["min", "max", "mean", "median"]
)

min       0.025842
max       0.781277
mean      0.346963
median    0.362298
Name: distance_km_to_pd_cell, dtype: float64

**How to interpret this**

You’re matching water-quality points to 1 km population-density grid cells.

Your results:
	•	Mean distance ≈ 0.34 km (340 meters)
	•	Median ≈ 0.36 km
	•	Max ≈ 0.61 km

For a 1 km grid, this is exactly what we want to see.

Why?

	•	In a 1 km × 1 km cell, the furthest possible distance from the center is ~0.71 km.
	•	Your max (0.61 km) is below that theoretical worst case.
	•	A mean around 0.35 km means most matches are happening well within the same grid cell or very close neighbors.

This tells us:

	•	Spatial alignment is tight
	•	Nearest-neighbor matching worked correctly
	•	No red flags like multi-km mismatches

- Bing Search: "how to read a table from wikipedia pandas" --> ```HTTPError: HTTP Error 403: Forbidden```
- Bing Search: "HTTPError: HTTP Error 403: Forbidden" --> https://stackoverflow.com/questions/16627227/how-do-i-avoid-http-error-403-when-web-scraping-with-python
- Add ```import requests```, ```from io import StringIO```, *header* and ```response = requests.get(url, headers=headers)``` PLUS ``StringIO(response.text)`` --> https://docs.python-requests.org/ & https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

In [11]:
# List of rivers of South Africa
# Wikipedia URL
url = "https://en.wikipedia.org/wiki/List_of_rivers_of_South_Africa"

# Custom headers to mimic a real browser
headers = {
    "User-Agent": "Chrome/107.0.0.0 Safari/537.36"
}

# Send HTTP request with headers
response = requests.get(url, headers=headers)

# Read HTML tables from the response content
tables = pd.read_html(StringIO(response.text))

# the second table is where it's at
rivers = tables[1].dropna(subset='Mouth / junction coordinates')

# clean columns
def split_join_lower(string: str):
    # a = "Drainage basin[A] "
    lst = string.split()
    return ''.join(lst).lower()
rivers.columns = [split_join_lower(s) for s in rivers.columns] # add .strip()
# df.columns = df.columns.str.strip().str.lower()

### Convert Degrees, Minutes and Seconds to Decimal Degrees

In [12]:
# Formula: Decimal Degrees = Degrees + (Minutes ÷ 60) + (Seconds ÷ 3,600)
# For example, to convert 30° 15′ 50″: Decimal Degrees = 30 + (15 ÷ 60) + (50 ÷ 3,600) = 30.2639°.
# https://stackoverflow.com/questions/33997361/how-to-convert-degree-minute-second-to-degree-decimal

def dms2dd(degrees, minutes, seconds, direction):
    dd = float(degrees) + float(minutes)/60 + float(seconds)/(3600)
    if direction == 'E' or direction == 'S': # East and south are negative
        dd *= -1
    return dd

# def dd2dms(deg):
#     d = int(deg)
#     md = abs(deg - d) * 60
#     m = int(md)
#     sd = (md - m) * 60
#     return [d, m, sd]

def parse_dms(dms):
    parts = re.split('[^\d\w]+', dms) # split on ° ' "

    try:
        lat = dms2dd(parts[0], parts[1], parts[2], parts[3])
    except:
        lat = dms
    return (lat)


def convert_to_decimal_degrees(series: pd.Series) -> pd.DataFrame:
    '''
    Docstring for convert_to_decimal_degrees
    returns: TRY pandas datafram with two columns of floats (EXCEPT some may not be floats)
    '''

    df = series.str.split(' / ', n=1, expand=True)
    df = df[0].str.split(' ', n=1, expand=True)
    df.columns = ['latitude', 'longitude']

    df['latitude'] = df['latitude'].apply(lambda x: parse_dms(x))
    df['longitude'] = df['longitude'].apply(lambda x: parse_dms(x))

    return df


rivers['mouth/junctioncoordinates'] = rivers['mouth/junctioncoordinates'].apply(lambda x: x.replace('\ufeff', ""))
rivers['latitude'] = convert_to_decimal_degrees(rivers['mouth/junctioncoordinates'])['latitude']
rivers['longitude'] = convert_to_decimal_degrees(rivers['mouth/junctioncoordinates'])['longitude']
rivers= rivers[rivers['latitude'].apply(lambda x: isinstance(x, float))]

rivers.info()


<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, 0 to 274
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   river                           50 non-null     object
 1   drainagebasin[a]                29 non-null     object
 2   provinceandlocation             50 non-null     object
 3   sourcelocation(town/mountains)  36 non-null     object
 4   tributaryof(river)              25 non-null     object
 5   daminriver                      19 non-null     object
 6   mouth/junctionatlocation(town)  43 non-null     object
 7   mouth/junctioncoordinates       50 non-null     object
 8   latitude                        50 non-null     object
 9   longitude                       50 non-null     object
dtypes: object(10)
memory usage: 4.3+ KB


## Lastly Join River Names and Junction Coordinates

In [None]:

THRESH_KM = 2
EARTH_RADIUS_KM = 6371

# river mouth coordinates
mouth_coords = np.radians(
    rivers[["latitude", "longitude"]].astype(float).dropna().to_numpy()
)

tree = BallTree(mouth_coords, metric="haversine")

# sampling point coordinates
sample_coords = np.radians(
    wq_joined[["latitude", "longitude"]].to_numpy()
)

dist_rad, _ = tree.query(sample_coords, k=1)

# single binary flag (1 = river mouth, 0 = not)
wq_joined["river_mouth"] = (
    dist_rad[:, 0] * EARTH_RADIUS_KM <= THRESH_KM
).astype(int)

In [22]:
pd.set_option("display.max_columns", None)

wq_joined.head()

Unnamed: 0,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,binned_months,pop_density_nn,distance_km_to_pd_cell,river_mouth
0,Namibia,-28.760833,17.730278,2011-01-02,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595,174.2,128.912,555.0,10.0,2011-01-31,0.57699,0.325999,0
1,South Africa,-26.861111,28.884722,2011-01-03,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134,124.1,74.72,162.9,163.0,2011-01-31,5.049022,0.251093,0
2,South Africa,-26.45,28.085833,2011-01-03,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805,127.5,89.254,573.0,80.0,2011-01-31,23.239988,0.418343,0
3,South Africa,-27.671111,27.236944,2011-01-03,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416,129.7,82.0,203.6,101.0,2011-01-31,687.465759,0.069949,0
4,South Africa,-27.356667,27.286389,2011-01-03,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683,129.2,56.1,145.1,151.0,2011-01-31,6.092811,0.23173,0


In [23]:
wq_joined[wq_joined["river_mouth"] == 1].head()

Unnamed: 0,country,latitude,longitude,sample date,nir,green,swir16,swir22,ndmi,mndwi,pet,total alkalinity,electrical conductance,dissolved reactive phosphorus,binned_months,pop_density_nn,distance_km_to_pd_cell,river_mouth


In [24]:
wq_joined["river_mouth"].value_counts()

river_mouth
0    9319
Name: count, dtype: int64

In [25]:
# raise NotImplementedError()