Skip to content
An attempt to classify global water risk typologies
Jupyter Notebook
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.
Final Project Proposal Harth LaSalle.pdf

Water Risk Classification

Leonardo Harth & John Michael LaSalle




Our project uses the Aqueduct Alliance Global Maps 3.0 data as the initial data source. This dataset aggregates diverse indicators of Water Risk for every water basin in the world. The information it conveys informs us about issues such as water depletion, water quality, and extreme conditions – like drought and flood. The richness of this data piqued our curiosity, leading us to wonder what other factors are associated with these characteristics. Is low water quality more prevalent in developing countries? Is it correlated to the human development index of its surroundings? We embraced this challenge as a means to explore these relationships.

We have chosen to use the Global Human Settlement Layer from the European Commission as a measure of population density, the global subnational infant mortality rates dataset from the Socioeconomic Data and Applications Center (SEDAC) as a proxy for the Human Development Index (IDH), and the Nighttime Lights Annual Composites from the National Oceanic and Atmospheric Administration (NOAA) as a proxy for economic development – as recent research indicates that night light can be used to measure income growth.

Our main dataset is comprised of vector data. Each basin is a multipolygon, that has diverse features. Each feature is presented in its raw value, a 0 to 5 score, a label (categorical string), and a categorical integer. We chose to work solely with raw values. The full description of the categories can be found here. All of our complimentary datasets are raster images.

Methods – Data Wrangling and Modeling

The first step we had to take was to combine these 4 datasets into one. We used our water dataset as the baseline. Since it is a vector data (and all other datasets are rasters), we aggregated the mean or sum values of the pixels that fall within the boundaries of each basin using the zonal statistics tool of the rasterstats library. This allowed us to have a single dataset, in which each row represents a basin, having all the original raw water risk values, plus the mean values of child mortality, night light, and population density.

Unfortunately, several lines of this dataset have NA values. We attempted to use the multivariate feature imputation tool from the Scikit-learn library to fill those values, but the results did not look good. Since it is advertised by the developer that this tool is experimental, we chose to remove the NA values from our dataset instead of imputing values to them, reducing the number of rows from 68,506 to 42,823.

The combined dataset has 11 water risk features and 3 aggregated features (mean child mortality, population density, and mean night light). Before we performed the cluster analysis, we subsetted the water risk features into 3 groups:

  • Physical Risk variables: This group comprises variables related to the quality of the water and water systems infrastructure – Untreated connected wastewater (ucw_raw), Unimproved/no drinking water (udw_raw), and Unimproved/no sanitation (usa_raw).
  • Water depletion variables: This group holds variables regarding the scarcity or future scarcity of water – Baseline water stress (bws_raw), Baseline water depletion (bwd_raw), Interannual variability (iav_raw), Seasonal variability (sev_raw), and Groundwater table decline (gtd_raw).
  • Extreme conditions variables: This group has variables related to flood and drought – Riverine flood risk (rfr_raw), Coastal flood risk (cfr_raw), and Drought risk (drr_raw).

We used the StandardScaler in a copy of our dataset to run the Kmeans clustering algorithm. To determine the optimal number of clusters in our Kmeans analysis, we used a for loop that ran the algorithm in a range from 1 to 20 clusters, appending the inertia (the sum of the squared distances to clusters) values for each number of clusters. We plotted the “elbow function” for each of our variable groups, to determine the optimal number of clusters (the optimal value should significantly reduce the inertia in relation to its previous value, whereas the subsequent value improves it only marginally). From this we chose 10 clusters as the optimal k for every group of variables. This allowed us to specify to which cluster each observation in the dataset belongs.

We learned that Kmeans cluster analysis is not an optimal tool to handle this kind of data. However, we hypothesized that the clusters generated by the Kmeans analysis could be used as an input in a DBScan cluster analysis.

To perform the DBScan cluster analysis, the first step we took was to convert the multipolygons into points. We did that by converting the geometry of the observations in the dataset into the centroids of these shapes. We have also converted the dataset into the EASE-Grid 2.0 projection (EPSG 6933, a global equal area projection, to get units in meters. Since we wanted the labels from the Kmeans analysis (which were added to the untransformed dataset), we used the standardscaler tool again, transforming the raw variables, the x and y coordinates, and the Kmeans labels. We used this dataset as an input in the DBScan cluster analysis.

The parametrization of the epsilon and the minimal samples for the DBScan clustering was done by experimentation. By the end, we have decided to adopt parameters that generate around 20 clusters, so that the visualization tool would make more sense. We reduced the size of the cluster dataset in preparation for web visualization by dissolving the bsains into their clusters and simplifying the geometry and stored it as a shapefile, due to its reduced size. That enabled us to upload the file to GitHub and have our app hosted there.

Conclusions and final remarks

The result of our cluster analysis was less revealing than what we would have wanted it to be. There are some visible patterns though. In South America, we can see different clusters for the Amazonia region, for the arid region of Northeast Brazil, and a separate cluster for Patagonia. The NAs we had to exclude from our dataset were mainly in the North of the African continent. Most of Africa was classified as noise. That also happened in India and Scandinavia.

By observing the clusters 0, 1, and 2 (the 3 clusters in Africa), we can see that the analysis did work, as the bar chart shows us that aside from Untreated Connected Wastewater, all other variables have similar values. If we had more time though, we would dedicate it to better understand the relationship between these variables to make a more sensitive feature selection.

User Notes

An attempt to classify global water risk typologies. Final project for CPLN 691 Geospatial Data Science in Python.


Create the water-risk-classification virtual environment with conda env create -f environment.yml Update the environment with conda env update --name wrc --file environment.yml --prune

Git Usage

  1. Download updates with git pull
  2. Make changes to files.
  3. Stage changes with git add .
  4. Create commit with git commit -m "commit message"
  5. Push commit with git push

If you get an error like error: Pulling is not possible because you have unmerged files. use git reset --hard origin/master to overwrite your local version. Save any updated code outside of the water-risk-classification folder.

Data Sources

  1. WRI Aqueduct
  2. Global Human Settlements Layer
  3. SEDAC Infant Mortality
  4. Version 4 DMSP-OLS Nighttime Lights Time Series
You can’t perform that action at this time.