# Bayesian biodiversity: PREDICTS data processing

TODO: Change from pandas to polars

In [3]:
import pandas as pd
import geopandas as geopd
from shapely.geometry import Point

In [4]:
# Load black for formatting
import jupyter_black
jupyter_black.load()

# Adjust display settings for pandas
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

## Metadata description

**Identifiers**
- `_id`: ?
- `Source_ID`: ID for the Data Source. Unique.
- `Reference`: Reference for the Data Source in the main text.
- `Study_number`: Between 1 and n for n Studies within Data Source. 
- `Study_name`: Unique within Source_ID. 
- `Block`: Within a Study either: Empty for all Sites / Non-empty for all Sites and at least two different values among Sites.
- `Site_number`: Between 1 and n for n Sites within Study. Unique within Study.
- `Site_name`: Unique within Study. Where requested by data providers, the names of some Sites have been replaced with 'Site ' + Site_number.
- `SS`: Concatenation of Source_ID and Study_number.
- `SSS`: Concatenation of Source_ID, Study_number and Site_number.
- `SSB`: Concatenation of Source_ID, Study_number and Block.
- `SSBS`: Concatenation of Source_ID, Study_number, Block and Site_number.
<br>

**Geography**
- `Longitude`: Where requested by data providers, the coordinates for some Sites have been removed from the data extract.
- `Latitude`: Where requested by data providers, the coordinates for some Sites have been removed from the data extract.
- `Coordinates_method`: One of: "Direct from publication / author" / "Georeferenced".
- `Country`: Coordinates matched to a World Borders GIS polygon.
- `Country_distance_metres`: If zero, Site latitude and longitude were within the matching World Borders 0.3 (Thematic Mapping 2008) GIS polygon. If greater than zero, the value is the distance in metres to the nearest WorldBorders GIS polygon.
- `UN_region`: Coordinates matched to a World Borders GIS polygon.
- `UN_subregion`: Coordinates matched to a World Borders GIS polygon.
<br>

**Biogeography**
- `Realm`: Coordinates matched to an ecoregions GIS polygon.
- `Biome`: Coordinates matched to an ecoregions GIS polygon.
- `Ecoregion`: Coordinates matched to an ecoregions GIS polygon.
- `Ecoregion_distance_metres`: If zero, Site latitude and longitude were within the matching Terrestrial ecoregions of the world (The Nature Conservancy 2009) GIS polygon. If greater than zero, the value is the distance in metres to the nearest ecoregions GIS polygon.
- `Wilderness_area`: Coordinates matched to a high biodiversity wilderness areas (Mittermeier et al. 2003) GIS polygon. Empty if Site did not fall within a wilderness area polygon.
- `Hotspot`: Coordinates matched to a biodiversity hotspots (Conservation International Foundation 2011) GIS polygon. Empty if Site did not fall within a hotspot polygon.
<br>

**Study scope**
- `Study_common_taxon`: The Kingdom, Phylum, Class, Order, Family, Genus or Species that is common to all taxa within this Study. Empty for Studies that examined taxa in multiple kingdoms.
- `Rank_of_study_common_taxon`: The lowest taxonomic Rank that is common to all taxa within this Study. Empty for Studies that examined taxa in multiple kingdoms.
<br>

**Sampling approach**
- `Sample_start_earliest`: In the form YYYY-MM-DD. 
- `Sample_end_latest`: In the form YYYY-MM-DD. Value greater than or equal to Sample_start_earliest.
- `Sample_midpoint`: Mid-point of Sample_start_earliest and Sample_end_latest. 
- `Sample_date_resolution`: One of: day / month / year
- `Sampling_method`
- `Sampling_effort`: In units given in Sampling_effort_unit. Where sampling effort did not vary among sites within a study, we set the Sampling_effort to 1. If present a value greater than zero.
- `Rescaled_sampling_effort`: Sampling effort rescaled to be between 0 and 1 within the Study i.e., Sampling_effort / max(Sampling_effort for this Study). If present a value greater than zero.
- `Sampling_effort_unit`: In units given in Sampling_effort_unit. Where sampling effort did not vary among sites within a study, we set the Sampling_effort to 1.
- `Max_linear_extent_metres`: The maximum linear extent of sampling in metres.
- `Transect_details`: Free text. Where requested by data providers, the transect details of some Sites have been removed from the data extract.
<br>

**Taxonomy**
- `Taxon`: Matching taxon in the Catalogue of Life 2013 checklist.
- `Taxon_number`: Between 1 and n for n taxa within Study.
- `Taxon_name_entered`: Name of the taxon as provided by the data contributor.
- `Parsed_name`: The result of parsing Taxon_name_entered.
- `Best_guess_binomial`: COL did not recognize all of the Latin binomials that were given to us so we employed the following scheme: The value of the Species column if Rank contains 'Species'. The first two words of the Species column if Rank contains 'Infraspecies'. The first two words of the Parsed_name column if Rank contains neither 'Infraspecies' nor 'Species' and Parsed_name contains two or more words. Empty in other cases.
- `COL_ID`: The ID of Taxon in COL (Catalogue of Life). 
- `Kingdom`: From COL.
- `Phylum`: From COL.
- `Class`: From COL.
- `Order`: From COL.
- `Family`: From COL.
- `Genus`: From COL.
- `Species`: From COL.
- `Higher_taxon`: The higher-taxonomic group that this taxon belongs to.
- `Indication`: A free-text description of the higher taxonomic group of this taxon.
- `Name_status`: From COL.
- `Rank`: From COL.
<br>

**Diversity metrics**
- `Diversity_metric_type`: One of: Abundance / Occurrence / Species richness"
- `Diversity_metric`
- `Diversity_metric_is_effort_sensitive`
- `Diversity_metric_is_suitable_for_Chao`
- `Diversity_metric_unit`
- `Measurement`: The biodiversity measurement of the Taxon at the Site in the Study, in units of Diversity_metric_unit.
- `Effort_corrected_measurement`: Where Diversity_metric_is_effort_sensitive is TRUE, the biodiversity measurement corrected for sampling effort (i.e., Measurement / Rescaled_sampling_effort). Where Diversity_metric_is_effort_sensitive is FALSE, the same value as Measurement.

**Habitat / land use**
- `Predominant_land_use`: One of: Primary vegetation / Young secondary vegetation / Intermediate secondary vegetation / Mature secondary vegetation / Secondary vegetation (indeterminate age) / Plantation forest / Pasture / Cropland / Urban / Cannot decide.
- `Source_for_predominant_land_use`: One of: Direct from publication or author / Google maps. May be empty for data collated before this information was captured
- `Use_intensity`: One of: Minimal use / Light use / Intense use / Cannot decide.
- `Habitat_as_described`: Free text description of habitat. Where requested by data providers, the habitat descriptions of some Sites have been removed from the data extract.
- `Habitat_patch_area_square_metres`: Habitat_patch_area expressed in square metres.
- `Km_to_nearest_edge_of_habitat`: Distance in km to the nearest edge of habitat supporting high diversity. A negative value indicates that the Site was within the high-diversity habitat
- `Years_since_fragmentation_or_conversion`: Years since fragmentation or conversion to present land cover (Primary habitat) or since start of recovery (Secondary habitat).

## Load and merge the two releases of the database

### 2016 release

https://data.nhm.ac.uk/dataset/the-2016-release-of-the-predicts-database-v1-1

**Summary**
- 3,278,056 measurements
- 26,194 sampling locations
- 94 countries
- 47,089 species
- Based on 480 studies

In [5]:
# Load the original predicts data
df_predicts_orig = pd.read_csv("../../data/PREDICTS/PREDICTS_2016/data.csv")

  df_predicts_orig = pd.read_csv("../../data/PREDICTS/PREDICTS_2016/data.csv")


In [6]:
df_predicts_orig.shape

(3278056, 68)

### 2022 release of additional data

https://data.nhm.ac.uk/dataset/release-of-data-added-to-the-predicts-database-november-2022

**Summary**
- 1,040,752 measurements
- 9,544 sampling locations
- 46 countries
- 10,635 species
- Based on 115 studies

In [7]:
# Load the new 2022 predicts data
df_predicts_new = pd.read_csv("../../data/PREDICTS/PREDICTS_2022/data.csv")

  df_predicts_new = pd.read_csv("../../data/PREDICTS/PREDICTS_2022/data.csv")


In [8]:
df_predicts_new.shape

(1040752, 72)

### Merge 2016 and 2022 data

In [9]:
# Find out if there are any columns that are not overlapping
unique_2016 = list(set(df_predicts_orig.columns) - set(df_predicts_new.columns))
unique_2022 = list(set(df_predicts_new.columns) - set(df_predicts_orig.columns))
print(unique_2016)
print(unique_2022)

[]
['Max_linear_extent', 'Eco_region_distance_metres', 'Predominant_habitat', 'Source_for_predominant_habitat']


In [10]:
# Drop non-overlapping columns from 2022 dataframe
df_predicts_new = df_predicts_new.drop(
    [
        "Max_linear_extent",
        "Eco_region_distance_metres",
        "Predominant_habitat",
        "Source_for_predominant_habitat",
    ],
    axis="columns",
)

# Make sure we have the same column order
df_predicts_new = df_predicts_new[df_predicts_orig.columns]

# Append new data to old with matching column order
df_predicts = pd.concat([df_predicts_orig, df_predicts_new], ignore_index=True)

In [11]:
# Reorganize the columns in the df to a logical structure
# See the metadata description in data exploration notebook for details
col_order = [
    "_id",
    "Source_ID",
    "Reference",
    "Study_number",
    "Study_name",
    "Block",
    "Site_number",
    "Site_name",
    "SS",
    "SSS",
    "SSB",
    "SSBS",
    "Longitude",
    "Latitude",
    "Coordinates_method",
    "Country",
    "Country_distance_metres",
    "UN_region",
    "UN_subregion",
    "Realm",
    "Biome",
    "Ecoregion",
    "Ecoregion_distance_metres",
    "Wilderness_area",
    "Hotspot",
    "Study_common_taxon",
    "Rank_of_study_common_taxon",
    "Sample_start_earliest",
    "Sample_end_latest",
    "Sample_midpoint",
    "Sample_date_resolution",
    "Sampling_method",
    "Sampling_effort",
    "Rescaled_sampling_effort",
    "Sampling_effort_unit",
    "Max_linear_extent_metres",
    "Transect_details",
    "Taxon",
    "Taxon_number",
    "Taxon_name_entered",
    "Parsed_name",
    "Best_guess_binomial",
    "COL_ID",
    "Kingdom",
    "Phylum",
    "Class",
    "Order",
    "Family",
    "Genus",
    "Species",
    "Higher_taxon",
    "Indication",
    "Name_status",
    "Rank",
    "Diversity_metric_type",
    "Diversity_metric",
    "Diversity_metric_is_effort_sensitive",
    "Diversity_metric_is_suitable_for_Chao",
    "Diversity_metric_unit",
    "Measurement",
    "Effort_corrected_measurement",
    "Predominant_land_use",
    "Source_for_predominant_land_use",
    "Use_intensity",
    "Habitat_as_described",
    "Habitat_patch_area_square_metres",
    "Km_to_nearest_edge_of_habitat",
    "Years_since_fragmentation_or_conversion",
]

df_predicts = df_predicts[col_order]
df_predicts.head()

Unnamed: 0,_id,Source_ID,Reference,Study_number,Study_name,Block,Site_number,Site_name,SS,SSS,SSB,SSBS,Longitude,Latitude,Coordinates_method,Country,Country_distance_metres,UN_region,UN_subregion,Realm,Biome,Ecoregion,Ecoregion_distance_metres,Wilderness_area,Hotspot,Study_common_taxon,Rank_of_study_common_taxon,Sample_start_earliest,Sample_end_latest,Sample_midpoint,Sample_date_resolution,Sampling_method,Sampling_effort,Rescaled_sampling_effort,Sampling_effort_unit,Max_linear_extent_metres,Transect_details,Taxon,Taxon_number,Taxon_name_entered,Parsed_name,Best_guess_binomial,COL_ID,Kingdom,Phylum,Class,Order,Family,Genus,Species,Higher_taxon,Indication,Name_status,Rank,Diversity_metric_type,Diversity_metric,Diversity_metric_is_effort_sensitive,Diversity_metric_is_suitable_for_Chao,Diversity_metric_unit,Measurement,Effort_corrected_measurement,Predominant_land_use,Source_for_predominant_land_use,Use_intensity,Habitat_as_described,Habitat_patch_area_square_metres,Km_to_nearest_edge_of_habitat,Years_since_fragmentation_or_conversion
0,26004,AD1_2008__Billeter,Billeter et al. 2008,8,Greenveins2001_France02,F2,32,F2.P,AD1_2008__Billeter 8,AD1_2008__Billeter 8 32,AD1_2008__Billeter 8 F2,AD1_2008__Billeter 8 F2 32,-1.590365,48.472153,Direct from publication / author,France,0.0,Europe,Western Europe,Palearctic,Temperate Broadleaf & Mixed Forests,Atlantic Mixed Forests,0.0,,,Hymenoptera,Order,2002-01-01,2002-12-31,2002-07-02,year,flight trap,5.0,1.0,week,1414.214,Ecotone between a Green-veins habitat and an a...,Lasioglossum morio,49,Lasioglossum morio,Lasioglossum morio,Lasioglossum morio,6967008.0,Animalia,Arthropoda,Insecta,Hymenoptera,Halictidae,Lasioglossum,morio,Hymenoptera,Hymenoptera: Apidae sensu lato,accepted name,Species,Abundance,abundance,True,True,individuals,0.0,0.0,Cropland,Direct from publication / author,Minimal use,,,,13.5
1,26006,AD1_2008__Billeter,Billeter et al. 2008,8,Greenveins2001_France02,F2,32,F2.P,AD1_2008__Billeter 8,AD1_2008__Billeter 8 32,AD1_2008__Billeter 8 F2,AD1_2008__Billeter 8 F2 32,-1.590365,48.472153,Direct from publication / author,France,0.0,Europe,Western Europe,Palearctic,Temperate Broadleaf & Mixed Forests,Atlantic Mixed Forests,0.0,,,Hymenoptera,Order,2002-01-01,2002-12-31,2002-07-02,year,flight trap,5.0,1.0,week,1414.214,Ecotone between a Green-veins habitat and an a...,Lasioglossum pauxillum,51,Lasioglossum pauxillum,Lasioglossum pauxillum,Lasioglossum pauxillum,6967187.0,Animalia,Arthropoda,Insecta,Hymenoptera,Halictidae,Lasioglossum,pauxillum,Hymenoptera,Hymenoptera: Apidae sensu lato,accepted name,Species,Abundance,abundance,True,True,individuals,0.0,0.0,Cropland,Direct from publication / author,Minimal use,,,,13.5
2,26024,AD1_2008__Billeter,Billeter et al. 2008,8,Greenveins2001_France02,F3,33,F3.A,AD1_2008__Billeter 8,AD1_2008__Billeter 8 33,AD1_2008__Billeter 8 F3,AD1_2008__Billeter 8 F3 33,-1.610663,48.540593,Direct from publication / author,France,0.0,Europe,Western Europe,Palearctic,Temperate Broadleaf & Mixed Forests,Atlantic Mixed Forests,0.0,,,Hymenoptera,Order,2002-01-01,2002-12-31,2002-07-02,year,flight trap,5.0,1.0,week,1414.214,Ecotone between a Green-veins habitat and an a...,Andrena helvola,11,Andrena helvola,Andrena helvola,Andrena helvola,6960605.0,Animalia,Arthropoda,Insecta,Hymenoptera,Andrenidae,Andrena,helvola,Hymenoptera,Hymenoptera: Apidae sensu lato,accepted name,Species,Abundance,abundance,True,True,individuals,0.0,0.0,Cropland,Direct from publication / author,Light use,,,,63.5
3,26031,AD1_2008__Billeter,Billeter et al. 2008,8,Greenveins2001_France02,F3,33,F3.A,AD1_2008__Billeter 8,AD1_2008__Billeter 8 33,AD1_2008__Billeter 8 F3,AD1_2008__Billeter 8 F3 33,-1.610663,48.540593,Direct from publication / author,France,0.0,Europe,Western Europe,Palearctic,Temperate Broadleaf & Mixed Forests,Atlantic Mixed Forests,0.0,,,Hymenoptera,Order,2002-01-01,2002-12-31,2002-07-02,year,flight trap,5.0,1.0,week,1414.214,Ecotone between a Green-veins habitat and an a...,Andrena ovatula,18,Andrena ovatula,Andrena ovatula,Andrena ovatula,6960904.0,Animalia,Arthropoda,Insecta,Hymenoptera,Andrenidae,Andrena,ovatula,Hymenoptera,Hymenoptera: Apidae sensu lato,accepted name,Species,Abundance,abundance,True,True,individuals,0.0,0.0,Cropland,Direct from publication / author,Light use,,,,63.5
4,26032,AD1_2008__Billeter,Billeter et al. 2008,8,Greenveins2001_France02,F3,33,F3.A,AD1_2008__Billeter 8,AD1_2008__Billeter 8 33,AD1_2008__Billeter 8 F3,AD1_2008__Billeter 8 F3 33,-1.610663,48.540593,Direct from publication / author,France,0.0,Europe,Western Europe,Palearctic,Temperate Broadleaf & Mixed Forests,Atlantic Mixed Forests,0.0,,,Hymenoptera,Order,2002-01-01,2002-12-31,2002-07-02,year,flight trap,5.0,1.0,week,1414.214,Ecotone between a Green-veins habitat and an a...,Andrena,19,Andrena spinigera,Andrena spinigera,Andrena spinigera,13049592.0,Animalia,Arthropoda,Insecta,Hymenoptera,Andrenidae,Andrena,,Hymenoptera,Hymenoptera: Apidae sensu lato,accepted name,Genus,Abundance,abundance,True,True,individuals,0.0,0.0,Cropland,Direct from publication / author,Light use,,,,63.5


In [16]:
df_predicts.shape

(4318808, 68)

In [12]:
# Save the merged dataframe as a csv file
df_predicts.to_csv("../../data/PREDICTS/merged_data.csv", index=False)

## Extract coordinate data to use with raster data 

In [13]:
# Get the coordinates for each unique site
df_site_long_lat = pd.DataFrame(
    df_predicts.groupby("SSBS")[["Longitude", "Latitude"]].min()
).reset_index()

# Generate coordinate tuples from dataframe
coordinates = zip(
    df_site_long_lat["Longitude"].tolist(), df_site_long_lat["Latitude"].tolist()
)

# Create point geometries
geometry = [Point(x, y) for x, y in coordinates]

# Create a geodataframe containing site id and coordinates
gdf_sites = geopd.GeoDataFrame({"SSBS": df_site_long_lat["SSBS"], "geometry": geometry})
gdf_sites.crs = "EPSG:4326"

# Add the UN region to enable filtering later
df_site_region = df_predicts[["SSBS", "UN_region"]].groupby("SSBS")["UN_region"].min()
gdf_sites = gdf_sites.join(df_site_region, on="SSBS", how="left", validate="1:1")

# Save as shapefile
gdf_sites.to_file("../../data/PREDICTS/site_coordinates/site_coord.shp")

In [14]:
gdf_sites.head(10)

Unnamed: 0,SSBS,geometry,UN_region
0,AD1_2001__Liow 1 1,POINT (103.77861 1.35194),Asia
1,AD1_2001__Liow 1 2,POINT (103.80806 1.35472),Asia
2,AD1_2001__Liow 1 3,POINT (103.81167 1.39472),Asia
3,AD1_2001__Liow 1 4,POINT (103.78722 1.32694),Asia
4,AD1_2001__Liow 1 5,POINT (103.80361 1.28278),Asia
5,AD1_2001__Liow 2 1,POINT (103.77861 1.35194),Asia
6,AD1_2001__Liow 2 2,POINT (103.80806 1.35472),Asia
7,AD1_2001__Liow 2 3,POINT (103.81167 1.39472),Asia
8,AD1_2001__Liow 2 4,POINT (103.78722 1.32694),Asia
9,AD1_2001__Liow 2 5,POINT (103.80361 1.28278),Asia


In [15]:
gdf_sites.shape

(35736, 3)