<a href="https://colab.research.google.com/github/marqub/gbif-species-distribution-analysis/blob/main/notebooks/phase-1-data-exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
# Install the pandas-profiling library from the master branch on GitHub
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 
# Install the necessary libraries for the project
!pip install numpy pandas streamlit gdown pyarrow
# Install git, in case it is not already installed
!pip install git
# Upgrade the pandas-profiling library
!pip install --upgrade pandas-profiling
# Show the version of the pandas-profiling library that is currently installed
!pip show pandas-profiling
# Install pytz, a library for timezone handling
!pip install pytz

In [2]:
# Import the markupsafe library
import markupsafe
print(markupsafe.__version__)
# Import necessary libraries for the project
import pandas as pd
import numpy as np
# Import the git library for version control
import git
# Import the os library for file and directory management
import os
# Import the zipfile library to handle zip files
import zipfile
# Import the ProfileReport class from the pandas_profiling library for creating profiling reports
from pandas_profiling import ProfileReport
# Import the pytz library for timezone handling
import pytz as tz

2.0.1


In [3]:
# Show all columns (instead of cascading columns in the middle)
pd.set_option("display.max_columns", None)
# Don't show numbers in scientific notation
pd.set_option("display.float_format", "{:.2f}".format)

In [4]:
# Load GBIF dataset
repo = git.Repo.clone_from("https://github.com/marqub/gbif-species-distribution-analysis.git", "gbif-species-distribution-analysis")

In [5]:
os.chdir("gbif-species-distribution-analysis/data")
with zipfile.ZipFile("gbif_data_2016to2022_northamerica.csv.zip", "r") as zip_ref:
    zip_ref.extractall(".")

In [6]:
df = pd.read_csv("0233944-220831081235567.csv", sep='\t', on_bad_lines='skip')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [7]:
# Explore the data
# Use df.info() to get a summary of the data types and missing values in the dataset
print(df.info())
# Use df.head() to view the first few rows of the dataset
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 616318 entries, 0 to 616317
Data columns (total 50 columns):
 #   Column                            Non-Null Count   Dtype  
---  ------                            --------------   -----  
 0   gbifID                            616318 non-null  int64  
 1   datasetKey                        616318 non-null  object 
 2   occurrenceID                      616301 non-null  object 
 3   kingdom                           616318 non-null  object 
 4   phylum                            613735 non-null  object 
 5   class                             563890 non-null  object 
 6   order                             540768 non-null  object 
 7   family                            591741 non-null  object 
 8   genus                             530786 non-null  object 
 9   species                           455886 non-null  object 
 10  infraspecificEpithet              21782 non-null   object 
 11  taxonRank                         616318 non-null  o

Unnamed: 0,gbifID,datasetKey,occurrenceID,kingdom,phylum,class,order,family,genus,species,infraspecificEpithet,taxonRank,scientificName,verbatimScientificName,verbatimScientificNameAuthorship,countryCode,locality,stateProvince,occurrenceStatus,individualCount,publishingOrgKey,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,coordinatePrecision,elevation,elevationAccuracy,depth,depthAccuracy,eventDate,day,month,year,taxonKey,speciesKey,basisOfRecord,institutionCode,collectionCode,catalogNumber,recordNumber,identifiedBy,dateIdentified,license,rightsHolder,recordedBy,typeStatus,establishmentMeans,lastInterpreted,mediaType,issue
0,2885135301,90d8babc-685f-449e-a4ec-7275ca7655c7,871d7336-1a5b-4926-8d12-c2102f55975b,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,Bombus johanseni,,SPECIES,"Bombus johanseni (Sladen, 1919) Sladen, 1919",Bombus johanseni Sladen,Sladen,CA,"Kitikmeot, Cambridge Bay",Nunavut,PRESENT,1.0,39fd7088-af63-4ad5-8d30-479a720a368b,69.13,-105.06,,,,,,,2018-08-09T00:00:00,9.0,8.0,2018,10827632,10827632.0,PRESERVED_SPECIMEN,CBG,,DCHAR2640-19,,Cory S. Sheffield,2020-01-01T00:00:00,CC0_1_0,,Collector(s): CBG Team 3,,,2022-11-25T07:54:38.249Z,,OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COU...
1,2885135303,90d8babc-685f-449e-a4ec-7275ca7655c7,60874093-5130-4d84-baa4-eafd4de80718,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,Bombus johanseni,,SPECIES,"Bombus johanseni (Sladen, 1919) Sladen, 1919",Bombus johanseni Sladen,Sladen,CA,"Sach's Harbour, Banks Island",Northwest Territories,PRESENT,1.0,39fd7088-af63-4ad5-8d30-479a720a368b,71.99,-125.25,,,,,,,2018-07-09T00:00:00,9.0,7.0,2018,10827632,10827632.0,PRESERVED_SPECIMEN,RSKM,ENT,RSKM_ENT_E-199719,,Cory S. Sheffield,2020-01-01T00:00:00,CC_BY_4_0,,Collector(s): J.M. Heron,,,2022-11-25T07:54:38.509Z,,OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COU...
2,2885135305,90d8babc-685f-449e-a4ec-7275ca7655c7,6b69e42c-2710-46bb-8581-8a6c6f5701dc,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,Bombus johanseni,,SPECIES,"Bombus johanseni (Sladen, 1919) Sladen, 1919",Bombus johanseni Sladen,Sladen,CA,"Sach's Harbour, Banks Island",Northwest Territories,PRESENT,1.0,39fd7088-af63-4ad5-8d30-479a720a368b,71.99,-125.29,,,,,,,2018-07-07T00:00:00,7.0,7.0,2018,10827632,10827632.0,PRESERVED_SPECIMEN,RSKM,ENT,RSKM_ENT_E-199704,,Cory S. Sheffield,2020-01-01T00:00:00,CC_BY_4_0,,Collector(s): J.M. Heron,,,2022-11-25T07:54:38.551Z,,OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COU...
3,2885135306,90d8babc-685f-449e-a4ec-7275ca7655c7,426e6fd4-8074-41a4-82ef-8c796837d99e,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,Bombus johanseni,,SPECIES,"Bombus johanseni (Sladen, 1919) Sladen, 1919",Bombus johanseni Sladen,Sladen,CA,"Sach's Harbour, Banks Island",Northwest Territories,PRESENT,1.0,39fd7088-af63-4ad5-8d30-479a720a368b,71.99,-125.24,,,,,,,2018-07-04T00:00:00,4.0,7.0,2018,10827632,10827632.0,PRESERVED_SPECIMEN,RSKM,ENT,RSKM_ENT_E-199663,,Cory S. Sheffield,2020-01-01T00:00:00,CC_BY_4_0,,Collector(s): J.M. Heron,,,2022-11-25T07:54:38.562Z,,OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COU...
4,2885135307,90d8babc-685f-449e-a4ec-7275ca7655c7,0c105d9b-1325-4f81-9b29-158ac8eeb6f9,Animalia,Arthropoda,Insecta,Hymenoptera,Apidae,Bombus,Bombus johanseni,,SPECIES,"Bombus johanseni (Sladen, 1919) Sladen, 1919",Bombus johanseni Sladen,Sladen,CA,"Sach's Harbour, Banks Island",Northwest Territories,PRESENT,1.0,39fd7088-af63-4ad5-8d30-479a720a368b,71.99,-125.24,,,,,,,2018-07-04T00:00:00,4.0,7.0,2018,10827632,10827632.0,PRESERVED_SPECIMEN,RSKM,ENT,RSKM_ENT_E-199662,,Cory S. Sheffield,2020-01-01T00:00:00,CC_BY_4_0,,Collector(s): J.M. Heron,,,2022-11-25T07:54:38.582Z,,OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COU...


In [8]:
# Split the values in the 'issue' column by ';'
tokens = df['issue'].str.split(';', expand=True).stack()
# Count the occurrences of each token
counts = tokens.value_counts()
# Print the counts
print(counts)

INSTITUTION_MATCH_FUZZY                              276698
GEODETIC_DATUM_ASSUMED_WGS84                         224727
OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT     200302
COLLECTION_MATCH_FUZZY                               159655
AMBIGUOUS_COLLECTION                                 120122
COORDINATE_ROUNDED                                    54520
INSTITUTION_COLLECTION_MISMATCH                       50669
TAXON_MATCH_HIGHERRANK                                22490
GEODETIC_DATUM_INVALID                                20759
INSTITUTION_MATCH_NONE                                20486
COORDINATE_REPROJECTED                                19328
COLLECTION_MATCH_NONE                                 17561
COORDINATE_PRECISION_INVALID                          17124
DIFFERENT_OWNER_INSTITUTION                           13675
AMBIGUOUS_INSTITUTION                                 10326
PRESUMED_NEGATED_LONGITUDE                             7154
RECORDED_DATE_INVALID                   

In [9]:
# Iterate over the specified columns
for col in ['gbifID', 'datasetKey', 'occurrenceID']:
    # Count the number of unique values in the column
    uniq = df[col].nunique()
    # Count the number of non-null values in the column
    nonnull = df[col].count()
    # Print the results
    print(f"Column {col}: {uniq} unique values, {nonnull} non-null values")

Column gbifID: 616318 unique values, 616318 non-null values
Column datasetKey: 268 unique values, 616318 non-null values
Column occurrenceID: 611549 unique values, 616301 non-null values


In [10]:
df = df.drop(columns=[
    "verbatimScientificNameAuthorship",
    "verbatimScientificName",
    "recordNumber",
    "catalogNumber",
    "taxonKey",
    "speciesKey",
    "year",
    "month",
    "day",
    "datasetKey",
    "occurrenceID"
])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 616318 entries, 0 to 616317
Data columns (total 39 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   gbifID                         616318 non-null  int64  
 1   kingdom                        616318 non-null  object 
 2   phylum                         613735 non-null  object 
 3   class                          563890 non-null  object 
 4   order                          540768 non-null  object 
 5   family                         591741 non-null  object 
 6   genus                          530786 non-null  object 
 7   species                        455886 non-null  object 
 8   infraspecificEpithet           21782 non-null   object 
 9   taxonRank                      616318 non-null  object 
 10  scientificName                 616318 non-null  object 
 11  countryCode                    613282 non-null  object 
 12  locality                      

In [11]:
# some of the properties have low cardinality... If I can not extrapolate or makes sense of the values, better to drop them.
# Create a list of tuples containing the column name and its null value ratio
ratios = [(col, round(df[col].isnull().sum() / df[col].shape[0], 2)) for col in df.columns]
# Sort the list by the null value ratio
from operator import itemgetter
ratios = sorted(ratios, key=itemgetter(1), reverse=True)
# Print the sorted list
print("\n".join(f"{col} {ratio}" for col, ratio in ratios))

coordinatePrecision 1.0
typeStatus 1.0
depthAccuracy 0.99
depth 0.98
infraspecificEpithet 0.96
establishmentMeans 0.94
mediaType 0.88
elevationAccuracy 0.87
elevation 0.79
coordinateUncertaintyInMeters 0.72
dateIdentified 0.67
rightsHolder 0.62
individualCount 0.59
identifiedBy 0.46
species 0.26
locality 0.19
collectionCode 0.19
recordedBy 0.19
institutionCode 0.16
issue 0.16
genus 0.14
decimalLatitude 0.13
decimalLongitude 0.13
order 0.12
class 0.09
family 0.04
stateProvince 0.03
gbifID 0.0
kingdom 0.0
phylum 0.0
taxonRank 0.0
scientificName 0.0
countryCode 0.0
occurrenceStatus 0.0
publishingOrgKey 0.0
eventDate 0.0
basisOfRecord 0.0
license 0.0
lastInterpreted 0.0


In [12]:
df = df.drop(columns=["coordinatePrecision",
                     "typeStatus",
                     "depthAccuracy",
                     "establishmentMeans",
                     "mediaType",
                     "elevationAccuracy",
                     "coordinateUncertaintyInMeters",
                     "rightsHolder",
                     "identifiedBy",
                     "license",
                     "recordedBy",
                     "collectionCode",
                     "institutionCode"])

In [13]:
#let's rename and give more meaningfull names
dateColumnNames = {"lastInterpreted":"lastInterpretationDate","dateIdentified":"identificationDate","eventDate":"eventObservationDate"}
df = df.rename(columns=dateColumnNames)

env_columns = ["countryCode","locality","stateProvince","decimalLatitude","decimalLongitude","elevation","depth","eventObservationDate"]
meta_columns = ["occurrenceStatus","individualCount","basisOfRecord","typeStatus","establishmentMeans","lastInterpretationDate","publishingOrgKey","identificationDate"]
species_colums=["kingdom","phylum","class","order","family","genus","species","infraspecificEpithet","taxonRank","scientificName","issue"]

df = df.rename(columns={col:"env_"+col for col in env_columns})
df = df.rename(columns={col:"meta_"+col for col in meta_columns})
df = df.rename(columns={col:"species_"+col for col in species_colums})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 616318 entries, 0 to 616317
Data columns (total 26 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   gbifID                        616318 non-null  int64  
 1   species_kingdom               616318 non-null  object 
 2   species_phylum                613735 non-null  object 
 3   species_class                 563890 non-null  object 
 4   species_order                 540768 non-null  object 
 5   species_family                591741 non-null  object 
 6   species_genus                 530786 non-null  object 
 7   species_species               455886 non-null  object 
 8   species_infraspecificEpithet  21782 non-null   object 
 9   species_taxonRank             616318 non-null  object 
 10  species_scientificName        616318 non-null  object 
 11  env_countryCode               613282 non-null  object 
 12  env_locality                  499963 non-nul

In [14]:
dateColumnNames = ["meta_lastInterpretationDate", "meta_identificationDate", "env_eventObservationDate"]

# Iterate over the date columns
for col in dateColumnNames:
    # Convert the data in the column to datetime format
    df[col] = pd.to_datetime(
        df[col], 
        # Infer the datetime format
        infer_datetime_format=True, 
        # Set the timezone to UTC
        utc=True, 
        # Handle errors by replacing invalid data with NaT
        errors="coerce"
    )

In [15]:
# Get the object columns in the DataFrame
object_columns = df.select_dtypes(include=['object']).columns.tolist()

# Iterate over the object columns
for col in object_columns:
    # Calculate the ratio of unique values to rows in the column
    ratio = df[col].nunique() / df[col].count()
    # Print the ratio
    print(f"Ratio of unique values to rows for column '{col}': {ratio:.2f}")

Ratio of unique values to rows for column 'species_kingdom': 0.00
Ratio of unique values to rows for column 'species_phylum': 0.00
Ratio of unique values to rows for column 'species_class': 0.00
Ratio of unique values to rows for column 'species_order': 0.00
Ratio of unique values to rows for column 'species_family': 0.01
Ratio of unique values to rows for column 'species_genus': 0.03
Ratio of unique values to rows for column 'species_species': 0.09
Ratio of unique values to rows for column 'species_infraspecificEpithet': 0.13
Ratio of unique values to rows for column 'species_taxonRank': 0.00
Ratio of unique values to rows for column 'species_scientificName': 0.09
Ratio of unique values to rows for column 'env_countryCode': 0.00
Ratio of unique values to rows for column 'env_locality': 0.13
Ratio of unique values to rows for column 'env_stateProvince': 0.00
Ratio of unique values to rows for column 'meta_occurrenceStatus': 0.00
Ratio of unique values to rows for column 'meta_publishin

In [16]:
# Convert the object columns to categorical
df[object_columns] = df[object_columns].astype('category')

In [17]:
# Get the float columns in the DataFrame
floats = df.select_dtypes(include=['float64']).columns.tolist()

# Convert the float columns to a smaller data type
df[floats] = df[floats].apply(pd.to_numeric, downcast='float')

df[floats].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 616318 entries, 0 to 616317
Data columns (total 5 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   meta_individualCount  253318 non-null  float32
 1   env_decimalLatitude   536572 non-null  float32
 2   env_decimalLongitude  536572 non-null  float32
 3   env_elevation         131838 non-null  float32
 4   env_depth             11972 non-null   float32
dtypes: float32(5)
memory usage: 11.8 MB


In [18]:
print(df[['env_depth', 'env_elevation']].describe().loc[['min', 'max']])

     env_depth  env_elevation
min       0.00        -339.00
max    5832.04       16917.50


In [19]:
df = df.rename(columns={"env_depth": "env_depthInMeters", "env_elevation": "env_elevationInMeters"})
df[['env_depthInMeters', 'env_elevationInMeters']] = (df[['env_depthInMeters', 'env_elevationInMeters']] * 0.3048).fillna(0).round(1).astype("float32")

In [20]:
print(df[['meta_individualCount']].describe().loc[['min', 'max']])

     meta_individualCount
min                  0.00
max              49000.00


In [21]:
# Replace NaN values in 'meta_individualCount' column with 0
df['meta_individualCount'].fillna(0, inplace=True)
# Convert the 'meta_individualCount' column to integer type and handle errors
df['meta_individualCount'] = pd.to_numeric(df['meta_individualCount'], downcast='integer', errors='coerce')

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 616318 entries, 0 to 616317
Data columns (total 26 columns):
 #   Column                        Non-Null Count   Dtype              
---  ------                        --------------   -----              
 0   gbifID                        616318 non-null  int64              
 1   species_kingdom               616318 non-null  category           
 2   species_phylum                613735 non-null  category           
 3   species_class                 563890 non-null  category           
 4   species_order                 540768 non-null  category           
 5   species_family                591741 non-null  category           
 6   species_genus                 530786 non-null  category           
 7   species_species               455886 non-null  category           
 8   species_infraspecificEpithet  21782 non-null   category           
 9   species_taxonRank             616318 non-null  category           
 10  species_scientificNa

In [23]:
#let's focus on one part of the dataset
print(df["species_kingdom"].value_counts())
print(df["env_countryCode"].value_counts())
print(df['meta_basisOfRecord'].value_counts())

Animalia          393178
Plantae           166721
Fungi              54809
incertae sedis       893
Chromista            361
Bacteria             278
Protozoa              74
Viruses                4
Name: species_kingdom, dtype: int64
US    487784
CA     45360
CR     44261
NI      9066
PA      6512
MX      5950
BS      3208
JM      1827
GT      1789
DO      1158
BB      1005
BZ       893
TT       643
KY       584
DM       508
PR       469
SV       374
HN       351
MW       279
AG       279
ZZ       243
CW       232
GP        71
CU        71
GL        59
GD        49
VI        42
TC        41
UM        37
PM        29
MQ        23
BM        21
HT        19
KN        19
LC        17
CO         5
FR         2
VG         2
Name: env_countryCode, dtype: int64
PRESERVED_SPECIMEN     346097
HUMAN_OBSERVATION      141839
FOSSIL_SPECIMEN        104602
OCCURRENCE              10910
MATERIAL_SAMPLE          9064
MACHINE_OBSERVATION      3278
LIVING_SPECIMEN           528
Name: meta_basisOfRecord

In [24]:
# Determine which species and geographic regions to focus on
df=df[(df["env_countryCode"]=="US") & (df["species_kingdom"]=="Animalia")]
# I will focus only on the US but still keep all the specicies, since it could be interesting to correlate species presence for now.
#df=df[(df["env_countryCode"]=="US")]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 294206 entries, 41 to 616313
Data columns (total 26 columns):
 #   Column                        Non-Null Count   Dtype              
---  ------                        --------------   -----              
 0   gbifID                        294206 non-null  int64              
 1   species_kingdom               294206 non-null  category           
 2   species_phylum                293485 non-null  category           
 3   species_class                 248216 non-null  category           
 4   species_order                 229127 non-null  category           
 5   species_family                274180 non-null  category           
 6   species_genus                 222434 non-null  category           
 7   species_species               177592 non-null  category           
 8   species_infraspecificEpithet  12509 non-null   category           
 9   species_taxonRank             294206 non-null  category           
 10  species_scientificN

In [25]:
# Save the cleaned and filtered data to a new file
df.to_csv('cleaned_data.csv', index=False)

In [None]:
# Generate an interactive HTML report
profile = ProfileReport(df, title="Pandas Profiling Report")
# Display the report
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

DONE

Possible next steps:

- Compute some statistical analysis, such as simple metrics, but also tests (average time between observation and interpretation, per kingdom, per country, per organization, etc.).
- Find correlations, such as the impact of food on population, the impact of bacteria/virus on some populations, and the correlation or non-correlation between species.
```  
#filtered_df2 = df[df.applymap(lambda x: "virus" in str(x)).any(1)]
#filtered_df3 = df[df.applymap(lambda x: "bacteria" in str(x)).any(1)]
```
- Train a model to predict future occurrences based on previous ones, and when: the occurrence could be a disease, or the presence/absence of food, etc.
