<a href="https://colab.research.google.com/github/marqub/gbif-species-distribution-analysis/blob/main/notebooks/phase-1-data-exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
%%capture
# Install the pandas-profiling library from the master branch on GitHub
!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip 
# Install the necessary libraries for the project
!pip install numpy pandas streamlit gdown pyarrow
# Install git, in case it is not already installed
!pip install git
# Upgrade the pandas-profiling library
!pip install --upgrade pandas-profiling
# Show the version of the pandas-profiling library that is currently installed
!pip show pandas-profiling
# Install pytz, a library for timezone handling
!pip install pytz

In [21]:
# Import the markupsafe library
import markupsafe
print(markupsafe.__version__)
# Import necessary libraries for the project
import pandas as pd
import numpy as np
# Import the git library for version control
import git
# Import the os library for file and directory management
import os
# Import the zipfile library to handle zip files
import zipfile
# Import the ProfileReport class from the pandas_profiling library for creating profiling reports
from pandas_profiling import ProfileReport
# Import the pytz library for timezone handling
import pytz as tz

2.0.1


In [22]:
# Show all columns (instead of cascading columns in the middle)
pd.set_option("display.max_columns", None)
# Don't show numbers in scientific notation
pd.set_option("display.float_format", "{:.2f}".format)

In [30]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install --skip-smudge
!git clone https://github.com/marqub/gbif-species-distribution-analysis.git

Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Detected apt version as 1.6.14
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... Packagecloud gpg key imported to /etc/apt/keyrings/github_git-lfs-archive-keyring.gpg
done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (3.2.0).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 30 not upgraded.
Updated Git hooks.
Git LFS initialized.
Cloning into 'gbif-species-distribution-analysis'...
remote: Enumerating objects: 92, done.[K
remote: Counting obj

In [32]:
# Load GBIF dataset
os.chdir("gbif-species-distribution-analysis/data")
!git lfs pull --include="gbif_data_2016to2022_northamerica_CC.csv.zip"



In [33]:
with zipfile.ZipFile("gbif_data_2016to2022_northamerica_CC.csv.zip", "r") as zip_ref:
    zip_ref.extractall(".")

In [35]:
df = pd.read_csv("0246602-220831081235567.csv", sep='\t', on_bad_lines='skip')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [36]:
# Explore the data
# Use df.info() to get a summary of the data types and missing values in the dataset
print(df.info())
# Use df.head() to view the first few rows of the dataset
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 50 columns):
 #   Column                            Dtype  
---  ------                            -----  
 0   gbifID                            int64  
 1   datasetKey                        object 
 2   occurrenceID                      object 
 3   kingdom                           object 
 4   phylum                            object 
 5   class                             object 
 6   order                             object 
 7   family                            object 
 8   genus                             object 
 9   species                           object 
 10  infraspecificEpithet              object 
 11  taxonRank                         object 
 12  scientificName                    object 
 13  verbatimScientificName            object 
 14  verbatimScientificNameAuthorship  object 
 15  countryCode                       object 
 16  locality                          ob

Unnamed: 0,gbifID,datasetKey,occurrenceID,kingdom,phylum,class,order,family,genus,species,infraspecificEpithet,taxonRank,scientificName,verbatimScientificName,verbatimScientificNameAuthorship,countryCode,locality,stateProvince,occurrenceStatus,individualCount,publishingOrgKey,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,coordinatePrecision,elevation,elevationAccuracy,depth,depthAccuracy,eventDate,day,month,year,taxonKey,speciesKey,basisOfRecord,institutionCode,collectionCode,catalogNumber,recordNumber,identifiedBy,dateIdentified,license,rightsHolder,recordedBy,typeStatus,establishmentMeans,lastInterpreted,mediaType,issue
0,1930479050,d415c253-4d61-4459-9d25-4015b9084fb0,bfc07e69-59f3-4ec5-8890-4df46fc8b293,Fungi,Ascomycota,Lecanoromycetes,Lecanorales,Cladoniaceae,Cladonia,Cladonia apodocarpa,,SPECIES,Cladonia apodocarpa Robb.,Cladonia apodocarpa Robbins,Robbins,US,James D. Martin – Skyline Wildlife Management ...,Alabama,PRESENT,,ae447c50-b8a8-11d8-92a4-b8a03c50a862,34.87,-86.14,,,,,,,2017-05-19T00:00:00,19.0,5.0,2017.0,3391449.0,3391449.0,PRESERVED_SPECIMEN,NY,NY,3218863,52530,J. C. Lendemer,2018-01-01T00:00:00,CC0_1_0,The New York Botanical Garden,J. C. Lendemer,,,2023-01-11T13:29:42.579Z,StillImage,GEODETIC_DATUM_ASSUMED_WGS84;COLLECTION_MATCH_...
1,3328180681,d415c253-4d61-4459-9d25-4015b9084fb0,c74f5b5e-0c90-4f1e-8f96-1a75f99fdd00,Fungi,Ascomycota,Lecanoromycetes,Lecanorales,Catillariaceae,Catillaria,Catillaria nigroclavata,,SPECIES,Catillaria nigroclavata (Nyl.) Schuler,Catillaria nigroclavata (Nyl.) Schuler,(Nyl.) Schuler,US,"Pisgah National Forest, Bald Mountains, W-faci...",North Carolina,PRESENT,,ae447c50-b8a8-11d8-92a4-b8a03c50a862,35.83,-82.94,,,,,,,2019-10-23T00:00:00,23.0,10.0,2019.0,2607504.0,9117432.0,PRESERVED_SPECIMEN,NY,NY,4258535,62601,J. C. Lendemer,2019-01-01T00:00:00,CC0_1_0,The New York Botanical Garden,J. C. Lendemer,,,2023-01-11T13:29:43.944Z,StillImage,GEODETIC_DATUM_ASSUMED_WGS84;COLLECTION_MATCH_...
2,3422300410,d415c253-4d61-4459-9d25-4015b9084fb0,38619449-806c-4521-afd5-d9eff75972a4,Fungi,Ascomycota,Lecanoromycetes,Pertusariales,Pertusariaceae,Pertusaria,Pertusaria paratuberculifera,,SPECIES,Pertusaria paratuberculifera Dibben,Pertusaria paratuberculifera Dibben,Dibben,US,"Great Smoky Mountains National Park, Foothills...",Tennessee,PRESENT,,ae447c50-b8a8-11d8-92a4-b8a03c50a862,35.77,-83.56,,,,,,,2021-10-18T00:00:00,18.0,10.0,2021.0,3411752.0,3411752.0,PRESERVED_SPECIMEN,NY,NY,4284937,71647,J. C. Lendemer,2021-01-01T00:00:00,CC0_1_0,The New York Botanical Garden,J. C. Lendemer,,,2023-01-11T13:30:24.353Z,StillImage,GEODETIC_DATUM_ASSUMED_WGS84;COLLECTION_MATCH_...
3,2235426236,d415c253-4d61-4459-9d25-4015b9084fb0,c9dec564-90b8-44ce-b171-888e60532983,Fungi,Ascomycota,Lecanoromycetes,Peltigerales,Lobariaceae,Ricasolia,Ricasolia quercizans,,SPECIES,Lobaria quercizans Michx.,Lobaria quercizans Michx.,Michx.,US,"Great Smoky Mountains National Park, N slopes ...",Tennessee,PRESENT,,ae447c50-b8a8-11d8-92a4-b8a03c50a862,35.56,-83.74,,,,,,,2017-12-11T00:00:00,11.0,12.0,2017.0,7086259.0,6337981.0,PRESERVED_SPECIMEN,NY,NY,3861913,8002,E. A. Tripp,2017-12-13T00:00:00,CC0_1_0,The New York Botanical Garden,E. A. Tripp;J. C. Lendemer,,,2023-01-11T13:29:44.652Z,StillImage,GEODETIC_DATUM_ASSUMED_WGS84;COLLECTION_MATCH_...
4,3328180691,d415c253-4d61-4459-9d25-4015b9084fb0,ce31fe91-27d7-4846-a15e-b0cfd46a7fce,Fungi,Ascomycota,Lecanoromycetes,Lecanorales,Parmeliaceae,Melanohalea,Melanohalea halei,,SPECIES,"Melanohalea halei (Ahti) O.Blanco, A.Crespo, D...","Melanohalea halei (Ahti) O.Blanco, A.Crespo, D...","(Ahti) O.Blanco, A.Crespo, Divakar, Essl., D.H...",US,"Pisgah National Forest, Bald Mountains, S-slop...",North Carolina,PRESENT,,ae447c50-b8a8-11d8-92a4-b8a03c50a862,35.95,-82.79,,,,,,,2020-08-17T00:00:00,17.0,8.0,2020.0,2605328.0,2605328.0,PRESERVED_SPECIMEN,NY,NY,4687056,69140,J. C. Lendemer,2020-01-01T00:00:00,CC0_1_0,The New York Botanical Garden,J. C. Lendemer,,,2023-01-11T13:29:45.790Z,StillImage,GEODETIC_DATUM_ASSUMED_WGS84;COLLECTION_MATCH_...


In [37]:
# Split the values in the 'issue' column by ';'
tokens = df['issue'].str.split(';', expand=True).stack()
# Count the occurrences of each token
counts = tokens.value_counts()
# Print the counts
print(counts)

CONTINENT_DERIVED_FROM_COORDINATES                   4341570
GEODETIC_DATUM_ASSUMED_WGS84                         1976773
COORDINATE_ROUNDED                                   1603999
GEODETIC_DATUM_INVALID                               1134822
REFERENCES_URI_INVALID                                587791
INSTITUTION_MATCH_FUZZY                               326699
COLLECTION_MATCH_FUZZY                                266620
DIFFERENT_OWNER_INSTITUTION                           192224
OCCURRENCE_STATUS_INFERRED_FROM_INDIVIDUAL_COUNT      184592
OCCURRENCE_STATUS_UNPARSABLE                          171696
RECORDED_DATE_INVALID                                 155747
TAXON_MATCH_HIGHERRANK                                149980
INSTITUTION_MATCH_NONE                                 96314
TAXON_MATCH_NONE                                       72634
INSTITUTION_COLLECTION_MISMATCH                        63484
AMBIGUOUS_COLLECTION                                   50552
COUNTRY_DERIVED_FROM_COO

In [38]:
# Iterate over the specified columns
for col in ['gbifID', 'datasetKey', 'occurrenceID']:
    # Count the number of unique values in the column
    uniq = df[col].nunique()
    # Count the number of non-null values in the column
    nonnull = df[col].count()
    # Print the results
    print(f"Column {col}: {uniq} unique values, {nonnull} non-null values")

Column gbifID: 4675566 unique values, 4675566 non-null values
Column datasetKey: 933 unique values, 4675566 non-null values
Column occurrenceID: 4675124 unique values, 4675561 non-null values


In [39]:
df = df.drop(columns=[
    "verbatimScientificNameAuthorship",
    "verbatimScientificName",
    "recordNumber",
    "catalogNumber",
    "taxonKey",
    "speciesKey",
    "year",
    "month",
    "day",
    "datasetKey",
    "occurrenceID"
])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 39 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   gbifID                         int64  
 1   kingdom                        object 
 2   phylum                         object 
 3   class                          object 
 4   order                          object 
 5   family                         object 
 6   genus                          object 
 7   species                        object 
 8   infraspecificEpithet           object 
 9   taxonRank                      object 
 10  scientificName                 object 
 11  countryCode                    object 
 12  locality                       object 
 13  stateProvince                  object 
 14  occurrenceStatus               object 
 15  individualCount                object 
 16  publishingOrgKey               object 
 17  decimalLatitude                float64
 18  de

In [40]:
# some of the properties have low cardinality... If I can not extrapolate or makes sense of the values, better to drop them.
# Create a list of tuples containing the column name and its null value ratio
ratios = [(col, round(df[col].isnull().sum() / df[col].shape[0], 2)) for col in df.columns]
# Sort the list by the null value ratio
from operator import itemgetter
ratios = sorted(ratios, key=itemgetter(1), reverse=True)
# Print the sorted list
print("\n".join(f"{col} {ratio}" for col, ratio in ratios))

coordinatePrecision 1.0
typeStatus 1.0
infraspecificEpithet 0.98
establishmentMeans 0.95
individualCount 0.83
elevationAccuracy 0.82
elevation 0.77
depth 0.75
depthAccuracy 0.75
mediaType 0.72
coordinateUncertaintyInMeters 0.6
dateIdentified 0.57
institutionCode 0.54
recordedBy 0.52
collectionCode 0.51
identifiedBy 0.5
rightsHolder 0.45
stateProvince 0.44
locality 0.43
species 0.23
class 0.2
genus 0.15
family 0.06
order 0.05
phylum 0.03
decimalLatitude 0.01
decimalLongitude 0.01
issue 0.01
gbifID 0.0
kingdom 0.0
taxonRank 0.0
scientificName 0.0
countryCode 0.0
occurrenceStatus 0.0
publishingOrgKey 0.0
eventDate 0.0
basisOfRecord 0.0
license 0.0
lastInterpreted 0.0


In [41]:
df = df.drop(columns=["coordinatePrecision",
                     "typeStatus",
                     "depthAccuracy",
                     "establishmentMeans",
                     "mediaType",
                     "elevationAccuracy",
                     "coordinateUncertaintyInMeters",
                     "rightsHolder",
                     "identifiedBy",
                     "license",
                     "recordedBy",
                     "collectionCode",
                     "institutionCode"])

In [42]:
#let's rename and give more meaningfull names
dateColumnNames = {"lastInterpreted":"lastInterpretationDate","dateIdentified":"identificationDate","eventDate":"eventObservationDate"}
df = df.rename(columns=dateColumnNames)

env_columns = ["countryCode","locality","stateProvince","decimalLatitude","decimalLongitude","elevation","depth","eventObservationDate"]
meta_columns = ["occurrenceStatus","individualCount","basisOfRecord","typeStatus","establishmentMeans","lastInterpretationDate","publishingOrgKey","identificationDate"]
species_colums=["kingdom","phylum","class","order","family","genus","species","infraspecificEpithet","taxonRank","scientificName","issue"]

df = df.rename(columns={col:"env_"+col for col in env_columns})
df = df.rename(columns={col:"meta_"+col for col in meta_columns})
df = df.rename(columns={col:"species_"+col for col in species_colums})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   gbifID                        int64  
 1   species_kingdom               object 
 2   species_phylum                object 
 3   species_class                 object 
 4   species_order                 object 
 5   species_family                object 
 6   species_genus                 object 
 7   species_species               object 
 8   species_infraspecificEpithet  object 
 9   species_taxonRank             object 
 10  species_scientificName        object 
 11  env_countryCode               object 
 12  env_locality                  object 
 13  env_stateProvince             object 
 14  meta_occurrenceStatus         object 
 15  meta_individualCount          object 
 16  meta_publishingOrgKey         object 
 17  env_decimalLatitude           float64
 18  env_decimalLongitude  

In [43]:
dateColumnNames = ["meta_lastInterpretationDate", "meta_identificationDate", "env_eventObservationDate"]

# Iterate over the date columns
for col in dateColumnNames:
    # Convert the data in the column to datetime format
    df[col] = pd.to_datetime(
        df[col], 
        # Infer the datetime format
        infer_datetime_format=True, 
        # Set the timezone to UTC
        utc=True, 
        # Handle errors by replacing invalid data with NaT
        errors="coerce"
    )

In [44]:
# Get the object columns in the DataFrame
object_columns = df.select_dtypes(include=['object']).columns.tolist()

# Iterate over the object columns
for col in object_columns:
    # Calculate the ratio of unique values to rows in the column
    ratio = df[col].nunique() / df[col].count()
    # Print the ratio
    print(f"Ratio of unique values to rows for column '{col}': {ratio:.2f}")

Ratio of unique values to rows for column 'species_kingdom': 0.00
Ratio of unique values to rows for column 'species_phylum': 0.00
Ratio of unique values to rows for column 'species_class': 0.00
Ratio of unique values to rows for column 'species_order': 0.00
Ratio of unique values to rows for column 'species_family': 0.00
Ratio of unique values to rows for column 'species_genus': 0.01
Ratio of unique values to rows for column 'species_species': 0.02
Ratio of unique values to rows for column 'species_infraspecificEpithet': 0.06
Ratio of unique values to rows for column 'species_taxonRank': 0.00
Ratio of unique values to rows for column 'species_scientificName': 0.02
Ratio of unique values to rows for column 'env_countryCode': 0.00
Ratio of unique values to rows for column 'env_locality': 0.04
Ratio of unique values to rows for column 'env_stateProvince': 0.00
Ratio of unique values to rows for column 'meta_occurrenceStatus': 0.00
Ratio of unique values to rows for column 'meta_individua

In [45]:
# Convert the object columns to categorical
df[object_columns] = df[object_columns].astype('category')

In [46]:
# Get the float columns in the DataFrame
floats = df.select_dtypes(include=['float64']).columns.tolist()

# Convert the float columns to a smaller data type
df[floats] = df[floats].apply(pd.to_numeric, downcast='float')

df[floats].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 3 columns):
 #   Column                Dtype  
---  ------                -----  
 0   env_decimalLatitude   float32
 1   env_decimalLongitude  float32
 2   env_depth             float32
dtypes: float32(3)
memory usage: 53.5 MB


In [47]:
print(df[['env_depth', 'env_elevation']].describe().loc[['min', 'max']])

     env_depth
min       0.00
max    2067.00


In [48]:
df = df.rename(columns={"env_depth": "env_depthInMeters", "env_elevation": "env_elevationInMeters"})
df[['env_depthInMeters', 'env_elevationInMeters']] = (df[['env_depthInMeters', 'env_elevationInMeters']] * 0.3048).fillna(0).round(1).astype("float32")

TypeError: ignored

In [49]:
print(df[['meta_individualCount']].describe().loc[['min', 'max']])

KeyError: ignored

In [None]:
# Replace NaN values in 'meta_individualCount' column with 0
df['meta_individualCount'].fillna(0, inplace=True)
# Convert the 'meta_individualCount' column to integer type and handle errors
df['meta_individualCount'] = pd.to_numeric(df['meta_individualCount'], downcast='integer', errors='coerce')

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4675566 entries, 0 to 4675565
Data columns (total 26 columns):
 #   Column                        Dtype              
---  ------                        -----              
 0   gbifID                        int64              
 1   species_kingdom               category           
 2   species_phylum                category           
 3   species_class                 category           
 4   species_order                 category           
 5   species_family                category           
 6   species_genus                 category           
 7   species_species               category           
 8   species_infraspecificEpithet  category           
 9   species_taxonRank             category           
 10  species_scientificName        category           
 11  env_countryCode               category           
 12  env_locality                  category           
 13  env_stateProvince             category           
 14  me

In [None]:
#let's focus on one part of the dataset
print(df["species_kingdom"].value_counts())
print(df["env_countryCode"].value_counts())
print(df['meta_basisOfRecord'].value_counts())

Animalia          393178
Plantae           166721
Fungi              54809
incertae sedis       893
Chromista            361
Bacteria             278
Protozoa              74
Viruses                4
Name: species_kingdom, dtype: int64
US    487784
CA     45360
CR     44261
NI      9066
PA      6512
MX      5950
BS      3208
JM      1827
GT      1789
DO      1158
BB      1005
BZ       893
TT       643
KY       584
DM       508
PR       469
SV       374
HN       351
MW       279
AG       279
ZZ       243
CW       232
GP        71
CU        71
GL        59
GD        49
VI        42
TC        41
UM        37
PM        29
MQ        23
BM        21
HT        19
KN        19
LC        17
CO         5
FR         2
VG         2
Name: env_countryCode, dtype: int64
PRESERVED_SPECIMEN     346097
HUMAN_OBSERVATION      141839
FOSSIL_SPECIMEN        104602
OCCURRENCE              10910
MATERIAL_SAMPLE          9064
MACHINE_OBSERVATION      3278
LIVING_SPECIMEN           528
Name: meta_basisOfRecord

In [None]:
# Determine which species and geographic regions to focus on
df=df[(df["env_countryCode"]=="US") & (df["species_kingdom"]=="Animalia")]
# I will focus only on the US but still keep all the specicies, since it could be interesting to correlate species presence for now.
#df=df[(df["env_countryCode"]=="US")]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 294206 entries, 41 to 616313
Data columns (total 26 columns):
 #   Column                        Non-Null Count   Dtype              
---  ------                        --------------   -----              
 0   gbifID                        294206 non-null  int64              
 1   species_kingdom               294206 non-null  category           
 2   species_phylum                293485 non-null  category           
 3   species_class                 248216 non-null  category           
 4   species_order                 229127 non-null  category           
 5   species_family                274180 non-null  category           
 6   species_genus                 222434 non-null  category           
 7   species_species               177592 non-null  category           
 8   species_infraspecificEpithet  12509 non-null   category           
 9   species_taxonRank             294206 non-null  category           
 10  species_scientificN

In [None]:
# Save the cleaned and filtered data to a new file
df.to_csv('cleaned_data.csv', index=False)

In [None]:
# Generate an interactive HTML report
profile = ProfileReport(df, title="Pandas Profiling Report")
# Display the report
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

DONE

Possible next steps:

- Compute some statistical analysis, such as simple metrics, but also tests (average time between observation and interpretation, per kingdom, per country, per organization, etc.).
- Find correlations, such as the impact of food on population, the impact of bacteria/virus on some populations, and the correlation or non-correlation between species.
```  
#filtered_df2 = df[df.applymap(lambda x: "virus" in str(x)).any(1)]
#filtered_df3 = df[df.applymap(lambda x: "bacteria" in str(x)).any(1)]
```
- Train a model to predict future occurrences based on previous ones, and when: the occurrence could be a disease, or the presence/absence of food, etc.
