# Costa Rica Pipeline Notebook
## {Insert info on notebook purpose}

### Project objective:
- {TODO}

#### Notebook sections:
1. Setup
2. Data Acquisition / Cleaning
3. Utilizing ML Models
4. Maps / Visualizations / Documentation

#### Data sources
- {TODO}

## Section 1 - Setup

### Section 1.1 - Installing software and importing packages

In [4]:
!pip install pandas numpy geopandas seaborn scikit-learn tensor folium folium matplotlib mapclassify earthengine-api geemap



In [5]:
from geopandas import GeoDataFrame
from shapely.geometry import Point
import geopandas as gpd
import pandas as pd
import numpy as np
import pprint
import geemap
import ee
import folium

### Section 1.2 - Importing our Dataset

In [9]:
# We are importing the observational data from the ZIP file provided (which contains the shape file) into a GeoDataFrame
gdf = gpd.read_file('../Costa Rican Data/Classification_Plots.zip')

# These display information about the GeoDataFrame to confirm the contains are what we expected
display(gdf.crs)
display(gdf.columns)

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Index(['Source.Nam', 'plotid', 'sampleid', 'lon', 'lat', 'sample_geo', 'Uso',
       'Cobertura', 'Vegetacion', 'Herbaceas', 'Pasto_Arb', 'Cultivo',
       'Humedal', 'Terreno', 'Agua', 'Otra_clase', 'SAF', 'Cambios15_',
       'Gana_Perdi', 'geometry'],
      dtype='object')

### Section 1.3 - English Translations

Renaming every column name to their respective English translation

In [4]:
# Insert code here for Section 1.3
k_clms = ['plotid','sampleid','Uso','Cobertura','Vegetacion','Herbaceas', 'Pasto_Arb', 'Cultivo','Humedal', 'Terreno','Agua','Otra_clase','SAF','Cambios15_','Gana_Perdi','geometry']
gdf_s=gdf[k_clms]
# gdf_s


# new column names
english_translations_predictors = {
    'Uso': 'Use',
    'Cobertura': 'CoverType',
    'Vegetacion': 'Vegetations',
    'Herbaceas': 'Herbaceous',
    'Pasto_Arb': 'GrasslandShrub',
    'Cultivo': 'CropsType',
    'Humedal': 'WetlandArea',
    'Terreno': 'LandType',
    'Agua': 'WaterBodyType',
    'Otra_clase': 'OtherClass',
    'SAF': 'SAF',
    'Cambios15_': 'Changes_15',
    'Gana_Perdi': 'Gain_Loss',
}

# Create a new GeoDataFrame with selected columns and translations
gdf_e = gdf[k_clms].copy()  # Creating a new GeoDataFrame
gdf_e.rename(columns=english_translations_predictors, inplace=True)  # Renaming columns

Renaming each predictor according to their respective English translation

In [5]:
# Translation dictionaries for each categorical predictor
translation_dicts = {
    'Use': { # num of missing values: 216
        'Bosque': 'Forest',
        'Pastos': 'Grasslands',
        'Humedal': 'Wetlands',
        'Otras clases': 'Other classes',
        'Agricultura': 'Agriculture',
        'Plantacion forestal': 'Forest plantation',
        'Sin informacion': 'No information',
        'None': 'None'
    },
    'CoverType': { # num of missing values: 216
        'Vegetacion': 'Vegetation',
        'Sin vegetacion': 'No vegetation',
        'Agua': 'Water',
        'Sin informacion': 'No information',
        'Nubes y sombras': 'Clouds and shadows',
        'None': 'None'
    },
     'Changes_15': {# num of missing values: 216
        'No se determina': 'Not determined',
        'No': 'No',
        'Si': 'Yes',
        'None': 'None'
    },
    'Vegetations': {
        'Arboles': 'Trees',
        'Herbaceas': 'Herbaceous plants',
        'None': 'None',
        'Palmas': 'Palms',
        'Arbustos': 'Shrubs',
        'Otra vegetacion': 'Other vegetation',
        'Saran': 'Saran (plastic cover)',
        'Plastico': 'Plastic'
    },
    'Herbaceous': {
        'None': 'None',
        'Gramineas': 'Grasses',
        'Otras Herbaceas': 'Other Herbaceous',
        'Musaceas': 'Plantains'
    },
    'GrasslandShrub': {
        'None': 'None',
        'Pastos mezclados (70-90%)': 'Mixed Pasture (70-90%)',
        'Pastos Puros (90-100%)': 'Pure Pasture (90-100%)',
        'Pastos Combinados (50-70%)': 'Combined Pasture (50-70%)'
    },
    'CropsType': { # 10 different crop types
        'None': 'None',
        'Pina': 'Pineapple',
        'Otro': 'Other',
        'Arroz': 'Rice',
        'Citricos': 'Citrus',
        'Cana': 'Sugarcane',
        'Palma': 'Palm',
        'Banano': 'Banana',
        'Melon': 'Melon',
        'Sandia': 'Watermelon',
        'Cafe': 'Coffee'
    },
     'WaterBodyType': { # 2 types of water bodies
        'None': 'None',
        'Continentales': 'Continental',
        'Mar�timas': 'Marine'
    },

    'WetlandArea': { # 5 wetland area types
        'None': 'None',
        'Pantano (Palustre)': 'Swamp (Marsh)',
        'Cuerpos de agua': 'Water bodies',
        'Yolillal': 'Yolillal_Plants',
        'Salinera': 'Salt marsh',
        'Manglar': 'Mangrove'
    },
    'LandType': {
        'None': 'None',
        'Otras superficies': 'Other surfaces', # forested areas?
        'Terreno descubierto': 'Exposed land', #  land degradation (soil is exposed)
        'Suelo desnudo': 'Bare Land' # lacks cover (deforestation, desert)
    },
    'OtherClass': {
        'None': 'None',
        'Edificado / Desarrollado': 'Built/Developed',
        'Suelo desnudo': 'Bare Land',
        'Nubes': 'Clouds',
        'Sombra de nubes': 'Cloud shadow',
        'Paramo': 'Páramo (high-altitude ecosystem)',
        'Playas y arenales': 'Beaches and sandbanks'
    },
    'SAF': {
        'None': 'None',
        'Cultivo Puro (90-100%)': 'Pure crop (90-100%)',
        'Cultivo mezclado (70-90%)': 'Mixed crop (70-90%)',
        'Cultivo Combinado (50-70%)': 'Combined crop (50-70%)'
    },
    'Gain_Loss': {
        'None': 'None',# third category no gain or loss
        'Perdida de Bosque': 'Forest loss',
        'Ganancia de Bosque': 'Forest gain'
    }
}

# Loop through each column and replace the Spanish values with the English translations
for col, trans_dict in translation_dicts.items():
    gdf_e[col] = gdf_e[col].replace(trans_dict)

Confirming the translations in the new Dataframe

In [6]:
# set display options to show more content
pd.set_option('display.max_colwidth', None)  # Allows full width of content in each column
pd.set_option('display.max_rows', None)      # Show all rows (if there are not too many unique values)

# Now, get and display unique values for each categorical feature
unique_values = gdf_e.select_dtypes(include=['object', 'category']).apply(lambda x: x.unique())

# Display the result
print(unique_values)

Use                             [Forest, Grasslands, Wetlands, Other classes, Agriculture, Forest plantation, No information, None]
CoverType                                              [Vegetation, No vegetation, Water, No information, Clouds and shadows, None]
Vegetations                       [Trees, Herbaceous plants, None, Palms, Shrubs, Other vegetation, Saran (plastic cover), Plastic]
Herbaceous                                                                             [None, Grasses, Other Herbaceous, Plantains]
CropsType                                [None, Pineapple, Other, Rice, Citrus, Sugarcane, Palm, Banana, Melon, Watermelon, Coffee]
WetlandArea                                              [None, Swamp (Marsh), Water bodies, Yolillal_Plants, Salt marsh, Mangrove]
LandType                                                                            [None, Other surfaces, Exposed land, Bare Land]
WaterBodyType                                                               

New translated gdf is named
∴ `gdf_e`

## Section 2 - Data Acquisition and Cleaning

### Section 2.1 - Points imported, and points turned into "wide" format for the "9->1 row per plot conversion

In [7]:
# Insert code here for Section 2.1

### Section 2.2 - Exploratory Data Analysis (EDA)

#### Section 2.2.1 Exploration of Plots and Points

In [8]:
# Count of points, count of plots, and ID of NA's (if appropriate).

# Function to calculate unique values of a dataframe's columns and check if it contains NAs
def unique_values_table(dataframe):
    unique_counts = []

    for col in dataframe.columns:
        # Check the number of unique values in this column ('col')
        unique_without_na = dataframe[col].nunique(dropna=True)
        # Check if said column ('col') has any NA values
        has_na = dataframe[col].isna().any()

        # Append the information calculated above into a list of dictionaries
        unique_counts.append({
            'Column Name': col,
            'Unique Values (without NAs)': unique_without_na,
            'Has NAs': has_na
        })

    # Convert results into a DataFrame
    result_df = pd.DataFrame(unique_counts)
    return result_df

# Generate and display the table
unique_table = unique_values_table(gdf_s[['plotid', 'sampleid']])
display(unique_table)
display(gdf_s.shape)

Unnamed: 0,Column Name,Unique Values (without NAs),Has NAs
0,plotid,11233,False
1,sampleid,49469,False


(101160, 16)

In summary, out of the **101,160 rows of data** in our Dataframe, there are only:
- 11,233 unique plots,
- 49,469 unique points.

> Assumption 1: Some sampleids are being used more than once

> Question 1: How many sampleids (points) are being used more than one time?

Thankfully, there are no NA values in these two columns.

In [9]:
# Count of plots that have fewer than 9 points per plot (and why)

# Create subsets of gdf_s where the frequency of the plotid/sampleid is calculated
plotid_counts = pd.DataFrame(gdf_s['plotid'].value_counts(dropna=False))
sampleid_counts = pd.DataFrame(gdf_s['sampleid'].value_counts(dropna=False))

# Creates subsets of plotid_counts that show plots with more or less than 9 points
plotid_counts_lt9 = plotid_counts[plotid_counts['count'] < 9]
plotid_counts_mt9 = plotid_counts[plotid_counts['count'] > 9]

# Creates subsets of sampleid_counts that counts the frequency of each points' frequency
sampleid_counts_mt0 = sampleid_counts[sampleid_counts['count'] > 0]
sampleid_counts_frequency = pd.DataFrame(sampleid_counts_mt0['count'].value_counts())
sampleid_counts_frequency.rename(columns={ sampleid_counts_frequency.columns[0]: "count frequency" }, inplace=True)

print("There are {number_of_rows} plotid's with less than 9 points per plot:".format(number_of_rows=len(plotid_counts_lt9)))
display(plotid_counts_lt9)

print("There are {number_of_rows} plotid's with more than 9 points per plot:".format(number_of_rows=len(plotid_counts_mt9)))
display(plotid_counts_mt9)

# This answers Question 1: "How many sampleids (points) are being used more than one time?"
# Also proves Assumption 1 as fact
print("There are {number_of_rows} points that appear more than once:".format(number_of_rows=len(sampleid_counts[sampleid_counts['count'] > 1])))
display(sampleid_counts_frequency)

There are 0 plotid's with less than 9 points per plot:


Unnamed: 0_level_0,count
plotid,Unnamed: 1_level_1


There are 7 plotid's with more than 9 points per plot:


Unnamed: 0_level_0,count
plotid,Unnamed: 1_level_1
2175,18
4607,18
6916,18
3150,18
3105,18
3163,18
1630,18


There are 41437 points that appear more than once:


Unnamed: 0_level_0,count frequency
count,Unnamed: 1_level_1
2,31204
3,10212
1,8032
4,21


##### **Recall Question 1: "How many sampleids (points) are being used more than one time?"**
In summary, out of all the unique plots in our data:
- **None** have less than nine points per plot
- **Seven** have more than nine points per plot (exactly 18 in fact)

> Question 2: Why do those seven plots have 18 points instead of nine?

> Question 3: Are all 18 of those points unique, or are some of them repeated?

Out of all the sampleids (points) in our data:
- **21** points appear four times
- **10,212** points three times
- **31,204** points appear two times
- **8,032** points appear one time

> Question 4: Do the points that appear repeatedly have the same longitude and latitude?

In [41]:
# This answers Question 2: "Why do those seven plots have 18 points instead of nine?"
# Also Question 3: "Are all 18 of those points unique, or are some of them repeated?"

# Analyzes the uniqueness of sample IDs associated with each plot ID.
def analyze_plotids(plotids, gdf_s, onlyUniques=False):

    # Create an empty list to store the results
    results = []

    # Iterate over unique plot IDs
    for plotid in plotids:
        # Filter the DataFrame to include only rows for the current plotid
        filtered_data = gdf_s[gdf_s['plotid'] == plotid]

        # Get the array of all points (sampleid) associated with the current plotid
        if onlyUniques:
            points_array = filtered_data['sampleid'].drop_duplicates().tolist()
        else:
            points_array = filtered_data['sampleid'].tolist()

        # Check if all points are unique
        all_points_unique = filtered_data['sampleid'].is_unique

        # Append the plotid, points array, and uniqueness result to the results list
        results.append({
            'plotid': plotid,
            'sampleids (points)': points_array,
            'All points unique': all_points_unique
        })

    # Convert the results list into a new DataFrame
    return pd.DataFrame(results)

# TODO: Look into why so many points appear more than once (Might be found by completing TODO 1)

# Display the results
display(analyze_plotids(plotid_counts_mt9.index, gdf_s))

Unnamed: 0,plotid,sampleids (points),All points unique
0,2175,"[8697, 8697, 8698, 8698, 8699, 8699, 8700, 8700, 8701, 8701, 8702, 8702, 8703, 8703, 8704, 8704, 8705, 8705]",False
1,4607,"[18425, 18425, 18426, 18426, 18427, 18427, 18428, 18428, 18429, 18429, 18430, 18430, 18431, 18431, 18432, 18432, 18433, 18433]",False
2,6916,"[27661, 27661, 27662, 27662, 27663, 27663, 27664, 27664, 27665, 27665, 27666, 27666, 27667, 27667, 27668, 27668, 27669, 27669]",False
3,3150,"[12597, 12597, 12598, 12598, 12599, 12599, 12600, 12600, 12601, 12601, 12602, 12602, 12603, 12603, 12604, 12604, 12605, 12605]",False
4,3105,"[12417, 12417, 12418, 12418, 12419, 12419, 12420, 12420, 12421, 12421, 12422, 12422, 12423, 12423, 12424, 12424, 12425, 12425]",False
5,3163,"[12649, 12649, 12650, 12650, 12651, 12651, 12652, 12652, 12653, 12653, 12654, 12654, 12655, 12655, 12656, 12656, 12657, 12657]",False
6,1630,"[6517, 6517, 6518, 6518, 6519, 6519, 6520, 6520, 6521, 6521, 6522, 6522, 6523, 6523, 6524, 6524, 6525, 6525]",False


##### **Recall Questions 2 & 3: "Why do those seven plots have 18 points instead of nine; also, are all 18 of those points unique, or are some of them repeated?"**
Visually, we can tell that these seven plots:
- Have had 9 sampleid's counted twice,
- Do not have all associated sampleid's unique.

In [45]:
# TODO: Answer Question 4: "Do the points that appear repeatedly have the same longitude and latitude?"

# For starters, let's work with the 21 points that appear four times

display(sampleid_counts[sampleid_counts['count'] == 4])

#for sampleid in sampleid_counts[sampleid_counts['count'] == 4].index:
    # TODO: Finish loop dumbass
    # Specifically create a subset of gdf_s that contains only the sampleids listed below so we can check their long/lats

Unnamed: 0_level_0,count
sampleid,Unnamed: 1_level_1
12417,4
12601,4
12657,4
12597,4
12605,4
12653,4
6525,4
12649,4
6521,4
12421,4


Fill in with summary of above findings

In [19]:
# Map the plots (the 10,000 not the 90,000)

### Section 2.3 - Acquisition of data from GEE

## Section 3 - Utilizing ML Models

### Section 3.1 - Feature Selection

In [12]:
# Insert code here for Section 3.1

### Section 3.2 - Model Research

In [13]:
# Insert code here for Section 3.2

### Section 3.3 - Model Comparisons

In [14]:
# Insert code here for Section 3.3

## Section 4 - Maps / Visualizations / Documentation

### Section 4.1 - Creating the maps

In [15]:
# Insert code here for Section 4.1