# Costa Rica Pipeline Notebook
## {Insert info on notebook purpose}

### Project objective:
- {TODO}

#### Notebook sections:
1. Setup
2. English Translations
3. Clean Observational data
4. Extract Medoid and NEM data (provide file of data in case they don't want to run it for hours)
5. Create and Extract Predictor surfaces (pred2)
6. Clean Remote Sensing data
7. Preprocessing
8. Build a predictive model

#### Data sources
- {TODO}

## Section 1: Setup
### Installing software

In [2]:
!pip install pandas numpy geopandas seaborn scikit-learn tensor folium folium matplotlib mapclassify earthengine-api geemap



### Importing packages

In [18]:
from geopandas import GeoDataFrame
from shapely.geometry import Point
import geopandas as gpd
import pandas as pd
import numpy as np
import pprint
import geemap
import ee
import folium

### Importing our Dataset

In [19]:
# We are importing the observational data from the ZIP file provided (which contains the shape file) into a GeoDataFrame
gdf = gpd.read_file('Data/Classification_Plots.zip')

# These display information about the GeoDataFrame to confirm the contains are what we expected
display(gdf.crs)
display(gdf.columns)

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

Index(['Source.Nam', 'plotid', 'sampleid', 'lon', 'lat', 'sample_geo', 'Uso',
       'Cobertura', 'Vegetacion', 'Herbaceas', 'Pasto_Arb', 'Cultivo',
       'Humedal', 'Terreno', 'Agua', 'Otra_clase', 'SAF', 'Cambios15_',
       'Gana_Perdi', 'geometry'],
      dtype='object')

In [20]:
# Extra analysis, remove later
unique_values = {'Otra_clase': gdf['Otra_clase'].unique().tolist()}
display(unique_values)

unique_values = {'Cobertura': gdf['Cobertura'].unique().tolist()}
display(unique_values)

{'Otra_clase': [None,
  'Edificado / Desarrollado',
  'Suelo desnudo',
  'Nubes',
  'Sombra de nubes',
  'Paramo',
  'Playas y arenales']}

{'Cobertura': ['Vegetacion',
  'Sin vegetacion',
  'Agua',
  'Sin informacion',
  'Nubes y sombras',
  None]}

## Section 2: English Translations
### {subheading}

In [21]:
# Insert code here for Section 2

## Section 3: Clean Observational Data
### Uso's Null Values & Cobertura's 'Sin informacion'

In [22]:
# Creates a subset of the data that only contains columns that had null/empty values (from previous output)
# and the columns we are interested in (i.e. not Cambios15_ and Gana_Perdi)
subset_gdf = gdf[['Uso', 'Cobertura', 'Vegetacion', 'Herbaceas', 'Pasto_Arb', 'Cultivo', 'Humedal', 'Terreno', 'Agua', 'Otra_clase', 'SAF']]

# Search for the number of rows that contain all na values
na_rows=subset_gdf[subset_gdf.isna().all(axis=1)]
print('number of rows with all na =',na_rows.shape[0])

# Display the dataframe of the rows which contain all na values (for confirmation)
display(na_rows)

number of rows with all na = 216


Unnamed: 0,Uso,Cobertura,Vegetacion,Herbaceas,Pasto_Arb,Cultivo,Humedal,Terreno,Agua,Otra_clase,SAF
14112,,,,,,,,,,,
14113,,,,,,,,,,,
14114,,,,,,,,,,,
14115,,,,,,,,,,,
14116,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
100201,,,,,,,,,,,
100202,,,,,,,,,,,
100203,,,,,,,,,,,
100204,,,,,,,,,,,


In [23]:
# Creates a subset of our 'subset_gdf' where Uso and Cobertura have the value 'Sin informacion'
# We do this because we want to check if these rows have any useful information (as 'Sin informacion' means no information)
filtered_rows = subset_gdf[subset_gdf['Uso'] == 'Sin informacion']
filtered_rows = filtered_rows[filtered_rows['Cobertura'] == 'Sin informacion']

# Creates a dictionary of the unique values of all the columns in filtered_rows
unique_values = {col: filtered_rows[col].unique().tolist() for col in filtered_rows.columns}

# Check the unique values of the columns in filtered_rows
pprint.pprint(unique_values, sort_dicts=False)
# Since the rest of the columns have no values, this shows us that these rows are not important to keep 

{'Uso': ['Sin informacion'],
 'Cobertura': ['Sin informacion'],
 'Vegetacion': [None],
 'Herbaceas': [None],
 'Pasto_Arb': [None],
 'Cultivo': [None],
 'Humedal': [None],
 'Terreno': [None],
 'Agua': [None],
 'Otra_clase': [None],
 'SAF': [None]}


In [24]:
# Create a subset of 'gdf' without the na_rows previously found
gdf_cleaned = gdf.drop(na_rows.index)

# Drops the rows from 'gdf_cleaned' that had 'Sin informacion' and null/empty values (i.e. filtered_rows)
gdf_cleaned = gdf_cleaned.drop(filtered_rows.index)

# Time to check the null presence and counts in each column
nulls_in_columns = gdf_cleaned.isna().any()
null_counts = gdf_cleaned.isna().sum()

# Print the results in a visually aligned format
print(f"{'Column':<20}{'Contains Nulls':<15}\t\t\t{'Null Count':<10}")
print("-" * 55)

for column in gdf_cleaned.columns:
    has_null = nulls_in_columns[column]
    null_count = null_counts[column]
    print(f"{column:<20}{str(has_null):<15}\t\t\t{null_count:<10}")
    
# Notice that now there are no nulls in 'Uso', 'Cobertura', and 'Cambios15_' (coincidentally).
# The next smallest null count is from 'Vegetacion', so that's what we're going to look at next.
# Furthermore, another reason why the 'Vegetacion' column is interesting is because it is used in every type of 'Uso'; meaning, it's supposed to be a must to fill-in.

Column              Contains Nulls 			Null Count
-------------------------------------------------------
Source.Nam          False          			0         
plotid              False          			0         
sampleid            False          			0         
lon                 False          			0         
lat                 False          			0         
sample_geo          False          			0         
Uso                 False          			0         
Cobertura           False          			0         
Vegetacion          True           			6685      
Herbaceas           True           			76378     
Pasto_Arb           True           			77552     
Cultivo             True           			92445     
Humedal             True           			96917     
Terreno             True           			95285     
Agua                True           			99956     
Otra_clase          True           			95394     
SAF                 True           			92445     
Cambios15_          False          			0         
Gana_Perdi   

### Looking into Vegetacion

In [25]:
# Creates a subset of the data that only contains columns that had null/empty values (from previous output) and the columns we are interested in
subset_gdf = gdf_cleaned[['Vegetacion', 'Herbaceas', 'Pasto_Arb', 'Cultivo', 'Humedal', 'Terreno', 'Agua', 'Otra_clase', 'SAF']]
sliced_gdf = gdf_cleaned[['Uso', 'Cobertura', 'Vegetacion', 'Herbaceas', 'Pasto_Arb', 'Cultivo', 'Humedal', 'Terreno', 'Agua', 'Otra_clase', 'SAF', 'Cambios15_', 'Gana_Perdi']]

# Search for the number of rows that contain all na values
na_rows=subset_gdf[subset_gdf.isna().all(axis=1)]
print('number of rows with all na =',na_rows.shape[0])
# Display the dataframe of the rows which contain all na values (for confirmation)
display(sliced_gdf.loc[na_rows.index])

# Creates a dictionary of the unique values of all the columns in filtered_rows
#unique_values = {col: sliced_gdf.loc[na_rows.index][col].unique().tolist() for col in sliced_gdf.loc[na_rows.index].columns}

# Check the unique values of the columns in filtered_rows
#pprint.pprint(unique_values, sort_dicts=False)

# Ask the question, should we be keeping these records in the dataframe?
# They provide no data outside of 'Uso' and 'Cobertura', and even then the data is limited to 'Sin informacion', 'Bosque', and 'Nubes y sombras'. 

number of rows with all na = 54


Unnamed: 0,Uso,Cobertura,Vegetacion,Herbaceas,Pasto_Arb,Cultivo,Humedal,Terreno,Agua,Otra_clase,SAF,Cambios15_,Gana_Perdi
12290,Bosque,Sin informacion,,,,,,,,,,No,
12291,Bosque,Sin informacion,,,,,,,,,,No,
12638,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12639,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12640,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12641,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12642,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12643,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12644,Sin informacion,Nubes y sombras,,,,,,,,,,No se determina,
12938,Sin informacion,Nubes y sombras,,,,,,,,,,No,


In [26]:
# Insert code here for Section 3

## Section 4: Extract Medoid and NEM Data
### Forewarning

The following two sections will take at least an hour to run, and will require setting up a Google Earth Engine project within your Google Drive.
Therefore, if you'd prefer to skip this section, please uncomment the code below, run it, and skip to Section 6.

In [27]:
# Insert code here that imports a saved GeoDataFrame with Medoid, NEM, and Predictor Costa Rican Data

TODO: Add comments about what the cell above does

In [28]:
# Insert code here from John's notebook on Medoid and NEM Costa Rican Data

## Section 5: Create and Extract Predictor Surfaces
### {subheading}

In [29]:
# Insert code here from John's notebook on Predictor Values and Surfaces

## Section 6: Clean Remote Sensing Data
### {subheading}

In [30]:
# Insert code here that cleans the Remote Sensing Costa Rican Data and discusses any observations

## Section 7: Preprocessing
### {subheading}

In [31]:
# Insert code here that prepares the data for modeling

## Section 8: Build a Predictive Model
### {subheading}

In [32]:
# Insert code here that builds the Predictive Model