# 📌 Benefits of a Modular Pipeline for Geospatial Data Processing

## Why Modularity?  
A well-structured pipeline should be modular to improve **maintainability**, **scalability**, and **reusability**. Instead of monolithic functions, breaking down the pipeline into independent components offers several advantages:

### ✅ **1. Easier Maintenance**
- Each part of the pipeline (data loading, preprocessing, ML model training, validation) can be **modified independently**.
- Bugs in one module do not impact the entire pipeline.

### 🚀 **2. Reusability Across Different Datasets**
- A modular design allows us to **swap out components** (e.g., use different ML models or data formats) with minimal changes.
- Functions and classes can be reused in **multiple projects**.

### 📊 **3. Scalability**
- If we need to **extend functionality** (e.g., add more preprocessing steps), a modular structure allows us to do so **without rewriting everything**.
- Supports parallel processing in larger pipelines.

### 🛠 **4. Flexibility in Experimentation**
- Researchers and engineers can test **different preprocessing techniques** or **models** without affecting other pipeline components.

---

## 🔥 Moving Forward with This Modular Approach
- We'll **separate data processing, ML models, and validation** into independent modules.
- This will allow seamless integration with other machine learning workflows.
- Future enhancements (e.g., cloud storage support, dataset filtering) can be added **without breaking the existing pipeline**.



# Loading and Inspecting Geospatial Data

In [1]:
import geopandas as gpd
from typing import Optional

def load_and_inspect_geodata(file_path: str, view: Optional[str] = None, num_rows: int = 5):
    """

    
    load a shapefile or ZIP-compressed spatial data into a GeoDataFrame and inspect it.

    Parameters:
    - file_path (str): Path to the file (shapefile, ZIP, etc.).
    - view (Optional[str]): Choose 'head' for first elements, 'tail' for last elements. Default is None.
    - num_rows (int): Number of rows to display if 'head' or 'tail' is selected.

    returns:
    - gdf (GeoDataFrame): Loaded GeoDataFrame.

    
    """
    
    # load GeoDataFrame
    gdf = gpd.read_file(file_path)

    # display basic information
    print("Coordinate Reference System (CRS):", gdf.crs)
    print("Columns:", list(gdf.columns))
    
    # view requested elements
    if view == "head":
        display(gdf.head(num_rows))
    elif view == "tail":
        display(gdf.tail(num_rows))

    return gdf  # return the GeoDataFrame for further processing


In [4]:
gdf = load_and_inspect_geodata("../../Costa_Rica_Data/Classification_Plots.zip", view="head, num_rows=10)

Coordinate Reference System (CRS): EPSG:4326
Columns: ['Source.Nam', 'plotid', 'sampleid', 'lon', 'lat', 'sample_geo', 'Uso', 'Cobertura', 'Vegetacion', 'Herbaceas', 'Pasto_Arb', 'Cultivo', 'Humedal', 'Terreno', 'Agua', 'Otra_clase', 'SAF', 'Cambios15_', 'Gana_Perdi', 'geometry']


Unnamed: 0,Source.Nam,plotid,sampleid,lon,lat,sample_geo,Uso,Cobertura,Vegetacion,Herbaceas,Pasto_Arb,Cultivo,Humedal,Terreno,Agua,Otra_clase,SAF,Cambios15_,Gana_Perdi,geometry
101150,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904913,3619657,-83.937443,10.083982,POINT(-83.93744280866356 10.083981530646781),Bosque,Vegetacion,Arboles,,,,,,,,,No,,POINT (-83.93744 10.08398)
101151,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619573,-83.843396,10.028786,POINT(-83.84339619133644 10.028786397378227),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.8434 10.02879)
101152,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619574,-83.843396,10.029211,POINT(-83.84339619133644 10.029211000000004),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.8434 10.02921)
101153,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619575,-83.843396,10.029636,POINT(-83.84339619133644 10.029635602065312),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.8434 10.02964)
101154,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619576,-83.842965,10.028786,POINT(-83.842965 10.028786397378227),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.84296 10.02879)
101155,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619577,-83.842965,10.029211,POINT(-83.842965 10.029211000000004),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.84296 10.02921)
101156,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619578,-83.842965,10.029636,POINT(-83.842965 10.029635602065312),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.84296 10.02964)
101157,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619579,-83.842534,10.028786,POINT(-83.84253380866357 10.028786397378227),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.84253 10.02879)
101158,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619580,-83.842534,10.029211,POINT(-83.84253380866357 10.029211000000004),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.84253 10.02921)
101159,ceo-ACTo-puntos-Mapa-de-tipos-de-Bosque-y-otra...,904894,3619581,-83.842534,10.029636,POINT(-83.84253380866357 10.029635602065312),Bosque,Vegetacion,Arboles,,,,,,,,,No se determina,,POINT (-83.84253 10.02964)


# Ensuring Valid Geometries for Geospatial Analysis

In [6]:
import geopandas as gpd
import pandas as pd
import folium
from shapely.geometry import Point

def ensure_geometry(df: pd.DataFrame, geometry_col: str = "geometry") -> gpd.GeoDataFrame:
    """
    Ensures that the specified column contains valid Point objects 
    and converts the DataFrame to a GeoDataFrame.

    Parameters:
    - df (pd.DataFrame): Input DataFrame with a geometry column.
    - geometry_col (str): Name of the column containing geometry data.

    Returns:
    - gdf (gpd.GeoDataFrame): A properly formatted GeoDataFrame.
    """
    
    def convert_to_point(coord):
        """Helper function to convert string representations into Point objects."""
        if isinstance(coord, Point):
            return coord  # Already a Point object, return as is
        
        try:
            # Extract coordinates from string
            coords = coord.replace("POINT (", "").replace(")", "").split()
            return Point(float(coords[0]), float(coords[1]))
        except Exception as e:
            print(f"Warning: Could not convert {coord} to Point. Error: {e}")
            return None  # Handle cases where conversion fails

    # apply conversion function to the geometry column
    df[geometry_col] = df[geometry_col].apply(convert_to_point)

    # convert to a GeoDataFrame
    gdf = gpd.GeoDataFrame(df, geometry=geometry_col)

    return gdf  # return cleaned GeoDataFrame

# Example Usage
df = pd.DataFrame({'geometry': ['POINT (9.92 -84.07)', 'POINT (10.5 -83.9)']})  # example data
gdf = ensure_geometry(df)
print(gdf['geometry'].head())

0    POINT (9.92 -84.07)
1     POINT (10.5 -83.9)
Name: geometry, dtype: geometry


# Visualizing Geospatial Data with Folium

In [7]:
# create a folium map centered on a general location
m = folium.Map(location=[10.0, -84.0], zoom_start=6)

# add points from GeoDataFrame to the map
for _, row in gdf.iterrows():
    folium.Marker(
        location=[row.geometry.y, row.geometry.x],  # extract lat/lon from Point
        popup=f"Point: {row.geometry}"
    ).add_to(m)

# Display the map
m

# Cleaning The numerical Data

In [10]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

def clean_numerical_data(df: pd.DataFrame, num_cols: list, fill_method: str = "mean", scale: bool = True) -> pd.DataFrame:
    """
    cleans numerical columns by handling missing values, removing outliers, and optionally scaling.

    Parameters:
    - df (pd.DataFrame): The input DataFrame.
    - num_cols (list): List of numerical column names.
    - fill_method (str): Strategy for filling missing values ("mean", "median", or "zero"). Default is "mean".
    - scale (bool): Whether to standardize numerical features using StandardScaler. Default is True.

    Returns:
    - df (pd.DataFrame): Cleaned DataFrame with processed numerical columns.
    """
    
    df = df.copy()  # avoid modifying original DataFrame

    # handle missing values
    for col in num_cols:
        if fill_method == "mean":
            df[col] = df[col].fillna(df[col].mean())
        elif fill_method == "median":
            df[col] = df[col].fillna(df[col].median())
        elif fill_method == "zero":
            df[col] = df[col].fillna(0)
    
    # remove outliers (values beyond 3 standard deviations)
    for col in num_cols:
        std_dev = df[col].std()
        mean = df[col].mean()
        df = df[(df[col] >= mean - 3 * std_dev) & (df[col] <= mean + 3 * std_dev)]
    
    # scale numerical features if enabled
    if scale:
        scaler = StandardScaler()
        df[num_cols] = scaler.fit_transform(df[num_cols])
    
    print("✅ umerical data cleaned successfully!")
    return df