# ___

# [ Machine Learning in Geosciences ]

**Department of Applied Geoinformatics and Carthography, Charles University** 

*Lukas Brodsky lukas.brodsky@natur.cuni.cz*
    
___

#  Essential Python Data Science Techniques in **NumPy, Pandas and Matplotlib**

# 1: Introduction & Setup

## End-to-End Data Processing: Glacial Valley Topography


In this lab, we will simulate, clean, and visualize the cross-sectional topographic profile of a U-shaped glacial valley. Real-world elevation data (e.g., from drone LiDAR) is often noisy and contains missing values. We will process this data using the fundamental data science triad: **NumPy, Pandas, and Matplotlib**.

Learning Objectives:

    1. Formulate non-linear spatial data using NumPy and Linear Algebra.

    2. Manage tabular data and impute missing values using Pandas.

    3. Create publication-ready geoscience figures using Matplotlib.

In [None]:
# Import the required data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set a random seed so all students get the exact same "random" noise
np.random.seed(42)

# 2: Simulating the Valley using Linear Algebra

Data Simulation (NumPy & Linear Algebra)

A U-shaped glacial valley can be approximated by a **second-degree polynomial** (parabola):
$$ y = ax^2 + bx + c $$

Where:
* **$y$** is the elevation.
* **$x$** is the distance across the valley.
* **$a$** controls the steepness of the valley walls.
* **$b$** controls the horizontal shift (symmetry).
* **$c$** is the base elevation of the valley floor.

Instead of calculating this point-by-point, we can use **Linear Algebra** to calculate the entire valley profile at once. We will build a **Design Matrix A** containing our variables, and a **Weight Vector W** containing our coefficients. The true elevation is the **dot product of A and W**.

    X: Distance across the valley (meters)

    W: Vector of coefficients [a,b,c]

$$ Y_{true} = A \cdot W $$

In [None]:
# 1. Generate the spatial domain: 500 points from -1000m to 1000m
# Use NumPy method linspace()
# returns evenly spaced numbers based on a specified interval. 
# The interval by default includes the starting value and ending value.
# X = np.linspace()
pass 

In [None]:
# Explore NumPy method linspace() 
pass 

First, we construct the **Design Matrix** ($A$). For $n$ spatial points across the valley, $A$ is an $n \times 3$ matrix where the columns represent the polynomial features ($x^2$, $x$, and a constant $1$ for the intercept):

$$ A = \begin{bmatrix} 
x_1^2 & x_1 & 1 \\ 
x_2^2 & x_2 & 1 \\ 
\vdots & \vdots & \vdots \\ 
x_n^2 & x_n & 1 
\end{bmatrix} $$

In [None]:
# 2. Linear Algebra Setup: Build the Design Matrix (N x 3)
# Columns: [X^2, X, 1] (The '1' represents the intercept/base elevation)
# Use NumPy column_stack() and ones_like()
# A = np.column_stack(()))
pass 

In [None]:
# Explore NumPy method column_stack()
pass 

Next, we define our **Weight Vector** ($W$), which contains our geologic coefficients $[a, b, c]$. It is a $3 \times 1$ column vector:

$$ W = \begin{bmatrix} a \\ b \\ c \end{bmatrix} $$


In [None]:
# 3. Define the Weight Vector (coefficients for a U-shaped valley)
# a = 0.0005 (steepness), b = 0 (symmetry), c = 1200 (base elevation in meters)
# Use NumPy method array() - a constructuctor from list 
# W = np.array([...])
pass 

In [None]:
# Explore NumPy method array() 
pass 

By taking the dot product of $A$ and $W$, we get the $n \times 1$ vector of our true elevations ($Y_{true}$):

$$ Y_{true} = \begin{bmatrix} 
x_1^2 & x_1 & 1 \\ 
x_2^2 & x_2 & 1 \\ 
\vdots & \vdots & \vdots \\ 
x_n^2 & x_n & 1 
\end{bmatrix} \begin{bmatrix} a \\ b \\ c \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} $$

In [None]:
# 4. Matrix Multiplication to get True Elevation (Y_true)
# Calculate Y_true as dot product of A and W. Use NumPy np.dot()
# Y_true = np.dot(..)
pass 

In [None]:
# Explore Numpy method dot()
pass 

In [None]:
# 5. Inject Environmental Noise (Simulating rough terrain and sensor noise)
# generate noise using NumPy np.random.normal() 
# loc=0 (Mean), scale=35 (Standard deviation - spread), size=X.shape 
# noise = np.random.normal()
pass 
# add the noise to the true data to simulate measured data 
# Y_measured = Y_true + noise
pass 

In [None]:
print(f"Shape of Design Matrix A: {A.shape}")
print(f"Shape of Weight Vector W: {W.shape}")

In [None]:
# Explore NumPy method random.normal()
pass 

# 3: Structuring and Cleaning Messy Data

**Data Management & Cleaning (Pandas)** 

Geoscience data is rarely perfect. Sensors fail, water surfaces absorb LiDAR pulses, and errors occur. We will move our raw NumPy arrays into a structured Pandas DataFrame.

We will artificially drop 5% of our sensor readings to simulate missing data (NaNs). Then, we will use Pandas to:

    1. Interpolate (fill in) the missing gaps.

    2. Apply a rolling window mean to smooth out the high-frequency sensor noise.

In [None]:
# 1. Create a Pandas DataFrame
# Use pandas pd.DataFrame() constructor, pass disctionary with data, keys become the columns  
df = pd.DataFrame({
    'Distance_m': X,
    'True_Elevation_m': Y_true,
    'Measured_Elevation_m': Y_measured
})

In [None]:
# Explore pandas method DataFrame() 
pass 

In [None]:
# Explore DataFrame columns 
pass 

In [None]:
# 2. Simulate Sensor Failure: Randomly replace 5% of measurements with NaN
drop_indices = np.random.choice(df.index, size=int(0.05 * len(df)), replace=False)

In [None]:
# explore the indices series
pass 

In [None]:
# replace the data with NaN 
df.loc[drop_indices, 'Measured_Elevation_m'] = np.nan
print(f"Missing data points before cleaning: {df['Measured_Elevation_m'].isna().sum()}")

In [None]:
# Explore Pandas method random.choice()
pass 

In [None]:
# Explore Pandas indexing df.index 
pass 

In [None]:
# Explore Pandas df.loc[] 
pass 

In [None]:
# 3. Data Imputation: Fill missing values using polynomial interpolation
# here the pandas method interpoalte() fill NaN values using polynomial interpolation method.
df['Cleaned_Elevation_m'] = df['Measured_Elevation_m'].interpolate(method='polynomial', order=2)

In [None]:
print(f"Missing data points after cleaning: {df['Cleaned_Elevation_m'].isna().sum()}")

In [None]:
# Explore Pandas function isnan() and sum() 
pass 

In [None]:
# 4. Feature Engineering: Apply a Rolling Window Average to smooth noise
# Pandas method rolling() provide rolling window calculations
# window=15 means it averages the 15 neighboring spatial points

df['Smoothed_Elevation_m'] = df['Cleaned_Elevation_m'].rolling(window=15, center=True).mean()

df.head()

# 4. Geoscience Visualization

Visual Reporting - **Matplotlib** 

Data science in the geosciences requires clear, accurately labeled visualizations. We will use Matplotlib's object-oriented API (plt.subplots()) to overlay our raw noisy data, our Pandas-smoothed data, and the underlying mathematical truth.

Notice how the Pandas rolling.mean() does a decent job, but struggles near the steep edges of the valley. This visual limitation will perfectly motivate the need for Scikit-Learn Regression models in our next lab!

In [None]:
# Use Matplotlib to plot the data 

# 1.Create a figure and axis (plt.subplots()).

# 2. Plot:
# - Scatter points (raw LiDAR data)
# - Lines (smoothed data and true profile)

# 3. Format the plot (title, labels, grid, legend).

# 4. Display the final result (plt.show()).

In [None]:
# 1. Initialize the Figure and Axes
fig, ax = plt.subplots(figsize=(12, 6))

# 2. Plot the raw, noisy LiDAR measurements
ax.scatter(df['Distance_m'], df['Measured_Elevation_m'], 
           color='gray', alpha=0.4, s=15, label='Raw Sensor Data (Noisy)')

# 3. Plot the Pandas Smoothed Data
ax.plot(df['Distance_m'], df['Smoothed_Elevation_m'], 
        color='blue', linewidth=2.5, linestyle='--', label='Pandas Rolling Mean')

# 4. Plot the Underlying Mathematical Truth
ax.plot(df['Distance_m'], df['True_Elevation_m'], 
        color='red', linewidth=2, label='True Geologic Profile')

# 5. Formatting the Plot for a Science Report
ax.set_title("Cross-Sectional Topographic Profile of a Glacial Valley", fontsize=16, fontweight='bold')
ax.set_xlabel("Distance Across Valley [m]", fontsize=12)
ax.set_ylabel("Elevation [m.a.s.l.]", fontsize=12)

# Add grid lines for readability
ax.grid(True, linestyle=':', alpha=0.7)

# Add the legend to identify the layers
ax.legend(loc='upper center', fontsize=11)

# Render the plot
plt.tight_layout()
plt.show()

In [None]:
# Explore Matplotlib elements: 
# scatter()
# plot
pass 