# Python Skill Week Mini-Project (Colab)
## Antarctic Ice‑Free Soil Geochemistry: marine aerosols, climate, and salinity

**Scientific question:** Are Antarctic soil salts controlled primarily by **proximity to the ocean** (marine aerosols), by **climate** (precipitation/temperature), or **both**? How does salinity relate to **soil carbon and nitrogen**?

### What you will do
You will build a reproducible workflow in Python:
**load → QA/QC → derive variables → analyze with a function → plot → interpret**

### What you submit
- This notebook (download as `.ipynb` and submit to Canvas, **or** share a viewable link as instructed).
- Notebook must run **top‑to‑bottom without errors**.

---

## Markdown cheat sheet (how to “write in Markdown”)
In Colab, click **+ Text** to add a Markdown (text) cell.

Examples:
- Headings: `## Heading`
- Bold: `**bold**`
- Bullets:
  - `- item 1`
  - `- item 2`
- Table (for your data dictionary):

```markdown
| Column | Units | Meaning |
|---|---:|---|
| dist_coast_km | km | Distance from sample to coast |
| cl_mgL | mg/L | Chloride concentration |
```


## Part 0 — Setup + load the dataset (run these cells)
Run the next cell first.

In [None]:
# --- RUN THIS FIRST ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

DATA_URL = "https://raw.githubusercontent.com/joshlemonte/EarthDataViz/refs/heads/main/ansoil_geochem_teaching.csv"
df = pd.read_csv(DATA_URL)

print("Loaded:", df.shape[0], "rows x", df.shape[1], "columns")
df.head()


In [None]:
# Inspect the dataset
df.info()
df.columns


### Required columns check
These are the columns we expect in the teaching CSV. If any are missing, tell the instructor.

In [None]:
required_cols = [
    "sample_id","sample_location","location_group",
    "lat","lon","dist_coast_km",
    "precip_racmo","temp_racmo",
    "ph_mq","ec_uScm",
    "cl_mgL","na_mgL","so4_mgL","ca_mgL",
    "wtpct_c","wtpct_n","cn_ratio"
]

missing = [c for c in required_cols if c not in df.columns]
print("Missing required columns:", missing if missing else "None ✅")


## Part A — Data dictionary (required)

Create a **Markdown table** describing ~10 key columns (include **units** where relevant).

Suggested columns:
- `location_group`, `dist_coast_km`, `precip_racmo`, `temp_racmo`
- `ph_mq`, `ec_uScm`
- `cl_mgL`, `na_mgL`, `so4_mgL`, `ca_mgL`
- `wtpct_c`, `wtpct_n`, `cn_ratio`

✅ **Action:** Click **+ Text** and paste a table like the template below, then fill it in.

```markdown
| Column | Units | Meaning |
|---|---:|---|
| dist_coast_km | km | ... |
| cl_mgL | mg/L | ... |
```


## Part B — QA/QC and cleaning (required)

Create a cleaned dataframe called `dfc`.

**Requirements**
1. Implement **≥ 3 QA/QC steps**
2. Show **before/after row counts**
3. Explain each QA/QC step in Markdown (2–4 sentences each)

Below is a *starter* cleaning cell with two example QA/QC steps. You must add **at least one more**.


In [None]:
# Start from a core set of columns (only keep columns that exist)
core_cols = [
    "sample_id","sample_location","location_group",
    "lat","lon","dist_coast_km",
    "precip_racmo","temp_racmo",
    "ph_mq","ec_uScm",
    "cl_mgL","na_mgL","so4_mgL","ca_mgL","no3_mgL","po4_mgL",
    "wtpct_c","wtpct_n","cn_ratio","d15n_permil","d13c_vpdb_permil"
]
core_cols = [c for c in core_cols if c in df.columns]

dfc = df[core_cols].copy()
print("Before QC:", dfc.shape)

# QC Step 1 (example): drop rows missing key variables needed for Figure 1
dfc = dfc.dropna(subset=["dist_coast_km","cl_mgL","ec_uScm"])
print("After QC step 1:", dfc.shape)

# QC Step 2 (example): impossible negatives -> NaN
nonneg_cols = [c for c in ["ec_uScm","cl_mgL","na_mgL","so4_mgL","ca_mgL","no3_mgL","po4_mgL","wtpct_c","wtpct_n","cn_ratio"] if c in dfc.columns]
for c in nonneg_cols:
    dfc.loc[dfc[c] < 0, c] = np.nan
print("After QC step 2 (negatives->NaN):", dfc.shape)

# QC Step 3 (YOU ADD): choose one (or invent your own)
# - require climate variables:
# dfc = dfc.dropna(subset=["precip_racmo","temp_racmo"])
# - remove rows with cl_mgL <= 0 before log plots:
# dfc.loc[dfc["cl_mgL"] <= 0, "cl_mgL"] = np.nan
# - filter a subset (e.g., one location_group) AFTER you justify why
# dfc = dfc[dfc["location_group"] == "YOUR_GROUP_HERE"]

print("After your QC step 3:", dfc.shape)
dfc.head()


✅ **Action:** Add a **Text (Markdown)** cell below explaining your QA/QC steps:
- What did you do?
- Why is it scientifically reasonable?
- What did it change (counts, distributions, etc.)?


## Part C — Derived variables (required)

Create **≥ 2 derived variables** that help tell the geochemical story.

Common choices:
- `total_salts_mgL = cl_mgL + na_mgL + so4_mgL + ca_mgL`
- `na_cl_ratio = na_mgL / cl_mgL`
- log transform (helps with skewed geochem data): `log_total_salts = log10(total_salts_mgL)`


In [None]:
# Derived variables (edit/extend as needed)

# 1) Total salts proxy
dfc["total_salts_mgL"] = dfc["cl_mgL"] + dfc["na_mgL"] + dfc["so4_mgL"] + dfc["ca_mgL"]

# 2) Ratio
dfc["na_cl_ratio"] = dfc["na_mgL"] / dfc["cl_mgL"]

# 3) Log transform (mask non-positive values)
dfc["log_total_salts"] = np.nan
mask = dfc["total_salts_mgL"] > 0
dfc.loc[mask, "log_total_salts"] = np.log10(dfc.loc[mask, "total_salts_mgL"])

dfc[["total_salts_mgL","na_cl_ratio","log_total_salts"]].describe()


✅ **Action:** Add a **Text (Markdown)** cell below explaining:
- Why these derived variables are useful for the scientific question
- Why log scaling is often used in geochemistry (when it helps)


## Part D — Write a function + use it twice (required)

Write **one function** and use it **at least twice**.

Recommended option (simple + interpretable): fit a straight line and return **slope + intercept + r²**.


In [None]:
def fit_line_and_r2(x, y):
    """Return slope (m), intercept (b), and r² for y = m*x + b. Drops NaNs pairwise."""
    x = np.asarray(x)
    y = np.asarray(y)
    msk = np.isfinite(x) & np.isfinite(y)
    x = x[msk]
    y = y[msk]
    if len(x) < 2:
        return np.nan, np.nan, np.nan
    m, b = np.polyfit(x, y, 1)
    yhat = m*x + b
    ss_res = np.sum((y - yhat)**2)
    ss_tot = np.sum((y - np.mean(y))**2)
    r2 = 1 - ss_res/ss_tot if ss_tot != 0 else np.nan
    return m, b, r2


In [None]:
# Function use #1 (required): distance to coast vs chloride (log space optional)
x1 = dfc["dist_coast_km"]
y1 = np.log10(dfc["cl_mgL"].where(dfc["cl_mgL"] > 0, np.nan))
m1, b1, r21 = fit_line_and_r2(x1, y1)
print("Use #1: log10(Cl) vs distance -> slope, intercept, r2:", m1, b1, r21)


In [None]:
# Function use #2 (choose one pairing)
# Option A: precip vs EC
x2 = dfc["precip_racmo"]
y2 = dfc["ec_uScm"]

# Option B: temperature vs EC
# x2 = dfc["temp_racmo"]
# y2 = dfc["ec_uScm"]

# Option C: EC vs %C (biogeochem link)
# x2 = dfc["ec_uScm"]
# y2 = dfc["wtpct_c"]

m2, b2, r22 = fit_line_and_r2(x2, y2)
print("Use #2: slope, intercept, r2:", m2, b2, r22)


✅ **Action:** Add a **Text (Markdown)** cell below interpreting your function outputs:
- What does the slope mean (direction/strength)?
- What does r² tell you?
- How does this support (or not) your geochemical story?


## Part E — Figures (required)

You must make **two figures**, each with a **caption** in a Markdown cell.

### Figure 1 (required)
Plot **distance to coast** vs **chloride**:
- x = `dist_coast_km`
- y = `cl_mgL` (log scale often helps)

### Figure 2 (choose one)
Pick one relationship:
- `precip_racmo` vs `ec_uScm` (or `total_salts_mgL`)
- `temp_racmo` vs `ec_uScm` (or `total_salts_mgL`)
- `ec_uScm` (or `total_salts_mgL`) vs `wtpct_c` / `wtpct_n` / `cn_ratio`

**Both figures must**
- Have axis labels + units
- Be readable
- Be saved with `plt.savefig(...)`


In [None]:
# Figure 1: dist_coast_km vs cl_mgL (log scale)

plt.figure()
plt.scatter(dfc["dist_coast_km"], dfc["cl_mgL"])
plt.yscale("log")
plt.xlabel("Distance to coast (km)")
plt.ylabel("Cl (mg/L) [log scale]")
plt.title("Figure 1 — Marine aerosol test: Cl vs distance to coast")
plt.tight_layout()
plt.savefig("figure1_cl_vs_distance.png", dpi=200)
plt.show()


✅ **Action:** Add a **Text (Markdown)** cell below with your **Figure 1 caption**.

Template:
**Figure 1.** [1–3 sentences describing the pattern + what it suggests about marine aerosols. Mention log scale if used.]


In [None]:
# Figure 2: choose your x and y below, then run

# CHOOSE ONE:
# x = dfc["precip_racmo"]; xlab = "Precipitation (RACMO)"
# x = dfc["temp_racmo"];   xlab = "Temperature (RACMO)"
# x = dfc["ec_uScm"];      xlab = "EC (uS/cm)"

# CHOOSE ONE:
# y = dfc["ec_uScm"];           ylab = "EC (uS/cm)"
# y = dfc["total_salts_mgL"];   ylab = "Total salts proxy (mg/L)"
# y = dfc["wtpct_c"];           ylab = "C (wt%)"
# y = dfc["cn_ratio"];          ylab = "C/N ratio"

# --- Edit the next two lines (required) ---
x = dfc["precip_racmo"]; xlab = "Precipitation (RACMO)"
y = dfc["ec_uScm"];      ylab = "EC (uS/cm)"

m, b, r2 = fit_line_and_r2(x, y)

plt.figure()
plt.scatter(x, y)
plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(f"Figure 2 — {ylab} vs {xlab} (slope={m:.3g}, r²={r2:.2f})")
plt.tight_layout()
plt.savefig("figure2_relationship.png", dpi=200)
plt.show()


✅ **Action:** Add a **Text (Markdown)** cell below with your **Figure 2 caption**.

Template:
**Figure 2.** [1–3 sentences describing the relationship. State what you think is the strongest control (distance vs climate vs both) and why.]


## Part F — Interpretation + limitations (required)

Add **two Markdown (Text) cells**:

1) **Story (200–300 words):** What controls salts in these soils? Use Figure 1–2 and your function outputs as evidence.  
2) **Limitations (100–200 words):** What assumptions did you make? What data would increase confidence?

Use geology/geochemistry reasoning, not just code descriptions.


## Optional — Download your figures from Colab
If you want to download the saved PNGs:

In [None]:
from google.colab import files
# files.download("figure1_cl_vs_distance.png")
# files.download("figure2_relationship.png")
