# Homework Assignment 3: Data Analysis & Optimization

# **[YOUR NAME]**

**Total Points:** 100

**Instructions:**
- Complete all three problems in this notebook
- Write your code in the provided cells
- Run all cells to verify your code works
- Save your completed `.ipynb` file to the `homework` folder in your private GitHub repository (shared with the instructor)
- Submit the link to your notebook on Canvas

**⚠️ IMPORTANT:** GitHub records the timestamp of every file update. Your notebook must be committed to GitHub **before the deadline**. **DO NOT** update the file after the deadline—late modifications will be flagged and may result in a grade penalty.

**Academic Integrity:** This is an individual assignment. You may consult course materials, Python documentation, AI tools, and discuss concepts with classmates, but all code must be your own.

---

In [None]:
# Standard imports - run this cell first
import numpy as np
import pandas as pd
from scipy import stats
from scipy.optimize import minimize, curve_fit

---
## Problem 1: Stream Water Quality Analysis (35 points)

You are analyzing water quality data from multiple stream monitoring sites in Wisconsin. The dataset contains measurements of temperature, dissolved oxygen (DO), pH, and conductivity collected over several months.

### Your Tasks:

**Part A (10 points):** Load and explore the data
1. Load the data from the CSV string provided below into a pandas DataFrame
2. Display basic information about the dataset (shape, data types, first few rows)
3. Check for missing values and report how many are in each column
4. Convert the `date` column to datetime format using `pd.to_datetime()`

**Part B (15 points):** Data analysis with grouping
1. Calculate the mean, standard deviation, min, and max for dissolved oxygen (`do_mg_l`) grouped by `site_id`
2. Determine which site has the lowest mean dissolved oxygen
3. Create a new column called `do_status` that classifies each measurement as:
   - "Critical" if DO < 4 mg/L
   - "Low" if DO is 4-6 mg/L
   - "Adequate" if DO is 6-8 mg/L
   - "Good" if DO ≥ 8 mg/L
4. Count how many measurements fall into each `do_status` category per site

**Part C (10 points):** Filtering and summary
1. Filter the data to include only measurements where temperature > 15°C AND pH is between 6.5 and 8.5
2. For this filtered subset, calculate the mean conductivity for each month (hint: extract month from date)
3. Identify which site-month combination had the highest number of "Critical" or "Low" DO readings

In [None]:
# Water quality dataset
water_quality_csv = """site_id,date,temp_c,do_mg_l,ph,conductivity_us
SITE_A,2024-05-15,12.3,9.2,7.1,245
SITE_A,2024-05-22,14.1,8.5,7.3,252
SITE_A,2024-06-05,16.8,7.2,7.0,268
SITE_A,2024-06-19,19.2,5.8,6.8,285
SITE_A,2024-07-03,22.5,4.5,6.9,312
SITE_A,2024-07-17,24.1,3.8,7.1,298
SITE_A,2024-08-01,23.8,4.2,7.2,305
SITE_A,2024-08-15,21.2,5.5,7.0,289
SITE_B,2024-05-15,11.8,10.1,7.4,198
SITE_B,2024-05-22,13.5,9.8,7.5,205
SITE_B,2024-06-05,15.9,8.9,7.3,215
SITE_B,2024-06-19,18.4,7.5,7.2,228
SITE_B,2024-07-03,21.2,6.2,7.1,245
SITE_B,2024-07-17,22.8,5.8,7.0,251
SITE_B,2024-08-01,22.1,6.1,7.1,248
SITE_B,2024-08-15,20.5,7.0,7.2,235
SITE_C,2024-05-15,13.1,8.8,6.2,312
SITE_C,2024-05-22,14.8,8.1,6.4,325
SITE_C,2024-06-05,17.2,6.5,6.3,348
SITE_C,2024-06-19,20.1,5.2,6.1,372
SITE_C,2024-07-03,23.4,3.5,6.0,398
SITE_C,2024-07-17,25.2,2.8,5.9,412
SITE_C,2024-08-01,24.5,3.2,6.1,405
SITE_C,2024-08-15,22.3,4.1,6.2,385"""

# Part A: Load and explore the data
from io import StringIO
# Hint: Use pd.read_csv(StringIO(water_quality_csv))



In [None]:
# Part B: Data analysis with grouping



In [None]:
# Part C: Filtering and summary



---
## Problem 2: Statistical Comparison of Forest Plots (30 points)

Researchers measured tree biomass (kg) in paired plots—one subjected to a thinning treatment and one left as control. They want to determine if the thinning treatment significantly affected individual tree biomass and whether there's a relationship between tree diameter and biomass.

### Your Tasks:

**Part A (10 points):** Comparing treatment groups
1. Calculate descriptive statistics (mean, std, median) for biomass in each treatment group
2. Perform an independent two-sample t-test to determine if there's a significant difference in mean biomass between control and thinned plots (α = 0.05)
3. State your null and alternative hypotheses, report the t-statistic and p-value, and write a conclusion

**Part B (10 points):** Correlation analysis
1. Calculate the Pearson correlation coefficient between DBH and biomass for the entire dataset
2. Test whether this correlation is statistically significant (α = 0.05)
3. Interpret the strength and direction of the correlation

**Part C (10 points):** Distribution fitting
1. Fit a normal distribution to the biomass data from the control plots
2. Report the fitted parameters (μ and σ)
3. Calculate the probability that a randomly selected tree from the control plots has biomass > 150 kg
4. What biomass value represents the 90th percentile for control plot trees?

In [None]:
# Forest plot data
np.random.seed(458)  # For reproducibility

# Control plots: untreated forest
n_control = 35
dbh_control = np.random.uniform(15, 50, n_control)
biomass_control = 0.1 * dbh_control**2.2 + np.random.normal(0, 15, n_control)
biomass_control = np.maximum(biomass_control, 10)  # Ensure positive

# Thinned plots: trees have more resources, potentially larger
n_thinned = 30
dbh_thinned = np.random.uniform(18, 55, n_thinned)
biomass_thinned = 0.12 * dbh_thinned**2.2 + np.random.normal(5, 18, n_thinned)
biomass_thinned = np.maximum(biomass_thinned, 10)

# Create DataFrame
forest_df = pd.DataFrame({
    'dbh_cm': np.concatenate([dbh_control, dbh_thinned]),
    'biomass_kg': np.concatenate([biomass_control, biomass_thinned]),
    'treatment': ['Control']*n_control + ['Thinned']*n_thinned
})

forest_df.head()

In [None]:
# Part A: Comparing treatment groups



In [None]:
# Part B: Correlation analysis



In [None]:
# Part C: Distribution fitting



---
## Problem 3: Fitting a Light Response Curve (35 points)

Photosynthesis rates depend on light intensity following a saturating curve. The **rectangular hyperbola** (non-rectangular hyperbola simplified) is commonly used to model this relationship:

$$A = \frac{A_{max} \cdot I}{K + I} - R_d$$

Where:
- $A$ = net photosynthesis rate (μmol CO₂ m⁻² s⁻¹)
- $A_{max}$ = maximum photosynthesis rate at light saturation
- $I$ = light intensity (μmol photons m⁻² s⁻¹, PAR)
- $K$ = half-saturation constant (light level at which A = A_max/2 - R_d)
- $R_d$ = dark respiration rate (CO₂ released when I = 0)

### Your Tasks:

**Part A (10 points):** Define the model and cost function
1. Write a function `light_response(I, Amax, K, Rd)` that implements the equation above
2. Write a cost function `light_response_mse(params, I_data, A_data)` that calculates the mean squared error between observed and predicted photosynthesis rates
3. Test your light_response function by calculating A for I = 500 with Amax=25, K=200, Rd=2

**Part B (15 points):** Fit the model using optimization
1. Use `scipy.optimize.minimize` to find the optimal parameters (Amax, K, Rd) that minimize the MSE
2. Use initial guesses: Amax=20, K=150, Rd=1
3. Report the fitted parameters and final MSE
4. Also fit the model using `scipy.optimize.curve_fit` and compare the results

**Part C (10 points):** Evaluate and interpret the model
1. Calculate the predicted photosynthesis values using your fitted parameters
2. Calculate R² (coefficient of determination) to assess model fit:
   $$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$
3. Calculate the light compensation point (the light level where A = 0, i.e., photosynthesis equals respiration). Hint: solve for I when A = 0
4. What is the light-saturated photosynthesis rate (Amax - Rd)?

In [None]:
# Light response curve data
# PAR (photosynthetically active radiation) in μmol photons m⁻² s⁻¹
par_data = np.array([0, 25, 50, 75, 100, 150, 200, 300, 400, 600, 800, 1000, 1200, 1500, 1800])

# Net photosynthesis rate in μmol CO₂ m⁻² s⁻¹
photo_data = np.array([-1.8, 1.2, 4.5, 7.1, 9.2, 12.5, 14.8, 17.5, 19.2, 21.1, 22.0, 22.5, 22.8, 23.0, 23.1])

print(f"PAR range: {par_data.min()} to {par_data.max()} μmol photons m⁻² s⁻¹")
print(f"Photosynthesis range: {photo_data.min()} to {photo_data.max()} μmol CO₂ m⁻² s⁻¹")

In [None]:
# Part A: Define the model and cost function



In [None]:
# Part B: Fit the model using optimization



In [None]:
# Part C: Evaluate and interpret the model



---
## Submission Checklist

Before submitting, verify that:

- [ ] All code cells run without errors
- [ ] All three problems are complete
- [ ] Output is visible for all cells
- [ ] Your name and date are filled in below
- [ ] File is saved to the `homework` folder in your private GitHub repository
- [ ] File is committed and pushed **before the deadline**
- [ ] Link to your notebook is submitted on Canvas