## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback



## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Hypothesis


Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Data

In [2]:
import pandas as pd
import numpy as np

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Puerto Rico Coral Reef Benthic Cover (CRCP Monitoring Data, 2014–2023)
---

The Puerto Rico Benthic Cover dataset includes 10,656 reef survey observations collected between 2014 and 2023 as part of NOAA’s Coral Reef Conservation Program. Each row represents one benthic category observed at a specific reef site and date. The dataset contains location information (latitude and longitude in decimal degrees), survey timing (year, month, day), depth measurements in meters, and percent cover values for different benthic groups such as live coral species, macroalgae, cyanobacteria, rubble, and bare substrate.

The main variable of interest is percent cover, measured in percentage units (%). Percent cover describes how much of the surveyed reef surface area is occupied by a given category. For example, 30% live coral cover means nearly one-third of the transect area was covered by living coral tissue. In general, coral cover above ~30% is often considered relatively healthy, while values below ~10% may indicate reef degradation. The dataset also includes reef complexity measurements (rugosity), which describe how structurally complex the reef surface is — higher values indicate more three-dimensional structure, which can support greater biodiversity and recovery potential.

**Variables We're Focusing On**
For this analysis, we’re mainly looking at location and habitat structure variables from the benthic dataset.
* Latitude (`degrees_north`) and Longitude (`degrees_east`)
These tell us where each survey site is located. We’ll use them to look at spatial patterns and see whether habitat characteristics cluster in certain areas.
* Minimum and Maximum Depth
These describe how deep the survey area is. DDepth affects light availability, species distribution, and habitat type. We’ll use depth to see whether benthic composition changes across shallow vs. deeper sites.
* Substrate Percent Cover (e.g., `hardbottom`, `softbottom`, `rubble`)
These variables show what the seafloor is made of at each site. They help us compare habitat composition and determine whether certain substrate types are more common in specific locations or depth ranges.
* Mean Rugosity (`MEAN_RUG`)
Rugosity measures how complex the seafloor structure is. Since structural complexity often supports greater biodiversity, we’ll use this to assess how habitat complexity varies across sites.

By combining location, depth, substrate composition, and rugosity, we can analyze how benthic habitat structure differs across Puerto Rico and identify environmental factors that may explain those differences. These variables together help us understand patterns in habitat composition rather than just describing individual sites.


**Concerns and Limitations**

There are a few important limitations to keep in mind. Several columns related to zoning and marine protected areas are completely empty, so we can’t use this dataset to analyze management or protection effects in Puerto Rico. Some reef complexity (rugosity) measurements are also missing for many observations, which may reflect changes in survey methods over time rather than random missing data.

In addition, several numeric variables (like latitude, depth, and percent cover values) are stored as text and need to be converted before analysis. The first row of the dataset also contains unit labels instead of real data and must be removed during cleaning.

Finally, the dataset only covers about 10 years (2014–2023), not the full 20-year period in our research question. While it still captures recent bleaching and stress events, longer-term trends would require additional historical data.


In [12]:
## Load Raw Dataset
benthic_PR = pd.read_csv("data/00-raw/CRCP_Benthic_Cover_Puerto_Rico_07c6_3c17_78b6.csv")

benthic_PR.head()

Unnamed: 0,time,latitude,longitude,REGION,PRIMARY_SAMPLE_UNIT,STATION_NR,YEAR,MONTH,DAY,Date_UTC,...,DEPTH_STRAT,MIN_DEPTH,MAX_DEPTH,METERS_COMPLETED,COVER_CAT_CD,COVER_CAT_NAME,HARDBOTTOM_P,SOFTBOTTOM_P,RUBBLE_P,accession_url
0,UTC,degrees_north,degrees_east,,,,,,,UTC,...,,m,m,m,,,percent,percent,percent,
1,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,BAR SUB.,Bare Substrate,29,0,0,https://accession.nodc.noaa.gov/0217139
2,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,CYA SPE.,Cyanophyta spp,4,0,0,https://accession.nodc.noaa.gov/0217139
3,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,DIC SPE.,Dictyota spp,18,0,0,https://accession.nodc.noaa.gov/0217139
4,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,GOR GORG,Gorgonians,1,0,0,https://accession.nodc.noaa.gov/0217139


In [13]:
benthic_PR.shape

(10656, 34)

In [14]:
benthic_PR.columns

Index(['time', 'latitude', 'longitude', 'REGION', 'PRIMARY_SAMPLE_UNIT',
       'STATION_NR', 'YEAR', 'MONTH', 'DAY', 'Date_UTC', 'HABITAT_CD', 'STRAT',
       'RUGOSITY_CD', 'WTD_RUG', 'MEAN_RUG', 'MAPGRID_NR', 'SUB_REGION_NAME',
       'SUB_REGION_NR', 'ZONE_NAME', 'ZONE_NR', 'MPA_NAME', 'MPA_NR', 'ADMIN',
       'PROT', 'DEPTH_STRAT', 'MIN_DEPTH', 'MAX_DEPTH', 'METERS_COMPLETED',
       'COVER_CAT_CD', 'COVER_CAT_NAME', 'HARDBOTTOM_P', 'SOFTBOTTOM_P',
       'RUBBLE_P', 'accession_url'],
      dtype='object')

In [15]:
benthic_PR.describe()

Unnamed: 0,PRIMARY_SAMPLE_UNIT,STATION_NR,YEAR,MONTH,DAY,RUGOSITY_CD,WTD_RUG,MAPGRID_NR,SUB_REGION_NR,ZONE_NAME,ZONE_NR,MPA_NAME,MPA_NR,PROT
count,10655.0,10655.0,10655.0,10655.0,10655.0,0.0,4754.0,10628.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,6273.545378,1.0,2018.36321,8.110371,17.405913,,0.444762,8270768.0,,,,,,
std,2407.341171,0.0,3.435621,1.806798,8.820304,,0.279351,2906061.0,,,,,,
min,1000.0,1.0,2014.0,4.0,1.0,,0.1,3642914.0,,,,,,
25%,6062.0,1.0,2014.0,7.0,11.0,,0.235417,5321685.0,,,,,,
50%,6321.0,1.0,2019.0,8.0,18.0,,0.354167,7981655.0,,,,,,
75%,9141.0,1.0,2021.0,9.0,25.0,,0.608333,10985700.0,,,,,,
max,9359.0,1.0,2023.0,12.0,31.0,,1.566667,13580560.0,,,,,,


**Check Tidy Structure**

The dataset is already in long (tidy) format because each row represents one benthic category observation at a specific site and date. Each column represents a single variable (e.g., depth, percent cover, region).
**_BUT_**, this does not mean it is *clean*.

In [16]:
## Convert Data Types

cols_to_numeric = [
    "latitude", "longitude",
    "MIN_DEPTH", "MAX_DEPTH",
    "HARDBOTTOM_P", "SOFTBOTTOM_P", "RUBBLE_P",
    "MEAN_RUG"
]

for col in cols_to_numeric:
    benthic_PR[col] = pd.to_numeric(benthic_PR[col], errors='coerce')

In [17]:
## Missing Data Analysis

missing_counts = benthic_PR.isna().sum()
missing_percent = benthic_PR.isna().mean() * 100

pd.DataFrame({
    "Missing Count": missing_counts,
    "Missing %": missing_percent
}).sort_values("Missing %", ascending=False)

Unnamed: 0,Missing Count,Missing %
SUB_REGION_NR,10656,100.0
ZONE_NR,10656,100.0
MPA_NAME,10656,100.0
MPA_NR,10656,100.0
RUGOSITY_CD,10656,100.0
PROT,10656,100.0
ZONE_NAME,10656,100.0
WTD_RUG,5902,55.386637
MEAN_RUG,4795,44.998123
MAPGRID_NR,28,0.262763


Missing info does not seem random; entire columns (e.g., zoning variables) are completely empty, suggesting systematic absence rather than sporadic missing data.

In [18]:
benthic_PR[["HARDBOTTOM_P", "SOFTBOTTOM_P", "RUBBLE_P"]].describe()

Unnamed: 0,HARDBOTTOM_P,SOFTBOTTOM_P,RUBBLE_P
count,10655.0,10655.0,10655.0
mean,6.681933,1.512529,0.421211
std,10.735646,7.496323,2.470354
min,0.0,0.0,0.0
25%,1.0,0.0,0.0
50%,2.0,0.0,0.0
75%,7.0,0.0,0.0
max,98.0,98.0,72.0


In [21]:
benthic_PR[(benthic_PR["HARDBOTTOM_P"] > 100) | (benthic_PR["HARDBOTTOM_P"] < 0)]

Unnamed: 0,time,latitude,longitude,REGION,PRIMARY_SAMPLE_UNIT,STATION_NR,YEAR,MONTH,DAY,Date_UTC,...,DEPTH_STRAT,MIN_DEPTH,MAX_DEPTH,METERS_COMPLETED,COVER_CAT_CD,COVER_CAT_NAME,HARDBOTTOM_P,SOFTBOTTOM_P,RUBBLE_P,accession_url


In [23]:
benthic_PR[["MIN_DEPTH", "MAX_DEPTH"]].describe()

Unnamed: 0,MIN_DEPTH,MAX_DEPTH
count,10632.0,10632.0
mean,12.819517,13.985222
std,6.59737,6.656109
min,0.3048,0.6096
25%,7.9248,9.144
50%,12.4968,13.716
75%,17.373599,18.592799
max,29.260799,30.479999


No biologically impossible percent values (<0% or >100%) were seen. Depth values fall within reasonable reef ranges (~0–30m), with no extreme outliers.

**Cleaning Strategy**

Columns that are entirely missing (e.g., ZONE_NAME, MPA_NAME, PROT) are going to be dropped because they contain no usable information. Numeric variables stored as text were converted to numeric format. Rows with missing percent cover values were retained only if at least one substrate percent value was present. We avoided aggressive row deletion to preserve ecological observations.

In [25]:
benthic_PR = benthic_PR.dropna(axis=1, how='all') # drops fully empty columns

benthic_PR = benthic_PR.dropna(subset=["YEAR", "COVER_CAT_NAME"]) # drops missing values in essential variables

In [26]:
benthic_PR.head()

Unnamed: 0,time,latitude,longitude,REGION,PRIMARY_SAMPLE_UNIT,STATION_NR,YEAR,MONTH,DAY,Date_UTC,...,DEPTH_STRAT,MIN_DEPTH,MAX_DEPTH,METERS_COMPLETED,COVER_CAT_CD,COVER_CAT_NAME,HARDBOTTOM_P,SOFTBOTTOM_P,RUBBLE_P,accession_url
1,2019-07-18T00:00:00Z,18.143344,-67.300632,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.8392,9.4488,15,BAR SUB.,Bare Substrate,29.0,0.0,0.0,https://accession.nodc.noaa.gov/0217139
2,2019-07-18T00:00:00Z,18.143344,-67.300632,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.8392,9.4488,15,CYA SPE.,Cyanophyta spp,4.0,0.0,0.0,https://accession.nodc.noaa.gov/0217139
3,2019-07-18T00:00:00Z,18.143344,-67.300632,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.8392,9.4488,15,DIC SPE.,Dictyota spp,18.0,0.0,0.0,https://accession.nodc.noaa.gov/0217139
4,2019-07-18T00:00:00Z,18.143344,-67.300632,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.8392,9.4488,15,GOR GORG,Gorgonians,1.0,0.0,0.0,https://accession.nodc.noaa.gov/0217139
5,2019-07-18T00:00:00Z,18.143344,-67.300632,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.8392,9.4488,15,HAL SPE.,Halimeda spp,8.0,0.0,0.0,https://accession.nodc.noaa.gov/0217139


In [27]:
benthic_PR.shape


(10655, 27)

In [29]:
benthic_PR.describe()

Unnamed: 0,latitude,longitude,PRIMARY_SAMPLE_UNIT,STATION_NR,YEAR,MONTH,DAY,WTD_RUG,MEAN_RUG,MAPGRID_NR,MIN_DEPTH,MAX_DEPTH,HARDBOTTOM_P,SOFTBOTTOM_P,RUBBLE_P
count,10655.0,10655.0,10655.0,10655.0,10655.0,10655.0,10655.0,4754.0,5861.0,10628.0,10632.0,10632.0,10655.0,10655.0,10655.0
mean,18.159688,-66.376224,6273.545378,1.0,2018.36321,8.110371,17.405913,0.444762,0.354587,8270768.0,12.819517,13.985222,6.681933,1.512529,0.421211
std,0.184946,0.885091,2407.341171,0.0,3.435621,1.806798,8.820304,0.279351,0.218344,2906061.0,6.59737,6.656109,10.735646,7.496323,2.470354
min,17.862056,-67.949417,1000.0,1.0,2014.0,4.0,1.0,0.1,0.0,3642914.0,0.3048,0.6096,0.0,0.0,0.0
25%,17.979148,-67.266039,6062.0,1.0,2014.0,7.0,11.0,0.235417,0.196667,5321685.0,7.9248,9.144,1.0,0.0,0.0
50%,18.145444,-66.26637,6321.0,1.0,2019.0,8.0,18.0,0.354167,0.316667,7981655.0,12.4968,13.716,2.0,0.0,0.0
75%,18.342258,-65.518407,9141.0,1.0,2021.0,9.0,25.0,0.608333,0.47,10985700.0,17.373599,18.592799,7.0,0.0,0.0
max,18.519091,-65.17725,9359.0,1.0,2023.0,12.0,31.0,1.566667,1.243333,13580560.0,29.260799,30.479999,98.0,98.0,72.0


In [30]:
## Save Processed Dataset

benthic_PR.to_csv("data/02-processed/benthic_cover_PR_cleaned.csv", index=False)

*Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets*

*Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.*

## Ethics

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> This project does not involve human subjects or individual-level data. All datasets used are publicly available environmental monitoring datasets collected by government agencies (e.g., NOAA) using standardized ecological survey methods.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Collection bias is relevant because coral reef monitoring sites are not evenly distributed across regions or reef types. Some U.S. reef systems may be monitored more frequently or consistently than others due to accessibility, funding, or conservation priority. This could bias observed recovery pathways toward better-studied regions. We acknowledge this limitation and will avoid overgeneralizing results beyond the monitored sites.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> This project does not collect or use any personally identifiable information. All data are ecological and site-level, such as percent coral cover and thermal stress metrics.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> Because this project does not involve human populations or protected groups, downstream bias related to demographic characteristics is not applicable

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> The datasets used are publicly available and contain no sensitive information. Data will be stored locally for analysis using standard file protections. While advanced security measures are not required, care will be taken to avoid accidental modification or loss of data.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The project does not use personal or individual-level data. There are no individuals whose data could be removed upon request.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> Data will be retained only for the duration of the course project and may be deleted afterward. Since the data are publicly available, long-term storage does not pose ethical concerns.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Our analysis does not include input from reef managers or local communities. Our findings *could* influence policy or funding, e.g., reefs with faster recovery might get more attention, while slower-recovering reefs could be deprioritized. We will clarify that recovery trajectories are descriptive, not value judgments.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Potential sources of bias include uneven temporal coverage across sites, missing years of data, and unmeasured confounding factors such as storms, pollution, or local management practices. These limitations may affect interpretation of recovery trajectories and will be discussed when presenting results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Visualizations and summary statistics will be designed to accurately reflect the underlying data, including showing uncertainty, missing data, and variability across sites. We will avoid visual choices that exaggerate trends or imply causation.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> No data with personally identifiable information will be used or displayed, as all data are ecological and environmental in nature.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> The data cleaning, merging, and analysis process will be documented using reproducible code and clear descriptions of methods. This allows the analysis to be reviewed or revisited if issues are discovered later.


### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> This project does not involve predictive models that affect individuals, nor does it include demographic variables.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> Fairness across human groups is not applicable. However, we recognize that modeling choices may implicitly emphasize certain regions or reef types over others due to data availability.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Recovery metrics such as changes in percent live coral cover or trajectory slopes were chosen because they are commonly used in reef ecology. We acknowledge that no single metric fully captures reef health and will discuss this limitation.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> The analytical methods used are interpretable and can be explained in clear terms. We will avoid complex models that obscure interpretation.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> Limitations such as observational data, lack of causal inference, and incomplete coverage will be clearly communicated in the final report to avoid misinterpretation of results.


### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> This project is exploratory and academic in nature and will not be deployed in a production environment. Ongoing monitoring is therefore not applicable.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Because this analysis does not produce decisions affecting individuals or communities directly, formal redress mechanisms are not required. However, we aim to present findings responsibly to avoid misuse.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> There is no deployed system or model to roll back. If errors are discovered, analyses and conclusions can be revised.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> While the project is academic, results could potentially be misused if interpreted as causal or definitive. To mitigate this, we will clearly state the scope, assumptions, and limitations of the analysis.


## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them