## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback



## Background and Prior Work

Coral reefs are highly sensitive marine ecosystems that provide important ecological, economic, and coastal protection benefits. In recent years, coral reef health has declined worldwide due to rising sea surface temperatures, more frequent marine heatwaves, and local stressors such as algal overgrowth. Thermal stress, often measured using Degree Heating Weeks (DHW), is strongly linked to coral bleaching events, which can lead to partial or complete loss of live coral cover. <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) While many studies document coral decline following thermal stress, less is known about how coral reefs recover over time and weather recovery follows consistent patterns across different sites.  <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Being students of University of California, San Diego, part of our academic environment is ingrained with the university’s strengths in marine biology, oceanography, and climate science through programs such as the Scripps Institution of Oceanography. UC San Diego researchers have made major contributions to coral reef monitoring and the study of climate-driven marine stressors, which does add to the motivation for this project. Publicly available datasets from organizations like NOAA and the National Coral Reef Monitoring Program align well with UC San Diego’s emphasis on data-driven marine science and provide a strong foundation for this analysis.

Previous research has shown that coral recovery trajectories can vary widely depending on environmental conditions and local ecological dynamics. Studies using NOAA Coral Reef Watch data have found that higher DHW values are associated with more severe bleaching and increased coral mortality, often resulting in slow or incomplete recovery. <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) However, more recent monitoring efforts, such as the National Coral Reef Monitoring Program (NCRMP), provide long-term, site-level data on percent live coral and algal cover across U.S. reef systems. <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

While many studies using these data focus on overall trends in reef health, fewer explicitly examine differences in recovery pathways over time. Our project builds on this prior work by identifying coral recovery trajectories and analyzing how thermal stress and algal cover, both individually and together, are associated with differences in recovery outcomes across U.S. coral reef sites.


1. <a name="cite_note-1"></a> [^](#cite_ref-1) Watch, N. C. R. (n.d.). Coral Reef Watch Home. NOAA Coral Reef Watch Daily 5km Satellite Coral Bleaching Heat Stress Monitoring Products (Version 3.1). https://coralreefwatch.noaa.gov/product/5km/index.php  
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Hughes et Al. (2017, March 16). Global warming and recurrent mass bleaching of corals. Nature News. https://www.nature.com/articles/nature21707 
3. <a name="cite_note-3"></a> [^](#cite-ref-3)Liu, G. et Al. (2014, November 20). Reef-scale thermal stress monitoring of coral ecosystems: New 5-km global products from NOAA Coral Reef Watch. MDPI. https://www.mdpi.com/2072-4292/6/11/11579 
4. <a name="cite_note-4"></a> [^](#cite-ref-4)National Coral Reef Monitoring Program: Tracking Environmental Conditions. NCRMP | Environmental. (n.d.). https://coralreef.noaa.gov/topics/national-coral-reef-monitoring-program/environmental 


## Hypothesis


We hypothesize that coral reef sites exposed to higher cumulative thermal stress will exhibit recovery trajectories characterized by slower increases or sustained declines in percent live coral cover. Higher cumulative thermal stress is strongly associated with coral bleaching, reduced repoduction, and decline in health, which can all prevent reef recovery. We also predict that sites with higher algal cover will show poorer recovery outcomes. Elevated algal cover often follows coral loss and can also prove to be detrimental to coral by competing for space and changing the conditions in the area.

## Data

In [7]:
import pandas as pd
import numpy as np

In [10]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import



### Data overview

### Florida (NCRMP Benthic Cover Data)

- Dataset name: NCRMP Benthic Cover – Florida
- Link: https://www.coris.noaa.gov/monitoring/monitoring_programs/ncrmp/data/
- Number of observations: (fill after loading)
- Number of variables: (fill after loading)
- Relevant variables: PRIMARY_SAMPLE_UNIT (site ID), LATITUDE, LONGITUDE (for matching thermal data), YEAR (filter to 2016/2018/2020), REGION (FLK/PRICO/STX etc.), COVER_CAT_NAME (coral vs algae category), HARDBOTTOM_P (percent cover value)
- Planned use: group by PRIMARY_SAMPLE_UNIT + YEAR, sum HARDBOTTOM_P to compute total coral % and total algae %; use lat/long to match to nearest thermal grid point.
- Shortcomings: Some reefs are monitored more often than others, which may bias results. Surveys may not occur every year at every site.

We will use this dataset to track coral and algae changes across Florida reefs.


### Dataset #1 


## Florida NCRMP Benthic Cover Data

This dataset contains benthic cover observations from Florida reef monitoring sites collected under the National Coral Reef Monitoring Program (NCRMP).

The dataset includes 13,602 observations and 35 variables. Important variables for our analysis include:

- PRIMARY_SAMPLE_UNIT: Unique reef site identifier  
- LATITUDE and LONGITUDE: Geographic coordinates used to match sites to thermal grid data  
- YEAR: Survey year  
- REGION: Geographic region classification  
- COVER_CAT_NAME: Organism category (coral vs algae)  
- HARDBOTTOM_P: Percent hardbottom cover (used to compute total coral and algae percentages)

Percent cover values represent the proportion of reef surface covered by different organism categories.

One limitation of this dataset is uneven monitoring across sites and missing values in certain habitat and substrate variables.

In [12]:
import pandas as pd

florida = pd.read_csv("data/00-raw/CRCP_Benthic_Cover_Florida_7018_0ee5_9488.csv")

print(florida.shape)
florida.head()
florida.isnull().sum()

(13602, 35)


time                       0
latitude                   0
longitude                  0
REGION                     1
PRIMARY_SAMPLE_UNIT        1
STATION_NR                 1
YEAR                       1
MONTH                      1
DAY                        1
Date_UTC                   0
HABITAT_CD                 4
HABITAT_TYPE              51
STRAT                     24
Description             4854
RUGOSITY_CD            10142
WTD_RUG                 6556
MAPGRID_NR                 1
SUB_REGION_NAME            1
SUB_REGION_NR              1
ZONE_NAME               3461
ZONE_NR                 3461
MPA_NAME                3461
MPA_NR                     1
ADMIN                  13602
PROT                       1
DEPTH_STRAT             8749
MIN_DEPTH                 29
MAX_DEPTH                 29
METERS_COMPLETED          29
COVER_CAT_CD              30
COVER_CAT_NAME            30
HARDBOTTOM_P              29
SOFTBOTTOM_P              29
RUBBLE_P                  29
accession_url 

In [13]:

florida = pd.read_csv("data/00-raw/CRCP_Benthic_Cover_Florida_7018_0ee5_9488.csv")

print(florida.shape)
florida.head()
florida.isnull().sum()

(13602, 35)


time                       0
latitude                   0
longitude                  0
REGION                     1
PRIMARY_SAMPLE_UNIT        1
STATION_NR                 1
YEAR                       1
MONTH                      1
DAY                        1
Date_UTC                   0
HABITAT_CD                 4
HABITAT_TYPE              51
STRAT                     24
Description             4854
RUGOSITY_CD            10142
WTD_RUG                 6556
MAPGRID_NR                 1
SUB_REGION_NAME            1
SUB_REGION_NR              1
ZONE_NAME               3461
ZONE_NR                 3461
MPA_NAME                3461
MPA_NR                     1
ADMIN                  13602
PROT                       1
DEPTH_STRAT             8749
MIN_DEPTH                 29
MAX_DEPTH                 29
METERS_COMPLETED          29
COVER_CAT_CD              30
COVER_CAT_NAME            30
HARDBOTTOM_P              29
SOFTBOTTOM_P              29
RUBBLE_P                  29
accession_url 

In [16]:
florida.columns

Index(['time', 'latitude', 'longitude', 'REGION', 'PRIMARY_SAMPLE_UNIT',
       'STATION_NR', 'YEAR', 'MONTH', 'DAY', 'Date_UTC', 'HABITAT_CD',
       'HABITAT_TYPE', 'STRAT', 'Description', 'RUGOSITY_CD', 'WTD_RUG',
       'MAPGRID_NR', 'SUB_REGION_NAME', 'SUB_REGION_NR', 'ZONE_NAME',
       'ZONE_NR', 'MPA_NAME', 'MPA_NR', 'ADMIN', 'PROT', 'DEPTH_STRAT',
       'MIN_DEPTH', 'MAX_DEPTH', 'METERS_COMPLETED', 'COVER_CAT_CD',
       'COVER_CAT_NAME', 'HARDBOTTOM_P', 'SOFTBOTTOM_P', 'RUBBLE_P',
       'accession_url'],
      dtype='str')

In [17]:
cols_needed = [
    "PRIMARY_SAMPLE_UNIT",
    "latitude",
    "longitude",
    "YEAR",
    "REGION",
    "COVER_CAT_NAME",
    "HARDBOTTOM_P"
]

florida_clean = florida[cols_needed]

florida_clean.head()

Unnamed: 0,PRIMARY_SAMPLE_UNIT,latitude,longitude,YEAR,REGION,COVER_CAT_NAME,HARDBOTTOM_P
0,,degrees_north,degrees_east,,,,percent
1,1702.0,24.46499,-81.9823,2016.0,FLK,Eusmilia fastigiata,1
2,1702.0,24.46499,-81.9823,2016.0,FLK,Encrusting gorgonian,2
3,1702.0,24.46499,-81.9823,2016.0,FLK,Gorgonians,13
4,1702.0,24.46499,-81.9823,2016.0,FLK,Meandrina meandrites,1


In [18]:
florida_clean.shape

(13602, 7)

In [19]:
florida_clean.isnull().sum()

PRIMARY_SAMPLE_UNIT     1
latitude                0
longitude               0
YEAR                    1
REGION                  1
COVER_CAT_NAME         30
HARDBOTTOM_P           29
dtype: int64

### Puerto Rico Coral Reef Benthic Cover (CRCP Monitoring Data, 2014–2023)
---

The Puerto Rico Benthic Cover dataset includes 10,656 reef survey observations collected between 2014 and 2023 through NOAA’s Coral Reef Conservation Program. Each row represents one benthic category (like live coral or algae) measured at a specific reef site on a particular survey date. The dataset includes the site ID (`PRIMARY_SAMPLE_UNIT`), latitude and longitude (in decimal degrees), survey year (and month/day when available), region, the organism category (`COVER_CAT_NAME`), and the percent of hardbottom covered (`HARDBOTTOM_P`).

The main variable we care about is percent hardbottom cover, which is measured in percent (%). This tells us how much of the reef surface is covered by a certain organism type. For example, 30% live coral cover means that about one-third of the surveyed reef area was covered by living coral. In many reef systems, coral cover above about 30% is considered relatively healthy, while values below 10% often suggest significant degradation or limited recovery after stress events.

Since our research question looks at recovery patterns in live coral cover and how those patterns relate to algal cover, we use COVER_CAT_NAME to separate coral from algae and HARDBOTTOM_P to measure how dominant each group is. By combining location (site and region) with time (year), we can track how coral cover changes over time at different sites and see whether some reefs recover differently than others. This helps us explore whether higher algal cover is associated with weaker coral recovery after stress events.

**Variables We're Focusing On**

For this analysis, we’re mainly looking at location and habitat structure variables from the benthic dataset.
* Site ID (`PRIMARY_SAMPLE_UNIT`)
This is the unique identifier for each survey site. It allows us to group observations by location and track changes over time at the same site
* Latitude (degrees north) and Longitude (degrees east)
These tell us where each survey site is located. We’ll use them to look at spatial patterns and see whether habitat characteristics cluster in certain areas.
* Date (`YEAR`, `MONTH`, `DAY`)
These variables describe when the observation was recorded. YEAR allows us to evaluate long-term trends, while month/day could help identify seasonal patterns.
* Region 
This categorizes sites into broader geographic areas within Puerto Rico. It allows us to compare coral and algae cover across different parts of the island.
* `COVER_CAT_NAME`
This identifies the organism category being measured (e.g., coral, algae). This is central to our research question because we are specifically comparing coral vs. algae cover.
* Percent Cover (`HARDBOTTOM_P`)
This is the key outcome variable. It represents the percentage of hardbottom substrate covered by the organism listed in `COVER_CAT_NAME`.
It is measured as a percent (%), ranging theoretically from 0 to 100.
> * Values near 0% indicate little to no coverage.
> * Values near 100% indicate nearly complete coverage of the hardbottom area by that organism type.
> * Values outside 0–100% would be biologically impossible and indicate data issues.

By combining site location (latitude and longitude), region, time (year, month, day), organism category (coral vs. algae), and percent hardbottom cover, we can analyze how benthic community composition varies across Puerto Rico and over time. These variables allow us to examine both spatial differences between regions and temporal trends within sites. Rather than simply describing individual survey locations, this approach helps us identify broader patterns in coral and algae cover and assess how benthic communities may be shifting across space and time.


**Concerns and Limitations**

One major concern is sampling bias. Survey sites are not randomly distributed across all reef habitat — they are likely chosen based on accessibility, monitoring priorities, or ecological importance. This means results may not represent all reef areas equally.

Another limitation is temporal imbalance. Some sites may have been surveyed more frequently than others, and some years may have more complete coverage than others. This could affect trend analysis.

In addition, several numeric variables (like latitude, depth, and percent cover values) are stored as text and need to be converted before analysis. The first row of the dataset also contains unit labels instead of real data and must be removed during cleaning.

Finally, the dataset only covers about 10 years (2014–2023), not the full 20-year period in our research question. While it still captures recent bleaching and stress events, longer-term trends would require additional historical data.


In [26]:
## Load Raw Dataset
benthic_PR = pd.read_csv("data/00-raw/CRCP_Benthic_Cover_Puerto_Rico_07c6_3c17_78b6.csv")

benthic_PR.head()

Unnamed: 0,time,latitude,longitude,REGION,PRIMARY_SAMPLE_UNIT,STATION_NR,YEAR,MONTH,DAY,Date_UTC,...,DEPTH_STRAT,MIN_DEPTH,MAX_DEPTH,METERS_COMPLETED,COVER_CAT_CD,COVER_CAT_NAME,HARDBOTTOM_P,SOFTBOTTOM_P,RUBBLE_P,accession_url
0,UTC,degrees_north,degrees_east,,,,,,,UTC,...,,m,m,m,,,percent,percent,percent,
1,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,BAR SUB.,Bare Substrate,29,0,0,https://accession.nodc.noaa.gov/0217139
2,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,CYA SPE.,Cyanophyta spp,4,0,0,https://accession.nodc.noaa.gov/0217139
3,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,DIC SPE.,Dictyota spp,18,0,0,https://accession.nodc.noaa.gov/0217139
4,2019-07-18T00:00:00Z,18.14334361,-67.30063227,PRICO,6243.0,1.0,2019.0,7.0,18.0,2019-07-18T00:00:00Z,...,SHLW,8.839199717,9.448799698,15,GOR GORG,Gorgonians,1,0,0,https://accession.nodc.noaa.gov/0217139


In [27]:
benthic_PR.shape

(10656, 34)

In [28]:
benthic_PR.columns

Index(['time', 'latitude', 'longitude', 'REGION', 'PRIMARY_SAMPLE_UNIT',
       'STATION_NR', 'YEAR', 'MONTH', 'DAY', 'Date_UTC', 'HABITAT_CD', 'STRAT',
       'RUGOSITY_CD', 'WTD_RUG', 'MEAN_RUG', 'MAPGRID_NR', 'SUB_REGION_NAME',
       'SUB_REGION_NR', 'ZONE_NAME', 'ZONE_NR', 'MPA_NAME', 'MPA_NR', 'ADMIN',
       'PROT', 'DEPTH_STRAT', 'MIN_DEPTH', 'MAX_DEPTH', 'METERS_COMPLETED',
       'COVER_CAT_CD', 'COVER_CAT_NAME', 'HARDBOTTOM_P', 'SOFTBOTTOM_P',
       'RUBBLE_P', 'accession_url'],
      dtype='object')

In [29]:
cols_to_keep = [
    "PRIMARY_SAMPLE_UNIT",
    "latitude",
    "longitude",
    "YEAR",
    "MONTH",
    "DAY",
    "REGION",
    "COVER_CAT_NAME",
    "HARDBOTTOM_P"
]

benthic_PR = benthic_PR[[col for col in cols_to_keep if col in benthic_PR.columns]]

In [31]:
benthic_PR.head()

Unnamed: 0,PRIMARY_SAMPLE_UNIT,latitude,longitude,YEAR,MONTH,DAY,REGION,COVER_CAT_NAME,HARDBOTTOM_P
0,,degrees_north,degrees_east,,,,,,percent
1,6243.0,18.14334361,-67.30063227,2019.0,7.0,18.0,PRICO,Bare Substrate,29
2,6243.0,18.14334361,-67.30063227,2019.0,7.0,18.0,PRICO,Cyanophyta spp,4
3,6243.0,18.14334361,-67.30063227,2019.0,7.0,18.0,PRICO,Dictyota spp,18
4,6243.0,18.14334361,-67.30063227,2019.0,7.0,18.0,PRICO,Gorgonians,1


In [43]:
benthic_PR["YEAR"].value_counts().sort_index()

YEAR
2014.0    2794
2016.0    1721
2017.0     266
2019.0    1673
2021.0    1962
2023.0    2239
Name: count, dtype: int64

**Check Tidy Structure**

The dataset is already in long (tidy) format because each row represents one benthic category observation at a specific site and date. Each column represents a single variable (e.g., depth, percent cover, region).
**_BUT_**, this does not mean it is *clean*.

In [32]:
benthic_PR.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10656 entries, 0 to 10655
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   PRIMARY_SAMPLE_UNIT  10655 non-null  float64
 1   latitude             10656 non-null  object 
 2   longitude            10656 non-null  object 
 3   YEAR                 10655 non-null  float64
 4   MONTH                10655 non-null  float64
 5   DAY                  10655 non-null  float64
 6   REGION               10655 non-null  object 
 7   COVER_CAT_NAME       10655 non-null  object 
 8   HARDBOTTOM_P         10656 non-null  object 
dtypes: float64(4), object(5)
memory usage: 749.4+ KB


In [33]:
print("Number of rows:", benthic_PR.shape[0])
print("Number of columns:", benthic_PR.shape[1])

Number of rows: 10656
Number of columns: 9


In [34]:
## Convert Data Types

numeric_cols = [
    "latitude",
    "longitude",
    "YEAR",
    "MONTH", 
    "DAY",
    "HARDBOTTOM_P"
]

for col in numeric_cols:
    benthic_PR[col] = pd.to_numeric(benthic_PR[col], errors="coerce")


In [35]:
## Missing Data Analysis

missing_counts = benthic_PR.isna().sum()
missing_percent = benthic_PR.isna().mean() * 100

pd.DataFrame({
    "Missing Count": missing_counts,
    "Missing %": missing_percent
}).sort_values("Missing %", ascending=False)

Unnamed: 0,Missing Count,Missing %
PRIMARY_SAMPLE_UNIT,1,0.009384
latitude,1,0.009384
longitude,1,0.009384
YEAR,1,0.009384
MONTH,1,0.009384
DAY,1,0.009384
REGION,1,0.009384
COVER_CAT_NAME,1,0.009384
HARDBOTTOM_P,1,0.009384


In [36]:
outliers = benthic_PR[
    (benthic_PR["HARDBOTTOM_P"] < 0) |
    (benthic_PR["HARDBOTTOM_P"] > 100)
]

outliers

Unnamed: 0,PRIMARY_SAMPLE_UNIT,latitude,longitude,YEAR,MONTH,DAY,REGION,COVER_CAT_NAME,HARDBOTTOM_P


No biologically impossible percent values (<0% or >100%) were seen.

In [37]:
benthic_PR["HARDBOTTOM_P"].describe()

count    10655.000000
mean         6.681933
std         10.735646
min          0.000000
25%          1.000000
50%          2.000000
75%          7.000000
max         98.000000
Name: HARDBOTTOM_P, dtype: float64

**Cleaning Strategy**

Columns that are entirely missing/ not being used for our research question (e.g., ZONE_NAME, MPA_NAME, PROT) are going to be dropped because they contain no usable information (as done above). Numeric variables stored as text were converted to numeric format. (also done above)

Because HARDBOTTOM_P is our primary outcome variable, rows missing this value cannot be used for analysis and will be removed.
We also drop rows missing essential identifiers (site, year, region, category).

*We keep rows with missing MONTH or DAY if YEAR is present, since yearly trends are sufficient for our main analysis.*

In [38]:
benthic_clean = benthic_PR.dropna(
    subset=[
        "PRIMARY_SAMPLE_UNIT",
        "YEAR",
        "REGION",
        "COVER_CAT_NAME",
        "HARDBOTTOM_P"
    ]
)

In [39]:
benthic_PR.head()

Unnamed: 0,PRIMARY_SAMPLE_UNIT,latitude,longitude,YEAR,MONTH,DAY,REGION,COVER_CAT_NAME,HARDBOTTOM_P
0,,,,,,,,,
1,6243.0,18.143344,-67.300632,2019.0,7.0,18.0,PRICO,Bare Substrate,29.0
2,6243.0,18.143344,-67.300632,2019.0,7.0,18.0,PRICO,Cyanophyta spp,4.0
3,6243.0,18.143344,-67.300632,2019.0,7.0,18.0,PRICO,Dictyota spp,18.0
4,6243.0,18.143344,-67.300632,2019.0,7.0,18.0,PRICO,Gorgonians,1.0


In [40]:
benthic_PR.shape


(10656, 9)

In [41]:
benthic_PR.describe()

Unnamed: 0,PRIMARY_SAMPLE_UNIT,latitude,longitude,YEAR,MONTH,DAY,HARDBOTTOM_P
count,10655.0,10655.0,10655.0,10655.0,10655.0,10655.0,10655.0
mean,6273.545378,18.159688,-66.376224,2018.36321,8.110371,17.405913,6.681933
std,2407.341171,0.184946,0.885091,3.435621,1.806798,8.820304,10.735646
min,1000.0,17.862056,-67.949417,2014.0,4.0,1.0,0.0
25%,6062.0,17.979148,-67.266039,2014.0,7.0,11.0,1.0
50%,6321.0,18.145444,-66.26637,2019.0,8.0,18.0,2.0
75%,9141.0,18.342258,-65.518407,2021.0,9.0,25.0,7.0
max,9359.0,18.519091,-65.17725,2023.0,12.0,31.0,98.0


In [42]:
## Save Processed Dataset

benthic_PR.to_csv("data/02-processed/benthic_cover_PR_cleaned.csv", index=False)

*Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets*

*Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.*

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> This project does not involve human subjects or individual-level data. All datasets used are publicly available environmental monitoring datasets collected by government agencies (e.g., NOAA) using standardized ecological survey methods.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Collection bias is relevant because coral reef monitoring sites are not evenly distributed across regions or reef types. Some U.S. reef systems may be monitored more frequently or consistently than others due to accessibility, funding, or conservation priority. This could bias observed recovery pathways toward better-studied regions. We acknowledge this limitation and will avoid overgeneralizing results beyond the monitored sites.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> This project does not collect or use any personally identifiable information. All data are ecological and site-level, such as percent coral cover and thermal stress metrics.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> Because this project does not involve human populations or protected groups, downstream bias related to demographic characteristics is not applicable

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> The datasets used are publicly available and contain no sensitive information. Data will be stored locally for analysis using standard file protections. While advanced security measures are not required, care will be taken to avoid accidental modification or loss of data.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> The project does not use personal or individual-level data. There are no individuals whose data could be removed upon request.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
> Data will be retained only for the duration of the course project and may be deleted afterward. Since the data are publicly available, long-term storage does not pose ethical concerns.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Our analysis does not include input from reef managers or local communities. Our findings *could* influence policy or funding, e.g., reefs with faster recovery might get more attention, while slower-recovering reefs could be deprioritized. We will clarify that recovery trajectories are descriptive, not value judgments.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Potential sources of bias include uneven temporal coverage across sites, missing years of data, and unmeasured confounding factors such as storms, pollution, or local management practices. These limitations may affect interpretation of recovery trajectories and will be discussed when presenting results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Visualizations and summary statistics will be designed to accurately reflect the underlying data, including showing uncertainty, missing data, and variability across sites. We will avoid visual choices that exaggerate trends or imply causation.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> No data with personally identifiable information will be used or displayed, as all data are ecological and environmental in nature.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> The data cleaning, merging, and analysis process will be documented using reproducible code and clear descriptions of methods. This allows the analysis to be reviewed or revisited if issues are discovered later.


### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> This project does not involve predictive models that affect individuals, nor does it include demographic variables.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> Fairness across human groups is not applicable. However, we recognize that modeling choices may implicitly emphasize certain regions or reef types over others due to data availability.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> Recovery metrics such as changes in percent live coral cover or trajectory slopes were chosen because they are commonly used in reef ecology. We acknowledge that no single metric fully captures reef health and will discuss this limitation.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> The analytical methods used are interpretable and can be explained in clear terms. We will avoid complex models that obscure interpretation.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> Limitations such as observational data, lack of causal inference, and incomplete coverage will be clearly communicated in the final report to avoid misinterpretation of results.


### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> This project is exploratory and academic in nature and will not be deployed in a production environment. Ongoing monitoring is therefore not applicable.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Because this analysis does not produce decisions affecting individuals or communities directly, formal redress mechanisms are not required. However, we aim to present findings responsibly to avoid misuse.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> There is no deployed system or model to roll back. If errors are discovered, analyses and conclusions can be revised.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> While the project is academic, results could potentially be misused if interpreted as causal or definitive. To mitigate this, we will clearly state the scope, assumptions, and limitations of the analysis.


## Team Expectations 

1. **Clear communication** - weekly meeting (Tuesday 5:00-6:00PM)
- Assign roles for the week/what we want to accomplish by the next meeting
<br>
<br>
2. **Github**
- Don't push without verifying with others
<br>
<br>
3. **Disagreements**
- Vote for majority
- Flip coin if we can’t come to conclusion


## Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  5 PM | Think of some project ideas  | Decide on an idea and start working on the proposal | 
| 2/5  |  5 PM  |  Complete project proposal | Divvy up work and expectations on gathering info | 
| 2/15 | 5 PM  | Background research on topic  | Discuss ideal dataset(s) and ethics; draft project proposal + start data checkpoint 1.  Divvy up datasets |
| 2/18  | 9 AM  | Work on data checkpoint 1 (general data analysis)| Clean up and finalize checkpoint 1|
| 2/24  | 5 PM  | Background info on different ways to analyze the data | Discuss how we want to deeply analyze our data and present it. Divvy up work on EDA |
| 3/3  | 5 PM  | Work on EDA| Finishing touches on EDA checkpoint + figure out what needs to be done for final submission |
| 3/10  | 5 PM  | Work on cleaning up + finishing up requirements for final submission | Turn in Final Project & Group Project Surveys |
| 3/13  | 5 PM  | Work on final submission | Discuss video (script, who says what, etc) |
| 3/20  | Before 11:59 PM  | Practice for video, Fine detailing submission | Record video, Turn in Final Project & Group Project Surveys |