## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Do certain months have more homicides than other months? Is it a significant difference between the other months? If so, have they changed over time?

## Background and Prior Work

Early criminological work and national surveillance data suggest that homicide is not evenly distributed throughout the year, with many studies reporting higher rates during warmer months. A data collection from the Bureau of Justice Statistics1 found that violent crime tends to increase in the summer, often attributed to greater outdoor activity, increased social interaction, and higher ambient temperatures, which possibly elevates aggression and opportunity for conflict, although the difference is quite small (about 4%).

A closely related project “Crime Seasonality Analysis”2 from GitHub investigates whether crime exhibits predictable seasonal structure by treating crime counts as a time series. The data was sourced from NYC Open Data3 and the City of Chicago data portal4. Methodologically, the project’s pipeline is similar to our project question. The author grouped crime incidents by month and created graphs to see if some months consistently had higher crime than others. Similarly to the data collection from BJS, their result showed that crime is not evenly spread across the year. They also noted that some types of crime show clearer seasonal patterns than others. An important takeaway we noticed was that patterns from New York City and Chicago are not universal — combined data from both cities yielded more ambiguous results than data analyzed separately, which is also a common statistical phenomenon known as Simpson’s Paradox5. From this, we will take into consideration that crime seasonality is fluid and depends on location/time period, rather than following one fixed national pattern, and that any potential pattern we find is likely not permanent. Thus, we will also test whether homicide patterns change over time.

    ^ Block, Carolyn Rebecca. Seasonality of Crime and Victimization. Bureau of Justice Statistics, U.S. Department of Justice, 2010, https://bjs.ojp.gov/content/pub/pdf/spcvt.pdf
    ^ Ty1erz. Crime Seasonality Analysis. GitHub, https://github.com/ty1erz/seasonality_and_crime
    ^ New York City Open Data. NYC Crime Dataset, City of New York, https://data.cityofnewyork.us/Public-Safety/NYC-crime/qb7u-rbmr/about_data
    ^ City of Chicago Data Portal. Crimes – 2001 to Present, City of Chicago, https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data
    ^ Blyth, Colin. “Simpson’s Paradox.” Stanford Encyclopedia of Philosophy, Stanford University, https://plato.stanford.edu/entries/paradox-simpson/



## Hypothesis


Hypothesis: Homicide rates vary significantly by month with summer months (June, July, August) experiencing higher homicide rates than winter months (December, January,February).

Null: There is no significant difference in homicide rates across months and any observed variations are due to randomness. 

## Data

### Data overview

We would want a dataset that tracks the deaths of different people, and has a column denoting the manner of death in which the deaths occurred. We would want the data to span multiple years, and would also need columns that tell us the month and year of when the deaths occurred, so we can sort by month, and also look at results over multiple years. 

The dataset we are planning on using: https://github.com/the-pudding/data/tree/master/birthday-effect

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|          | 0.00/1.23k [00:00<?, ?B/s][A
Overall Download Progress:  50%|█████     | 1/2 [00:00<00:00,  8.56it/s]   [A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|          | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████| 2/2 [00:00<00:00,  7.69it/s][A

Successfully downloaded: bad-drivers.csv





# Dataset #1: Individual-Level Mortality Records with Birthday Proximity (The Pudding “Birthday Effect” Dataset)

This dataset contains approximately 1.96 million individual mortality records. Each row represents one person’s recorded death, along with their birth date and several demographic attributes. A key variable we will be examining, days_from_birthday, measures how many days before or after a person’s birthday their death occurred. This allows us to examine whether deaths cluster around birthdays, which is a phenomenon sometimes referred to as the “birthday effect.” We will use this term moving forward for ease of clarity


## Important Variables and Metrics

* birth (date): The individual’s date of birth. Stored initially as a string but convertible to datetime format.

* death (date): The individual’s date of death.

* age_floor (integer, years): The age at death, rounded down to the nearest whole year. Units are years. Typical human lifespan ranges from 0 to roughly 110–120 years. Values outside this range may indicate data errors.

* days_from_birthday (integer, days): The number of days between the individual’s birthday and their death. Units are days. A value of 0 means the person died on their birthday. Values typically range from -182 to +182 (approximately half a year before or after the birthday), though the dataset contains 365 unique values reflecting all possible calendar-day distances.

* sex (categorical): Biological sex category.

* marital (categorical): Marital status at time of death.

* manner (categorical): Manner of death (e.g., natural, accident, suicide, homicide, etc.).

The key variable for birthday clustering analysis is days_from_birthday. If deaths are evenly distributed throughout the year, we would expect roughly uniform counts across values of this variable. Significant spikes near 0 could suggest a birthday effect.


## Potential Concerns and Limitations

Because this dataset consists of recorded mortality events, it reflects only individuals who have died and does not represent the living population. Additionally, the dataset may be limited geographically or temporally depending on its source, which may introduce sampling bias. Missing demographic values (such as marital status) may not be missing at random and certain age groups or manners of death may be more likely to have incomplete records. Finally, birth and death dates are stored as strings and must be converted to datetime format for proper analysis.

## Load Raw Data 

In [1]:
import pandas as pd
import numpy as np
import os

os.makedirs("data/00-raw", exist_ok=True)
os.makedirs("data/01-interim", exist_ok=True)
os.makedirs("data/02-processed", exist_ok=True)

url = "https://raw.githubusercontent.com/the-pudding/data/refs/heads/master/birthday-effect/birthdays.csv"
df = pd.read_csv(url)

df.to_csv("data/00-raw/birthdays_raw.csv", index=False)

df.head()

Unnamed: 0,birth,death,age_floor,days_from_birthday,sex,marital,manner
0,1988-11-22,1990-09-26,1,-57,f,s,n
1,1988-02-09,1990-01-07,1,-33,f,s,n
2,1988-04-01,1990-02-13,1,-47,m,s,n
3,1988-04-11,1990-02-24,1,-46,f,s,n
4,1988-08-14,1990-02-27,1,-168,f,s,n


## Tidy Dataset

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1955588 entries, 0 to 1955587
Data columns (total 7 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   birth               object
 1   death               object
 2   age_floor           int64 
 3   days_from_birthday  int64 
 4   sex                 object
 5   marital             object
 6   manner              object
dtypes: int64(2), object(5)
memory usage: 104.4+ MB


This data is already tidy because each row represents only one individual record, each column represents one single variable, and each cell contains one value. No further tidying necesary. 

## Dataset Size

In [2]:
print("Shape:", df.shape)
print("Total rows:", df.shape[0])
print("Total columns:", df.shape[1])

Shape: (1955588, 7)
Total rows: 1955588
Total columns: 7


## Convert Date Columns

In [3]:
df["birth"] = pd.to_datetime(df["birth"], errors="coerce")
df["death"] = pd.to_datetime(df["death"], errors="coerce")

Converted date columns from text strings to pandas datetime64[ns] to allow for more operations and calculations using data from the date columns 

## Check for Missing Data 

In [4]:
missing_counts = df.isna().sum()
missing_percent = df.isna().mean() * 100

missing_summary = pd.DataFrame({
    "missing_count": missing_counts,
    "missing_percent": missing_percent
}).sort_values("missing_percent", ascending=False)

missing_summary

Unnamed: 0,missing_count,missing_percent
sex,822646,42.066427
manner,10148,0.518923
marital,1627,0.083197
birth,2,0.000102
death,0,0.0
age_floor,0,0.0
days_from_birthday,0,0.0


The dataset contains minimal missing data in core timing variables (birth, death, age_floor, days_from_birthday). Some categorical variables may contain missing entries. To determine whether missingness appears systematic, we will compare missing rates across demographic groups.

## Comparing Missing Rates 

In [5]:
df["marital_missing"] = df["marital"].isna()

df.groupby("manner")["marital_missing"].mean().sort_values(ascending=False)

manner
h    0.001267
s    0.001094
a    0.001093
n    0.000783
c    0.000000
p    0.000000
t    0.000000
Name: marital_missing, dtype: float64

* n = Natural
* a = Accident
* s = Suicide
* h = Homicide
* c = Could not determine
* p = Pending investigation
* t = Therapeutic complication (medical/surgical complication)

Missingness in the marital variable appears to be extremely low across all manner categories. The variation between groups is minimal, and several categories show no missing values at all. Because the missing rates are both very small and relatively similar across groups, there is no strong evidence that missingness is systematically associated with manner of death. This suggests that missing marital data is likely missing at random rather than structurally biased.

## Outliers and Suspicious Entries 

In [6]:
df["flag_age_outlier"] = (df["age_floor"] < 0) | (df["age_floor"] > 120)
df["flag_days_outlier"] = df["days_from_birthday"].abs() > 366

df[["flag_age_outlier", "flag_days_outlier"]].sum()

flag_age_outlier     0
flag_days_outlier    0
dtype: int64

No outliers or suspicious entries present 

## Clean Data

Because our goal analysis requires accurate birth and death timing, rows missing these fields cannot meaningfully contribute to the birthday effect analysis. Therefore, rows missing essential timing variables such as these will be removed. 

In [7]:
core_cols = ["birth", "death", "age_floor", "days_from_birthday"]

df_clean = df.dropna(subset=core_cols)

df_clean = df_clean[
    (df_clean["age_floor"].between(0,120)) &
    (df_clean["days_from_birthday"].abs() <= 366)
]

df_clean.to_csv("data/02-processed/birthdays_processed.csv", index=False)

print("Original shape:", df.shape)
print("Cleaned shape:", df_clean.shape)

Original shape: (1955588, 10)
Cleaned shape: (1955586, 10)


## Summary of Statistics 

### Age at Death

In [8]:
df_clean["age_floor"].describe()

count    1.955586e+06
mean     7.514795e+01
std      1.643697e+01
min      1.000000e+00
25%      6.700000e+01
50%      7.900000e+01
75%      8.700000e+01
max      1.140000e+02
Name: age_floor, dtype: float64

Age_floor refers to age at death, with the mean age at death being 75.1 and the median being 79. A higher median indicates a slight left skew, meaning there are younger deaths pulling the mean downward. 

The interquartile range spans from 67 years (25th percentile) to 87 years (75th percentile), indicating that the middle 50% of individuals died between ages 67 and 87. This suggests the dataset is heavily concentrated among older adults, which aligns with expected mortality patterns.

The minimum recorded age is 1 year, and the maximum is 114 years, both of which fall within plausible biological limits. There are no obvious extreme outliers (e.g., negative ages or values above 120 after cleaning), suggesting the age variable appears valid and well-behaved.

Overall, the distribution of age at death appears realistic and consistent with known mortality patterns.

### Amount of Days from Birthday after Death 

In [9]:
df_clean["days_from_birthday"].describe()

count    1.955586e+06
mean     6.318208e-02
std      1.053832e+02
min     -1.820000e+02
25%     -9.100000e+01
50%      0.000000e+00
75%      9.100000e+01
max      1.820000e+02
Name: days_from_birthday, dtype: float64

The dataset also contains 1,955,586 valid observations for days_from_birthday. The mean value is approximately 0.063 days, which is extremely close to zero. This suggests that, on average, deaths are distributed symmetrically around birthdays.

The median is exactly 0 days, meaning that half of all deaths occurred within roughly half a year before or after a birthday, centered at zero. The interquartile range extends from -91 days (25th percentile) to +91 days (75th percentile). This is nearly perfectly symmetric, indicating that deaths are roughly evenly distributed within about three months on either side of birthdays.

The minimum value is -182 days, and the maximum is +182 days, representing approximately half a year before or after a birthday. 

Because the mean is very close to zero and the distribution is symmetric, there is no immediate evidence from summary statistics alone of strong clustering around birthdays. However, more detailed visualizations such as graphs and plots would be required to detect subtle spikes near day 0.

## Manner of Death

In [10]:
df_clean["manner"].value_counts()

manner
n    1838950
a      77769
s      19198
h       6312
c       1612
p        989
t        609
Name: count, dtype: int64

Manner of death is largely made up of 'n' (natural causes). This means that any birthday effect will be primarily impacted by natural cause deaths. 

## Conclusions 

The summary statistics indicate that:

* The dataset is large and robust (~1.96 million records).

* Age at death is concentrated among older adults and appears biologically plausible.

* The days_from_birthday variable is symmetrically distributed around zero, suggesting proper construction.

* Most deaths are due to natural causes, consistent with general mortality patterns.

* There are no obvious extreme outliers or structural anomalies in these core variables after cleaning.

The dataset appears suitable for further analysis of potential clustering of deaths around birthdays.

### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> The dataset was collected by the Massachusetts Registry of Vital Records and Statistics for administrative purposes and released publicly through a FOIA request. Because this study uses anonymous data with no personally identifiable information, informed consent from individuals was not required for our analysis.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Potential sources of bias include underreporting, differences in the classification of manners of death, and systemic factors that may affect which deaths are recorded as homicides. This dataset also only reflects deaths in Massachusetts, which may not represent other regions. These limits are acknowledged and our conclusions should be restricted to the studied population.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> The dataset contains no PII such as names, addresses, or identification numbers.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> Variables like age, sex, and marital status were used cautiously to check for potential disparities, and not to draw conclusions about any protected group.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> The dataset is public and anonymized on Github, but our analysis on it is accessible only to project members and used solely for academic purposes.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> Our manipulated data will only be retained for the duration of the project. However, the raw dataset is public and controlled by another party.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> The analysis recognizes that homicide rates are influenced by complex social, economic, and environmental factors not expressed in the dataset. Interpretations avoid oversimplification and acknowledge that seasonal variation by itself cannot explain patterns of violence.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> Data cleaning steps, including the removal of records with missing info, exclusion of  infant deaths under 1.5 years old, and removal of February 29 entries might introduce bias by omitting certain cases. These decisions were put in place to ensure consistent (month) comparisons and are acknowledged as limitations.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> Only anonymous data is used. No individual cases are to be highlighted, and results should be aggregated.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> All data processing and analysis steps are documented and should be reproducible, allowing for future retesting and correction of potential errors.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> No model is used? The study is descriptive instead of predictive.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> The project communicates that findings show correlation, not causation, and are limited to the Massachusetts dataset and the analyzed timeframe.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> This research is for academic purposes, but it would be interesting to keep monitoring with current data and different locations.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> No decisions affecting individuals were made based on this analysis.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> To reduce risk of unintended use, results are presented carefully while emphasizing that seasonal patterns do not explain underlying causes of homicide and should not be used to support harmful narratives or policies.


## Team Expectations 

* *Each member will contribute to data cleaning, analysis, visualization, coding, and writing.*
* *We expect everyone to communicate clearly and regularly, meet deadlines, and be open to receiving and giving constructive feedback.*
* *Tasks will be divided equally (such as coding, writing, or visualization) to give everyone a fair chance to demonstrate their knowledge and practice their skills in this class.*
* *If someone runs into issues or falls behind, they will let the team know in a timely manner to adjust responsibilities and deadlines.*
* *Any major decisions will be discussed as a group first.*

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/1  |  12 PM | Individual brainstorming on project ideas; look through different data collections; finalize project question  | Decide on final dataset; finalize research questions and hypothesis; assign members for project proposal  
| 2/3  |  5 PM |  Project Proposal: Research question (Affaan), data (Affaan), Background & Prior Work (Minsui), Ethics & Privacy (Joey), Hypotheses (Pato), Team Expectations (Minsui), Project Timeline (Minsui)| Review project proposal and format for GitHub submission 
| 2/5  | 5 PM  | Review CSV data collection and project proposal  | Review data cleaning progress; finalize plan for Checkpoint 1   |
| 2/11  | 5 PM  | Start cleaning script for dates and manner of death | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/16  | 12 PM  | Data wrangling: standardized formats and categories in pandas. EDA & Basic Viz: Generate histograms and frequency counts | Polish code/visuals; verify data clarity and readability; submit for Checkpoint 1 |
| 2/25  | 5 PM  | Statistical Analysis: run SciPy tests to check if monthly spikes are significant| Discuss findings and potential confounding variables  |
| 3/2  | 12 PM  | Time-period analysis: create subset DataFrames to compare trends across different years and regions | Finalize all visualizations; check for code readability; submit for Checkpoint 2|
| 3/11  | 5 PM  | Results & Discussion: Draft final project overview explaining data | Review project flow; create presentation; polish final data, project report, and presentation |
| 3/16  | 5 PM  | Check for any bugs in repo; finish group peer surveys (everyone) | Ensure all members agree on conclusions; submit final project & group peer surveys |