## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Do certain months have more homicides than other months? Is it a significant difference between the other months? If so, have they changed over time?

## Background and Prior Work

Early criminological work and national surveillance data suggest that homicide is not evenly distributed throughout the year, with many studies reporting higher rates during warmer months. A data collection from the Bureau of Justice Statistics1 found that violent crime tends to increase in the summer, often attributed to greater outdoor activity, increased social interaction, and higher ambient temperatures, which possibly elevates aggression and opportunity for conflict, although the difference is quite small (about 4%).

A closely related project “Crime Seasonality Analysis”2 from GitHub investigates whether crime exhibits predictable seasonal structure by treating crime counts as a time series. The data was sourced from NYC Open Data3 and the City of Chicago data portal4. Methodologically, the project’s pipeline is similar to our project question. The author grouped crime incidents by month and created graphs to see if some months consistently had higher crime than others. Similarly to the data collection from BJS, their result showed that crime is not evenly spread across the year. They also noted that some types of crime show clearer seasonal patterns than others. An important takeaway we noticed was that patterns from New York City and Chicago are not universal — combined data from both cities yielded more ambiguous results than data analyzed separately, which is also a common statistical phenomenon known as Simpson’s Paradox5. From this, we will take into consideration that crime seasonality is fluid and depends on location/time period, rather than following one fixed national pattern, and that any potential pattern we find is likely not permanent. Thus, we will also test whether homicide patterns change over time.

    ^ Block, Carolyn Rebecca. Seasonality of Crime and Victimization. Bureau of Justice Statistics, U.S. Department of Justice, 2010, https://bjs.ojp.gov/content/pub/pdf/spcvt.pdf
    ^ Ty1erz. Crime Seasonality Analysis. GitHub, https://github.com/ty1erz/seasonality_and_crime
    ^ New York City Open Data. NYC Crime Dataset, City of New York, https://data.cityofnewyork.us/Public-Safety/NYC-crime/qb7u-rbmr/about_data
    ^ City of Chicago Data Portal. Crimes – 2001 to Present, City of Chicago, https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data
    ^ Blyth, Colin. “Simpson’s Paradox.” Stanford Encyclopedia of Philosophy, Stanford University, https://plato.stanford.edu/entries/paradox-simpson/



## Hypothesis


Hypothesis: Homicide rates vary significantly by month with summer months (June, July, August) experiencing higher homicide rates than winter months (December, January,February).

Null: There is no significant difference in homicide rates across months and any observed variations are due to randomness. 

## Data

We would want a dataset that tracks the deaths of different people, and has a column denoting the manner of death in which the deaths occurred. We would want the data to span multiple years, and would also need columns that tell us the month and year of when the deaths occurred, so we can sort by month, and also look at results over multiple years. 

## The Dataset

Our dataset:
https://github.com/the-pudding/data/tree/master/birthday-effect.

The dataset was obtained from a publicly avaliable GitHub repository, orginnally released through a FOIA request to the Massachusetts Registry of Vital Records and Statistics. The dataset includes individual records that span a certain amount of years. Each record represents a single death event.

The dataset contains the following variables

    Data of birth (date variable)
    Date of death (date variable)
    Age of death (numerical,continous)
    Sex (Categorize)
    Marital \sStatus (Categorize)
    Manner of death (Categorieze)

Since the dataset consists of individual-level records rather than aggregated statistics, it allows for detailed analysis of temporal patterns such as seasonal variation. The inclusion of exact dates allows a computation of derived variables such as month of death or season, which are central parts of the projects hypothesis. 

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://github.com/the-pudding/data/blob/master/birthday-effect/birthdays.csv', 'filename':'birthdays.csv'},
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading birthdays.csv: 0.00B [00:00, ?B/s][A
Downloading birthdays.csv: 176kB [00:00, 1.48MB/s][A
Overall Download Progress: 100%|██████████| 1/1 [00:00<00:00,  1.26it/s]

Successfully downloaded: birthdays.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

The data was obtained from a publicly available dataset hosted on Github, through a FOIA request from the Massachusetts Registry of Vital Records and Statistics (RVRS). The data is anonymous and examined for analytical purposes, with no attempts to identify individuals. To further reduce privacy risks and avoid sensitive cases, the dataset excludes deaths where the age is under one and a half years old, records with missing birth or death dates, and dates falling on February 29 (leap years). 

Given the obvious sensitive nature of mortality and homicide data, the analysis should be conducted carefully to avoid misleading interpretations or sensationalism. Although we intend to focus on seasonal patterns, there may be more personal details that end up being relevant, such as age, sex, marital status, and manner of death.  The concern for location privacy is mitigated, as there are no indicators of location (such as ZIP codes) beyond knowing that the data is measured from the Massachusetts population. 

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

* *Each member will contribute to data cleaning, analysis, visualization, coding, and writing.*
* *We expect everyone to communicate clearly and regularly, meet deadlines, and be open to receiving and giving constructive feedback.*
* *Tasks will be divided equally (such as coding, writing, or visualization) to give everyone a fair chance to demonstrate their knowledge and practice their skills in this class.*
* *If someone runs into issues or falls behind, they will let the team know in a timely manner to adjust responsibilities and deadlines.*
* *Any major decisions will be discussed as a group first.*

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/1  |  12 PM | Individual brainstorming on project ideas; look through different data collections; finalize project question  | Decide on final dataset; finalize research questions and hypothesis; assign members for project proposal  
| 2/3  |  5 PM |  Project Proposal: Research question (Affaan), data (Affaan), Background & Prior Work (Minsui), Ethics & Privacy (Joey), Hypotheses (Pato), Team Expectations (Minsui), Project Timeline (Minsui)| Review project proposal and format for GitHub submission 
| 2/5  | 5 PM  | Review CSV data collection and project proposal  | Review data cleaning progress; finalize plan for Checkpoint 1   |
| 2/11  | 5 PM  | Start cleaning script for dates and manner of death | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/16  | 12 PM  | Data wrangling: standardized formats and categories in pandas. EDA & Basic Viz: Generate histograms and frequency counts | Polish code/visuals; verify data clarity and readability; submit for Checkpoint 1 |
| 2/25  | 5 PM  | Statistical Analysis: run SciPy tests to check if monthly spikes are significant| Discuss findings and potential confounding variables  |
| 3/2  | 12 PM  | Time-period analysis: create subset DataFrames to compare trends across different years and regions | Finalize all visualizations; check for code readability; submit for Checkpoint 2|
| 3/11  | 5 PM  | Results & Discussion: Draft final project overview explaining data | Review project flow; create presentation; polish final data, project report, and presentation |
| 3/16  | 5 PM  | Check for any bugs in repo; finish group peer surveys (everyone) | Ensure all members agree on conclusions; submit final project & group peer surveys |