# COGS 108 - Data Checkpoint

# Names

- Jonathan Chou
- Eli Marx-Kahn
- Kevin Lee
- Sherry Ma

<a id='research_question'></a>
# Research Question

Is there a relationship between temperature trends and number of suicides in New York City? 
More specifically:
1. Do suicide rates in a given season differ when average temperatures in that season deviate from typical* seasonal temperatures?
2. Do anomalously warm or cold seasons have a correlation with higher or lower than average suicide rates?
    - Does higher fluctuation in temperature have a correlation with higher or lower than average suicide rates?
3. Do suicide rates differ between men and women in years when there is higher temperature fluctuation?

***Typical refers here to the average seasonal temperature across the years 2008-2014**


# Dataset(s)

### Dataset 1
- Dataset Name: Suicide counts and rates in New York City 2000-2014
- Link to the dataset: https://www1.nyc.gov/assets/doh/downloads/pdf/epi/databrief75.pdf
     - 5th page of pdf
- Number of observations: 15
- Description: Suicide counts in NYC from 2000-2014 and broken down by gender, rate type.

### Dataset 2
- Dataset Name: Seasonal counts of suicide in New York City 2008-2014 
- Link to the dataset: https://www1.nyc.gov/assets/doh/downloads/pdf/epi/databrief75.pdf
    - 6th page of pdf
- Number of observations: 7
- Description: Suicide counts in NYC from 2008-2014 and 4 seasons.

### Dataset 3
- Dataset Name: New York City Temperature Trends 2008-2010 
- Link to the dataset: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table
    - Must request air temperature data from NYC as CSV for specified date range
- Number of observations: 47733
- Description: Daily min/max/avg temperature gathered by the weather stations in New York for the years 2008-2010.


### Dataset 4
- Dataset Name: New York City Temperature Trends 2011-2013 
- Link to the dataset: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table
     - Must request air temperature data from NYC as CSV for specified date range
- Number of observations: 62131
- Description: Daily min/max/avg temperature gathered by the weather stations in New York for the years 2011-2013.

### Dataset 5
- Dataset Name: New York City Temperature Trends 2014 
- Link to the dataset: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table
    - Must request air temperature data from NYC as CSV for specified date range
- Number of observations: 21809
- Description: Daily min/max/avg temperature gathered by the weather stations in New York for the year 2014.

### Multiple Datasets/Merge:
- Dataset 1 and 2 will be combined to one larger dataset where extra columns will be added as the suicide data pertain to different variables.
- Datasets 3, 4, and 5 all track the same variables with different timeframe observations, so they will be combined to one larger dataset that adds more rows. This way we have one dataset that contains the temperature data from 2008 to 2014. 



# Setup

In [13]:
import pandas as pd
import numpy as np

#getting the suicide data
suicide_count_yearly = pd.read_csv("Data/Suicide counts and rates in New York City 2000-2014.csv")
suicide_count_seasonal = pd.read_csv("Data/Seasonal counts of suicide in New York City 2008-2014.csv")

#getting the temperature data from all 3 datasets
dataset_2008_to_2010 = pd.read_csv("Data/2008-10.csv")
dataset_2011_to_2013 = pd.read_csv("Data/2011-13.csv")
dataset_2014 = pd.read_csv("Data/2014.csv")

# Data Cleaning

Describe your data cleaning steps here.

In [76]:
## Suicide Data Cleaning

#the yearly suicide contains data from 2000-2014, we want to filter it out to 2008-2014
suicide_count_yearly['Year'] = suicide_count_yearly['Year'].astype(int)
suicide_count_yearly = suicide_count_yearly[suicide_count_yearly['Year'] >= 2008]
suicide_count_yearly = suicide_count_yearly.reset_index(drop=True)

#remove irrelevant columns (crude rate and ageadjusted rate)
suicide_count_yearly = suicide_count_yearly.iloc[:,:4]

#merge with the seasonal data
suicide_dataset = pd.merge(left=suicide_count_yearly, right=suicide_count_seasonal, how='inner')

#rename columns
new_columns = {'Winter':'Winter count of suicides (Total)', 'Spring':'Spring count of suicides (Total)', 
               'Summer':'Summer count of suicides (Total)', 'Fall':'Fall count of suicides (Total)'}
suicide_dataset = suicide_dataset.rename(new_columns, axis='columns')

#verify integrity of datset

In [None]:
## Temperature Data Cleaning

# Project Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/17  |  8 PM | Import & Wrangle Data, EDA  | Review/Edit wrangling/EDA; Discuss Analysis Plan | 
| 2/24  |  8 PM |  Finalize wrangling/EDA; Begin Analysis  | Discuss/edit Analysis; Complete project check-in | 
| 3/3  |   8 PM  | Complete analysis; Draft results/conclusion/discussion   | Discuss/edit full project   |
| 3/10  |  8 PM  | Any last minute revisions | Turn in Final Project & Group Project Surveys  |
