# COGS 108 - Data Checkpoint

# Names

- Jonathan Chou
- Eli Marx-Kahn
- Kevin Lee
- Sherry Ma

<a id='research_question'></a>
# Research Question

Is there a relationship between temperature trends and number of suicides in New York City? 
More specifically:
1. Do suicide rates in a given season differ when average temperatures in that season deviate from typical* seasonal temperatures?
2. Do anomalously warm or cold seasons have a correlation with higher or lower than average suicide rates?
    - Does higher fluctuation in temperature have a correlation with higher or lower than average suicide rates?
3. Do suicide rates differ between men and women in years when there is higher temperature fluctuation?

***Typical refers here to the average seasonal temperature across the years 2008-2014**


# Dataset(s)

### Dataset 1
- Dataset Name: Suicide counts and rates in New York City 2000-2014
- Link to the dataset: https://www1.nyc.gov/assets/doh/downloads/pdf/epi/databrief75.pdf
     - 5th page of pdf
- Number of observations: 15
- Description: Suicide counts in NYC from 2000-2014 and broken down by gender, rate type.

### Dataset 2
- Dataset Name: Seasonal counts of suicide in New York City 2008-2014 
- Link to the dataset: https://www1.nyc.gov/assets/doh/downloads/pdf/epi/databrief75.pdf
    - 6th page of pdf
- Number of observations: 7
- Description: Suicide counts in NYC from 2008-2014 and 4 seasons.

### Dataset 3
- Dataset Name: New York City Temperature Trends 2008-2010 
- Link to the dataset: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table
    - Must request air temperature data from NYC as CSV for specified date range
- Number of observations: 47733
- Description: Daily min/max/avg temperature gathered by the weather stations in New York for the years 2008-2010.


### Dataset 4
- Dataset Name: New York City Temperature Trends 2011-2013 
- Link to the dataset: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table
     - Must request air temperature data from NYC as CSV for specified date range
- Number of observations: 62131
- Description: Daily min/max/avg temperature gathered by the weather stations in New York for the years 2011-2013.

### Dataset 5
- Dataset Name: New York City Temperature Trends 2014 
- Link to the dataset: https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table
    - Must request air temperature data from NYC as CSV for specified date range
- Number of observations: 21809
- Description: Daily min/max/avg temperature gathered by the weather stations in New York for the year 2014.

### Multiple Datasets/Merge:
- Dataset 1 and 2 will be combined to one larger dataset where extra columns will be added as the suicide data pertain to different variables.
- Datasets 3, 4, and 5 all track the same variables with different timeframe observations, so they will be combined to one larger dataset that adds more rows. This way we have one dataset that contains the temperature data from 2008 to 2014. 



# Setup

In [1]:
import pandas as pd
import numpy as np

#getting the suicide data
suicide_count_yearly = pd.read_csv("Data/Suicide counts and rates in New York City 2000-2014.csv")
suicide_count_seasonal = pd.read_csv("Data/Seasonal counts of suicide in New York City 2008-2014.csv")

#getting the temperature data from all 3 datasets
dataset_2008_to_2010 = pd.read_csv("Data/2008-10.csv")
dataset_2011_to_2013 = pd.read_csv("Data/2011-13.csv")
dataset_2014 = pd.read_csv("Data/2014.csv")

# Data Cleaning

### Initial Suicide Data Cleaning

In [2]:
## Suicide Data Cleaning

#the yearly suicide contains data from 2000-2014, we want to filter it out to 2008-2014
suicide_count_yearly['Year'] = suicide_count_yearly['Year'].astype(int)
suicide_count_yearly = suicide_count_yearly[suicide_count_yearly['Year'] >= 2008]
suicide_count_yearly = suicide_count_yearly.reset_index(drop=True)

#remove irrelevant columns (crude rate and ageadjusted rate)
suicide_count_yearly = suicide_count_yearly.iloc[:,:4]

#merge with the seasonal data
suicide_dataset = pd.merge(left=suicide_count_yearly, right=suicide_count_seasonal, how='inner')

#rename columns
new_columns = {'Winter':'Winter count of suicides (Total)', 'Spring':'Spring count of suicides (Total)', 
               'Summer':'Summer count of suicides (Total)', 'Fall':'Fall count of suicides (Total)'}
suicide_dataset = suicide_dataset.rename(new_columns, axis='columns')
suicide_dataset

Unnamed: 0,Year,Count of Suicides (Total),Count of suicides (Females),Count of suicides (Males),Winter count of suicides (Total),Spring count of suicides (Total),Summer count of suicides (Total),Fall count of suicides (Total)
0,2008,473,125,348,117,143,114,99
1,2009,475,115,360,129,103,128,115
2,2010,503,129,374,109,147,141,106
3,2011,509,128,381,126,133,137,113
4,2012,557,163,391,126,162,136,133
5,2013,550,146,404,142,158,133,117
6,2014,565,172,393,121,167,160,117


### Verifying Integrity of Dataset

Verifying Sum of Seasonal Counts to Total Yearly Counts

In [3]:
suicide_data_total_column = suicide_dataset['Count of Suicides (Total)']

#sum up seasonal total counts to check if matches yearly total count
season_columns = ['Winter count of suicides (Total)', 'Spring count of suicides (Total)', 
               'Summer count of suicides (Total)', 'Fall count of suicides (Total)']
sum_season_counts = suicide_dataset[season_columns].sum(axis=1)
print("Does seasonal suicide count match yearly total count? " + str(sum_season_counts.equals(suicide_data_total_column)))

Does seasonal suicide count match yearly total count? True


Verifying Sum of Gender Counts to Total Yearly Counts

In [4]:
#sum up gender specific counts to check if matches total count
gender_columns = ['Count of suicides (Females)', 'Count of suicides (Males)']
sum_gender_counts = suicide_dataset[gender_columns].sum(axis=1)
sum_gender_counts.equals(suicide_data_total_column)

##returned False, so checking which observations returned false
false_rows = suicide_dataset[sum_gender_counts.eq(suicide_data_total_column) == False]
false_rows

Unnamed: 0,Year,Count of Suicides (Total),Count of suicides (Females),Count of suicides (Males),Winter count of suicides (Total),Spring count of suicides (Total),Summer count of suicides (Total),Fall count of suicides (Total)
4,2012,557,163,391,126,162,136,133


$\frac{557-554}{557}$ < 1% so the margin of error is minimal which means that we should not remove this observation.

### Temperature Data Cleaning

Initial temperature data cleaning.

In [5]:
#combine temperature datasets to get one dataframe with data from 2008-2014
temperature_data = pd.concat([dataset_2008_to_2010, dataset_2011_to_2013, dataset_2014], sort=True)

#remove irrelevant columns (TAVG, NAME, TOBS)
temperature_data = temperature_data.drop(['TAVG','TOBS'], axis=1)

#remove stations not in New York
temperature_data = temperature_data[temperature_data.NAME.str.contains("NY US")]

#filter stations where there are observations with no recorded TMAX and TMIN in timeframe
max_observations = temperature_data['NAME'].value_counts().max()
highest_observation_stations = temperature_data['NAME'].value_counts() == max_observations
highest_observation_stations = highest_observation_stations.index[highest_observation_stations]
temperature_data = temperature_data[temperature_data['NAME'].isin(highest_observation_stations)]


#separate all highest observation stations into separate DataFrames
stations = []
for station in highest_observation_stations:
    stations.append(temperature_data[temperature_data['NAME'] == station])

#check for missing data 
for s in stations:
    print("Any missing data? " + str(s.isnull().values.any()))

Any missing data? False
Any missing data? False
Any missing data? False


There is no missing data in the stations that had the highest observation counts. However, this does not mean that everyday in between 2008 and 2014 had a recorded observation as they could have coincidentally had the same amount of missing days. Thus, we will verify by comparing number of observations with number of days in those 4 years.

In [6]:
#check each station for duplicate dates
for s in stations:
    print("No Duplicate Date: " + str(s["DATE"].is_unique))
    
#calculate total number of days between 2008-2014
from datetime import date
f_date = date(2008, 1, 1)
l_date = date(2015, 1, 1) #up to and not including 1/1/2015
day_count = (l_date - f_date).days

print("Does total number of days match observation count: " + str(day_count == max_observations))

No Duplicate Date: True
No Duplicate Date: True
No Duplicate Date: True
Does total number of days match observation count: True


Now that we have cleaned and verified each station's data, we will now assign each station to a variable for easier analysis.

In [7]:
station1 = stations[0]
station2 = stations[1]
station3 = stations[2]
display(station1)
display(station2)
display(station3)

Unnamed: 0,DATE,NAME,STATION,TMAX,TMIN
39330,2008-01-01,"NY CITY CENTRAL PARK, NY US",USW00094728,47.0,37.0
39331,2008-01-02,"NY CITY CENTRAL PARK, NY US",USW00094728,38.0,17.0
39332,2008-01-03,"NY CITY CENTRAL PARK, NY US",USW00094728,20.0,12.0
39333,2008-01-04,"NY CITY CENTRAL PARK, NY US",USW00094728,36.0,16.0
39334,2008-01-05,"NY CITY CENTRAL PARK, NY US",USW00094728,43.0,32.0
...,...,...,...,...,...
19492,2014-12-27,"NY CITY CENTRAL PARK, NY US",USW00094728,55.0,44.0
19493,2014-12-28,"NY CITY CENTRAL PARK, NY US",USW00094728,54.0,43.0
19494,2014-12-29,"NY CITY CENTRAL PARK, NY US",USW00094728,44.0,34.0
19495,2014-12-30,"NY CITY CENTRAL PARK, NY US",USW00094728,34.0,28.0


Unnamed: 0,DATE,NAME,STATION,TMAX,TMIN
5767,2008-01-01,"LAGUARDIA AIRPORT, NY US",USW00014732,49.0,35.0
5768,2008-01-02,"LAGUARDIA AIRPORT, NY US",USW00014732,39.0,19.0
5769,2008-01-03,"LAGUARDIA AIRPORT, NY US",USW00014732,23.0,15.0
5770,2008-01-04,"LAGUARDIA AIRPORT, NY US",USW00014732,37.0,19.0
5771,2008-01-05,"LAGUARDIA AIRPORT, NY US",USW00014732,43.0,33.0
...,...,...,...,...,...
4523,2014-12-27,"LAGUARDIA AIRPORT, NY US",USW00014732,55.0,42.0
4524,2014-12-28,"LAGUARDIA AIRPORT, NY US",USW00014732,52.0,42.0
4525,2014-12-29,"LAGUARDIA AIRPORT, NY US",USW00014732,44.0,35.0
4526,2014-12-30,"LAGUARDIA AIRPORT, NY US",USW00014732,35.0,28.0


Unnamed: 0,DATE,NAME,STATION,TMAX,TMIN
18911,2008-01-01,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,48.0,29.0
18912,2008-01-02,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,37.0,17.0
18913,2008-01-03,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,22.0,14.0
18914,2008-01-04,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,37.0,14.0
18915,2008-01-05,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,40.0,26.0
...,...,...,...,...,...
12703,2014-12-27,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,54.0,37.0
12704,2014-12-28,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,54.0,42.0
12705,2014-12-29,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,46.0,35.0
12706,2014-12-30,"JFK INTERNATIONAL AIRPORT, NY US",USW00094789,36.0,28.0


# Project Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/17  |  8 PM | Import & Wrangle Data, EDA  | Review/Edit wrangling/EDA; Discuss Analysis Plan | 
| 2/24  |  8 PM |  Finalize wrangling/EDA; Begin Analysis  | Discuss/edit Analysis; Complete project check-in | 
| 3/3  |   8 PM  | Complete analysis; Draft results/conclusion/discussion   | Discuss/edit full project   |
| 3/10  |  8 PM  | Any last minute revisions | Turn in Final Project & Group Project Surveys  |
