# Group Project Datasets Spring 2024 

### Your group has been assigned one of the following data sets.
This notebook contains:
* The code to load each of the data sets
* References to the source and possible metadata
* Data cleaning issues to consider
* One or two ideas for relationships to explore, but do not feel constrained -- explore the data use your imagination to find other possibilities!

In [2]:
# Import Numpy and Datascience modules.
import numpy as np
import pandas as pd
from datascience import *
import matplotlib.pyplot as plt
%matplotlib inline

## Data Set 1: Ecological Footprint 
This dataset measures the amount of ecological resources are used from each country in the years 1961 to 2016.  More information can be found at: https://data.world/footprint/nfa-2019-edition

This data set appears to be clean, but there is a lack of metadata:

No units are provided. I believe areas are in hectares, and carbon is in metric tons.
Qscore is explained here: https://www.footprintnetwork.org/data-quality-scores/
"total" column is not explained, but I think it is the total area (ha).

### Data Cleaning Issues:
* There are missing values in some of the columns.
* The "country" field includes "World," as a country, which could confound statistics.
* "forest_land" has both numbers and numbers in quotes that read in as strings.
* There is no "total land" column to put areas in perspective.

### Possible Hypothesis to Test:
The changes in land use over time could be interesting to investigate. China is a fascinating example where huge policy shifts drove change. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5036680/

In [2]:
url = 'data/NFA 2019 public_data.csv'
ecoFootprint = Table.read_table(url, low_memory=False)
ecoFootprint.show(3)

country,year,country_code,record,crop_land,grazing_land,forest_land,fishing_ground,built_up_land,carbon,total,QScore
Armenia,1992,1,AreaPerCap,0.140292,0.199546,0.097188051,0.0368885,0.0293199,0,0.503235,3A
Armenia,1992,1,AreaTotHA,483000.0,687000.0,334600.0,127000.0,100943.0,0,1732540.0,3A
Armenia,1992,1,BiocapPerCap,0.159804,0.135261,0.084003213,0.0137421,0.0333978,0,0.426209,3A


## Data Set #2: Chronic Kidney Disease 
More information on columns at: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease
##### <font color='red'>** Note: This data is not available for a group project because it is part of a class demonstration of k nearest neighbor machine learning ** </font>
See: [Class024](https://temple.2i2c.cloud/hub/user-redirect/lab/tree/datascience/Spring%202024/Class_Examples/Class024_knn%20kidney%20disease.ipynb)
### Data Cleaning Issues:
* There are missing values in some of the columns.

### Possible Hypothesis to Test:
This dataset is structured for machine learning to identify patients as positive or negative for chronic kidney disease. It would be intesting to campare the means of various columns for pos and neg patients.  Correlations are likely to exist as well. The data set is also a candidate for machine learning using k-means clustering.

In [3]:
url = 'data/kidney_disease.csv'
kidneyDisease = Table.read_table(url)
kidneyDisease.show(3)

id,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,bu,sc,sod,pot,hemo,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,48,80,1.02,1,0,,normal,notpresent,notpresent,121.0,36,1.2,,,15.4,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,7,50,1.02,4,0,,normal,notpresent,notpresent,,18,0.8,,,11.3,38,6000,,no,no,no,good,no,no,ckd
2,62,80,1.01,2,3,normal,normal,notpresent,notpresent,423.0,53,1.8,,,9.6,31,7500,,no,yes,no,poor,no,yes,ckd


__________________________

## Data Set #3: Periodic Table 

<img src="data/xkcd_periodic_table.png" width="600">

More information on columns: https://www.kaggle.com/datasets/berkayalan/chemical-periodic-table-elements?select=chemical_elements.csv  Of course, there are numerous references that discuss element groupings.

### Data Cleaning Issues:
* The Discovery(Year) column includes "ancient" as a year.
* Some columns (e.g. Boiling point) load as strings because there are commas at the thousands place. and would need to be converted to numbers.

### Possible Hypothesis:
Does boiling point correlate with atomic weight?

In [4]:
url = 'data/chemical_elements.csv'
ptdf = pd.read_csv(url, sep = ';')
pt = Table.from_df(ptdf)
pt.show(3)

Atomic Number,Name,Atomic weight,Symbol,Melting Point (°C),Boiling Point (°C),Discovery(Year),Group*,"Electron configuration,,"
1,Hydrogen,1.008,H,-259,-253,1776,1,"1s1,,"
2,Helium,4.003,He,-272,-269,1895,18,"1s2,,"
3,Lithium,6.941,Li,180,1347,1817,1,"[He] 2s1,"


## Data Set #4: Wine Quality Dataset 
More information at https://archive.ics.uci.edu/ml/datasets/wine+quality

The wine quality dataset can be used to understand which chemical properties contribute to a higher quality wine.

### Data Cleaning Issues:
Data set is clean

### Possible Hypothesis to Test:
Example hypothesis: Wines with higher acidity may have lower quality at present (test) but improve with aging [Wine Enthusiast](https://www.wineenthusiast.com/basics/advanced-studies/what-is-acidity-in-wine/#). One might also compare the properties of red versus white wines. Do any have significantly different means?

In [5]:
wine = Table().read_table('data/winequality_redwhite.csv')
wine.show(3)

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6,white
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6,white
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6,white


## Data Set #5: Air Quality

[From Kaggle:](https://www.kaggle.com/datasets/tawfikelmetwally/air-quality-dataset?resource=download)

**Content**
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city.
Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses.

Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities.

**Attribute Information:**

* 0-Date (DD/MM/YYYY)
* 1-Time (HH.MM.SS)
* 2-True hourly averaged concentration CO in mg/m^3 (reference analyzer)
* 3-PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
* 4-True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
* 5-True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
* 6-PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
* 7-True hourly averaged NOx concentration in ppb (reference analyzer)
* 8-PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
* 9-True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
* 10-PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
* 11-PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
* 12-Temperature in Â°C
* 13-Relative Humidity (%)
* 14-AH Absolute Humidity

### Data Cleaning Issues:
There are missing values (nans). Working with time series data can be tricky using tables; look at Lab 04 for useful functions.

### Possible Hypotheses to Test:

One could test whether there is a significant difference between Nitrous Oxide levels in the summer vs winter months. Exploring correlations beteen different contaminants would also be interesting. This is a rich data set, so there are many possibilities.

In [6]:
url = 'data/Air Quality.csv'
air = Table.read_table(url)
air.show(3)

Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Unnamed: 15,Unnamed: 16
10/03/2004,18:00:00,2.6,1360,150,11.9,1046,166,1056,113,1692,1268,13.6,48.9,0.7578,,
10/03/2004,19:00:00,2.0,1292,112,9.4,955,103,1174,92,1559,972,13.3,47.7,0.7255,,
10/03/2004,20:00:00,2.2,1402,88,9.0,939,131,1140,114,1555,1074,11.9,54.0,0.7502,,


## Data Set #6:  Earthquakes in the East Coast of the US
The east coast of the US does not have nearly as many earthquakes as California, but as we experienced this semester they do happen! The earthquake data provided here were extracted for the region shown on the map below for the last century, from 1924 to 2024 (of course the monitoring of early earthquake is incomplete).
More information: https://earthquake.usgs.gov/earthquakes/map

<img src="data/earthquake_extraction_region.jpeg">


### Data Cleaning Issues:

There are missing values in many files. Infomation such as the state name will have to be extracted from the "place" column.

### Possible Hypothesis:
One might compare earthquakes in Pennsylvania and New York to see if there is a significant difference in the mean earthquake magnitude by state.

In [7]:
url = 'data/east_coast_earthquakes.csv'
eq = Table.read_table(url)
eq.show(3)

time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,net,id,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
1925-10-09T13:55:00.000Z,43.7,-71.1,,4.0,fa,,,,,ushis,ushis732,2018-06-04T20:43:44.000Z,"2 km NE of Ossipee, New Hampshire",earthquake,,,,,reviewed,ushis,sc
1926-11-05T16:53:00.000Z,39.1,-82.1,,3.8,fa,,,,,ushis,ushis753,2018-06-04T20:43:44.000Z,"6 km NNE of Rutland, Ohio",earthquake,,,,,reviewed,ushis,sc
1928-03-18T15:20:00.000Z,44.5,-74.3,,4.1,ml,,,,,ushis,ushis781,2018-06-04T20:43:44.000Z,"7 km NNW of Paul Smiths, New York",earthquake,,,,,reviewed,ushis,epb


## Data Set #7: Steam Gage data for the Pennypack Creek in Philadelphia
The US Geological Survey has gages on many US streams that collect data data continuously. The Pennypack Creek runs through Philadelphia.
See this website: https://waterdata.usgs.gov/monitoring-location/01467042/#parameterCode=00065&period=P7D&showMedian=true

### Possible relationship to explore: 

Tubidity (sediment in water) and Discharge (stream flow rate)

### Data Cleaning Issues:

Working with time series data can be tricky using tables; look at Lab 04 for useful functions.

#### Column Headers in the data set
```
# Data provided for site 01467042
#    TS_ID       Parameter Description
#    121360      00010     Temperature, water, degrees Celsius
#    121357      00060     Discharge, cubic feet per second
#    121358      00065     Gage height, feet
#    121361      00095     Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius
#    121364      00300     Dissolved oxygen, water, unfiltered, milligrams per liter
#    121365      00301     Dissolved oxygen, water, unfiltered, percent of saturation
#    121362      00400     pH, water, unfiltered, field, standard units
#    277154      63680     Turbidity, water, unfiltered, monochrome near infra-red LED light, 780-900 nm, detection angle 90 +-2.5 degrees, formazin nephelometric units (FNU)
#
# Data-value qualification codes included in this output:
#     P  Provisional data subject to revision.
#     <  Actual value is known to be less than reported value.
```

In [8]:
url = 'data/penny_pack.csv'
pp = Table.read_table(url)
pp.show(3)

agency_cd,site_no,datetime,tz_cd,121360_00010,121360_00010_cd,121357_00060,121357_00060_cd,121358_00065,121358_00065_cd,121361_00095,121361_00095_cd,121364_00300,121364_00300_cd,121365_00301,121365_00301_cd,121362_00400,121362_00400_cd,277154_63680,277154_63680_cd
USGS,1467042,2023-09-01 00:00,EDT,21.4,P,23.5,P,3.21,P,667,P,8.1,P,92,P,7.7,P,0.3,P:<
USGS,1467042,2023-09-01 00:15,EDT,21.3,P,24.7,P,3.22,P,668,P,8.1,P,92,P,7.6,P,0.3,P:<
USGS,1467042,2023-09-01 00:30,EDT,21.2,P,23.5,P,3.21,P,668,P,8.0,P,90,P,7.6,P,0.3,P


## Data Set #8:  Weather Data
Data from a Weather Underground station in the South Kensington neighborhood of Philadelphia
South Kensington - KPAPHILA131: https://www.wunderground.com/dashboard/pws/KPAPHILA131

The data are hourly from December 2019 to January 2021.

### Data Cleaning Issues:

Working with time series data can be tricky using tables; look at Lab 04 for useful functions. To compare data by month requires parsing the date information.

### Possible Hypotheses
There are many interesting relationships to explore, such as between barometeric pressure trends and precipitation, or seasonal differences in rainfall, temperature, etc.

In [9]:
url = 'data/KPAPHILA131_20191217_to_20211119.csv'
weather = Table.read_table(url)
weather.show(3)

stationID,tz,obsTimeUtc,obsTimeLocal,epoch,lat,lon,solarRadiationHigh,uvHigh,winddirAvg,humidityHigh,humidityLow,humidityAvg,qcStatus,metric.tempHigh,metric.tempLow,metric.tempAvg,metric.windspeedHigh,metric.windspeedLow,metric.windspeedAvg,metric.windgustHigh,metric.windgustLow,metric.windgustAvg,metric.dewptHigh,metric.dewptLow,metric.dewptAvg,metric.windchillHigh,metric.windchillLow,metric.windchillAvg,metric.heatindexHigh,metric.heatindexLow,metric.heatindexAvg,metric.pressureMax,metric.pressureMin,metric.pressureTrend,metric.precipRate,metric.precipTotal,dateTimeUtc,obsTimeEST
KPAPHILA131,America/New_York,2019-12-17T05:59:16Z,2019-12-17 00:59:16,1576562356,39.9712,-75.1362,0,0,62,89,88,88.6,1,1.1,1.0,1.0,13.0,0,5.7,20.2,0,8.0,-0.5,-0.8,-0.7,1.1,-2.8,-0.4,1.1,1.0,1.0,1013.55,1012.19,-0.69,5.33,5.33,2019-12-17 05:59:16+00:00,2019-12-17 00:59:16
KPAPHILA131,America/New_York,2019-12-17T06:59:16Z,2019-12-17 01:59:16,1576565956,39.9712,-75.1362,0,0,70,90,89,89.8,1,1.2,1.1,1.2,22.7,0,6.4,28.1,0,9.2,-0.3,-0.5,-0.3,1.2,-4.0,-0.4,1.2,1.1,1.2,1012.19,1010.5,-1.38,4.32,9.65,2019-12-17 06:59:16+00:00,2019-12-17 01:59:16
KPAPHILA131,America/New_York,2019-12-17T07:59:16Z,2019-12-17 02:59:16,1576569556,39.9712,-75.1362,0,0,70,91,90,90.5,1,1.4,1.2,1.3,19.1,0,6.9,28.1,0,9.9,0.1,-0.3,-0.1,1.4,-3.3,-0.4,1.4,1.2,1.3,1011.18,1009.48,-1.38,1.78,11.43,2019-12-17 07:59:16+00:00,2019-12-17 02:59:16


## Data Set #9:  Fetal Health 
Kaggle Dataset: https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification

Reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals and is a key indicator of human progress.
The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.

Parallel to notion of child mortality is of course maternal mortality, which accounts for 295 000 deaths during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in low-resource settings, and most could have been prevented.

In light of what was mentioned above, Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.

Data
This dataset contains 2126 records of features extracted from Cardiotocogram exams, which were then classified by three expert obstetritians into 3 classes:

fetal_health
* Normal (1)
* Suspect (2)
* Pathological (3)



### Data Cleaning Issues:

The data appear to be clean. Need to research the features obtained from Cardiotocograms.

### Possible Hypotheses:
Apart from explored correlations, this dataset would be an excellent one to try k-means prediction of fetal health.

In [7]:
filename = "data/fetal_health.csv"
fetal = Table().read_table(filename)
fetal.show(3)

baseline value,accelerations,fetal_movement,uterine_contractions,light_decelerations,severe_decelerations,prolongued_decelerations,abnormal_short_term_variability,mean_value_of_short_term_variability,percentage_of_time_with_abnormal_long_term_variability,mean_value_of_long_term_variability,histogram_width,histogram_min,histogram_max,histogram_number_of_peaks,histogram_number_of_zeroes,histogram_mode,histogram_mean,histogram_median,histogram_variance,histogram_tendency,fetal_health
120,0.0,0,0.0,0.0,0,0,73,0.5,43,2.4,64,62,126,2,0,120,137,121,73,1,2
132,0.006,0,0.006,0.003,0,0,17,2.1,0,10.4,130,68,198,6,1,141,136,140,12,0,1
133,0.003,0,0.008,0.003,0,0,16,2.1,0,13.4,130,68,198,5,1,141,135,138,13,0,1


In [11]:
## might be also useful to have population for looking at Philly Vaccination Rates
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/PA_zip_pop.csv"
paPop = Table.read_table(url)
paPop.sort("pop",descending=True)
paPop.where('county','Philadelphia').show(3)

zip,city,county,pop
19120,Philadelphia,Philadelphia,74060
19124,Philadelphia,Philadelphia,70304
19111,Philadelphia,Philadelphia,68113


## Data Set #10: Diabetes Prediction
This data set is from Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

**Description:**

"The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information. This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans. Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes."

### Data Cleaning Issues:

While there are no missing values, the gender and smoking history columns needs to be converted to a numbers to model.

### Possible Hypotheses:
This data set is a good candidate for k-mean clustering to predict the whether a patient has diabetes. One could also explore correlation between fields, look at differences by gender, smoking history, etc.

In [3]:
url = 'data/diabetes_prediction_dataset.csv'
diabetes = Table.read_table(url)
diabetes.show(3)

gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
Female,80,0,1,never,25.19,6.6,140,0
Female,54,0,0,No Info,27.32,6.6,80,0
Male,28,0,0,never,27.32,5.7,158,0


In [4]:
diabetes.stats()

statistic,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
min,Female,0.08,0,0,No Info,10.01,3.5,80.0,0
max,Other,80.0,1,1,not current,95.69,9.0,300.0,1
median,,43.0,0,0,,27.32,5.8,140.0,0
sum,,4188590.0,7485,3942,,2732080.0,552751.0,13805800.0,8500


In [5]:
np.unique(diabetes['smoking_history'])

array(['No Info', 'current', 'ever', 'former', 'never', 'not current'],
      dtype='<U11')

## Data Set #11: Philadelphia Open Data School Graduation Rates
This longitudinal open data file includes information about the graduation rates for schools broken out by: graduation rate type (four-year, five-year, or six-year), demographic category (EL status, IEP status, Economically Disadvantaged Status, Gender, or Ethnicity), and ninth-grade cohort. Students are attributed to the last school at which they actively attended in the respective graduation window, which ends on September 30 each year. Students are classified as EL, as having an IEP, and/or economically disadvantaged if they were designated as such at any point during their high school career.
see: https://www.philasd.org/performance/programsservices/open-data/school-performance/#school_graduation_rates 
see also: https://www.philasd.org/research/wp-content/uploads/sites/90/2020/05/graduation-rate-definitions-and-trends-may-2020.pdf

### Data Cleaning Issue:
Some of the fields have mixed numerical and text data (e.g., num, score), with the code "s" where a score was not calculated.

### Possible Hypotheses
Many possibilities. One could look at whether there is a statistically significance difference in scores between two schools, investigate trends over time, or look at differenct groups an subgroups. Keep in mind that this a limited data set covering a socially sensitive topic, so do not draw overly broad conclusions.

In [13]:
url = "https://cdn.philasd.org/offices/performance/Open_Data/School_Performance/Graduation_Rates/SDP_Graduation_Rates_School_S_2022-05-23.csv"
grad = Table.read_table(url)
grad.show(3)

cohort,schoolid_ulcs,schoolname,sector,rate_type,group,subgroup,denom,num,score
2010-2011,1010,John Bartram High School,District,4-Year Graduation Rate,All Students,All Students,281,203,72.24
2010-2011,1010,John Bartram High School,District,4-Year Graduation Rate,Economically Disadvantaged,Economically Disadvantaged,211,153,72.51
2010-2011,1010,John Bartram High School,District,4-Year Graduation Rate,Economically Disadvantaged,Not Economically Disadvantaged,70,50,71.43


## Data Set #12: Jeopardy
see: https://www.jeopardy.com

### Data Cleaning Issues:
Multiple table to join. Some fields have missing values.

### Possible Hypothesis to Test:
Do returning champions score better? Does seating position matter? There are many imaginative possibilities to investigate.

In [14]:
contestant = "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/contestants.csv"
locations =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/locations.csv"
results =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/final_results.csv"
loc = Table.read_table(locations)
contest = Table.read_table(contestant)
outcome = Table.read_table(results)
outcome.show(3)

game_id,season,position,dj_score,wager,correct,coryat_score
5635,33,returning_champion,10000,3001,1,11000
5635,33,middle,12000,12000,0,12000
5635,33,right,13000,11001,0,15200


## Data Set #13: Crime Data for Philadelphia
The data came from OpenDataPhilly: https://opendataphilly.org/datasets/crime-incidents
Reported incidents cover the full year of 2023.

### Data Cleaning Issues:
To extract months you would need to parse the date data. Working with time series data can be tricky using tables; look at Lab 04 for useful functions. To compare data by month requires parsing the date information.

### Possible Hypotheses:
Possible correlation: type of crime and time of day. Could look at where at particular type of crime occurs more frequency at particular time of day or whether the number of crimes is significantly different in different months.

In [15]:
url = 'data/Philly_crime_2023.csv'
crime = Table.read_table(url)
crime.show(3)

the_geom,cartodb_id,the_geom_webmercator,objectid,dc_dist,psa,dispatch_date_time,dispatch_date,dispatch_time,hour,dc_key,location_block,ucr_general,text_general_code,point_x,point_y,lat,lng
0101000020E6100000A51C8299A5C752C006342AD3DCFF4340,2,0101000020110F0000F80DE2A145E65FC1E5EC7592BE8F5241,114,25,3,2023-03-11 17:12:00+00,2023-03-11,12:12:00,12,202325000000.0,3300 BLOCK HARTVILLE ST,300,Robbery No Firearm,-75.1195,39.9989,39.9989,-75.1195
0101000020E6100000F9245E3B64CC52C0B7195D940FF64340,4,0101000020110F00000426B7CE54EE5FC1C5E06D37E2845241,116,1,1,2023-03-11 18:31:00+00,2023-03-11,13:31:00,13,202301000000.0,2400 BLOCK S 28TH ST,600,Theft from Vehicle,-75.1936,39.9224,39.9224,-75.1936
0101000020E6100000118A52E7F6C052C0CFF41263190C4440,7,0101000020110F00006728CED7EBDA5FC169DB64F8519D5241,119,8,2,2023-03-11 22:13:00+00,2023-03-11,17:13:00,17,202308000000.0,9800 BLOCK Roosevelt Blvd,600,Thefts,-75.0151,40.0945,40.0945,-75.0151


## Data Set #14: Global Sustainable Energy Production
Data set taken from Kaggle:

"Uncover this dataset showcasing sustainable energy indicators and other useful factors across all countries from 2000 to 2020. Dive into vital aspects such as electricity access, renewable energy, carbon emissions, energy intensity, Financial flows, and economic growth. Compare nations, track progress towards Sustainable Development Goal 7, and gain profound insights into global energy consumption patterns over time."

Metadata: https://www.kaggle.com/datasets/anshtanwar/global-data-on-sustainable-energy

### Data Cleaning Issues:
Field names are overly long. Some fields have missing values.

### Possible Hypotheses:
Can investigate trends over time, differences in means between countries, correlation between fields -- many possibilities! For example, one could look for differences between the energy habits of richer (gdp_per_capita) and poorer nations.

In [16]:
filename = "data/global-data-on-sustainable-energy.csv"
energy = Table.read_table(filename)
energy.show(3)

Entity,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),Low-carbon electricity (% electricity),Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Density\n(P/Km2),Land Area(Km2),Latitude,Longitude
Afghanistan,2000,1.61359,6.2,9.22,20000.0,44.99,0.16,0,0.31,65.9574,302.595,1.64,760,,,,60,652230,33.9391,67.71
Afghanistan,2001,4.07457,7.2,8.86,130000.0,45.6,0.09,0,0.5,84.7458,236.892,1.74,730,,,,60,652230,33.9391,67.71
Afghanistan,2002,9.40916,8.2,8.47,3950000.0,37.83,0.13,0,0.56,81.1594,210.862,1.4,1030,,,179.427,60,652230,33.9391,67.71


## Data Set #15: Motor Vehicle Crash Data for Staten Island in 2023

This data set came from Data.gov. The accident data file for New York city is huge, so it has been trimmed to just Staten Island in 2023.

https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

"The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage (https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf). It should be noted that the data is preliminary and subject to change when the MV-104AN forms are amended based on revised crash details.For the most accurate, up to date statistics on traffic fatalities, please refer to the NYPD Motor Vehicle Collisions page (updated weekly) or Vision Zero View (updated monthly)."

### Data Cleaning Issues

Some fields have missing values. May need to parse dates.

### Possible Hypotheses
Such a large data set opens up many possibilities. Compare percent of accidents resulting in fatalities by vechicle type? How about in two-vehicle accidents? Are certain months statistically more likely to have accidents? Certain zipcodes (may need to look for populations data to convert to per capita)? 


In [17]:
filename = 'data/StatenIsland_crash_data_2023.csv'
crash = Table.read_table(filename)
crash.show(3)

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
24,12/13/2021,17:40,STATEN ISLAND,10301,40.6317,-74.0876,"(40.63165, -74.08762)",VICTORY BOULEVARD,WOODSTOCK AVENUE,,1,0,0,0,0,0,1,0,Unspecified,Unspecified,,,,4487001,Sedan,Sedan,,,
83,12/08/2021,22:37,STATEN ISLAND,10314,40.6212,-74.1239,"(40.62121, -74.12385)",,,288 MANOR ROAD,0,0,0,0,0,0,0,0,Pavement Slippery,,,,,4484906,Sedan,,,,
94,03/26/2022,14:00,STATEN ISLAND,10301,40.6378,-74.0819,"(40.637833, -74.08193)",CORSON AVENUE,WESTERVELT AVENUE,,0,0,0,0,0,0,0,0,Driver Inattention/Distraction,Unspecified,,,,4513697,Sedan,Sedan,,,
