# Group Project Datasets Fall 2025 

|Helpful Links|
|---|
|[Elements of Data Science Primer](https://laserchemist.github.io/dprimer/intro.html)|
|[Inferential Thinking](https://inferentialthinking.com/chapters/intro.html)|
|[Data Science Toolkit examples](https://temple.2i2c.cloud/hub/user-redirect/lab/tree/datascience/Fall%202024/DataScience_Toolkit.ipynb)|

### Your group has been assigned one of the following data sets.
This notebook contains:
* The code to load each of the data sets
* References to the source and possible metadata
* Data cleaning issues to consider
* One or two ideas for relationships to explore, but do not feel constrained -- explore the data use your imagination to find other possibilities!

## Dataset Areas
<ul>
      <li><a href="#Bioinformatics">Bioinformatics</a></li>
      <li><a href="#Chemistry">Chemistry</a></li>
      <li><a href="#Environmental Science">Environmental Science</a>
      <li><a href="#Ecology">Ecology</a>
      <li><a href="#General">General</a>
      <li><a href="#Medical">Public Health and Medical</a>
        </li>
</ul>

In [None]:
# Import Numpy and Datascience modules.
import numpy as np
import pandas as pd
from datascience import *
import matplotlib.pyplot as plt
%matplotlib inline

Useful auxiliary data (can use .join() to merge with your data where appropriate)

**US population by zip code (source census.gov)**

In [None]:
zippop = Table().read_table('data/census_pop_byzip.csv',dtype=str)

In [None]:
zippop.where('zip code','19122')

**World population by country and year (source: UN)**

In [None]:
worldpop = Table().read_table('data/WPP2024.csv', low_memory=False)

In [None]:
worldpop.where('Country','France').show(3)

## **Ecological Datasets** <a id="Ecology"></a>

## Ecological Footprint (Data Set 1)
This dataset measures the amount of ecological resources are used from each country in the years 1961 to 2016.  More information can be found at: https://data.world/footprint/nfa-2019-edition

This data set appears to be clean, but there is a lack of metadata:

No units are provided. I believe areas are in hectares, and carbon is in metric tons.
QScore is explained here: https://www.footprintnetwork.org/data-quality-scores/
"total" column is not explained, but I think it is the total area (ha).

### Data Cleaning Issues:
* There are missing values in some of the columns.
* The "country" field includes "World," as a country, which could confound statistics.
* "forest_land" has both numbers and numbers in quotes that read in as strings.
* There is no "total land" column to put areas in perspective.

### Possible Hypothesis to Test:
The changes in land use over time could be interesting to investigate. China is a fascinating example where huge policy shifts drove change. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5036680/

In [None]:
url = 'data/NFA 2019 public_data.csv'
ecoFootprint = Table.read_table(url, low_memory=False)
ecoFootprint.show(3)

## **Public Health and Medical** <a id="Medical">

## Heart Disease (DS2)
Data from Kaggle see [https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)

|variable|
|---|
|age|
|sex|
|chest pain type (4 values)|
|resting blood pressure|
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by fluoroscopy
thal: 0 = normal; 1 = fixed defect; 2 = reversible defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

target (0 = no heart disease and 1 = heart disease)


### Possible Hypothesis to Test:
This dataset is structured for machine learning to identify patients with or without heart disease. The data set is a good candidate for machine learning using k-nearest neighbor classification. See the EDS primer for succinct details in how to apply this algorithm from Lab 10: [k-Nearest Neighbors Classification](https://laserchemist.github.io/dprimer/k-Nearest_Neighbors_classification.html#). It would be interesting to compare the means of various columns for pos and neg patients.  Correlations are likely to exist as well. The data set is also a candidate for machine learning using k-means clustering.

In [None]:
url = 'data/heart.csv'
heart = Table.read_table(url)
heart.show(3)

## Fetal Health (DS9)
Kaggle Dataset: https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification

Reduction of child mortality is reflected in several of the United Nations' Sustainable Development Goals and is a key indicator of human progress.
The UN expects that by 2030, countries end preventable deaths of newborns and children under 5 years of age, with all countries aiming to reduce under‑5 mortality to at least as low as 25 per 1,000 live births.

Parallel to notion of child mortality is of course maternal mortality, which accounts for 295 000 deaths during and following pregnancy and childbirth (as of 2017). The vast majority of these deaths (94%) occurred in low-resource settings, and most could have been prevented.

In light of what was mentioned above, Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The equipment itself works by sending ultrasound pulses and reading its response, thus shedding light on fetal heart rate (FHR), fetal movements, uterine contractions and more.

Data
This dataset contains 2126 records of features extracted from Cardiotocogram exams, which were then classified by three expert obstetricians into 3 classes:

fetal_health
* Normal (1)
* Suspect (2)
* Pathological (3)



### Data Cleaning Issues:

The data appear to be clean. Need to research the features obtained from Cardiotocograms.

### Possible Hypotheses:
Apart from explored correlations, this dataset would be an excellent one to try k-means prediction of fetal health.

In [None]:
filename = "data/fetal_health.csv"
fetal = Table().read_table(filename)
fetal.show(3)

## Diabetes Prediction (DS10)
This data set is from Kaggle. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

**Description:**

"The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. This dataset can be used to build machine learning models to predict diabetes in patients based on their medical history and demographic information. This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans. Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes."

### Data Cleaning Issues:

While there are no missing values, the gender and smoking history columns needs to be converted to a numbers to model.

### Possible Hypotheses:
This data set is a good candidate for k-mean clustering to predict the whether a patient has diabetes. One could also explore correlation between fields, look at differences by gender, smoking history, etc.

In [None]:
url = 'data/diabetes_prediction_dataset.csv'
diabetes = Table.read_table(url)
diabetes.show(3)

In [None]:
diabetes.stats()

In [None]:
np.unique(diabetes['smoking_history'])

__________________________

## **Chemistry** <a id="Chemistry"></a>

## Data Set #3: Periodic Table 

<img src="data/xkcd_periodic_table.png" width="600">

More information on columns: https://www.kaggle.com/datasets/berkayalan/chemical-periodic-table-elements?select=chemical_elements.csv  Of course, there are numerous references that discuss element groupings.
<br>An auxiliary data set is available in the to compliment the data in the original file.

### Data Cleaning Issues:
* The Discovery(Year) column includes "ancient" as a year.
* Some columns (e.g. Boiling point) load as strings because there are commas at the thousands place. and would need to be converted to numbers.

### Possible Hypothesis:
Does boiling point correlate with atomic weight?

In [None]:
url = 'data/chemical_elements.csv'
auxdata = 'data/Periodic Table of Elements.csv'
ptdf = pd.read_csv(url, sep = ';')
pt = Table.from_df(ptdf)
pt.show(3)

In [None]:
pt_aux = Table.read_table(auxdata)
pt_aux.show(3)

## **Environmental Science** <a id="Environmental Science"></a>

## Air Quality (DS5)

[From Kaggle:](https://www.kaggle.com/datasets/tawfikelmetwally/air-quality-dataset?resource=download)

**Content**
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city.
Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses.

Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities.

**Attribute Information:**

* 0-Date (DD/MM/YYYY)
* 1-Time (HH.MM.SS)
* 2-True hourly averaged concentration CO in mg/m^3 (reference analyzer)
* 3-PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
* 4-True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
* 5-True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
* 6-PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
* 7-True hourly averaged NOx concentration in ppb (reference analyzer)
* 8-PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
* 9-True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
* 10-PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
* 11-PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
* 12-Temperature in Â°C
* 13-Relative Humidity (%)
* 14-AH Absolute Humidity

### Data Cleaning Issues:
There are missing values (nans). Working with time series data can be tricky using tables; look at Lab 04 for useful functions.

### Possible Hypotheses to Test:

One could test whether there is a significant difference between Nitrous Oxide levels in the summer vs winter months. Exploring correlations between different contaminants would also be interesting. This is a rich data set, so there are many possibilities.

In [None]:
url = 'data/Air Quality.csv'
air = Table.read_table(url)
air.show(3)

## Global Sustainable Energy Production (DS14)
Data set taken from Kaggle:

"Uncover this dataset showcasing sustainable energy indicators and other useful factors across all countries from 2000 to 2020. Dive into vital aspects such as electricity access, renewable energy, carbon emissions, energy intensity, Financial flows, and economic growth. Compare nations, track progress towards Sustainable Development Goal 7, and gain profound insights into global energy consumption patterns over time."

Metadata: https://www.kaggle.com/datasets/anshtanwar/global-data-on-sustainable-energy

### Data Cleaning Issues:
Field names are overly long. Some fields have missing values.

### Possible Hypotheses:
Can investigate trends over time, differences in means between countries, correlation between fields -- many possibilities! For example, one could look for differences between the energy habits of richer (gdp_per_capita) and poorer nations.

In [None]:
filename = "data/global-data-on-sustainable-energy.csv"
energy = Table.read_table(filename)
energy.show(3)

## Earthquakes in the East Coast of the US (DS6)
The East Coast of the US does not have nearly as many earthquakes as California, but as we experienced this semester they do happen! The earthquake data provided here were extracted for the region shown on the map below for the last century, from 1924 to 2024 (of course the monitoring of early earthquake is incomplete).
More information: https://earthquake.usgs.gov/earthquakes/map

<img src="data/earthquake_extraction_region.jpeg">


### Data Cleaning Issues:

There are missing values in many files. Information such as the state name will have to be extracted from the "place" column.

### Possible Hypothesis:
One might compare earthquakes in Pennsylvania and New York to see if there is a significant difference in the mean earthquake magnitude by state.

In [None]:
url = 'data/east_coast_earthquakes.csv'
eq = Table.read_table(url)
eq.show(3)

## Steam Gauge data for the Pennypack Creek in Philadelphia (DS7)
The US Geological Survey has gauges on many US streams that collect data data continuously. The Pennypack Creek runs through Philadelphia.
See this website: https://waterdata.usgs.gov/monitoring-location/01467042/#parameterCode=00065&period=P7D&showMedian=true

### Possible relationship to explore: 

Turbidity (sediment in water) and Discharge (stream flow rate)

### Data Cleaning Issues:

Working with time series data can be tricky using tables; look at Lab 04 for useful functions.

#### Column Headers in the data set
```
# Data provided for site 01467042
#    TS_ID       Parameter Description
#    121360      00010     Temperature, water, degrees Celsius
#    121357      00060     Discharge, cubic feet per second
#    121358      00065     Gauge height, feet
#    121361      00095     Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius
#    121364      00300     Dissolved oxygen, water, unfiltered, milligrams per liter
#    121365      00301     Dissolved oxygen, water, unfiltered, percent of saturation
#    121362      00400     pH, water, unfiltered, field, standard units
#    277154      63680     Turbidity, water, unfiltered, monochrome near infra-red LED light, 780-900 nm, detection angle 90 +-2.5 degrees, formazin nephelometric units (FNU)
#
# Data value qualification codes included in this output:
#     P  Provisional data subject to revision.
#     <  Actual value is known to be less than reported value.
```

In [None]:
url = 'data/penny_pack.csv'
pp = Table.read_table(url)
pp.show(3)

## Weather Data (DS8)
Data from a Weather Underground station in the South Kensington neighborhood of Philadelphia
South Kensington - KPAPHILA131: https://www.wunderground.com/dashboard/pws/KPAPHILA131

The data are hourly from December 2019 to January 2021.

### Data Cleaning Issues:

Working with time series data can be tricky using tables; look at Lab 04 for useful functions. To compare data by month requires parsing the date information.

### Possible Hypotheses
There are many interesting relationships to explore, such as between barometric pressure trends and precipitation, or seasonal differences in rainfall, temperature, etc.

In [None]:
url = 'data/KPAPHILA131_20191217_to_20211119.csv'
weather = Table.read_table(url)
weather.show(3)

In [None]:
## might be also useful to have population for looking at Philly Vaccination Rates
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/PA_zip_pop.csv"
paPop = Table.read_table(url)
paPop.sort("pop",descending=True)
paPop.where('county','Philadelphia').show(3)

## **General Datasets**  <a id="General"></a>

## Philadelphia Open Data School Graduation Rates (DS11)
This longitudinal open data file includes information about the graduation rates for schools broken out by: graduation rate type (four-year, five-year, or six-year), demographic category (EL status, IEP status, Economically Disadvantaged Status, Gender, or Ethnicity), and ninth-grade cohort. Students are attributed to the last school at which they actively attended in the respective graduation window, which ends on September 30 each year. Students are classified as EL, as having an IEP, and/or economically disadvantaged if they were designated as such at any point during their high school career.
see: https://www.philasd.org/performance/programsservices/open-data/school-performance/#school_graduation_rates 
see also: https://www.philasd.org/research/wp-content/uploads/sites/90/2020/05/graduation-rate-definitions-and-trends-may-2020.pdf

The 'group' column can be used with the `.where()` Table method to select a particular comparison such as Economically Disadvantaged which then has two subgroups for comparison.

### Data Cleaning Issue:
Some of the fields have mixed numerical and text data (e.g., num, score), with the code "s" where a score was not calculated.
<br>A dataset with the score computed as a floating point number but with the Table rows with "s" removed is available for analysis:<br>
```python
fscore_grad = 'data/Philly_grad_fscore.csv'
```
### Possible Hypotheses
Many possibilities. One could look at whether there is a statistically significance difference in scores between two schools, investigate trends over time, or look at different groups an subgroups. Keep in mind that this a limited data set covering a socially sensitive topic, so do not draw overly broad conclusions.

In [None]:
url = "https://cdn.philasd.org/offices/performance/Open_Data/School_Performance/Graduation_Rates/SDP_Graduation_Rates_School_S_2022-05-23.csv"
grad = Table.read_table(url)
grad.show(3)

In [None]:
fscore_grad = 'data/Philly_grad_fscore.csv'
fgrad = Table.read_table(fscore_grad)
fgrad.show(3)

## Jeopardy (DS12)
see: https://www.jeopardy.com<br>
data source: https://anuparna.github.io/jeopardy/<br>
<br>In the outcome Table there is a dj_score which is before final jeopardy in which wagers are made. The final score which determines the winner is the one who correctly wagers and adds or does not lose to much to come to a final score based one whether they are correct and their wager. It could help to put an array into new column to compute the final score to find winner of each game: 
```python
final_score = outcome['dj_score'] + outcome['wager']*outcome['correct'] - outcome['wager']*(outcome['correct']==0)
```
### Data Cleaning Issues:
Multiple table to join. Some fields have missing values.

### Possible Hypothesis to Test:
Do returning champions score better? Does seating position matter? There are many imaginative possibilities to investigate.

In [None]:
contestant = "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/contestants.csv"
locations =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/locations.csv"
results =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/final_results.csv"
loc = Table.read_table(locations)
contest = Table.read_table(contestant)
outcome = Table.read_table(results)
outcome.show(3)

In [None]:
contest.show(3)

## Crime Data for Philadelphia (DS13)
The data came from OpenDataPhilly: https://opendataphilly.org/datasets/crime-incidents
Reported incidents cover the full year of 2023.

### Data Cleaning Issues:
To extract months you would need to parse the date data. Working with time series data can be tricky using tables; look at Lab 04 for useful functions. To compare data by month requires parsing the date information.

### Possible Hypotheses:
Possible correlation: type of crime and time of day. Could look at where at particular type of crime occurs more frequency at particular time of day or whether the number of crimes is significantly different in different months.

In [None]:
url = 'data/Philly_crime_2023.csv'
crime = Table.read_table(url)
crime.show(3)

## Motor Vehicle Crash Data for Staten Island in 2023 (DS15)

This data set came from Data.gov. The accident data file for New York city is huge, so it has been trimmed to just Staten Island in 2023.

https://catalog.data.gov/dataset/motor-vehicle-collisions-crashes

"The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage (https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf). It should be noted that the data is preliminary and subject to change when the MV-104AN forms are amended based on revised crash details.For the most accurate, up to date statistics on traffic fatalities, please refer to the NYPD Motor Vehicle Collisions page (updated weekly) or Vision Zero View (updated monthly)."

### Data Cleaning Issues

Some fields have missing values. May need to parse dates.

### Possible Hypotheses
Such a large data set opens up many possibilities. Compare percent of accidents resulting in fatalities by vehicle type? How about in two-vehicle accidents? Are certain months statistically more likely to have accidents? Certain zipcodes (may need to look for populations data to convert to per capita)? 


In [None]:
filename = 'data/StatenIsland_crash_data_2023.csv'
crash = Table.read_table(filename)
crash.show(3)