# Potential Project Datasets 

##### Below are some datasets which could be used for the final group project.  Please feel free to find your own datasets as well if you have a particular interest or question.  
Some great websites for dataset curation include: 

1. https://www.kaggle.com/
2. https://data.world/
3. https://opendata.cern.ch/
4. https://www.sciencebase.gov/catalog/
5. https://data.neonscience.org/
6. https://opendataphilly.org

In [2]:
# Import Numpy and Datascience modules.
import numpy as np
import pandas as pd
from datascience import *

# Plotting modules
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', UserWarning)

### Biology Datasets
includes: Ecological Footprint, & kidney disease.

##### Ecological Footprint 
This dataset measures the amount of ecological resources are used from each country in the years 1961 to 2016.  More information can be found at: https://data.world/footprint/nfa-2019-edition

In [None]:
url = 'data/NFA 2019 public_data.csv'
ecoFootprint = Table.read_table(url)
ecoFootprint

##### Chronic Kidney Disease 
More information on columns at: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease

In [None]:
url = 'data/kidney_disease.csv'
kidneyDisease = Table.read_table(url)
kidneyDisease

__________________________

### Chemistry Datasets
including pKa, High Entropy Alloys, & Periodic Table

##### Molecular acid dissociation constant, pKa data
See: https://github.com/samplchallenges/SAMPL7/tree/master/physical_property/pKa
SMILES is a representaion of chemical structure.

In [None]:
url = "https://raw.githubusercontent.com/robraddi/GP-SAMPL7/main/pKaDatabase/OChem/ochem0-2000.csv"
pka = Table.read_table(url)
pka

##### Periodic Table 
More information on columns: https://www.kaggle.com/datasets/berkayalan/chemical-periodic-table-elements?select=chemical_elements.csv

In [None]:
url = 'data/chemical_elements.csv'
ptdf = pd.read_csv(url, sep = ';')
pt = Table.from_df(ptdf)
pt

##### High Entropy Alloys
https://www.sciencedirect.com/science/article/pii/S2352340921006302?via%3Dihub

In [None]:
url = 'data/high_entropy_alloys.csv'
alloys = Table.read_table(url)
alloys

##### Wine Quality Dataset 
More information at https://archive.ics.uci.edu/ml/datasets/wine+quality

The wine quality dataset can be used to understand which chemical properties contribute to a higher quality wine.
<br>Example hypothesis: Wines with higher acidity may have lower quality at present (test) but improve with aging [Wine Enthusiast](https://www.wineenthusiast.com/basics/advanced-studies/what-is-acidity-in-wine/#).

Citation Request:
  This dataset is public available for research. The details are described in [Cortez et al., 2009]. 
  Please include this citation if you plan to use this database:

  P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
                [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
                [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

1. Title: Wine Quality 

2. Sources
   Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
   
3. Past Usage:

  P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  In the above reference, two datasets were created, using red and white wine samples.
  The inputs include objective tests (e.g. PH values) and the output is based on sensory data
  (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality 
  between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
  these datasets under a regression approach. The support vector machine model achieved the
  best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
  etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
  analysis procedure).
 
4. Relevant Information:

   The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
   For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
   Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables 
   are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

   These datasets can be viewed as classification or regression tasks.
   The classes are ordered and not balanced (e.g. there are munch more normal wines than
   excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
   or poor wines. Also, we are not sure if all input variables are relevant. So
   it could be interesting to test feature selection methods. 

5. Number of Instances: red wine - 1599; white wine - 4898. 

6. Number of Attributes: 11 + output attribute
  
   Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
   feature selection.

7. Attribute information:

   For more information, read [Cortez et al., 2009].

   Input variables (based on physicochemical tests):
   1 - fixed acidity
   2 - volatile acidity
   3 - citric acid
   4 - residual sugar
   5 - chlorides
   6 - free sulfur dioxide
   7 - total sulfur dioxide
   8 - density
   9 - pH
   10 - sulphates
   11 - alcohol
   Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

8. Missing Attribute Values: None


In [10]:
wine_red = Table().read_table('data/winequality-red.csv', sep=';')
wine_red=wine_red.with_columns('type','red')
wine_red

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5,red
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5,red
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5,red
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6,red
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5,red
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5,red
7.9,0.6,0.06,1.6,0.069,15,59,0.9964,3.3,0.46,9.4,5,red
7.3,0.65,0.0,1.2,0.065,15,21,0.9946,3.39,0.47,10.0,7,red
7.8,0.58,0.02,2.0,0.073,9,18,0.9968,3.36,0.57,9.5,7,red
7.5,0.5,0.36,6.1,0.071,17,102,0.9978,3.35,0.8,10.5,5,red


In [11]:
wine_white = Table().read_table('data/winequality-white.csv', sep=';')
wine_white=wine_white.with_columns('type','white')
wine_white

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6,white
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6,white
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6,white
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6,white
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6,white
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6,white
6.2,0.32,0.16,7.0,0.045,30,136,0.9949,3.18,0.47,9.6,6,white
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6,white
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6,white
8.1,0.22,0.43,1.5,0.044,28,129,0.9938,3.22,0.45,11.0,6,white


In [12]:
wine = wine_white.append(wine_red)
wine

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6,white
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6,white
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6,white
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6,white
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6,white
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6,white
6.2,0.32,0.16,7.0,0.045,30,136,0.9949,3.18,0.47,9.6,6,white
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6,white
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6,white
8.1,0.22,0.43,1.5,0.044,28,129,0.9938,3.22,0.45,11.0,6,white


In [13]:
wine.group('type')

type,count
red,1599
white,4898


In [15]:
wine.to_csv('data/winequality_redwhite.csv')

### Environmental and Earth Science Datasets

Philadelphia Air Quality Measurements (Ozone, PM10, Carbon monoxide)

In [None]:
url = "https://opendata.arcgis.com/api/v3/datasets/3899a065577747fbb824f0a21afc2e7c_0/downloads/data?format=csv&spatialRefId=4326"
air = Table.read_table(url)
air

#### Earthquakes in the US 09/01/2023 until 11/12/2023
More information: https://earthquake.usgs.gov/earthquakes/map

Possible relationship to explore: earthquake magnitude with depth

In [None]:
url = 'data/USearthquake_fall_2023.csv'
eq = Table.read_table(url)
eq

#### Steam Gage data for the Pennypack Creek in Philadelphia
The US Geological Survey has gages on many US streams that collect data data continuously. The Pennypack Creek runs through Philadelphia.
See this website: https://waterdata.usgs.gov/monitoring-location/01467042/#parameterCode=00065&period=P7D&showMedian=true

Possible relationship to explore: Tubidity (sediment in water) and Discharge (stream flow rate)

#### Column Headers in the data set
```
# Data provided for site 01467042
#    TS_ID       Parameter Description
#    121360      00010     Temperature, water, degrees Celsius
#    121357      00060     Discharge, cubic feet per second
#    121358      00065     Gage height, feet
#    121361      00095     Specific conductance, water, unfiltered, microsiemens per centimeter at 25 degrees Celsius
#    121364      00300     Dissolved oxygen, water, unfiltered, milligrams per liter
#    121365      00301     Dissolved oxygen, water, unfiltered, percent of saturation
#    121362      00400     pH, water, unfiltered, field, standard units
#    277154      63680     Turbidity, water, unfiltered, monochrome near infra-red LED light, 780-900 nm, detection angle 90 +-2.5 degrees, formazin nephelometric units (FNU)
#
# Data-value qualification codes included in this output:
#     P  Provisional data subject to revision.
#     <  Actual value is known to be less than reported value.
```

In [None]:
url = 'data/penny_pack.csv'
pp = Table.read_table(url)
pp

#### Weather Data
Data from a Weather Underground station in the South Kensington neighborhood of Philadelphia
South Kensington - KPAPHILA131: https://www.wunderground.com/dashboard/pws/KPAPHILA131
Temperature, humidity, windspeed, rainfall -- many interesting features to compare. The data are hourly from December 2019 to January 2021.

In [None]:
url = 'data/KPAPHILA131_20191217_to_20211119.csv'
weather = Table.read_table(url)
weather

________________________________________________________________

### Physics datasets
including exoplanets, Near Earth Objects (NEO), & CERN Electron Collision data. 

##### Exoplanets observed by Kepler telescope
https://exoplanets.nasa.gov/keplerscience/

In [None]:
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/kepler.csv"
exoplanets = Table.read_table(url)
exoplanets

##### Near Earth Objects 
Data is found at https://cneos.jpl.nasa.gov/ca/, but more information about the project can be found here: https://cneos.jpl.nasa.gov/

In [None]:
url = 'data/cneos_closeapproach_data.csv'
neo = Table.read_table(url)
neo

##### CERN Electron Collision Data 
Data was downloaded from https://www.kaggle.com/datasets/fedesoriano/cern-electron-collision-data but was modified from the original data https://opendata.cern.ch/record/304

In [None]:
url = 'data/dielectron.csv'
electron = Table.read_table(url)
electron

________________

### Public Health datasets
including Philadelphia vaccination rates, global vaccination rates, & Hepatitis C diagnosis.  

##### Philadelphia vaccination rates by zip code
COVID-19 Vaccinations

Shows distribution counts of first and second dose, as well as total dose information for all vaccinations performed by the health department. Also provides vaccinations by census tract, ZIP code, age, race, and sex. Vaccinations include residents and non-residents of Philadelphia. Updates daily.
See: https://www.opendataphilly.org/dataset/covid-vaccinations/resource/87ac5b4e-8491-41e3-8cf0-5bfebba2e3a0

In [None]:
url = "https://phl.carto.com/api/v2/sql?filename=covid_vaccines_by_zip&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20*%20FROM%20covid_vaccines_by_zip"
phillyVax = Table.read_table(url)
phillyVax

In [None]:
## might be also useful to have population for looking at Philly Vaccination Rates
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/PA_zip_pop.csv"
paPop = Table.read_table(url)
paPop.sort("pop",descending=True)
paPop.where('county','Philadelphia')

##### COVID Vaccination data by country

In [None]:
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/COVID_VAXDATA.csv"
globalVax = Table.read_table(url)
globalVax

##### Hepatitis C Diagnosis Datasets 
Heptatitis C is a disease caused by the Hepatitus C virus.  More information on the dataset can be found: https://archive.ics.uci.edu/ml/datasets/HCV+data

In [None]:
url = 'data/HepatitisCdata.csv'
hepC = Table.read_table(url)
hepC

_____________________________________________________________________________________________

###  Other datasets
including Philadelphia Graduation Rates & Jeopardy

#### Philadelphia Open Data School Graduation Rates
This longitudinal open data file includes information about the graduation rates for schools broken out by: graduation rate type (four-year, five-year, or six-year), demographic category (EL status, IEP status, Economically Disadvantaged Status, Gender, or Ethnicity), and ninth-grade cohort. Students are attributed to the last school at which they actively attended in the respective graduation window, which ends on September 30 each year. Students are classified as EL, as having an IEP, and/or economically disadvantaged if they were designated as such at any point during their high school career.
see: https://www.philasd.org/performance/programsservices/open-data/school-performance/#school_graduation_rates 

In [None]:
url = "https://cdn.philasd.org/offices/performance/Open_Data/School_Performance/Graduation_Rates/SDP_Graduation_Rates_School_S_2022-05-23.csv"
grad = Table.read_table(url)
grad

#### Jeopardy
see: https://www.jeopardy.com

In [None]:
contestant = "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/contestants.csv"
locations =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/locations.csv"
results =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/final_results.csv"
loc = Table.read_table(locations)
contest = Table.read_table(contestant)
outcome = Table.read_table(results)
outcome

#### Crime Data for Philadelphia
The data came from OpenDataPhilly: https://opendataphilly.org/datasets/crime-incidents
Reported incidents from June 1st to mid-November 2023.

Possible correlation: type of crime and time of day.

In [None]:
url = 'data/incidents_June-midNovember2023.csv'
crime = Table.read_table(url)
crime