# Comparing Air Quality and Public Health across US Cities

To start, we will be conducting some pre-processing steps. While we have access to data prior to 1980, according to the EPA, 1980 marks a time that there was consistent procedure in measuring ozone levels. For now, we will begin looking at data from 2000 for consistent measurement procedures. If necessary, we will re-introduce data from 1980 - 2000. 

FROM THE EPA: 
1980 marked a revision to the ozone monitoring program (44 FR 8202) that included implementation of revised monitor calibration procedures. Data from prior years is available but the user should understand that total uncertainty and spatial variability as artefacts of the measurements are higher than in later years. 1980 marks the beginning of nationally consistent operational and quality assurance procedures.

1999 marked the beginning of required PM2.5 (particulate matter of 2.5 microns in aerodynamic diameter or less) and PM2.5 speciated monitoring (62 FR 38652). PM10 and TSP data is available in prior years, but 1999 is the first year with national FRM and non-FRM PM2.5 monitoring. This is reflected in the large jump in the number of monitors in 1999.


In [7]:
#imports
import pandas as pd
import altair as alt
import zipfile

##  Visualization/Exploration Planning (AQI by County for 2023)

### What Data is Show? (Data Abstraction)
This is a dataframe containing information about criteria pollutants annually by County. The criteria test are CO (Carbon Monoxide), NO2 (Nitrogen Dioxide), Ozone, PM2.5 (Fine Particulate Matter with Diameter less than or equal to 2.5 microns), and PM10 (Fine Particulate Matter with Diameter less than or equal to 10 microns)

### Why is the user analyzing/viewing it? (Task Abstraction)
#### Task:
#### Action: 

### How is the data presented? (Visual Encodings)


In [8]:
aqi_county_2023 = pd.read_csv("annual_aqi_by_county_2023.csv")

In [9]:
aqi_county_2023.head()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,Alabama,Baldwin,2023,170,143,27,0,0,0,0,90,54,40,0,0,84,86,0
1,Alabama,Clay,2023,155,109,46,0,0,0,0,83,61,40,0,0,0,155,0
2,Alabama,DeKalb,2023,212,155,55,2,0,0,0,133,63,43,0,0,141,71,0
3,Alabama,Elmore,2023,118,102,16,0,0,0,0,90,54,40,0,0,118,0,0
4,Alabama,Etowah,2023,181,126,55,0,0,0,0,100,64,43,0,0,74,107,0


## Visualization 1 (Exploratory)
**Task**: Discover trends of distribution for each criteria pollutant

**Action**: Display distribution of each criteria pollutant

**What**:
**How**:
**Why**:

In [10]:
aqi_county_2023.describe()

Unnamed: 0,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
count,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0,949.0
mean,2023.0,194.891465,146.892518,43.48156,3.541623,0.832455,0.092729,0.05058,136.228662,64.929399,39.51844,0.788198,2.926238,114.210748,70.404636,6.561644
std,0.0,51.563261,44.509908,30.623047,5.1531,1.569205,0.334134,1.034006,78.274006,23.441782,9.752471,6.732097,13.412824,74.522416,67.77705,28.16538
min,2023.0,29.0,25.0,0.0,0.0,0.0,0.0,0.0,29.0,10.0,6.0,0.0,0.0,0.0,0.0,0.0
25%,2023.0,179.0,119.0,21.0,0.0,0.0,0.0,0.0,97.0,54.0,38.0,0.0,0.0,70.0,0.0,0.0
50%,2023.0,182.0,150.0,37.0,2.0,0.0,0.0,0.0,126.0,64.0,42.0,0.0,0.0,121.0,61.0,0.0
75%,2023.0,241.0,173.0,59.0,5.0,1.0,0.0,0.0,164.0,74.0,45.0,0.0,0.0,170.0,111.0,0.0
max,2023.0,274.0,271.0,153.0,51.0,19.0,3.0,31.0,1695.0,542.0,74.0,93.0,181.0,273.0,273.0,269.0


In [16]:
criteria_pollutant_days = ['Days CO', 'Days NO2', 'Days Ozone', 'Days PM2.5', 'Days PM10']
for col in criteria_pollutant_days:
    chart_days_co = alt.Chart(aqi_county_2023).mark_boxplot().encode(
        x=alt.X(col, title='Frequency'),
    )

    # Display the chart for "Days CO"
    chart_days_co.display()


# Data Exploration for CDC 2023

In [3]:
# Specify the path to your zip file
zip_file_path = 'cdc_2023.csv.zip'

# Specify the file name within the zip file you wish to open
file_name_within_zip = 'cdc_2023.csv'

# Use the zipfile module to open the zip file
with zipfile.ZipFile(zip_file_path, 'r') as z:
    # Extract the specified file from the zip file
    with z.open(file_name_within_zip) as f:
        # Read the extracted file into a pandas DataFrame
        cdc_2023 = pd.read_csv(f)

# Now you can work with the DataFrame as usual
cdc_2023.head()

  cdc_2023 = pd.read_csv(f)


Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,Data_Value_Footnote,Low_Confidence_Limit,High_Confidence_Limit,TotalPopulation,Geolocation,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text
0,2020,AK,Alaska,Kiana,BRFSS,Health Outcomes,All teeth lost among adults aged >=65 years,%,Crude prevalence,38.5,...,,29.0,48.0,347.0,POINT (-160.4343638 66.97258641),239300,HLTHOUT,TEETHLOST,CrdPrv,All Teeth Lost
1,2021,AK,Alaska,Koliganek,BRFSS,Health Outcomes,Arthritis among adults aged >=18 years,%,Crude prevalence,22.0,...,,18.5,25.7,209.0,POINT (-157.2259091 59.69715239),241500,HLTHOUT,ARTHRITIS,CrdPrv,Arthritis
2,2021,AK,Alaska,Kongiganak,BRFSS,Health Outcomes,Arthritis among adults aged >=18 years,%,Crude prevalence,23.5,...,,20.2,27.0,439.0,POINT (-162.8830767 59.95797089),241610,HLTHOUT,ARTHRITIS,CrdPrv,Arthritis
3,2021,AK,Alaska,Lakes,BRFSS,Health Outcomes,Obesity among adults aged >=18 years,%,Crude prevalence,36.7,...,,32.5,41.2,8364.0,POINT (-149.3066764 61.60526948),242832,HLTHOUT,OBESITY,CrdPrv,Obesity
4,2021,AK,Alaska,Mountain Village,BRFSS,Health Outcomes,Obesity among adults aged >=18 years,%,Crude prevalence,47.3,...,,39.2,56.1,813.0,POINT (-163.7209368 62.09111567),251180,HLTHOUT,OBESITY,CrdPrv,Obesity
