<a href="https://colab.research.google.com/github/mydevco/python-desktop-reference/blob/main/EPA_Health_Air_Quality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Air Quality Index data analysis for data science week 2 discussion

In [None]:
#Import numpy and Panda
import pandas as pd
import numpy as np

#Import charting libraries
import statistics
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

 We will use pandas and numpy to compute statistical measures like mean, median, standard deviation, and percentiles. These stats will help us to detect tendencies and variations in AQI (Air Quality Index)

In [None]:
#Load data
epa_data = pd.read_csv("/content/daily_aqi_by_county_2025.csv", index_col = 0)

# remove spaces from column names
epa_data.columns = epa_data.columns.str.replace(" ", "_").str.lower()

#fill NaN or empty values
epa_data = epa_data.fillna(0)

#Make sure AQI columns has only numeric Values
epa_data['aqi'] = pd.to_numeric(epa_data['aqi'], errors='coerce')

is_all_numeric = not epa_data['aqi'].isnull().values.any()
print(is_all_numeric)
#display first 5 rows
epa_data.head(5)



True


Unnamed: 0_level_0,state_name,county_name,state_code,county_code,date,aqi,category,defining_parameter,defining_site,number_of_sites_reporting
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,Alabama,Baldwin,1,3,1/1/2025,20,Good,PM2.5,01-003-0010,1.0
1,Alabama,Baldwin,1,3,1/2/2025,37,Good,PM2.5,01-003-0010,1.0
2,Alabama,Baldwin,1,3,1/3/2025,52,Moderate,PM2.5,01-003-0010,1.0
3,Alabama,Baldwin,1,3,1/4/2025,31,Good,PM2.5,01-003-0010,1.0
4,Alabama,Baldwin,1,3,1/5/2025,31,Good,PM2.5,01-003-0010,1.0


In [None]:
#find more information about the file
epa_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 60030 entries, 0 to 60210
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   state_name                 60030 non-null  object 
 1   county_name                60030 non-null  object 
 2   state_code                 60030 non-null  int64  
 3   county_code                60030 non-null  int64  
 4   date                       60030 non-null  object 
 5   aqi                        60030 non-null  int64  
 6   category                   60030 non-null  object 
 7   defining_parameter         60030 non-null  object 
 8   defining_site              60030 non-null  object 
 9   number_of_sites_reporting  60030 non-null  float64
dtypes: float64(1), int64(3), object(6)
memory usage: 5.0+ MB


In [None]:
#Get descriptive stastics
epa_summary = epa_data[['state_name','aqi']].groupby('state_name')
epa_summary.describe()


Unnamed: 0_level_0,aqi,aqi,aqi,aqi,aqi,aqi,aqi,aqi
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
state_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Alabama,1120.0,41.372321,15.055742,0.0,32.0,41.0,51.0,133.0
Alaska,482.0,38.628631,25.884219,0.0,21.25,39.0,51.0,190.0
Arizona,2343.0,48.97866,38.573335,2.0,31.0,47.0,61.0,1215.0
Arkansas,1329.0,42.2769,15.042763,5.0,33.0,41.0,50.0,239.0
California,5096.0,46.766484,21.212908,0.0,35.0,44.0,55.0,365.0
Colorado,3998.0,44.188844,21.108394,1.0,33.0,45.0,51.0,652.0
Connecticut,632.0,39.405063,11.879766,1.0,34.0,40.0,45.0,84.0
Delaware,270.0,44.066667,10.531736,22.0,36.0,42.0,52.0,82.0
District Of Columbia,151.0,43.834437,10.521363,22.0,37.0,44.0,49.0,93.0
Florida,3413.0,43.720481,11.265804,1.0,36.0,43.0,51.0,140.0


The Air Quality Index (AQI), measures air pollution levels. It helps assess potential health impacts, with higher values indicating poorer air quality. The AQI is based on measurements of pollutants like particulate matter, ozone, carbon monoxide, sulfur dioxide, and nitrogen dioxide. The defining parameter column shows the type of AQI measure.

Particle Matter size are shown as PM2.5 and PM10. PM2.5 is particulate matter includes all atmospheric aerosols with a maximum diameter of 2.5 micrometers.
PM10 refers to airborne particles that do not exceed 10 micrometers in diameter. PM10 also contains carcinogenic substances, including benzopyrenes, furans, dioxins, and heavy metals

#Conclusion
Based on the above summary we can see thaT arizona has the highest mean AQI value (48.97) and max value(1215)

In [None]:
#Analysis across state
epa_data['state_name'].describe()

Unnamed: 0,state_name
count,60030
unique,32
top,California
freq,5096


In [None]:
print(f'Mean AQI across all states is {np.mean(epa_data["aqi"])}')
print(f'Median AQI across all states is {np.median(epa_data["aqi"])}')
print(f'Minimum AQI across all states is {np.min(epa_data["aqi"])}')
print(f'Maximum AQI across all states is {np.max(epa_data["aqi"])}')
print(f'Standard deviation AQI across all states is {np.std(epa_data["aqi"], ddof=1)}')

Mean AQI across all states is 41.59153756455106
Median AQI across all states is 41.0
Minimum AQI across all states is 0
Maximum AQI across all states is 1215
Standard deviation AQI across all states is 18.68661285768535
