## PyAirPollution Data Analysis

In this tutorial, we use the US EPA AirData, a public database from the US Environmental Protection Agency. It has detailed air quality data, including PM2.5 levels, for various years. You can download it in formats like yearly concentration or Air Quality Index by county. With its extensive data, it's great for researching air quality trends and effects.

### 1. Importing the Data

- Get the PM2.5 data file from the US EPA AirData site.
- Open the file called 'annual_aqi_by_county_2023.csv'.
- This CSV file contains fields includes:

  - **State:** The state in which the air quality was measured.
  - **County:** The specific county within the state.
  - **Year**: The year the data was recorded.
  - **Days with AQI:** Total number of days air quality was measured.
  - **Good Days:** Days with good air quality.
  - **Moderate Days:** Days with moderate air quality.
  - **Unhealthy for Sensitive Groups Days:** Days when air quality was unhealthy for sensitive groups like people with asthma.
  - **Unhealthy Days:** Days with unhealthy air quality for the general population.
  - **Very Unhealthy Days:** Days with very poor air quality, posing health risks to everyone.
  - **Hazardous Days:** Days with extremely dangerous air quality levels.
  - **Max AQI:** Maximum Air Quality Index recorded in the year.
  - **90th Percentile AQI:** The AQI value below which 90% of all AQI values fall.
  - **Median AQI:** The middle AQI value when all are lined up from lowest to highest.
  - **Days CO:** Days when Carbon Monoxide (CO) was the primary pollutant.
  - **Days NO2:** Days when Nitrogen Dioxide (NO2) was the primary pollutant.
  - **Days Ozone:** Days when Ozone was the primary pollutant.
  - **Days PM2.5:** Days when fine particulate matter (PM2.5) was the primary pollutant.
  - **Days PM10:** Days when coarse particulate matter (PM10) was the primary pollutant.


In [None]:
# Flag to determine the source of the file
use_local_file = False # Change to True if you want to use a local file

In [None]:
import pandas as pd

if use_local_file:
  # Import file from a local file
  epa_data = pd.read_csv('annual_aqi_by_county_2023.csv')
else:
  # Import file from GitHub raw URL
  url = 'https://raw.githubusercontent.com/peyrone/PyAirPollution/main/annual_aqi_by_county_2023.csv'
  epa_data = pd.read_csv(url)

### 2. Cleaning and Preparing Data

- Handle missing values, correct data types, and remove duplicates.

#### 2.1 Inspect for Missing Values

In [None]:
# Check for missing values
print(epa_data.isnull().sum())

# If missing values are not significant, we might drop them
epa_data.dropna(inplace=True)

#### 2.2 Correcting Data Types

In [None]:
# Correct data types if needed
epa_data['Year'] = epa_data['Year'].astype(str)

### 2.3 Dealing with Duplicates

In [None]:
# Remove duplicate rows
epa_data.drop_duplicates(inplace=True)

### 2.4 Handling Outliers

In [None]:
# Example using Z-score
from scipy import stats
z_scores = stats.zscore(epa_data['Good Days'])

outliers = epa_data[(abs(z_scores) > 3)]
print("Outliers based on Z-scores:\n", outliers)

### 3. Exploring Data

- Explore the data with basic statistical methods.

In [None]:
# Summary statistics and information
print(epa_data.describe())
print(epa_data.info())

**Understanding Correlation Values:**

- Correlation coefficients range from -1 to 1.
- A value close to 1 implies a strong positive correlation (as one variable increases, so does the other).
- A value close to -1 implies a strong negative correlation (as one variable increases, the other decreases).
- A value around 0 suggests no correlation.

In [None]:
# Correlation Analysis
correlation_matrix = epa_data.corr()
print("\nCorrelation Matrix:\n", correlation_matrix)

**This correlation matrix analyzes relationships between air quality indicators:**

- **'Days with AQI'** and **'Good Days'** are strongly correlated, indicating more monitored days often mean more good air quality days.
- **'Moderate Days'** negatively affect **'Good Days'** but correlate positively with **'Days with AQI'**.
- **'Unhealthy for Sensitive Groups Days'** increase with **'Moderate Days'** but decrease **'Good Days'**.
- A strong negative correlation between **'Days Ozone' and 'Days PM2.5'** suggests that high ozone days often have low PM2.5 pollution, and vice versa.
- Higher pollution levels (Max AQI, 90th Percentile AQI, Median AQI) are associated with more **'Moderate'**, **'Unhealthy for Sensitive Groups'**, and **'Unhealthy'** days.

### 4. Visualizing Data

- Show the data in graphs to see trends and patterns. For example, make a bar chart to display the average good days in each state.

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install seaborn

In [None]:
# Heatmap of Correlation Matrix
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Plotting PM2.5 data
avg_good_days = epa_data.groupby('State')['Good Days'].mean()
avg_good_days.plot(kind='bar')
plt.ylabel('Average Good Days')
plt.title('Average Good Days per State')
plt.show()