# Exercise: Exploring Air Quality Index (AQI) by County (2023)

### Introduction

Welcome to the Descriptive Statistics Mastery Exercise! In this exercise, we will delve into the "aqi_by_county_2023" dataset, which contains information about the Air Quality Index (AQI) across different counties for the year 2023. The AQI is a crucial metric that quantifies the quality of the air we breathe, considering various pollutants. Through this exercise, you will sharpen your skills in descriptive statistics, exploring key aspects of the dataset to gain insights into air quality variations.

### Dataset Information:

Dataset Name: annual_aqi_by_county_2023.csv
    
Columns: State, County, Year, Days with AQI, Good Days, Moderate Days, Unhealthy for Sensitive Groups Days, Unhealthy Days, Very Unhealthy Days, Hazardous Days, Max AQI, 90th Percentile AQI, Median AQI, Days CO, Days NO2, Days Ozone, Days PM2.5, Days PM10

    
Rows: Each row represents AQI data for a specific county.
    
### Objective:

Explore and analyze the AQI dataset using descriptive statistics to understand the central tendencies, variabilities, and distributions of air quality across different counties. This exercise will cover mean, median, range, quartiles, and more, providing a comprehensive overview of the air quality landscape in 2023.

### Tools:
You may use Pandas, NumPy, and Matplotlib for data manipulation, calculations, and visualization.

Now, let's dive into the exercises and uncover valuable insights from the AQI dataset!

## Step 1 : Import the relevant files

Import the relevant Python libraries `pandas` and `numpy`.

In [57]:
# Import relevant Python libraries.

### YOUR CODE HERE ###

## Step 2: Import the Data from the file into a dataframe



In [58]:
# RUN THIS CELL TO IMPORT YOUR DATA.

import os

# Get the current working directory
current_directory = os.getcwd()

# Specify the relative path to the CSV file
file_name = "annual_aqi_by_county_2023.csv"
file_path = os.path.join(current_directory, file_name)

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv(file_path)

## Step 3: Display the first 5 rows of the dataframe

In [59]:
# Displaying the first few rows of the DataFrame to get an overview
df.head()

Unnamed: 0,State,County,Year,Days with AQI,Good Days,Moderate Days,Unhealthy for Sensitive Groups Days,Unhealthy Days,Very Unhealthy Days,Hazardous Days,Max AQI,90th Percentile AQI,Median AQI,Days CO,Days NO2,Days Ozone,Days PM2.5,Days PM10
0,Alabama,Baldwin,2023,170,143,27,0,0,0,0,90,54,40,0,0,84,86,0
1,Alabama,Clay,2023,155,109,46,0,0,0,0,83,61,40,0,0,0,155,0
2,Alabama,DeKalb,2023,212,155,55,2,0,0,0,133,63,43,0,0,141,71,0
3,Alabama,Elmore,2023,118,102,16,0,0,0,0,90,54,40,0,0,118,0,0
4,Alabama,Etowah,2023,181,126,55,0,0,0,0,100,64,43,0,0,74,107,0


## Data Exploration : What do the individual columns represent.

[Write your response here. Double-click (or enter) to edit.]

- State -
- County -
- Year - 
- Days with AQI -
- Good Days -
- Moderate Days -
- Unhealthy for Sensitive Groups Days -
- Unhealthy Days -
- Very Unhealthy Days -
- Hazardous Days -
- Max AQI -
- 90th Percentile AQI -
- Median AQI -
- Days CO - 
- Days NO2 -
- Days Ozone -
- Days PM2.5 -
- Days PM10 -


## Exercises

### Basic Questions:
- What is the average number of "Days with AQI" across all counties in the dataset?
- What is the maximum value of "Max AQI" recorded in any county?
- How many counties have "Good Days" as the majority category?
- What is the median number of "Days CO" across all counties?
- What is the range of "Days PM10" for the given dataset?

#### Q1. What is the average number of "Days with AQI" across all counties in the dataset?

In [60]:
### YOUR CODE HERE

#### Q2. What is the maximum value of "Max AQI" recorded in any county?

In [61]:
### YOUR CODE HERE

#### Q3. How many counties have "Good Days" as the majority category?

In [62]:
### YOUR CODE HERE

#### Q4. What is the median number of "Days CO" across all counties?

In [63]:
### YOUR CODE HERE

#### Q5. What is the range of "Days PM10" for the given dataset?

In [64]:
### YOUR CODE HERE

### Medium Difficulty Questions:

- Which county has the highest number of "Unhealthy Days"?
- What is the 90th percentile AQI value for "Unhealthy for Sensitive Groups Days" across all counties?
- Calculate the interquartile range (IQR) for the "Days NO2" column.
- In how many counties does the "Hazardous Days" category occur at least once?
- What is the correlation between "Days Ozone" and "Days PM2.5"?

#### Q1. Which county has the highest number of "Unhealthy Days"?

In [65]:
### YOUR CODE HERE

#### Q2. What is the 90th percentile AQI value for "Unhealthy for Sensitive Groups Days" across all counties?

In [66]:
### YOUR CODE HERE

#### Q3. Calculate the interquartile range (IQR) for the "Days NO2" column.

In [67]:
### YOUR CODE HERE

#### Q4. In how many counties does the "Hazardous Days" category occur at least once?

In [68]:
### YOUR CODE HERE

#### Q5. What is the correlation between "Days Ozone" and "Days PM2.5"?

In [69]:
### YOUR CODE HERE

### Difficult Questions:

- For counties with more than 200 "Days with AQI," what is the average "Median AQI"?
- Identify the county with the highest percentage of "Very Unhealthy Days" relative to the total "Days with AQI."
- Perform a t-test to compare the means of "Days Ozone" and "Days PM2.5" across all counties.
- Create a box plot for the distribution of "Days CO" and "Days NO2" to visualize their spread.
- Using linear regression, predict the "Days PM10" based on the "Days PM2.5" for a selected county.

#### Q1. For counties with more than 200 "Days with AQI," what is the average "Median AQI"?

In [70]:
### YOUR CODE HERE

#### Q2. Identify the county with the highest percentage of "Very Unhealthy Days" relative to the total "Days with AQI."

In [71]:
### YOUR CODE HERE

#### Q3. Perform a t-test to compare the means of "Days Ozone" and "Days PM2.5" across all counties.

In [72]:
### YOUR CODE HERE

#### Q4. Create a box plot for the distribution of "Days CO" and "Days NO2" to visualize their spread.

In [73]:
### YOUR CODE HERE

#### Q5. Using linear regression, predict the "Days PM10" based on the "Days PM2.5" for a selected county.

In [74]:
### YOUR CODE HERE

## Conclusion

In this Descriptive Statistics Mastery Exercise, we embarked on a comprehensive exploration of the "aqi_by_county_2023" dataset, unraveling key insights into air quality across different counties. Through a series of questions, we delved into basic, medium difficulty, and challenging aspects, employing Python descriptive statistics to extract valuable information.

Our journey began with fundamental questions, providing an overview of averages, ranges, and counts. As we progressed to medium difficulty questions, we navigated through percentile calculations, interquartile ranges, and correlations between air quality metrics. The exercise concluded with advanced challenges, including statistical tests, visualizations, and predictive modeling.

By engaging in this exercise, you have honed your skills in utilizing Pandas, NumPy, and Matplotlib to analyze and interpret data. Descriptive statistics served as powerful tools for summarizing, visualizing, and drawing meaningful conclusions about air quality dynamics.

This exercise not only enhanced your proficiency in Python for statistical analysis but also equipped you with practical insights into the factors influencing air quality across diverse counties. Your ability to pose and answer questions using descriptive statistics is a valuable skill applicable to a wide range of data analysis scenarios.

Continue exploring datasets, asking pertinent questions, and applying statistical techniques to gain deeper insights into the world of data analysis. Congratulations on completing this exercise, and may your data exploration journey continue to be both rewarding and enlightening!