<a href="https://colab.research.google.com/github/laibaabbas/2024-MS-DS-11/blob/main/2024_MS_DS_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity: Explore descriptive statistics

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread.

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports**


Import the relevant Python libraries `pandas` and `numpy`.

In [None]:
# Import relevant Python libraries.

### YOUR CODE HERE ###
import pandas as pd

import numpy as np

The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a susbet of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [None]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE
epa_data = pd.read_csv("/content/c4_epa_air_quality.csv",index_col=0)

<details>
  <summary><h4><strong>Hint</strong></h4></summary>

  Use the `read_csv` function from the pandas `library`. The `index_col` parameter can be set to `0` to read in the first column as an index (and to avoid `"Unnamed: 0"` appearing as a column in the resulting DataFrame).

</details>

## **Step 2: Data exploration**

To understand how the dataset is structured, display the first 10 rows of the data.

In [None]:
# Display first 10 rows of the data.

### YOUR CODE HERE

epa_data.head(10)


Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


<details>
  <summary><h4><strong>Hint</strong></h4></summary>

  Use the `head()` function from the `pandas` library.

</details>

**Question:** What does the `aqi` column represent?

**The aqi column represents the Air Quality Index for each observation, which provides a standardized measure of air pollution levels. Lower values indicate better air quality, while higher values correspond to worse air quality. For example, an AQI of 7 or 5 indicates low levels of pollution, whereas higher numbers (above 100) would indicate unhealthy levels**

Now, get a table that contains some descriptive statistics about the data.

In [None]:
# Get descriptive stats.

### YOUR CODE HERE
table = epa_data.describe()
table


Unnamed: 0,arithmetic_mean,aqi
count,260.0,260.0
mean,0.403169,6.757692
std,0.317902,7.061707
min,0.0,0.0
25%,0.2,2.0
50%,0.276315,5.0
75%,0.516009,9.0
max,1.921053,50.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate a table of basic descriptive statistics about the numeric columns in a DataFrame.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `describe()` function from the `pandas` library.

</details>

**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

**Based on the descriptive statistics table, the count value for the aqi column is 260, there are no missing values in the aqi column**

**Question:** What do you notice about the 25th percentile for the `aqi` column?

This is an important measure for understanding where the aqi values lie.

**The 25th percentile for the aqi column is 2. This means that 25% of the observations have an aqi value of 2 or lower.**

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This is another important measure for understanding where the aqi values lie.

**The 25th percentile for the aqi column is 9. This means that 25% of the observations have an aqi value of 9 or lower.**

## **Step 3: Statistical tests**

Next, get some descriptive statistics about the states in the data.

In [None]:
# Get descriptive stats about the states in the data.

### YOUR CODE HERE

state_stats = epa_data['state_name'].describe()

print("Descriptive Statistics for State Names:")
state_stats


Descriptive Statistics for State Names:


Unnamed: 0,state_name
count,260
unique,52
top,California
freq,66


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `pandas` library that allows you to generate basic descriptive statistics about a DataFrame or a column you are interested in.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

 Use the `describe()` function from the `pandas` library. Note that this function can be used:
- "on a DataFrame (to find descriptive statistics about the numeric columns)"
- "directly on a column containing categorical data (to find pertinent descriptive statistics)"

</details>

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data?

Note: Sometimes you have to individually calculate statistics. To review to that approach, use the `numpy` library to calculate each of the main statistics in the preceding table for the `aqi` column.

Certain states, such as California, appear to have a higher frequency of 66 monitoring sites. This could suggest a focus on areas with significant air quality concerns or regulatory requirements.

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [None]:
# Compute the mean value from the aqi column.

### YOUR CODE HERE
aqi_values = epa_data['aqi'].values

# Calculate the mean of the 'aqi' values
mean_aqi = np.mean(aqi_values)

# Print the result
print(f"Mean AQI: {mean_aqi}")

Mean AQI: 6.757692307692308


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the mean value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `mean()` function from the `numpy` library.

</details>

**Question:** What do you notice about the mean value from the `aqi` column?

This is an important measure, as it tells you what the average air quality is based on the data.

**The mean AQI value of 6.76 indicates generally good air quality across the monitored sites in the dataset. This low average suggests that most locations experience minimal carbon monoxide pollution, aligning with health standards for air quality**

Next, compute the median value from the aqi column.

In [None]:
# Compute the median value from the aqi column.

### YOUR CODE HERE

median_aqi = np.median(aqi_values)

print("Median AQI: ",median_aqi)

Median AQI:  5.0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the median value from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `median()` function from the `numpy` library.

</details>

**Question:** What do you notice about the median value from the `aqi` column?

This is an important measure for understanding the central location of the data.

**The median AQI value of 5.0 indicates that half of the monitored locations have an AQI below this level, reflecting generally good air quality. This central measure suggests that while some sites may experience higher pollution levels, the majority are likely to be in a healthier range, reinforcing the positive trend observed in the mean AQI value. The proximity of the median to the mean also indicates a relatively symmetric distribution of AQI values.**

Next, identify the minimum value from the `aqi` column.

In [None]:
# Identify the minimum value from the aqi column.

### YOUR CODE HERE
min_aqi = np.min(aqi_values)
print("Minimum AQI: ",min_aqi)


Minimum AQI:  0


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the minimum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `min()` function from the `numpy` library.

</details>

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.


**The minimum AQI value of 0 indicates that the best air quality observed in the dataset is at a level considered excellent. This suggests that there are locations where carbon monoxide levels are negligible, posing no health risks. Such low readings are essential for understanding the potential for clean air zones and can serve as benchmarks for air quality improvement efforts in more polluted areas.**

Now, identify the maximum value from the `aqi` column.

In [None]:
# Identify the maximum value from the aqi column.

### YOUR CODE HERE
max_aqi = np.max(aqi_values)
print(f"Maximum AQI: {max_aqi}")


Maximum AQI: 50


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the maximum value from an array or a Series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

  Use the `max()` function from the `numpy` library.

</details>

**Question:** What do you notice about the maximum value from the `aqi` column?

This is an important measure, as it tells you which value in the data corresponds to the worst air quality observed in the data.

**The maximum AQI value of 50 indicates the worst air quality observed in the dataset. This level falls into the "Moderate" category, suggesting that while the air quality is generally acceptable, there may be concerns for some individuals, particularly those who are sensitive to air pollution. This maximum reading highlights the need for ongoing monitoring and potential interventions in areas experiencing higher pollution levels.**

Now, compute the standard deviation for the `aqi` column.

By default, the `numpy` library uses 0 as the Delta Degrees of Freedom, while `pandas` library uses 1. To get the same value for standard deviation using either library, specify the `ddof` parameter to 1 when calculating standard deviation.

In [None]:
# Compute the standard deviation for the aqi column.

### YOUR CODE HERE
std_aqi = np.std(aqi_values, ddof=1)


print(f"Standard Deviation of AQI: {std_aqi}")

Standard Deviation of AQI: 7.061706678820724


<details>
  <summary><h4><strong>Hint 1</strong></h4></summary>

Refer to the video section about descriptive statistics in Python.

</details>

<details>
  <summary><h4><strong>Hint 2</strong></h4></summary>

  There is a function in the `numpy` library that allows you to get the standard deviation from an array or a series of values.

</details>

<details>
  <summary><h4><strong>Hint 3</strong></h4></summary>

Use the `std()` function from the `numpy` library. Make sure to specify the `ddof` parameter as 1. To read more about this function,  refer to its documentation in the references section of this lab.

</details>

**Question:** What do you notice about the standard deviation for the `aqi` column?

This is an important measure of how spread out the aqi values are.

**The standard deviation of 7.06 for the AQI column indicates a moderate level of variability in air quality across the monitored sites. This suggests that while many locations experience low AQI values, there are also some that report significantly higher values, contributing to the spread. Such variability is crucial for understanding the differences in air quality, as it highlights areas that may require more attention or intervention to improve air pollution levels.**

## **Considerations**


**What are some key takeaways that you learned during this lab?**



1.  **Understanding Descriptive Statistics**: The lab provided a comprehensive overview of how to compute and interpret descriptive statistics, including mean, median, minimum, maximum, and standard deviation for the AQI values. These metrics are essential for summarizing data and identifying trends in air quality.
2.  **Data Variability:** The analysis revealed a moderate standard deviation in AQI values, indicating variability in air quality across different locations. This highlights the importance of continuous monitoring and targeted interventions in areas with poorer air quality.
3. Air Quality Insights: The minimum AQI value of 0 and maximum of 50 suggest that while many locations experience good air quality, there are still areas where pollution levels can reach moderate levels that may affect sensitive populations.

4. Geographic Distribution: The dataset reflects a diverse geographic representation of air quality data across various states, emphasizing the need for localized strategies to address air pollution.

5. **Compute Statistics Using NumPy**: The NumPy library allows for quick calculations using built-in functions that handle arrays efficiently.A few lines of code can yield comprehensive statistics, making it easy to analyze data without extensive boilerplate code.Use NumPy functions to compute the mean, median, standard deviation, minimum, and maximum values.

**How would you present your findings from this lab to others? Consider the following relevant points noted by AirNow.gov as you respond:**
- "AQI values at or below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy—at first for certain sensitive groups of people, then for everyone as AQI values increase."
- "An AQI of 100 for carbon monoxide corresponds to a level of 9.4 parts per million."

# Presentation of Findings



1.   Overview of AQI Data:
*   The dataset includes air quality measurements from over 200 monitoring sites across various states, specifically focusing on carbon monoxide levels.
2.   Key Statistics:
* Mean AQI: 6.76, indicating generally good air quality.
* Median AQI: 5.0, suggesting that half of the locations have AQI values below this level.
* Minimum AQI: 0, representing the best air quality observed.
* Maximum AQI: 50, which falls into the "Moderate" category, indicating potential health concerns for sensitive groups.
* Standard Deviation: 7.06, reflecting moderate variability in air quality across different locations.

3. Health Implications:
* According to AirNow.gov, AQI values at or below 100 are satisfactory. Our findings show that most monitored sites are well within this range, with only a few reaching levels that could be considered unhealthy, particularly for sensitive populations.

4. Carbon Monoxide Levels:
* An AQI of 100 corresponds to a carbon monoxide level of 9.4.  parts per million (ppm). The maximum recorded AQI of 50 suggests that carbon monoxide levels remain significantly below this threshold, indicating a lower
risk for adverse health effects.

5. Conclusion and Recommendations:
* Overall, the data suggests that air quality is generally good across the monitored sites, but continuous monitoring is essential to address any spikes in pollution levels promptly.
It is recommended to focus on areas with higher AQI readings to ensure public health safety and to implement strategies for reducing carbon monoxide emissions.


**What summary would you provide to stakeholders? Use the same information provided previously from AirNow.gov as you respond.**

### Summary for Stakeholders

The analysis of air quality data focused on carbon monoxide levels across various states has yielded several key insights:

1. **Overall Air Quality**: The mean AQI value is **6.76**, and the median is **5.0**, indicating that air quality is generally good across the monitored sites. The minimum AQI recorded is **0**, reflecting instances of excellent air quality.

2. **Variability in Air Quality**: While the maximum AQI value is **50**, which falls into the "Moderate" category, this suggests that there are locations where air quality could pose health risks, particularly for sensitive groups.

3. **Health Implications**: According to AirNow.gov, AQI values at or below **100** are considered satisfactory. Our findings confirm that most monitored sites remain well within this range, with only a few instances of higher values that may affect vulnerable populations.

4. **Carbon Monoxide Levels**: An AQI of **100** corresponds to a carbon monoxide level of **9.4 parts per million (ppm)**. The maximum AQI of **50** indicates that carbon monoxide levels are significantly lower than this threshold, suggesting a lower risk for adverse health effects.

5. **Recommendations**: Continuous monitoring is essential to maintain air quality standards and address any spikes in pollution levels promptly. Targeted interventions may be necessary in regions experiencing higher AQI readings to protect public health.

This summary provides a clear understanding of the current state of air quality concerning carbon monoxide and highlights areas for potential action to ensure ongoing public health safety.





**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data).

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.