# Activity: Explore descriptive statistics

## **Introduction**

Data professionals often use descriptive statistics to understand the data they are working with and provide collaborators with a summary of the relative location of values in the data, as well an information about its spread. 

For this activity, you are a member of an analytics team for the United States Environmental Protection Agency (EPA). You are assigned to analyze data on air quality with respect to carbon monoxide, a major air pollutant. The data includes information from more than 200 sites, identified by state, county, city, and local site names. You will use Python functions to gather statistics about air quality, then share insights with stakeholders.

## **Step 1: Imports** 


Import the relevant Python libraries `pandas` and `numpy`.

In [None]:
# Import relevant Python libraries.

import pandas as pd
import numpy as np

The dataset provided is in the form of a .csv file named `c4_epa_air_quality.csv`. It contains a susbet of data from the U.S. EPA. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [3]:
# RUN THIS CELL TO IMPORT YOUR DATA.

### YOUR CODE HERE
epa_data = pd.read_csv("c4_epa_air_quality.csv", index_col = 0)

## **Step 2: Data exploration** 

To understand how the dataset is structured, display the first 10 rows of the data.

In [4]:
# Display first 10 rows of the data.

epa_data.head(10)

Unnamed: 0,date_local,state_name,county_name,city_name,local_site_name,parameter_name,units_of_measure,arithmetic_mean,aqi
0,2018-01-01,Arizona,Maricopa,Buckeye,BUCKEYE,Carbon monoxide,Parts per million,0.473684,7
1,2018-01-01,Ohio,Belmont,Shadyside,Shadyside,Carbon monoxide,Parts per million,0.263158,5
2,2018-01-01,Wyoming,Teton,Not in a city,Yellowstone National Park - Old Faithful Snow ...,Carbon monoxide,Parts per million,0.111111,2
3,2018-01-01,Pennsylvania,Philadelphia,Philadelphia,North East Waste (NEW),Carbon monoxide,Parts per million,0.3,3
4,2018-01-01,Iowa,Polk,Des Moines,CARPENTER,Carbon monoxide,Parts per million,0.215789,3
5,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.994737,14
6,2018-01-01,Hawaii,Honolulu,Not in a city,Kapolei,Carbon monoxide,Parts per million,0.2,2
7,2018-01-01,Pennsylvania,Erie,Erie,,Carbon monoxide,Parts per million,0.2,2
8,2018-01-01,Hawaii,Honolulu,Honolulu,Honolulu,Carbon monoxide,Parts per million,0.4,5
9,2018-01-01,Colorado,Larimer,Fort Collins,Fort Collins - CSU - S. Mason,Carbon monoxide,Parts per million,0.3,6


**Question:** What does the `aqi` column represent?

AQI is a yardstick that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 or below represents good air quality, while an AQI value over 300 represents hazardous air quality.

**Question:** In what units are the aqi values expressed?

There is no units for aqi. It is just integer values representing how bad the air quality is.

Now, get a table that contains some descriptive statistics about the data.

In [5]:
# Get descriptive stats.
epa_data['aqi'].describe()

count    260.000000
mean       6.757692
std        7.061707
min        0.000000
25%        2.000000
50%        5.000000
75%        9.000000
max       50.000000
Name: aqi, dtype: float64

**Question:** Based on the table of descriptive statistics, what do you notice about the count value for the `aqi` column?

count is same as the number of rows in the dataset without null values.

**Question:** What do you notice about the 25th percentile for the `aqi` column?

It points out that 25% of the data points in the "aqi" column have a value of 2 or less. In other words, at last a quarter of the data falls below or equal to this value

**Question:** What do you notice about the 75th percentile for the `aqi` column?

This indicates that 75% of the data points in the "aqi" column have a value of 9 or below. In other words, at least three-quarters of the data falls below or equal to this value, suggesting that a significant portion of the data is clustered in the lower range.

## **Step 3: Statistical tests** 

Next, get some descriptive statistics about the states in the data.

In [6]:
# Get descriptive stats about the states in the data.

epa_data['state_name'].describe()

count            260
unique            52
top       California
freq              66
Name: state_name, dtype: object

**Question:** What do you notice while reviewing the descriptive statistics about the states in the data? 

260 records have total 52 states and California is the most frequent state that has 66 out of 260

## **Step 4. Results and evaluation**

Now, compute the mean value from the `aqi` column.

In [20]:
# Compute the mean value from the aqi column.

epa_data['aqi'].mean()

6.757692307692308

**Question:** What do you notice about the mean value from the `aqi` column?

The mean suggests that, on average, the "aqi" values tend to be around 6.757692. However, it is important to note that the mean can be influenced by extreme values, such as the maximum value of 50. Therefore, while the mean provides a measure of central tendency, it may not fully represent the typical values of the "aqi" column, especially if there are outliers present.

Next, compute the median value from the aqi column.

In [21]:
# Compute the mean value from the aqi column.

epa_data['aqi'].median()

5.0

**Question:** What do you notice about the median value from the `aqi` column?

It suggests that 50% of the data points in the "aqi" column have values of 5 or less, while the other 50% have values of 5 or greater.
Compared to the mean value of approximately 6.757692, the median of 5 indicates that the distribution of the "aqi" column might be slightly skewed to the left. This means that there might be some lower values that are pulling the median downward, while the mean is influenced by the presence of higher values. The median provides a measure of central tendency that is not affected by extreme values or outliers, making it a useful indicator of the typical value in the "aqi" column.

Next, identify the minimum value from the `aqi` column.

In [22]:
# Identify the minimum value from the aqi column.

epa_data['aqi'].min()

0

**Question:** What do you notice about the minimum value from the `aqi` column?

This is an important measure, as it tell you the best air quality observed in the data.

In the context of air quality index (AQI), where higher values typically indicate poorer air quality, a minimum value of 0 might suggest either a measurement error or the absence of air pollution in that particular instance. However, it is important to consider the context and the data collection process to determine the exact interpretation of this minimum value.

Now, identify the maximum value from the `aqi` column.

In [23]:
# Identify the maximum value from the aqi column.

epa_data['aqi'].max()

50

**Question:** What do you notice about the maximum value from the `aqi` column?

In the context of air quality index (AQI), higher values typically indicate poorer air quality. Therefore, the presence of a maximum value of 50 suggests that, at some point, the air quality in the corresponding measurements reached the highest level within the defined range. It is worth noting that without additional information about the specific AQI scale being used and the associated thresholds, it is difficult to ascertain the exact severity or implications of this maximum value.

Now, compute the standard deviation for the `aqi` column.

In [26]:
# Compute the standard deviation for the aqi column.

epa_data['aqi'].std()

7.0617066788207215

**Question:** What do you notice about the standard deviation for the `aqi` column? 

The standard deviation of 7.061707 implies that the "aqi" values in the dataset are spread out or deviate from the mean by a moderate amount. This suggests that there may be some fluctuations or differences in air quality measurements within the dataset.

## **Considerations**


- 75% of the AQI values in the data are below 9 parts per million, which is the standard for healthy air quality levels in terms of carbon monoxide. 
- Funding should be allocated for further investigation of the regions with unhealthy levels of carbon monoxide in order to learn how to improve the conditions.

**References**

[Air Quality Index - A Guide to Air Quality and Your Health](https://www.airnow.gov/sites/default/files/2018-04/aqi_brochure_02_14_0.pdf). (2014,February)

[Numpy.Std — NumPy v1.23 Manual](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

US EPA, OAR. (2014, 8 July).[*Air Data: Air Quality Data Collected at Outdoor Monitors Across the US*](https://www.epa.gov/outdoor-air-quality-data). 