# Class activity: Peak over threshold algorithm development 


## 1 Introduction: Air quality (ozone) regulations in the US


### 1.1 Background

#### The 2015 National Ambient Air Quality Standards (NAAQS) for Ozone is a rule by the ENVIRONMENTAL PROTECTION AGENCY

It set standards for air quality control / regulations. It starts with a summary statement:  

"_SUMMARY: Based on its review of the air quality criteria for ozone (O3) and related photochemical oxidants and national ambient air quality standards (NAAQS) for O3, the Environmental Protection Agency (EPA) is revising the primary and secondary NAAQS for O3 to provide requisite protection of public health and welfare, respectively._"
 [[1]](https://www.epa.gov/ground-level-ozone-pollution/2015-revision-2008-ozone-national-ambient-air-quality-standards-naaqs)

Monitoring ozone concentrations is therefore important to be able to inform the public about human health risk when the ozone levels are too high. 


#### The EPA lists on their web page critical threshold levels for near-surface ozone concentratations.  

"_Based on extensive scientific evidence about the effects of ozone on public health and welfare, on October 1, 2015, EPA strengthened the ground-level ozone standard to 0.070 ppm, averaged over an 8-hour period. This standard is met at an air quality monitor when the 3-year average of the annual fourth-highest daily maximum 8-hour average ozone concentration is less than or equal to 0.070 ppm_" 
 [[2]](https://www3.epa.gov/region1/airquality/avg8hr.html)

__A critical value is 0.07 ppm__ for an 8-hour period according to revised standards set by the EPA.


### 1.2 Objectives

- Apply the learned Python data visualization functions to present the data, explore the main features of the time series and describe the interesting pattern that you can identify by the visual inspection of the data.
- Development of code that counts of the number of days in which the ozone threshold was exceeded.

---

In [16]:
%matplotlib inline
from IPython.display import HTML
from IPython.display import display

# import packages that we need to read the data files
# convert date strings into numerical values for plotting time series
# pandas is a powerful (but a bit more complicated) package to work with spreadsheet data
import pandas as pd
import datetime as dt 

# our two main packages for data analysis
import numpy as np
import matplotlib.pyplot as plt

### A supporting function is defined in the next code cell. 

We run the code cell so we can call the function _load_data_ in the main part of our code 
but we don't look into the code in detail. Go to the main part below.

In [17]:
def load_data(city,pollutant='ozone'):
    """A supporting function to load ozone data from a csv file
    
    Args:
        city (str): A string for the city name (must match string in file names).
        pollutant (str): name of the pollutant (defaults to 'ozone')
            Use this second parameter to assign another string 
    
    Returns:
        x (numpy array): an array with the dates (values have type datetime)
        y (numpy array): an array with the concentration
    """
    path="/nfs/home11/staff/timm/Public/Data/hw2/"
    filename=city+'_'+pollutant+'.csv'
    try:
        open(path+filename,'r')
        is_file=True
        print (80*"+")
        print ("Load data for "+city+" pollutant: "+pollutant)
        print ("Local file is "+path+filename)
        print (80*"+")
    except:
        print("Warning: could not open file "+path+filename)
        is_file=False
    
    if is_file:
        df=pd.read_csv(path+filename)
        print(80*"-")
        print ("+ Name of data columns in the Pandas Dataframe:")
        for name in df.columns:
            print (name)
        print(80*"-")
        ########################################################
        # pre-processing of the data
        ########################################################
        
        ########################################################
        # 1. convert the date data (type string) into numerical 
        # values (useful for plotting in plt.plot)
        ########################################################
        
        dates=df['Date'] # extracts the column named 'Date' from dataframe
        datelist=[]
        n=0
        for d in dates: # dates is iterable
            # take the string and convert into a numerical value
            value=dt.datetime.strptime(d,'%m/%d/%y')
            datelist.append(value)
            n=n+1
        
        # 
        x=np.array(datelist) # convert the list with datetimevalues into numpy array
        
        ########################################################
        # 2. extract column with the ozone concentration data
        ########################################################
        # gets data in a type numpy array
        y=df['Daily Max 8-hour Ozone Concentration'].values 
        # units we expect to be the same in each row, so we get one cell value
        unit=df['UNITS'][0] 
        print ("Loaded the data successfully!")
        print ("Number of days in file: "+str(n))
        print ("Dates:"+str(x[0])+" to "+str(x[-1]))
        print ("Concentration values range from: ")
        print ("%12.4f to %12.4f" % (np.nanmin(y), +np.nanmax(y)))
        print ("Units: "+unit)
        
    else:
        print ("do else")
        
        x,y = np.nan, np.nan
    
    return x,y

---
## Here begins the main part of the Python code
---

In [18]:
help(load_data)

Help on function load_data in module __main__:

load_data(city, pollutant='ozone')
    A supporting function to load ozone data from a csv file
    
    Args:
        city (str): A string for the city name (must match string in file names).
        pollutant (str): name of the pollutant (defaults to 'ozone')
            Use this second parameter to assign another string 
    
    Returns:
        x (numpy array): an array with the dates (values have type datetime)
        y (numpy array): an array with the concentration



Note: we have five data files for these cities:

| City          | String to use in function  load_data()  |
|---------------|-----------------------------------------|
| New York City | 'nyc'                                   |
| Los Angeles   | 'los_angeles'                           |
| Houston       | 'houston_tx'                            |
| Philadelphia  | 'philadelphia_pa'                       |
| Phoenix       | 'phoenix_az'                            |


In [19]:
x,y = load_data('nyc')

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Load data for nyc pollutant: ozone
Local file is /nfs/home11/staff/timm/Public/Data/hw2/nyc_ozone.csv
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--------------------------------------------------------------------------------
+ Name of data columns in the Pandas Dataframe:
Date
Site ID
Daily Max 8-hour Ozone Concentration
UNITS
SITE_LATITUDE
SITE_LONGITUDE
Unnamed: 6
Unnamed: 7
--------------------------------------------------------------------------------
Loaded the data successfully!
Number of days in file: 365
Dates:2019-01-01 00:00:00 to 2019-12-31 00:00:00
Concentration values range from: 
      0.0070 to       0.0810
Units: ppm


### Solution with for-loop:

In [20]:
i=0
e=0 
for a in x:
    if y[i] > 0.07:
        print(y[i])
        e=e+1
    else:
        pass
    i=i+1
print('The number of exceedence events:')
print(e)

print('The number of loop iterations:')
print(i)
    
        
    


0.081
0.071
The number of exceedence events:
2
The number of loop iterations:
365


### Solution with while-loop:

In [28]:
i=0
e=0 
while i < len(x):
    if y[i] > 0.07:
        print(y[i])
        e=e+1
    else:
        pass
    i=i+1
print('The number of exceedence events:')
print(e)

print('The number of loop iterations:')
print(i)
    
    

0.081
0.071
The number of exceedence events:
2
The number of loop iterations:
365


---
