# PDAP-2018: Exam Homework Exercise 2

## Summer term 2018
### University of Bremen / Dr. Andreas Hilboll

This is the second graded homework assignment for the course *Practical Data Analysis with Python*.  There will be one more graded homework assignment towards the end of the semester; your total course grade will consist of the aggregated grades of all threehomework assignments.

## Rules and regulations

### When to submit
This homework must be submitted by Monday, 02 July 2018, 08:00:00 CET

### How to submit
For now, you can submit this homework assignment by sending the Jupyter notebook (`.ipynb` file) to [hilboll@uni-bremen.de](mailto:hilboll@uni-bremen.de).

**Note:** It is *your* responsibility to make sure that your homework submission reaches me in time.  If in doubt, submit early and ask me if I received your file.

### Technical requirements
Your solution has to be written to this Jupyter Notebook.  Please rename the notebook so that your last name(s) is/are included in the filename.  Your solution has to consist of exactly one (i.e., this) file.  Please leave the blocks labled **EVALUATION** in place so that I can fill them out when correcting your submission.

### Study groups
You are allowed to complete this homework assignment either alone or in groups of up to two students.  In case you do not do your homework alone, please clearly state who has contributed how much to which part of the homework.

### Discussion
At the end of the semester, there will be a separate ~20 minute oral discussion for each study group, in which all students are expected be able to demonstrate that they understand the code they submitted.

### Using internet resources
You are allowed to use book and/or internet resources to complete this homework assignment.  However, you are expected to a) clearly state any reference you have used to complete the assigment (e.g., by giving the URL to a website in a code comment), and b) to be able to explain the code you are writing.

### Evaluation criteria
You will be graded on all *tasks* laid out below.  There are some possibilities to earn extra credit, clearly indicated in the task description.

## Background

NO2 is a trace gas which is produced mainly from the burning of fossil fuels;
other (natural) sources include biomass burning (forest fires, agricultural
fires), lightning, and microbial emissions from soils.

MAX-DOAS stands for *Multi-AXis Differential Optical Absorption Spectroscopy*.
The instrument consists of a telescope and a spectrometer, which measure the
intensity of scattered sunlight in different elevation and azimuth directions.
DOAS is an application of the Beer-Lambert law, in which the integrated trace
gas concentration along the average light path (from the sun to the instrument),
called *slant column* or *slant column density*, is derived from the trace gas'
absorption cross section (measured in the laboratory) and the attenuation of the
scattered sunlight in the atmosphere.  The slant column density is in units of
*molecules per ground area*.  As it depends strongly on the length of the light
path, it is larger close to sunrise and sunset, when the sun is low, compared to
midday, when the sun is high.

The elevation angle is the angle between the vertical (pointing downwards) and
the viewing elevation of the telescope, i.e., it is 90° for looking towards the
horizon and 180° for looking towards the zenith.

The azimuth angle is the geographical direction of the telescope line-of-sight.
In these data files, it is defined to go from -180° to 180°, with -90° being
East, 0° being South, and 90° being West.

## Technical comments
- The filename of the NO2 data files (`*.VisNO2A`) contains two pieces of
  information, the date (in the form `YYMMDD`) and one of five viewing azimuth
  directions (`SS`, `TS`, `US`, `VS`, `WS`).  For example, the file
  `130624VS.VisNO2A` contains measurements from 24 Jun 2013 for the azimuth
  direction `VS`.
- The data files start with a description of the file contents. Each of the first comment lines, starting with `*`, contains information on the contents of one column.  E.g., the first column contains information on *Day of Year 1993* and the second column contains information on *Uhrzeit [UT]*.
- The NO2 slant column density is contained in the column *Schräge Säule NO2*
- The column *Day of Year 1993* contains the days which passed since
  1992-12-31T00:00:00 UTC.  This means that for example 23 Jun 2013 has values
  between 7479.0 and 7480.0.
- The column *Line of Sight* contains the elevation angle in degrees.

## Task 0: Participants

Please fill in your personal details into the following table:

| Last name | First name | Study program | Student ID |
|-----------|------------|---------------|------------|
| FOO       | Alice      | PEP           | 1234567    |
| BAR       | Bob        | SST           | 7654321    |

## Part 1: NO2 data analysis for the Athens measurements

### Data download
Download the NO2 data files from here: https://seafile.zfn.uni-bremen.de/f/097046dd20/ (download size ~85M; uncompressed size ~260M) and unzip them to a new folder in your course repository. **Note:** Do not commit this directory to version control!

### Task 1.1
Write a function to read a single data file.

Use the `pandas.read_csv()` function to read the data file.  You can use the following optional keyword arguments:
- `names` -> to define what the columns in the data frame should be called
- `encoding` -> needed to accomodate some special characters in the files.  set it to the string `"cp1252"`
- `comment` -> specify which columns should not be interpreted as values
- `usecols` -> choose the required columns
- `delim_whitespace` -> specify that the values are separated by whitespace

After reading the data, the function should add two additional columns to the DataFrame:
- a column `timestamp` specifying the date/time of the measurement.  You can use the function `netCDF4.num2date` to create the datetime objects needed to do this
- a column `azimuth_direction` specifying the azimuth direction's letter (see above, one of `S`, `T`, `U`, `V`, `W`).  You can use the functions `os.path.split` and `os.path.splitext` and string indexing to extract this information from the `filename` parameter.

The function should return one single `pandas.DataFrame` which has the columns `no2_scd`, `solar_zenith_angle`, `azimuth_direction`, and `elevation_angle`; the DataFrame's *index* should be the `timestamp`.

Note that the columns might be called differently in the file, so you will have to rename the columns after reading the DataFrame.  Here is a translation of column names as used in the file to column names as they should be in the DataFrame:

| in file               | in output            |
|-----------------------|----------------------|
| not there (see above) | `timestamp`          |
| `a[NO2]`              | `no2_scd`            |
| `zenith_angle`        | `solar_zenith_angle` |
| not there (see above) | `azimuth_direction`  |
| `los`                 | `elevation_angle`    |


To help you, here is a list of all columns contained in the data file:

In [None]:
column_names = 'day-of-year-1993 time time_LT endtime zenith_angle solar-azimuth-angle los viewing-azimuth-angle ref-zenith-angle ref-azimuth-angle ref-LOS a[O3] sig[O3] a[NO2] sig[NO2] a[O4] sig[O4] a[BrO] sig[BrO] a[H2O] sig[H2O] a[RING] sig[RING] a[Offset] sig[Offset] a[Bezug] sig[Bezug] sh[Bezug] sq[Bezug] chisq Q rms it spikes ints expt'.split()
print(column_names)

You can start from this function stub:

In [None]:
import os.path
import pandas as pd
import netCDF4

def read_datafile(filename):
    # this is a list of all columns contained in the data file
    column_names = 'day-of-year-1993 time time_LT endtime zenith_angle solar-azimuth-angle los viewing-azimuth-angle ref-zenith-angle ref-azimuth-angle ref-LOS a[O3] sig[O3] a[NO2] sig[NO2] a[O4] sig[O4] a[BrO] sig[BrO] a[H2O] sig[H2O] a[RING] sig[RING] a[Offset] sig[Offset] a[Bezug] sig[Bezug] sh[Bezug] sq[Bezug] chisq Q rms it spikes ints expt'.split()

    df = pd.read_csv(filename, )  # add the above keyword arguments to this function call
    
    df['datetime'] =  # add the date/time of the measurement.
    
    # delete the day-of-year-1993 column
    del ...
    
    df['azimuth_direction'] =   # add the letter specifying the azimuth direction.
    
    df.columns =   # rename the columns as required
    
    # make the the timestamp column the DataFrame's index
    df....
    
    return df

Use this function to read any one data file:

In [None]:
df = read_datafile(...)
df

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.2
Create a data frame consisting of all NO2 measurements for year 2013, and save this data frame to a HDF file.  First you will need to use `glob.glob()` to create a list of all relevant file names, and then loop over this list.  After you have concatenated all files' data into one large DataFrame, make sure that it is sorted by timestamp, and use the DataFrame's `.to_hdf()` method to save.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.3
Choose one azimuth direction.  For this azimuth direction, calculate the average NO2 for each elevation and month.  The result should be a `pandas.DataFrame` with one *column* per elevation angle, and one *row* per month.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.4
For the same azimuth direction as in *Task 1.3*, calculate the annual mean NO2 for each elevation.  The result should be a `pandas.Series`, with the elevation angle as index.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.5
There is more than one way of calculating the annual mean:

- Directly calculate the mean for each elevation from the data frame containing all measurements
- Calculate the annual mean as the average of the monthly mean values

#### Task 1.5.1
Explain under which circumstances the two methods can yield different results.

**EVALUATION**

*Grade:*  / 0.5

*Comment:*

#### Task 1.5.2
Implement the variant you did not choose in *Task 1.3* and visually compare the results.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.6
Write a function, which takes as inputs

1. the data frame you created in *Task 1.2*
2. the azimuth direction
3. the elevation angle

and returns a data frame of average diurnal cycle (i.e., hourly means) for the specified azimuth and elevation, for each month, i.e., the data frame should have the month (`Jan`, `Feb`, ...) as columns and the time-of-day as index.

*Hint:* First write a helper function which takes the same inputs as before plus the month as fourth input, and returns as a series the average diurnal cycle for this month. Then, use this helper function inside a loop to solve this task.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.7
Choose any one azimuth direction.  For this azimuth direction, plot the average diurnal cycle for each month and elevation angle. The time-of-day should be on the x-axis, and the NO2 value on the y-axis.

If possible (optional - *extra credit!*), do this by creating one subplot for each elevation angle, with all months for the same elevation angle within the same plot.  Use different colors for each month.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 1.8
Choose any one elevation angle and any pair of two azimuth directions.  For this subset of data, create a scatter plot of all daily mean NO2 values (one azimuth direction on *x*, the other azimuth direction on *y*).

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

#### Task 1.8.1 (extra credit)
As in *Task 1.8*, but give each scatter point a color indicating the month, and include a legend for the different colors.

**EVALUATION**

*Grade:*  / 0.5

*Comment:*

### Task 1.9 (extra credit)
Use the [xarray](http://xarray.pydata.org/) module to create a four-dimensional array holding the average diurnal cycle for all months, elevation angles, and azimuth directions, and save this array to a netCDF file.

*Hint:* You will want to create a 4D `xarray.DataArray`, with the dimensions *month*, *time*, *azimuth*, and *elevation*, and iteratively fill this array in loops over azimuth and elevation, using the functions from above.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

## Part 2: Analysis of the Mauna Loa CO2 time series

### Task 2.1
Create a data frame of the Mauna Loa CO2 measurements, available at ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt, and plot the CO2 time series.  The actual CO2 value is contained in the column *average*.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*

### Task 2.2
Calculate annual minima, maxima, and averages from the CO2 data and plot these.  Your plot should have time on the x-axis, CO2 on the y-axis, and should show three different lines.

**EVALUATION**

*Grade:*  / 1.0

*Comment:*