# Assignment #07

## Exercise #07-01: Using ChatGPT for programming

A [Galton Board](https://en.wikipedia.org/wiki/Galton_board) is a vertical board with staggered rows of pins or other obstacles. As beads are dropped from the top, they bounce off these pins, either to the left or right, finding a path between the pins until they drop into one of the collecting bins at the bottom. The resulting distribution of beads in the bins will follow a binomial distribution.

In this exercise, we are going to explore how we can use [ChatGPT](https://chat.openai.com) to potentially improve our program and our programming style and what some of the drawbacks may be. To interact with ChatGPT, you will need to create an account if you don't have one yet.

**A.** Write a program that simulates a Galton Board using only the standard library. The user should be able to specify the number of rows of pins on the Galton Board and the number of beads to simulate as commond-line arguments. For this assignment you are asked to use the [argparse](https://docs.python.org/3/library/argparse.html) module to parse the command-line arguments. The end result of your program should be a list containing the number of beads in each bin. *It is important that you write this code yourself without the help of ChatGPT.*

**B.** Now ask ChatGPT to document the code for you. Does it provide accurate and sufficient documentation (e.g., docstrings and meaningful comment lines)?

**C.** Ask ChatGPT to provide some suggestions on how to improve your code. But make sure that it still follows the instructions in our assignment. Are the suggestions actually improvements?

**C1.** Hit regenerate and see if you get different suggestions this time.

**D.** We have not covered plotting yet in our course. So ask ChatGPT to write a function for you to plot the distribution of beads using matplotlib. It should also modify your program accordingly, i.e., include calls to the plot function and store the resulting plot in a directory that the user can specify via command-line arguments. Does the code work as expected?

**E.** Now start a fresh conversation with ChatGPT and actually ask it to produce the entire code for you. Is the code doing what it is supposed to be doing? How does it compare to your code? Think about what it takes to make sure that you can rely on the code being correct.

**F.** Which of the above interactions with ChatGPT provided useful input to you, which did not?

## Exercise #07-02: ACINN meteorological data

This exercise will give you a glimpse of working with [pandas](https://pandas.pydata.org/). We are going to analyze a dataset downloaded from the [ACINN department database](https://acinn-data.uibk.ac.at/). The one-month long dataset is from the automatic weather station [TAWES UIBK](https://acinn-data.uibk.ac.at/pages/tawes-uibk.html). [Here](https://acinn-data.uibk.ac.at/station/1/RAWDATA/) you can find a description of the variable names.

The data is shared by ACINN under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).

<a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank">
  <img src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png"/>
</a>

You can access the dataset at the URL shown below. Data downloaded from the department database are formatted as [csv](https://en.wikipedia.org/wiki/Comma-separated_values) files, which we can read in easily using pandas. You may want to read the documentation of [read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to see what all the arguments do.

In [14]:
from urllib.request import Request, urlopen
from io import BytesIO
import pandas as pd

url = 'https://raw.githubusercontent.com/manuelalehner/scientific_programming/master/data/data_Ibk_Sep2023.csv'
# Parse the given url
req = urlopen(Request(url)).read()
# Read the data
data = pd.read_csv(BytesIO(req), sep=';', header=1, index_col=0, parse_dates=True)

The data are read into a so-called [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), which can be very useful for time series analysis, e.g., from weather stations. Let's explore this DataFrame somewhat to get you started.

In [10]:
data.columns # list all the column headers

Index(['tl', 'tl2', 'ts', 'tb1', 'tb2', 'tb3', 'tp', 'rf', 'rf2', 'rr', 'rrm',
       'p', 'som', 'glom', 'ffamm', 'ffm', 'ddm', 'ffxm', 'ddxm'],
      dtype='object')

In [15]:
data['som'] # access the data column 'som' (sunshine duration)

rawdate
2023-09-01 00:00:00    0.0
2023-09-01 00:01:00    0.0
2023-09-01 00:02:00    0.0
2023-09-01 00:03:00    0.0
2023-09-01 00:04:00    0.0
                      ... 
2023-09-30 23:55:00    0.0
2023-09-30 23:56:00    0.0
2023-09-30 23:57:00    0.0
2023-09-30 23:58:00    0.0
2023-09-30 23:59:00    0.0
Name: som, Length: 37434, dtype: float64

In [16]:
data.index # access the datetime index

DatetimeIndex(['2023-09-01 00:00:00', '2023-09-01 00:01:00',
               '2023-09-01 00:02:00', '2023-09-01 00:03:00',
               '2023-09-01 00:04:00', '2023-09-01 00:05:00',
               '2023-09-01 00:06:00', '2023-09-01 00:07:00',
               '2023-09-01 00:08:00', '2023-09-01 00:09:00',
               ...
               '2023-09-30 23:50:00', '2023-09-30 23:51:00',
               '2023-09-30 23:52:00', '2023-09-30 23:53:00',
               '2023-09-30 23:54:00', '2023-09-30 23:55:00',
               '2023-09-30 23:56:00', '2023-09-30 23:57:00',
               '2023-09-30 23:58:00', '2023-09-30 23:59:00'],
              dtype='datetime64[ns]', name='rawdate', length=37434, freq=None)

**Write a script that allows the user to input the variable (e.g., air temperature, wind speed, ...) either as a command line argument or using ``input()`` and prints the following information in the terminal.**

If the variable is wind direction:

    The dominant wind direction was {XX} ({XX}% of the time). The least dominant wind direction was {XX} ({XX}% of the time).
    
If it is any other variable:

    The maximum {VARIABLE} was {XX} {UNITS} ({DATE/TIME}), while the strongest {VARIABLE} averaged over an hour was {XX} {UNITS} ({DATE/TIME}).")

*Hint 1:* You can use either numpy to determine, for example, the maximum values or you can work directly with the DataFrame. ([e.g., calculating the maximum](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html)). To convert a column of the DataFrame to a numpy array, you can use

In [17]:
temp = data['som'].to_numpy()

*Hint 2:* Calculating time averages is easy using the pandas [resample](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html) method.

*Hint 3:* For wind direction, use the following eight wind direction classes: N, NW, W, SW, S, SE, E, NE.

*Hint 4:* To output the datetime index in a specific format, you can use the [strftime](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.strftime.html) method.

## Exercise #04-02: Reading and writing netCDF

[netCDF](https://www.unidata.ucar.edu/software/netcdf/) is a very commonly used data format in the Atmospheric Sciences and it is becoming more and more of a standard for data exchange because netCDF data files are self-describing. That means that the format allows you to store metadata (e.g., information about the measurement location and instrumentation and information about each of the variables) together with the actual data.

It is thus not unlikely that you will encounter netCDF data during your studies, e.g., output from the WRF model, such as this [file](https://fileshare.uibk.ac.at/f/82dfc813aa0b4bc8bf33/?dl=1). In this exercise, we are going get some experience with the netCDF file structure. Unlike our WRF output file, which contains only a small subset of variables, full model output files can be very large. To share data with other people, you may thus want to split the file into multiple smaller files or extract only part of the data.

You will need to install the [netCDF4 package](https://unidata.github.io/netcdf4-python/) for this exercise and you should read the sections on creating/opening/closing, dimensions, variables, and attributes of the documentation to get familiar with the structure of netcdf files.

**A.** Write a program that creates a new netcdf file for each of the variables in the original file (use the netcdf4 format).

*Hint A1:* The dimensions and the global attributes will be the same for each of the new files. You can copy them directly from the original file, that is, read them from the original file and create the corresponding dimension/attribute in the new dataset.

*Hint A2:* The data are either one- or two-dimensional arrays similar to ``numpy`` arrays, which we will only introduce in the next chapter. Here, you simply need to copy the whole array to the new dataset, e.g.,
```python
# example for variable TSK
# nc0 ... original dataset created with nc.Dataset()
# nc1 ... new dataset
temp = nc1.createVariable('TSK', nc0['TSK'].dtype, nc0['TSK'].dimensions)
temp[:] = nc0['TSK'][:]
```

*Hint A3:* Each of the variables contains also its own attributes, which you need to copy to the new datasets together with the data.


**B. (optional)** Add the option of writing a new netcdf file for each output time instead of each variable, that is, each of the new files contains all the variables, but only for a single output time.

*Hint B1:* Add a command-line option that allows the user to choose between writing files for each variable or files for each output time.

*Hint B2:* Since each of the new files will contain only a single output time, the dimension 'Time' should be removed.

*Hint B3:* If one of the dimensions of a variable is 'Time', you need to extract the data for a given output time from the data arrays. Here is an example for variable 'TSK':
```python
# example for variable TSK and the output time with the index 'it'
data = nc0['TSK'][:]
# create indeces that cover the whole array
ind = [slice(0, dimlen) for dimlen in data.shape]
# since we don't know in advance which dimension is Time, we need to first find out from the list of dimensions
dims = list(nc0['TSK'].dimensions)
timeind = dims.index('Time')
# now we can replace the slice for the time dimension with the single index of the selected output time
ind[timeind] = it
data = data[tuple(ind)]
```
Here we are making use of so-called [list comprehensions](https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions) to create the indeces ``ind``. *Remember to also remove 'Time' from the variable dimensions.*