# Data Analytics Spring 2024 &mdash; Exercises 1

### A K M Mahmudul Haque (last modified: Sun Jan 14 at 13:13)

- Five problems
- Minor variations between users
- Theme: Python & Numpy (no Pandas allowed)
- Hints will be given during the opening weekend
- Deadline: <b>Tue Jan 23 at noon</b>
- Make a copy of the original notebook (right click & duplicate) and add your answers (new cells) there
- Please make both your code and your notebook readable
- When you are done, run the handin code cell at the end of this notebook
- The original notebook may change after publication, but the
  changes should be minor 
- The changes are not visible in the Last Modified column of the hub but Harri will inform you if there are notable changes
- See the edited opening weekend notes in the public folder
- Keep your folder structure up to date by running the code cell below:

In [None]:
import os
os.system('/usr/bin/bash /home/varpha/data_analytics/bin/config.sh');

## Problem 1. Documentation
- Browse through the Python and Numpy documentation
- Find a function that a) interests you, and b) has a messy documentation
- Play with the function and find simple use cases
- Explain the function to your anonymous peer reviewer.

Please write a nice and clear explanation. Include some elementary examples.

## Problem 2. Map, Lambda, Groupby
In this problem, only plain python may be used, no numpy.<br/>
The following links may be helpful:
- [sorting howto](https://docs.python.org/3/howto/sorting.html)
- [lambda sorting](https://blogboard.io/blog/knowledge/python-sorted-lambda)
- [itertools groupby](https://stackoverflow.com/questions/773/how-do-i-use-itertools-groupby).

Using the code cell below, read a csv (real wind turbine data) into a list of dicts.<br/>
Then do the following:
- a) using map, convert the timestamps into the format <b>MM/dd/yyyy HH:mm:ss</b>, e.g. 11/04/2018 09:10:43
- b) using sorted and lambda, sort the rows according to increasing rotorspeed
- c) add a column called <b><i>WindSpeed_Group</i></b> that contains the letter A, B or C, where A = less than 5mps, B = 5-10mps, C = more than 10mps. Try to use [itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby) (although it may not be very smart).

In your handin, include the code that does a) - c) above. No need to save the modified data. Here is the code for reading the raw data:

In [None]:
from getpass import getuser
import csv
user = getuser()
user = user.upper() if user not in ['x1234'] else user
csv_location = f'/home/varpha/data_analytics/private/{user}' + \
                f'/exrc_01/data/prob2_{user}.csv'
with open(csv_location) as handle:
    mydata = list(csv.DictReader(handle))

## Problem 3. Vectorization
- Some [general info](https://www.askpython.com/python-modules/numpy/vectorization-numpy)
- The code in <b>data_analytics/lib/integrator.py</b> contains rudimentary code,<br/>
  written in plain python, that numerically integrates a (math) function<br/>
  $f\colon \mathbb{R} \to \mathbb{R}$ over an interval $[a,b]$.
- Rewrite the code using numpy and vectorization.
- Introduce timings to measure the gain of vectorization.
- Use the (math) function $f(x)=8 x^{12} + 11 x^{10} - 12 x^{8} + 3$ and interval $[a,b] = [-15, 17]$ to test the code.
- Increase the number of subintervals in order to obtain a noticeable difference in the timings.

In your handin, include the rewritten code along with the timing measures.

## Problem 4. Numpy arrays

- The folder <b>/home/AB0208/data_analytics/private/exrc_01/data</b><br/>
  contains a csv file (<b>prob4_AB0208.csv</b>) with some weather data.
- a) Use [numpy.genfromtxt](https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html) to read the file into a 2-dimensional numpy array.<br/>
  Use dtype=str in order to not lose the headers.
- b) Use Boolean masking to drop the rows that contain <b>nan</b> entries.
- c) Convert the time entries (standard timestamp) into a human-readable format of your choice.
- d) Add a new row that contains the averages of the columns, except <b>nan</b> for the time column.

In your handin, include the code that does a) - d) above. Do not include any saved data.

## Problem 5. Data download
- Start by exploring / running the code in <b>data_analytics/lib/statfi.py</b>
- Choose a topic that interests you. Then try to download a "lot" of data of data of that topic. Here a lot means something like 500kB - 2MB range. (It's not really a lot but enough that the downloaded data is hard to grasp manually.)
- Save your data in one or several json files.

In your handin, include the code that you used (no saved data).
Also, tell a few words about your experiences. What problems, if any, did you encounter?

## Handin your final answers by running the code cell below.
- Save your latest changes first, and please remove anything that may identify you to your anonymous reviewer.
- More information about the anonymous reviewing process will appear in the second exercises that will be published on Tue Jan 23.
- You may run the code cell as many times as you wish.
- Your permission to write the handin file ends at the deadline.

In [None]:
import sys
sys.path.append('/home/varpha/data_analytics/lib')
from handin import handin_exrc_01
handin_exrc_01()