# Grading process


The submission notebook will be autovalidated with `papermill`. The exact command is the following:

```bash
papermill <notebook-name>.ipynb <notebook-name>-run.ipynb .ipynb -p TEST True
```

Papermill will inject new cell after each cell tagged as `parameters` (see `View > Cell toolbar > Tags`). Notebook will be executed from top to bottom in a linear order. `solutions.py` contains correct implementations used to validate your solutions.

Please, **fill `STUDENT` variable with the name of submitting student**, so that we can collect the results automatically. Please, **do not change `TEST` variable** and `validation` cells. If you need to inject your own code for testing, wrap it into

```python
if not TEST:
    ...
```

Different problems give different number of points. All problems in the basic section give 1 point, while all problems in intermediate section give 2 points.

Each problem contains specific validation details. You need to fill each cell tagged `solution` with your code. Note, that solution function must self-contained, i.e. it must not use any state from the notebook itself.

# Dataset

All problems in the assignment use [electricity load dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014). Some functions/methods accept data itself, and in that case it's a Pandas dataframe as obtained by

```python
df = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
```

In contrast, whenever a function/method accepts a filename, it's the filename of **unzipped** data file (i.e. `LD2011_2014.txt`). When testing, do not rely on any specific location of the dataset, as validation environment will most certainly different from your local one. Hence, calls like

```python
df = pd.read_csv("<your-local-directory>/LD2011_2014.txt")
```

will fail.

In [1]:
import numpy as np
import pandas as pd
import torch

In [2]:
STUDENT = "Noam Salomonski"

In [3]:
ASSIGNMENT = 1
TEST = False

In [4]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 12

# Pandas

### 1. Resample the dataset (1 point)

Resample the dataset to 1-hour resolution. Use `mean` as an aggregation function. Your function must output a dataframe, with the same structure as the original one (i.e. not indexed by datetime).

In [5]:
if not TEST:
    orig_df = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
    orig_df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
    print(orig_df.head())

            timestamp  MT_001  MT_002  MT_003  MT_004  MT_005  MT_006  MT_007  \
0 2011-01-01 00:15:00     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
1 2011-01-01 00:30:00     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
2 2011-01-01 00:45:00     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3 2011-01-01 01:00:00     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4 2011-01-01 01:15:00     0.0     0.0     0.0     0.0     0.0     0.0     0.0   

   MT_008  MT_009  ...  MT_361  MT_362  MT_363  MT_364  MT_365  MT_366  \
0     0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
1     0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
2     0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
3     0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
4     0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   

   MT_367  MT_368  MT_369  MT_370  
0     0.0     0.0     0.0     0.

In [6]:
def el_resample(df):
    # your code goes here
    #print(df.head())
    #print(df.index)
    #print(df["timestamp"].dtype)
    df = df.resample('H',on='timestamp').mean().reset_index()
    #print(df.head())
    #print(df.index)
    return df

if not TEST:
    el_resample(orig_df.copy())

In [7]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, el_resample)

### 2. Consumption peaks (1 point)

For each household, calculate, which month in 2014 had the highest consumption. Your function must output series, indexed by household ID (e.g., `MT_XXX`), and containing month as an integer (`1-12`).

In [8]:
def cons_peak(df):
    #print(df.head())
    #print(df.index)
    df2014 = df[df["timestamp"].dt.year == 2014]
    months2014 = df2014.resample('M',on='timestamp').mean().reset_index()
    #print(months2014)
    #print(f"len = {len(months2014)}")
    max_month = months2014.drop(["timestamp"], axis=1).idxmax(axis=0)
    #print("max_month:")
    #print(max_month)
    #max_month.hist(bins=12)
    return max_month

if not TEST:
    cons_peak(orig_df.copy())

In [9]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, cons_peak)

# PyTorch

### 3. Find minimum (2 points)

Consider the following scalar function:

$$
f(x) = ax^2 + bx + c
$$

Given $a,b,c$, find $x$, which minimizes $f(x)$, and minimum value of $f(x)$. Note this:

- $a,b,c$ are fixed, and generated in such a way, that minimum always exists ($f(x)$ is convex),
- $x$ is a scalar value, i.e. 0-dimensional tensor.

For reference, see `generate_coef` function, which is used to generate coefficients. Note, that since optimization process is not completely deterministic, the output is considered correct, if it falls within `1e-3` of actual values.

This problem must be solved as an optimization one using gradient descent.

For that, use only PyTorch functionality, `SciPy` (or alike) optimization routines are not allowed, neither is direct calculation using coefficients.

In [10]:
def generate_coeffs():
    a = torch.rand(size=()) * 10
    b = -10 + torch.rand(size=()) * 10
    c = -10 + torch.rand(size=()) * 10
    return a, b, c

def func(x, a, b, c):
    return x.pow(2) * a + x * b + c

In [11]:
def find_min(a, b, c):
    param = torch.zeros(size=(), requires_grad=True)
    value_function = lambda x :func(x, a, b, c)
    loss_function = lambda val: val

    optimizer = torch.optim.SGD([param], lr=0.01, momentum=0.9)
    for _ in range(10000):
        optimizer.zero_grad()
        value = value_function(param)
        loss = loss_function(value)
        loss.backward()
        optimizer.step()

    x_min = param
    val_min = value_function(x_min)

    return x_min, val_min

if not TEST:
    for _ in range(100):
        a, b, c = generate_coeffs()
        x_min, val_min = find_min(a, b, c)
        print(f"a={a}, b={b}, c={c}, x_min={x_min}, val_min={val_min}")

        analytic_solution_x = torch.tensor(-b / (2*a))
        analytic_solution_val = func(analytic_solution_x, a, b, c)

        np.testing.assert_almost_equal(analytic_solution_x, x_min.detach().numpy(), decimal=3),
        np.testing.assert_almost_equal(analytic_solution_val, val_min.detach().numpy(), decimal=3)

a=0.5775004625320435, b=-1.6436710357666016, c=-5.416696071624756, x_min=1.4230908155441284, val_min=-6.58624267578125
a=3.18483829498291, b=-6.9510650634765625, c=-3.4636788368225098, x_min=1.091274380683899, val_min=-7.256438255310059
a=9.740724563598633, b=-4.7058939933776855, c=-8.453173637390137, x_min=0.24155767261981964, val_min=-9.021546363830566
a=5.575559139251709, b=-3.188587188720703, c=-9.4719877243042, x_min=0.28594326972961426, val_min=-9.927865028381348
a=8.176003456115723, b=-7.5644941329956055, c=-5.421421527862549, x_min=0.4626034200191498, val_min=-7.1711015701293945
a=2.0038437843322754, b=-0.41909027099609375, c=-7.927860736846924, x_min=0.10457159578800201, val_min=-7.94977331161499
a=2.9687881469726562, b=-4.592647552490234, c=-6.167239665985107, x_min=0.7734885811805725, val_min=-7.943419933319092
a=3.999666690826416, b=-4.192836761474609, c=-1.156327247619629, x_min=0.5241482853889465, val_min=-2.2551612854003906
a=9.922456741333008, b=-2.0750832557678223, c=-



AssertionError: 
Arrays are not almost equal to 4 decimals

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.0045166
Max relative difference: 1.6317188e-05
 x: array(276.8047, dtype=float32)
 y: array(276.8002, dtype=float32)

In [None]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_min)

### 4. PyTorch `Dataset` (3 points)

Implement a `torch.utils.data.Dataset` sub-class for the electricity consumption data. Individual training instances must be week-long univarite series of hourly consumption (input, 168 values), followed by 24-hours long series of hourly consumption (output, 24 values) for a single household. Such a class can be used when training a consumption forecast model, which uses 7 days of historical consumption to forecast next 24 hours of consumption.

`__getitem__(self, idx)` must return a tuple of 1D tensors, `in_data` and `out_data`. `in_data` contains 168 hours of consumption (hourly), starting from some `start_ts`, while `out_data` must contain 24 hourly consumption values starting from `start_ts + 168 hours` for some household. `start_ts` should be sampled randomly.

Also, you need to implement a `get_mapping(self, idx)` method, which allows to calculate `(household, starting time) -> idx` correspondence.

This class will be validated as the following:

- dataset object is created with some random `samples`: `dataset = ElDataset(df, samples)` ,
- validator fetches random `idx` (between `0` and `len(dataset)`) from the dataset:
```python
household, start_ts = dataset.get_mapping(idx)
hist_data, future_data = dataset[idx]
```
- then, `hist_data` and `future_data` are compared with the data, obtained directly from `df` using `household, start_ts`.

In [30]:
from torch.utils.data import Dataset
from random import randrange


class ElDataset(Dataset):
    """Electricity dataset."""

    def __init__(self, df, samples):
        """
        Args:
            df: original electricity data (see HW intro for details).
            samples (int): number of sample to take per household.
        """
        df = df.resample('H',on='timestamp').mean().reset_index()
        self.raw_data = df.set_index('timestamp').asfreq('H')
        self._freq = self.raw_data.index.freq

        self._n_households = self.raw_data.shape[1]
        self._n_frames_per_house = self.raw_data.shape[0]

        self._max_allowed_timestamp_index = self._n_frames_per_house - 24 * (7 + 1)
        if self._max_allowed_timestamp_index < samples:
            print("warning, too many samples required!")
            samples = self._max_allowed_timestamp_index
        self._samples = samples
        self._time_stride = self._max_allowed_timestamp_index // self._samples

    def __len__(self):
        return self._samples * (self.raw_data.shape[1])

    def __getitem__(self, idx):
        if len(self) < idx:
            print(f"index {idx} out of bounds ({len(self)}")
        household, start_ts = self.get_mapping(idx)
        house_data = self.raw_data[household]
        hist_data = house_data.loc[start_ts:start_ts + (168 - 1) * self._freq]
        future_data = house_data.loc[start_ts + 168 * self._freq:start_ts + (168 + 24 - 1) * self._freq]
        return hist_data, future_data

    def get_mapping(self, idx):
        # your code goes here
        household_ind = idx % self._n_households
        time_sample_idx = idx // self._samples
        start_ts_ind = time_sample_idx * self._time_stride

        household = self.raw_data.columns[household_ind]
        start_ts = self.raw_data.index[start_ts_ind]
        return household, start_ts
        
if not TEST:
    samples = 5
    dataset = ElDataset(orig_df.copy(), samples)

    for idx in range(len(dataset)):
        household, start_ts = dataset.get_mapping(idx)

        tested_past = dataset.raw_data[household][start_ts:start_ts+167*dataset._freq]
        tested_future = dataset.raw_data[household][start_ts+168*dataset._freq:start_ts+(168+23)*dataset._freq]
        my_past, my_future = dataset[idx]

        diff1 = my_past - tested_past
        diff2 = my_future - tested_future

        assert np.count_nonzero(diff1) == 0
        assert np.count_nonzero(diff2) == 0

Traceback (most recent call last):
  File "_pydevd_bundle\pydevd_cython_win32_37_64.pyx", line 1589, in _pydevd_bundle.pydevd_cython_win32_37_64.ThreadTracer.__call__
  File "_pydevd_bundle\pydevd_cython_win32_37_64.pyx", line 756, in _pydevd_bundle.pydevd_cython_win32_37_64.PyDBFrame.trace_dispatch
  File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers-pro\jupyter_debug\pydev_jupyter_plugin.py", line 92, in can_not_skip
    if step_cmd == 108 and _is_equals(frame, _get_stop_frame(info)):
  File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers-pro\jupyter_debug\pydev_jupyter_plugin.py", line 124, in _is_equals
    return frame.f_code.co_filename == other_frame.f_code.co_filename and \
AttributeError: 'NoneType' object has no attribute 'f_code'
Traceback (most recent call last):
  File "_pydevd_bundle\pydevd_cython_win32_37_64.pyx", line 1589, in _pydevd_bundle.pydevd_cython_win32_37_64.ThreadTracer.__call__
  File "_pydevd_bundle\pydevd_cython_win

AttributeError: 'NoneType' object has no attribute 'f_code'

dssdds


In [None]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, ElDataset)

# Your grade

In [None]:
if TEST:
    print(f"{STUDENT}: {total_grade}")