# Grading process


The submission notebook will be autovalidated with `papermill`. The exact command is the following:

```bash
papermill <notebook-name>.ipynb <notebook-name>-run.ipynb .ipynb -p TEST True
```

Papermill will inject new cell after each cell tagged as `parameters` (see `View > Cell toolbar > Tags`). Notebook will be executed from top to bottom in a linear order. `solutions.py` contains correct implementations used to validate your solutions.

Please, **fill `STUDENT` variable with the name of submitting student**, so that we can collect the results automatically. Please, **do not change `TEST` variable** and `validation` cells. If you need to inject your own code for testing, wrap it into

```python
if not TEST:
    ...
```

Different problems give different number of points. All problems in the basic section give 1 point, while all problems in intermediate section give 2 points.

Each problem contains specific validation details. You need to fill each cell tagged `solution` with your code. Note, that solution function must self-contained, i.e. it must not use any state from the notebook itself.

# Dataset

All problems in the assignment use [electricity load dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014). Some functions/methods accept data itself, and in that case it's a Pandas dataframe as obtained by

```python
df = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
```

In contrast, whenever a function/method accepts a filename, it's the filename of **unzipped** data file (i.e. `LD2011_2014.txt`). When testing, do not rely on any specific location of the dataset, as validation environment will most certainly different from your local one. Hence, calls like

```python
df = pd.read_csv("<your-local-directory>/LD2011_2014.txt")
```

will fail.

In [1]:
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset

In [2]:
STUDENT = "Itamar Trainin"

In [3]:
ASSIGNMENT = 1
TEST = False

In [4]:
# Parameters
TEST = False


In [5]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 12

# Pandas

In [6]:
if not TEST:
    df = pd.read_csv("C:\Data\Electricity Load Dataset\LD2011_2014.txt",
                 parse_dates=[0], 
                 delimiter=";",
                 decimal=",")
    df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
    print(df)

                 timestamp    MT_001     MT_002    MT_003      MT_004  \
0      2011-01-01 00:15:00  0.000000   0.000000  0.000000    0.000000   
1      2011-01-01 00:30:00  0.000000   0.000000  0.000000    0.000000   
2      2011-01-01 00:45:00  0.000000   0.000000  0.000000    0.000000   
3      2011-01-01 01:00:00  0.000000   0.000000  0.000000    0.000000   
4      2011-01-01 01:15:00  0.000000   0.000000  0.000000    0.000000   
...                    ...       ...        ...       ...         ...   
140251 2014-12-31 23:00:00  2.538071  22.048364  1.737619  150.406504   
140252 2014-12-31 23:15:00  2.538071  21.337127  1.737619  166.666667   
140253 2014-12-31 23:30:00  2.538071  20.625889  1.737619  162.601626   
140254 2014-12-31 23:45:00  1.269036  21.337127  1.737619  166.666667   
140255 2015-01-01 00:00:00  2.538071  19.914651  1.737619  178.861789   

           MT_005      MT_006     MT_007      MT_008     MT_009  ...  \
0        0.000000    0.000000   0.000000    0.00000

### 1. Resample the dataset (1 point)

Resample the dataset to 1-hour resolution. Use `mean` as an aggregation function. Your function must output a dataframe, with the same structure as the original one (i.e. not indexed by datetime).

In [7]:
def el_resample(df):
    return df.groupby(pd.Grouper(key='timestamp', freq='1H')).mean().reset_index()

In [8]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, el_resample)
else:
    from datetime import datetime
    print(df[df['timestamp'].apply(lambda x: datetime.strptime('2014-12-31 23:00:00', '%Y-%m-%d %H:%M:%S') <= x < datetime.strptime('2015-1-1 00:00:00', '%Y-%m-%d %H:%M:%S'))].mean())
    df_t = el_resample(df)
    print(df_t)

  import sys


MT_001       2.220812
MT_002      21.337127
MT_003       1.737619
MT_004     161.585366
MT_005      83.841463
             ...     
MT_366       7.314219
MT_367     676.031607
MT_368     161.519199
MT_369     659.274194
MT_370    6932.432432
Length: 370, dtype: float64


                timestamp    MT_001     MT_002    MT_003      MT_004  \
0     2011-01-01 00:00:00  0.000000   0.000000  0.000000    0.000000   
1     2011-01-01 01:00:00  0.000000   0.000000  0.000000    0.000000   
2     2011-01-01 02:00:00  0.000000   0.000000  0.000000    0.000000   
3     2011-01-01 03:00:00  0.000000   0.000000  0.000000    0.000000   
4     2011-01-01 04:00:00  0.000000   0.000000  0.000000    0.000000   
...                   ...       ...        ...       ...         ...   
35060 2014-12-31 20:00:00  2.220812  25.248933  1.737619  186.483740   
35061 2014-12-31 21:00:00  2.538071  22.759602  1.737619  162.093496   
35062 2014-12-31 22:00:00  1.903553  22.048364  1.737619  161.077236   
35063 2014-12-31 23:00:00  2.220812  21.337127  1.737619  161.585366   
35064 2015-01-01 00:00:00  2.538071  19.914651  1.737619  178.861789   

          MT_005      MT_006     MT_007      MT_008     MT_009  ...  \
0       0.000000    0.000000   0.000000    0.000000   0.000000  

### 2. Consumption peaks (1 point)

For each household, calculate, which month in 2014 had the highest consumption. Your function must output series, indexed by household ID (e.g., `MT_XXX`), and containing month as an integer (`1-12`).

In [9]:
def cons_peak(df):
    df = df[np.logical_and(df['timestamp'] >= '2014', df['timestamp'] < '2015')]
    df = df.groupby(pd.Grouper(key='timestamp', freq='1M')).max().reset_index()
    df = df.drop('timestamp', axis=1)
    df.index += 1
    df = df.idxmax()
    return df

In [10]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, cons_peak)
else:
    s = cons_peak(df)
    print(s)

MT_001     9
MT_002     7
MT_003     1
MT_004     1
MT_005     2
          ..
MT_366    10
MT_367    10
MT_368     6
MT_369    10
MT_370     9
Length: 370, dtype: int64


# PyTorch

### 3. Find minimum (2 points)

Consider the following scalar function:

$$
f(x) = ax^2 + bx + c
$$

Given $a,b,c$, find $x$, which minimizes $f(x)$, and minimum value of $f(x)$. Note this:

- $a,b,c$ are fixed, and generated in such a way, that minimum always exists ($f(x)$ is convex),
- $x$ is a scalar value, i.e. 0-dimensional tensor.

For reference, see `generate_coef` function, which is used to generate coefficients. Note, that since optimization process is not completely deterministic, the output is considered correct, if it falls within `1e-3` of actual values.

This problem must be solved as an optimization one using gradient descent.

For that, use only PyTorch functionality, `SciPy` (or alike) optimization routines are not allowed, neither is direct calculation using coefficients.

In [11]:
def generate_coeffs():
    a = torch.rand(size=()) * 10
    b = -10 + torch.rand(size=()) * 10
    c = -10 + torch.rand(size=()) * 10
    return a, b, c

def func(x, a, b, c):
    return x.pow(2) * a + x * b + c

In [12]:
def find_min(a, b, c):
    opt = 'SGD'
    max_iter = 10**6

    if opt == 'SGD':
        x = torch.randn(1, requires_grad=True)
        optimizer = torch.optim.SGD([x], lr=1e-3)
        
        for iter in range(max_iter):
            optimizer.zero_grad()
            y = func(x,a,b,c)
            y.backward()
            optimizer.step()
            if x.grad.abs() < 1e-3:
                return x, y           
    else:
        lr = torch.tensor([1e-1], device="cpu", dtype=torch.float, requires_grad=True)
        x = torch.tensor([0], device="cpu", dtype=torch.float, requires_grad=True) 
        for iter in range(max_iter):
            if (x.grad is not None):
                x.grad.data.zero_()
            y = func(x,a,b,c)

            y.backward()

            if x.grad.abs() < 1e-3:
                return x, y

            x = torch.tensor([x - lr * x.grad], device="cpu", dtype=torch.float, requires_grad=True)

        print('min not found in wanted range.')
    
    return x, y

In [13]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_min)
else:
    a, b, c = generate_coeffs()
    x_min, val_min = find_min(a,b,c)
    true_x_min = -b/(2*a)
    true_y_min = func(true_x_min, a, b, c)
    print(true_x_min - 1e-3 < x_min < true_x_min + 1e-3)
    print(true_y_min - 1e-3 < val_min < true_y_min + 1e-3)
    print(x_min, val_min)

tensor([True])
tensor([True])
tensor([1.0252], requires_grad=True) tensor([-10.3172], grad_fn=<AddBackward0>)


### 4. PyTorch `Dataset` (3 points)

Implement a `torch.utils.data.Dataset` sub-class for the electricity consumption data. Individual training instances must be week-long univarite series of hourly consumption (input, 168 values), followed by 24-hours long series of hourly consumption (output, 24 values) for a single household. Such a class can be used when training a consumption forecast model, which uses 7 days of historical consumption to forecast next 24 hours of consumption.

`__getitem__(self, idx)` must return a tuple of 1D tensors, `in_data` and `out_data`. `in_data` contains 168 hours of consumption (hourly), starting from some `start_ts`, while `out_data` must contain 24 hourly consumption values starting from `start_ts + 168 hours` for some household. `start_ts` should be sampled randomly.

Also, you need to implement a `get_mapping(self, idx)` method, which allows to calculate `(household, starting time) -> idx` correspondence.

This class will be validated as the following:

- dataset object is created with some random `samples`: `dataset = ElDataset(df, samples)` ,
- validator fetches random `idx` (between `0` and `len(dataset)`) from the dataset:
```python
household, start_ts = dataset.get_mapping(idx)
hist_data, future_data = dataset[idx]
```
- then, `hist_data` and `future_data` are compared with the data, obtained directly from `df` using `household, start_ts`.

In [14]:
 if not TEST:
    samples = 10
    raw_data = df
    sampled_data = df.iloc[list(torch.rand(samples) * len(raw_data))]
    hourly_sampled_data = el_resample(raw_data)
    households = list(hourly_sampled_data.columns)
    households.remove('timestamp')
    idx_mapping = {
        i: {
            'household': households[int(torch.rand(1) * len(households))],
            'start_st_hist': date,
            'end_st_hist': date + pd.DateOffset(days=7),
            'end_st_future': date + pd.DateOffset(days=8)
        } for i, date in enumerate(list(hourly_sampled_data['timestamp']))
    }
    idx_mapping
    print(len(hourly_sampled_data[np.logical_and(
        hourly_sampled_data['timestamp'] >= str(idx_mapping[0]['start_st_hist']),
        hourly_sampled_data['timestamp'] < str(idx_mapping[0]['end_st_hist'])
    )]))
    print(len(hourly_sampled_data[np.logical_and(
        hourly_sampled_data['timestamp'] >= str(idx_mapping[0]['end_st_hist']),
        hourly_sampled_data['timestamp'] < str(idx_mapping[0]['end_st_future'])
    )]))

    hourly_sampled_data[np.logical_and(
            hourly_sampled_data['timestamp'] >= str(idx_mapping[0]['start_st_hist']),
            hourly_sampled_data['timestamp'] < str(idx_mapping[0]['end_st_hist'])
        )]

168
24


In [15]:
class ElDataset(Dataset):
    """Electricity dataset."""

    def __init__(self, df, samples):
        """
        Args:
            df: original electricity data (see HW intro for details).
            samples (int): number of sample to take per household.
        """
        self.samples = samples
        self.raw_data = df
        self.sampled_data = df.iloc[list(torch.rand(self.samples) * len(self.raw_data))]
        self.hourly_sampled_data = self.raw_data.groupby(pd.Grouper(key='timestamp', freq='1H')).max().reset_index()
        self.households = list(self.hourly_sampled_data.columns)
        self.households.remove('timestamp')
        self.idx_mapping = {
            i: {
                'household': self.households[int(torch.rand(1) * len(self.households))],
                'start_st_hist': date,
                'end_st_hist': date + pd.DateOffset(days=7),
                'end_st_future': date + pd.DateOffset(days=8)
            } for i, date in enumerate(list(self.hourly_sampled_data['timestamp']))
        }
        
    def __len__(self):
        return self.samples * (self.raw_data.shape[1] - 1)

    def __getitem__(self, idx):
        query_obj = self.idx_mapping[idx]
        household = query_obj['household']
        timestamp = self.hourly_sampled_data['timestamp']
        
        hist_data = self.hourly_sampled_data[
            np.logical_and(
                timestamp >= str(query_obj['start_st_hist']),
                timestamp < str(query_obj['end_st_hist'])
            )
        ]
        future_data = self.hourly_sampled_data[
            np.logical_and(
                timestamp >= str(query_obj['end_st_hist']),
                timestamp < str(query_obj['end_st_future'])
            )
        ]
        
        hist_data = torch.tensor(hist_data[household].tolist())
        future_data = torch.tensor(future_data[household].tolist())

        return hist_data, future_data

    def get_mapping(self, idx):
        mp = self.idx_mapping[idx]
        household = mp['household']
        start_ts = mp['start_st_hist']
        return household, start_ts


In [16]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, ElDataset)
else:
    dataset = ElDataset(df, 1400)
    household, start_st = dataset.get_mapping(100)
    print(household, '|', start_st)
    hist_data, future_data = dataset[100]
    print(hist_data)
    print(future_data)
    print(len(hist_data))
    print(len(future_data))    

MT_059 | 2011-01-05 04:00:00
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
168
24


# Your grade

In [17]:
if TEST:
    print(f"{STUDENT}: {total_grade}")