<a href="https://colab.research.google.com/gist/taruma/aca7f90c8fbb0034587809883d0d9e92/taruma_hk98_rekap_deret_waktu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Berdasarkan isu [#98](https://github.com/taruma/hidrokit/issues/98): **buat ringkasan/rekap data deret waktu**

Deskripsi permasalahan: 

- Membuat ringkasan/rekapitulasi/laporan dari data deret waktu (_time series_).

Strategi Penyelesaian:

- Membuat fungsi yang memudahkan kostumisasi saat menggunakan fungsi buatan sendiri.

Catatan:

- Fungsi ini hanya diuji pada data harian dengan kepentingan merekapitulasi setiap bulannya.

# PERSIAPAN DAN DATASET

In [0]:
import numpy as np
import pandas as pd

In [2]:
try:
    import hidrokit
except ModuleNotFoundError:
    !pip install git+https://github.com/taruma/hidrokit.git@latest -q
    import hidrokit

print(f'hidrokit version: {hidrokit.__version__}')

  Building wheel for hidrokit (setup.py) ... [?25l[?25hdone
hidrokit version: 0.3.5-beta.4


In [0]:
!wget -O sample.xlsx "https://taruma.github.io/assets/hidrokit_dataset/data_daily_sample.xlsx" -q
dataset_path = 'sample.xlsx'

In [4]:
from hidrokit.contrib.taruma import hk88

_data = hk88.read_workbook(dataset_path, ['STA_A', 'STA_B', 'STA_C'], 
                           as_df=False)
dataset = pd.concat(_data, sort=True, axis=1).infer_objects()
dataset.info()
dataset.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5478 entries, 2001-01-01 to 2015-12-31
Freq: D
Data columns (total 3 columns):
STA_A    5477 non-null float64
STA_B    5470 non-null float64
STA_C    5475 non-null float64
dtypes: float64(3)
memory usage: 171.2 KB


Unnamed: 0,STA_A,STA_B,STA_C
2001-01-01,0.0,0.0,0.0
2001-01-02,0.0,0.0,0.65
2001-01-03,0.0,45.0,9.16
2001-01-04,0.0,0.0,0.0
2001-01-05,0.0,5.0,1.03


# KODE

In [0]:
def summary_station(dataset, column, ufunc, ufunc_col, n_days='M'):
    grouped = [dataset.index.year, dataset.index.month]

    ufunc = ufunc if isinstance(ufunc, (list, tuple)) else (ufunc,)
    ufunc_col = (ufunc_col 
                 if isinstance(ufunc_col, (list, tuple)) else (ufunc_col,))

    if len(ufunc) != len(ufunc_col):
        raise ValueError('length ufunc and ufunc_col are not matched.')

    ix_month = []
    val_month = []
    for i, x in dataset[column].groupby(by=grouped):
        each_month = x.groupby(pd.Grouper(freq=n_days)).agg(ufunc)
        val_month.append(each_month.values)
        ix_month += each_month.index
    return pd.DataFrame(
        data=np.vstack(val_month), index=ix_month, 
        columns=pd.MultiIndex.from_product([[column], ufunc_col])
    )

def summary_all(dataset, ufunc, ufunc_col, columns=None, n_days='M'):
    res = []

    columns = columns if columns is not None else list(dataset.columns)
    columns = columns if isinstance(columns, (list, tuple)) else [columns]

    for column in columns:
        print('PROCESSING:', column)
        res.append(
            summary_station(dataset, column, ufunc, ufunc_col, n_days=n_days)
        )
    return pd.concat(res, axis=1)

# FUNGSI

## Fungsi `summary_station()`

Fungsi ini membuat rekap untuk stasiun/kolom tunggal dalam bentuk keluaran `pandas.DataFrame`. Argumen yang dibutuhkan antara lain:

- `dataset`: DataFrame dataset. Isian berupa `pandas.DataFrame`. 
- `column`: kolom tunggal yang akan diproses. Isian berupa _string_.
- `ufunc`: fungsi atau _list_ fungsi yang akan digunakan. Isian berupa `object` atau _list of `object`_.
- `ufunc_col`: nama atau _list_ nama dari fungsi `ufunc`. Isian berupa _list of string_.
- `n_days='M'`: indikator jumlah hari/bulan yang diproses. Isian merupakan isian valid untuk parameter `freq` pada objek `pd.Grouper` ([referensi](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)). Isian `'M'` berarti setiap bulan (*M*onth), isian `'9D'` berarti setiap 9 hari (*D*ays). 

### Argumen `ufunc` dan `ufunc_col`

Pengguna diberi kebebasan dalam melakukan perhitungan pada fungsi `summary_station`. Dalam _notebook_ ini akan diberikan contoh menggunakan fungsi yang tersedia pada python, numpy, dan membuatnya sendiri.

In [0]:
# Fungsi buatan sendiri
def n_rain(x):
    "Jumlah hari hujan"
    return (x > 0).sum()

myfunc = [np.sum, n_rain, len]
myfunc_col = ['sum', 'n_rain', 'n_days']

### Penggunaan (_default_)

Jika tidak diberi argumen `n_days` maka fungsi ini akan memproses data setiap bulan.

In [7]:
summary_station(
    dataset=dataset, column='STA_B', 
    ufunc=myfunc, ufunc_col=myfunc_col)

Unnamed: 0_level_0,STA_B,STA_B,STA_B
Unnamed: 0_level_1,sum,n_rain,n_days
2001-01-31,454.0,18.0,31.0
2001-02-28,298.0,12.0,28.0
2001-03-31,475.0,18.0,31.0
2001-04-30,272.0,12.0,30.0
2001-05-31,86.0,4.0,31.0
...,...,...,...
2015-08-31,0.0,0.0,31.0
2015-09-30,0.0,0.0,30.0
2015-10-31,14.0,1.0,31.0
2015-11-30,165.0,3.0,30.0


### Argumen `n_days`

`n_days` bisa diisi dengan jumlah hari yang ingin diproses **setiap bulan**-nya. 

In [8]:
# Setiap 8 Hari
summary_station(
    dataset=dataset, column='STA_B', 
    ufunc=myfunc, ufunc_col=myfunc_col,
    n_days='8D')

Unnamed: 0_level_0,STA_B,STA_B,STA_B
Unnamed: 0_level_1,sum,n_rain,n_days
2001-01-01,90.0,4.0,8.0
2001-01-09,123.0,5.0,8.0
2001-01-17,192.0,6.0,8.0
2001-01-25,49.0,3.0,7.0
2001-02-01,129.0,5.0,8.0
...,...,...,...
2015-11-25,48.0,1.0,6.0
2015-12-01,78.0,2.0,8.0
2015-12-09,48.0,4.0,8.0
2015-12-17,52.0,3.0,8.0


In [9]:
# Setiap 15 Hari
summary_station(
    dataset=dataset, column='STA_C', 
    ufunc=myfunc, ufunc_col=myfunc_col,
    n_days='15D')

Unnamed: 0_level_0,STA_C,STA_C,STA_C
Unnamed: 0_level_1,sum,n_rain,n_days
2001-01-01,158.08,13.0,15.0
2001-01-16,146.94,14.0,15.0
2001-01-31,22.96,1.0,1.0
2001-02-01,157.80,12.0,15.0
2001-02-16,77.45,11.0,13.0
...,...,...,...
2015-11-01,152.00,7.0,15.0
2015-11-16,76.00,4.0,15.0
2015-12-01,23.00,1.0,15.0
2015-12-16,46.00,8.0,15.0


## Fungsi `summary_all()`

Fungsi ini hanya melakukan proses `summary_station()` untuk seluruh kolom atau kolom tertentu yang diatur dengan argumen `columns`. Argumen `dataset`, `ufunc`, `ufunc_col`, `n_days='M'` sama dengan `summary_station()`, yang membedakan adalah argumen `columns`.

In [0]:
# Menggunakan fungsi yang lebih banyak
def n_rain(x):
    "Jumlah hari hujan"
    return (x > 0).sum()

def n_dry(x):
    "Jumlah hari kering"
    return np.logical_or(x == 0, x.isna()).sum()

myfunc_all = [len, n_rain, n_dry, np.sum, np.mean, np.std]
myfunc_all_col = ['n_days', 'n_rain', 'n_dry', 'SUM', 'MEAN', 'STD']

### Seluruh kolom

In [11]:
summary_all(
    dataset=dataset,
    ufunc=myfunc_all, ufunc_col=myfunc_all_col,
    n_days='7D')

PROCESSING: STA_A
PROCESSING: STA_B
PROCESSING: STA_C


Unnamed: 0_level_0,STA_A,STA_A,STA_A,STA_A,STA_A,STA_A,STA_B,STA_B,STA_B,STA_B,STA_B,STA_B,STA_C,STA_C,STA_C,STA_C,STA_C,STA_C
Unnamed: 0_level_1,n_days,n_rain,n_dry,SUM,MEAN,STD,n_days,n_rain,n_dry,SUM,MEAN,STD,n_days,n_rain,n_dry,SUM,MEAN,STD
2001-01-01,7.0,0.0,7.0,0.0,0.000000,0.000000,7.0,3.0,4.0,58.0,8.285714,16.499639,7.0,5.0,2.0,15.43,2.204286,3.384252
2001-01-08,7.0,0.0,7.0,0.0,0.000000,0.000000,7.0,4.0,3.0,68.0,9.714286,13.300555,7.0,7.0,0.0,125.30,17.900000,32.992996
2001-01-15,7.0,0.0,7.0,0.0,0.000000,0.000000,7.0,5.0,2.0,224.0,32.000000,37.434387,7.0,7.0,0.0,93.38,13.340000,12.806884
2001-01-22,7.0,0.0,7.0,0.0,0.000000,0.000000,7.0,4.0,3.0,60.0,8.571429,10.906529,7.0,6.0,1.0,55.88,7.982857,5.429001
2001-01-29,3.0,0.0,3.0,0.0,0.000000,0.000000,3.0,2.0,1.0,44.0,14.666667,13.650397,3.0,3.0,0.0,37.99,12.663333,11.376741
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-12-01,7.0,3.0,4.0,86.0,12.285714,25.408098,7.0,1.0,6.0,43.0,6.142857,16.252472,7.0,0.0,7.0,0.00,0.000000,0.000000
2015-12-08,7.0,5.0,2.0,55.0,7.857143,12.266874,7.0,3.0,4.0,67.0,9.571429,13.513662,7.0,0.0,7.0,0.00,0.000000,0.000000
2015-12-15,7.0,7.0,0.0,105.0,15.000000,11.503623,7.0,4.0,3.0,55.0,7.857143,9.118271,7.0,5.0,2.0,44.00,6.285714,8.220184
2015-12-22,7.0,7.0,0.0,136.0,19.428571,12.053452,7.0,3.0,4.0,51.0,7.285714,11.954278,7.0,2.0,5.0,18.00,2.571429,5.255383


### Kolom tertentu

In [12]:
summary_all(
    dataset=dataset, columns=['STA_A', 'STA_C'],
    ufunc=myfunc_all, ufunc_col=myfunc_all_col,
    n_days='16D')

PROCESSING: STA_A
PROCESSING: STA_C


Unnamed: 0_level_0,STA_A,STA_A,STA_A,STA_A,STA_A,STA_A,STA_C,STA_C,STA_C,STA_C,STA_C,STA_C
Unnamed: 0_level_1,n_days,n_rain,n_dry,SUM,MEAN,STD,n_days,n_rain,n_dry,SUM,MEAN,STD
2001-01-01,16.0,0.0,16.0,0.0,0.000000,0.000000,16.0,14.0,2.0,185.80,11.612500,22.786906
2001-01-17,15.0,0.0,15.0,0.0,0.000000,0.000000,15.0,14.0,1.0,142.18,9.478667,9.163421
2001-02-01,16.0,0.0,16.0,0.0,0.000000,0.000000,16.0,13.0,3.0,157.81,9.863125,15.319503
2001-02-17,12.0,0.0,12.0,0.0,0.000000,0.000000,12.0,10.0,2.0,77.44,6.453333,7.835151
2001-03-01,16.0,0.0,16.0,0.0,0.000000,0.000000,16.0,14.0,2.0,9.01,0.563125,0.719228
...,...,...,...,...,...,...,...,...,...,...,...,...
2015-10-17,15.0,9.0,6.0,97.0,6.466667,9.500877,15.0,1.0,14.0,6.00,0.400000,1.549193
2015-11-01,16.0,12.0,4.0,190.0,11.875000,14.655488,16.0,7.0,9.0,152.00,9.500000,19.721393
2015-11-17,14.0,13.0,1.0,384.0,27.428571,15.360861,14.0,4.0,10.0,76.00,5.428571,13.119066
2015-12-01,16.0,10.0,6.0,197.0,12.312500,19.269038,16.0,1.0,15.0,23.00,1.437500,5.750000


# Changelog

```
- 20191217 - 1.0.0 - Initial
```

#### Copyright &copy; 2019 [Taruma Sakti Megariansyah](https://taruma.github.io)

Source code in this notebook is licensed under a [MIT License](https://choosealicense.com/licenses/mit/). Data in this notebook is licensed under a [Creative Common Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/). 
