<a href="https://colab.research.google.com/gist/taruma/6d48b3ec9d601019c15fb5833ae03730/taruma_hk88_ambil_dataset_harian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Berdasarkan isu [#88](https://github.com/taruma/hidrokit/issues/88): **request: ambil dataset hujan harian**

Referensi isu:

- `hidrokit.contrib.taruma.hk79` [#79](https://github.com/taruma/hidrokit/issues/79). \([lihat notebook/manual](https://nbviewer.jupyter.org/gist/taruma/05dab67fac8313a94134ac02d0398897)\). **request: ambil dataset hujan jam-jaman dari excel**

Deskripsi permasalahan:

- Serupa dengan isu #79, akan tetapi dataset merupakan data harian.
- Mengambil dataset harian dalam excel yang berupa _pivot table_.
- Mengubah tabel tersebut ke dalam bentuk `pandas.DataFrame`, dengan baris menunjukkan observasi/kejadian dan kolom menunjukkan stasiun.

# PERSIAPAN DAN DATASET

In [1]:
try:
    import hidrokit
except ModuleNotFoundError:
    !pip install hidrokit -q
    import hidrokit
print(f'hidrokit version: {hidrokit.__version__}')

hidrokit version: 0.3.4


In [0]:
# Unduh dataset
!wget -O sample.xlsx "https://taruma.github.io/assets/hidrokit_dataset/hidrokit_daily_template.xlsx" -q
FILE = 'sample.xlsx'

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# KODE

In [0]:
from calendar import isleap
import pandas as pd


def _melt_to_array(df, year):
    """Melt dataframe to 1D array one year"""
    # ref: hidrokit.contrib.taruma.hk43
    _drop = [59, 60, 61, 123, 185, 278, 340]
    _drop_leap = [60, 61, 123, 185, 278, 340]

    data = df.melt().drop('variable', axis=1)
    if isleap(year):
        return data['value'].drop(_drop_leap).values
    else:
        return data['value'].drop(_drop).values


def _index_daily(year):
    """Return DateTimeIndex object for one year"""
    year_range = '{}0101 {}0101'.format(year, year + 1).split()
    return pd.date_range(*year_range, closed='left')


def _yearly_df(df, year, station_name):
    """Create dataframe for one year"""
    return pd.DataFrame(
        data=_melt_to_array(df, year),
        index=_index_daily(year),
        columns=[station_name]
    )


def _data_from_sheet(df, station_name, as_df=True):
    """Read dataset from single sheet as dataframe (or list of dataframe)"""
    n_years = int(df.iloc[0, 1])

    frames = []
    for i in range(2, n_years * 33, 33):
        year = int(df.iloc[i, 1])
        pivot = df.iloc[i:i + 31, 4:16]
        data = _yearly_df(pivot, year, station_name)
        frames.append(data)

    if as_df:
        return pd.concat(frames, sort=True)
    else:
        return frames


def read_workbook(io, stations, as_df=True):
    """Read dataset from single file based on stations"""
    excel = pd.ExcelFile(io)

    data = []
    for station in stations:
        df = pd.read_excel(excel, sheet_name=station, header=None)
        data.append(_data_from_sheet(df, station))

    if as_df:
        return pd.concat(data, sort=True, axis=1)
    else:
        return data


# PENERAPAN

In [5]:
from hidrokit.contrib.taruma import hk79

# Ambil informasi excel menggunakan modul .hk79
data_info = hk79._get_info(FILE, config_sheet='_INFO')
print(':: INFORMASI PADA BERKAS')
print(data_info)

:: INFORMASI PADA BERKAS
{'key': 'VALUE', 'n_stations': 2, 'stations': 'AURENE, TYBALT', 'source': 'RATA SUM', 'station_1_years': '2002, 2003, 2004', 'station_2_years': '2007, 2008'}


In [6]:
stations = data_info['stations'].replace(' ', '').split(',')
print('nama stasiun dalam berkas:', stations)

nama stasiun dalam berkas: ['AURENE', 'TYBALT']


## Baca satu stasiun

In [7]:
aurene = read_workbook(FILE, ['AURENE'])
aurene.info()
aurene.head()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1096 entries, 2002-01-01 to 2004-12-31
Freq: D
Data columns (total 1 columns):
AURENE    1096 non-null object
dtypes: object(1)
memory usage: 17.1+ KB


Unnamed: 0,AURENE
2002-01-01,17.08
2002-01-02,16.28
2002-01-03,20.32
2002-01-04,18.34
2002-01-05,13.16


## Baca lebih dari satu stasiun

In [8]:
dataset = read_workbook(FILE, ['TYBALT', 'AURENE'])
dataset.sort_index()

Unnamed: 0,TYBALT,AURENE
2002-01-01,,17.08
2002-01-02,,16.28
2002-01-03,,20.32
2002-01-04,,18.34
2002-01-05,,13.16
...,...,...
2008-12-27,134.83,
2008-12-28,81.88,
2008-12-29,20.14,
2008-12-30,208.54,


# Changelog

```
- 20191213 - 1.0.0 - Initial
- 20191217 - 1.0.1 - Fix read_workbook() issue#95
```

#### Copyright &copy; 2019 [Taruma Sakti Megariansyah](https://taruma.github.io)

Source code in this notebook is licensed under a [MIT License](https://choosealicense.com/licenses/mit/). Data in this notebook is licensed under a [Creative Common Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/). 
