# Getting started with hillmaker

In this tutorial we'll focus on basic use of hillmaker for analyzing arrivals, departures, and occupancy by time of day and day of week for a typical *discrete entity flow system*. A few examples of such systems include:

- patients arriving, undergoing some sort of care process and departing some healthcare system (e.g. emergency department, surgical recovery, nursing unit, outpatient clinic, and many more)
- customers renting, using, and returning bikes in a bike share system,
- users of licensed software checking out, using, checking back in a software license,
- products undergoing some sort of manufacturing or assembly process - occupancy is WIP,
- patrons arriving, dining and leaving a restaurant,
- travelers renting, residing in, and checking out of a hotel,
- flights taking off and arriving at their destination,
- ...

Basically, any sort of discrete [stock and flow system](https://en.wikipedia.org/wiki/Stock_and_flow) for which you are interested in time of day and day of week specific statistical summaries of occupancy, arrivals and departures, and have raw data on the arrival and departure times, is fair game for hillmaker.

## Installation

You can use pip to install hillmaker into the Python virtual environment of your choice.

```
pip install hillmaker
```

## Ways of using hillmaker

There are currently three ways of using hillmaker. 

1. Command line interface (CLI)
2. Calling a single Python function
3. An object oriented API in Python

The plan is to add a fourth option:

4. Through a GUI interface (not implemented yet)

Depending on your level of comfort with Python, you can choose the method that works best for you. This Getting Started tutorial will demo all three ways of using hillmaker and subsequent tutorials will go into more detail. There are numerous input parameters that can be used to customize the behavior of hillmaker and we'll only touch on a few in this tutorial.

## Module imports
To run hillmaker we only need to import a few modules. Often we will be using pandas as part of our analysis and we'll import that as well. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pandas as pd
import hillmaker as hm

## A prototypical example - occupancy analysis of a hospital Short Stay Unit

Patients flow through a short stay unit (SSU) for a variety of procedures, tests or therapies. Let's assume patients can be classified into one of five categories of patient types: ART (arterialgram), CAT (post cardiac-cath), MYE (myelogram), IVT (IV therapy), and OTH (other). We are interested in occupancy statistics by time of day and day of week to support things like staff scheduling and capacity planning.

From one of our hospital information systems we were able to get raw data about the entry and exit times of each patient and exported the data to a csv file. We call each row of such data a *stop* (as in, the patient stopped here for a while). Let's take a peek at the data by first reading the csv file into a pandas `DataFrame`.

In [None]:
ssu_stopdata = 'https://raw.githubusercontent.com/misken/hillmaker-examples/main/data/ssu_2024.csv'
# ssu_stopdata = './data/ssu_2024.csv'
stops_df = pd.read_csv(ssu_stopdata, parse_dates=['InRoomTS','OutRoomTS'])
stops_df.info() 

In [None]:
stops_df.head()

Before running hillmaker, we need to know the timeframe for which we have data. 

In [None]:
stops_df['InRoomTS'].min()

In [None]:
stops_df['InRoomTS'].max()

Looks like we have data from Jan through Sept of 2024. Since patients usually stay for less than 24 hours in an SSU, we'll do our occupancy analysis starting on January 2, 2024 (since the 1st is a holiday) and ending on September 30, 2024. Later in the tutorial we'll discuss the important issues of choosing an appropriate analysis timeframe and *horizon effects*. 

As part of an operational analysis we would like to compute a number of relevant statistics, such as:

- mean and 95th percentile of overall SSU occupancy by hour of day and day of week,
- similar hourly statistics for patient arrivals and departures,
- all of the above but by patient type as well.

In addition to tabular summaries, let's make some plots showing the mean and 95th percentile of occupancy by time of day and day of week.

## Running hillmaker via the command line interface (CLI)

To run hillmaker from the command line, make sure that you are using whatever virtual environment within which hillmaker is installed. Let's see the help for hillmaker's CLI:

```bash
> hillmaker -h
```

```bash
usage: hillmaker [--scenario_name SCENARIO_NAME] [--data DATA]
                 [--in_field IN_FIELD] [--out_field OUT_FIELD]
                 [--start_analysis_dt START_ANALYSIS_DT]
                 [--end_analysis_dt END_ANALYSIS_DT] [--config CONFIG]
                 [--cat_field CAT_FIELD] [--bin_size_minutes BIN_SIZE_MINUTES]
                 [--cats_to_exclude [CATS_TO_EXCLUDE ...]]
                 [--occ_weight_field OCC_WEIGHT_FIELD]
                 [--percentiles [PERCENTILES ...]] [--los_units LOS_UNITS]
                 [--csv_export_path CSV_EXPORT_PATH] [--no_dow_plots]
                 [--no_week_plots] [--plot_export_path PLOT_EXPORT_PATH]
                 [--plot_style PLOT_STYLE] [--figsize FIGSIZE FIGSIZE]
                 [--bar_color_mean BAR_COLOR_MEAN] [--alpha ALPHA]
                 [--plot_percentiles PLOT_PERCENTILES [PLOT_PERCENTILES ...]]
                 [--pctile_color PCTILE_COLOR [PCTILE_COLOR ...]]
                 [--pctile_linestyle PCTILE_LINESTYLE [PCTILE_LINESTYLE ...]]
                 [--pctile_linewidth [PCTILE_LINEWIDTH ...]] [--cap CAP]
                 [--cap_color CAP_COLOR] [--xlabel XLABEL] [--ylabel YLABEL]
                 [--main_title MAIN_TITLE] [--subtitle SUBTITLE]
                 [--first_dow FIRST_DOW] [--edge_bins EDGE_BINS]
                 [--highres_bin_size_minutes HIGHRES_BIN_SIZE_MINUTES]
                 [--keep_highres_bydatetime] [--verbosity VERBOSITY] [-h]

Occupancy analysis by time of day and day of week

Required arguments (either on command line or via config file):
  --scenario_name SCENARIO_NAME
                        Used in output filenames
  --data DATA           Path to csv file containing the stop data to be
                        processed
  --in_field IN_FIELD   Column name corresponding to the arrival times
  --out_field OUT_FIELD
                        Column name corresponding to the departure times
  --start_analysis_dt START_ANALYSIS_DT
                        Starting datetime for the analysis (use yyyy-mm-dd
                        format)
  --end_analysis_dt END_ANALYSIS_DT
                        Ending datetime for the analysis (use yyyy-mm-dd
                        format)

Optional arguments:
  --config CONFIG       Configuration file (TOML format) containing input
                        parameter arguments and values. Input parameters set
                        via a config file will override parameters values
                        passed via the command line.
  --cat_field CAT_FIELD
                        Column name corresponding to the categories. If None,
                        then only overall occupancy is analyzed.
  --bin_size_minutes BIN_SIZE_MINUTES
                        Number of minutes in each time bin of the day
                        (default=60) for aggregate statistics and plots.
  --cats_to_exclude [CATS_TO_EXCLUDE ...]
                        Category values to exclude from the analysis.
  --occ_weight_field OCC_WEIGHT_FIELD
                        Column name corresponding to occupancy weights. If
                        None, then weight of 1.0 is used. Default is None.
  --percentiles [PERCENTILES ...]
                        Which percentiles to compute
  --los_units LOS_UNITS
                        The time units for length of stay analysis. See https:
                        //pandas.pydata.org/docs/reference/api/pandas.Timedelt
                        a.html for allowable values (smallest value allowed is
                        'seconds', largest is 'days'. The default is 'hours'.
  --csv_export_path CSV_EXPORT_PATH
                        Destination path for exported csv files, default is
                        current directory.
  --no_dow_plots        If set, no day of week plots are created.
  --no_week_plots       If set, no weekly plots are created.
  --plot_export_path PLOT_EXPORT_PATH
                        Destination path for exported plots, default is
                        current directory.
  --plot_style PLOT_STYLE
                        Matplotlib style name.
  --figsize FIGSIZE FIGSIZE
                        Figure size
  --bar_color_mean BAR_COLOR_MEAN
                        Matplotlib color name for the bars representing mean
                        values.
  --alpha ALPHA         Transparency for bars, default=0.5.
  --plot_percentiles PLOT_PERCENTILES [PLOT_PERCENTILES ...]
                        Which percentiles to plot
  --pctile_color PCTILE_COLOR [PCTILE_COLOR ...]
                        Line color for each percentile series plotted. Order
                        should match order of percentiles list.
  --pctile_linestyle PCTILE_LINESTYLE [PCTILE_LINESTYLE ...]
                        Line style for each percentile series plotted.
  --pctile_linewidth [PCTILE_LINEWIDTH ...]
                        Line width for each percentile series plotted.
  --cap CAP             Capacity level line to include in occupancy plots
  --cap_color CAP_COLOR
                        Matplotlib color code.
  --xlabel XLABEL       x-axis label for plots.
  --ylabel YLABEL       y-axis label for plots.
  --main_title MAIN_TITLE
                        Main title for plot. Default = '{Occupancy, Arrivals,
                        Departures} by time of day and day of week'
  --subtitle SUBTITLE   Subtitle for plot. Default = 'Scenario:
                        {scenario_name}'
  --first_dow FIRST_DOW
                        Controls which day of week appears first in plot. One
                        of 'mon', 'tue', 'wed', 'thu', 'fri', 'sat, 'sun'
  -h, --help

Advanced optional arguments:
  --edge_bins EDGE_BINS
                        Occupancy contribution method for arrival and
                        departure bins. 1=fractional, 2=entire bin
  --highres_bin_size_minutes HIGHRES_BIN_SIZE_MINUTES
                        Number of minutes in each time bin of the day used for
                        initial computation of the number of arrivals,
                        departures, and the occupancy level. This value should
                        be <= `bin_size_minutes`. The shorter the duration of
                        stays, the smaller the resolution should be if using
                        edge_bins=2. See docs for more details.
  --keep_highres_bydatetime
                        Save the high resolution bydatetime dataframe in hills
                        attribute.
  --verbosity VERBOSITY
                        Used to set level in loggers. 0=logging.WARNING,
                        1=logging.INFO (default), 2=logging.DEBUG
```

There are several required arguments: 

- SCENARIO_NAME - a scenario name, 
- DATA - the path to the csv file containing the stop data, 
- IN_FIELD, OUT_FIELD - the field names containing the arrival times and the departure times, 
- START_ANALYSIS_DT, END_ANALYSIS_DT - starting and ending dates for the analysis.

There are also numerous optional arguments controlling how hillmaker works and which outputs are created.

Let's run hillmaker by specifying the required arguments as well as an output path for plots and csv files. The stop data file, `ssu_2024.csv` is in the `data` folder. We'll output plots and csv summary files to the `output` folder. Typically, we would also specify a category field - in this case it would be `PatType`. We'll use 60 minutes for the time bin size. 



By default, hillmaker prints out several informational messages. You can suppress these with `--verbosity 0`. You can get even more detailed status messages (useful for debugging) by using `--verbosity 2`.

```bash
> hillmaker --scenario cli_demo_ssu_60 --data ./data/ssu_2024.csv \
--in_field InRoomTS --out_field OutRoomTS --cat_field PatType --bin_size_minutes 60 \
--start_analysis_dt 2024-01-02 --end_analysis_dt 2024-09-30 --csv_export_path output --plot_export_path output --ylabel Patients 
```



In [None]:
!hillmaker --scenario cli_demo_ssu_60 --data ./data/ssu_2024.csv \
--in_field InRoomTS --out_field OutRoomTS --cat_field PatType --bin_size_minutes 60 \
--start_analysis_dt 2024-01-02 --end_analysis_dt 2024-09-30 --csv_export_path output --plot_export_path output --ylabel Patients 



## CSV file outputs
When you use the CLI, CSV versions of the output tables are exported to `CSV_EXPORT_PATH`.

In [None]:
for fname in Path('output/').glob('cli_demo_ssu_60*.csv'):
    print(fname)

There are four groups of files, each beginning with the scenario name `'cli_demo_ssu_60'`. 

- `occupancy`, `arrivals`, `departures` - summary statistics for occupancy, arrivals and departures
- `bydatetime` - number of arrivals, departures and occupancy level by datetime bin over the analysis range (e.g. individual hours on each date)

Usually it's the occupancy summaries that we are most interested in. From each occupancy related filename, we can infer the grouping levels used for the summary statistics.

### \<scenario name\>_occupancy_dow_binofday.csv

This is probably the most used summary as it gives us overall occupancy statistics by time bin of day (in this case, hourly) and day of week. We can read it into a pandas `DataFrame` and take a look. Since we used hourly time bins, there will be 168 rows in the summary. Numerous summary statistics are computed for each hour of the week.

In [None]:
occ_dow_binofday_df = pd.read_csv('output/cli_demo_ssu_60_occupancy_dow_binofday.csv')
occ_dow_binofday_df.head(30)


From the `count` field we can see that there were 39 Mondays in the analysis date range. It is this `DataFrame` that used to create the weekly and day of week plots.

Plots are created in PNG format and exported to `PLOT_EXPORT_PATH`.

In [None]:
[fname.name for fname in Path('output/').glob('cli_demo_ssu_60*occupancy*.png')]

In [None]:
from IPython.display import Image
Image('output/cli_demo_ssu_60_occupancy_week.png')

There are also day of week specific plots.

In [None]:
Image('output/cli_demo_ssu_60_occupancy_wed.png')

### \<scenario name\>_PatType_dow_binofday.csv

This is the most detailed summary as it is grouped by category (patient type in this example), day of week and hour of day.  This `DataFrame` is useful for seeing how individual patient types contribute to overall occupancy in the SSU. 


In [None]:
occ_PatType_dow_binofday_df = pd.read_csv('output/cli_demo_ssu_60_occupancy_PatType_dow_binofday.csv')
occ_PatType_dow_binofday_df.iloc[50:75]


The other two occupancy related csv files are summaries aggregated over time. One, `cli_demo_ssu_60_occupancy_PatType.csv`, is grouped by the category field and the other, `cli_demo_ssu_60_occupancy.csv`, is aggregated both over time and category.


In [None]:
occ_PatType_df = pd.read_csv('output/cli_demo_ssu_60_occupancy_PatType.csv')
occ_PatType_df


In [None]:
occ_df = pd.read_csv('output/cli_demo_ssu_60_occupancy.csv')
occ_df


### The bydatetime files
The remaining CSV files provided detailed occupancy, arrival and departures values for every time bin over the entire analysis range. For example, here's what `cli_demo_ssu_60_bydatetime_datetime.csv` looks like.

In [None]:
bydatetime_df = pd.read_csv('output/cli_demo_ssu_60_bydatetime_datetime.csv')
bydatetime_df.head(15)


Notice that the occupancy field contains non-integer values. This is by design. For now, it is enough to say that occupancy in each time bin is proportional to the time a patient spent in the system during that time bin. For example, if a patient arrives at 7:15a and departs at 9:30a, then their contribution to occupancy in the `bydatetime` dataframe is:

| time bin  |occupancy   |
|---|---|
|07-08   | 0.75  |
| 08-09  | 1.00  |
| 09-10  | 0.50  |

For all the details on how hillmaker computes occupancy and options for controlling those computations, see {doc}`topic guide on occupancy computation <computing_occupancy>`.

## Using the `make_hills` function

Before the creation of the object-oriented API, hillmaker could be used by calling a single, module level, function called `make_hills`. This type of legacy use is still possible. The `make_hills` function returns a dictionary containing DataFrames and plots along with a few other items. We'll use the same example but will use two-hour time bins. 

Like the CLI, the legacy `make_hills` function can create and explot all plots as well the dataframes as CSV files. This behavior can be more finely controlled through input arguments. See {doc}`using_make_hills` for all the details.

Notice that since we already read the CSV data into a pandas `DataFrame` named `stops_df`, we can use that for the `data=` argument.


In [None]:
# Required inputs
scenario_name = 'func_demo_ssu_120'
in_field_name = 'InRoomTS'
out_field_name = 'OutRoomTS'
start_date = '2024-01-02'
end_date = '2024-09-30'

# Optional inputs
cat_field_name = 'PatType'
verbosity = 1  # INFO level logging
csv_export_path = './output'
plot_export_path = './output'
bin_size_minutes = 120

# Use legacy function interface
hills = hm.make_hills(scenario_name=scenario_name, data=stops_df,
              in_field=in_field_name, out_field=out_field_name,
              start_analysis_dt=start_date, end_analysis_dt=end_date,
              cat_field=cat_field_name,
              bin_size_minutes=bin_size_minutes,
              csv_export_path=csv_export_path, plot_export_path=plot_export_path, verbosity=verbosity)

# Get and display occupancy plot
occ_plot = hm.get_plot(hills, 'occupancy')
occ_plot

This histogram looks "blockier" due to the two-hour bin sizes.

## Using the object oriented hillmaker API

For more control over hillmaker you can use the object oriented API. In this tutorial we'll just do a brief introduction. You can get more details in {doc}`using_oo_api`.

The main steps in using the API are:

- create a `Scenario` object initialized with your hillmaker inputs,
- call the `make_hills` method to run hillmaker and store the outputs in the `Scenario` object,
- use methods to retrieve `Dataframe` objects, plots and other outputs

Here's a brief example. We'll use the same inputs as in the first example, except for setting `bin_size_minutes=30`. By default, plots aren't automatically created or exported. We will use the `export_all_week_plots` argument to create and export just the weekly plots. There are a number of other input arguments that can be used for more detailed control of the hillmaker analysis process, but we'll hold off on that for now.

In [None]:
oo_demo_ssu_30 = hm.Scenario(scenario_name='oo_demo_ssu_30',
                             data=stops_df,
                             in_field='InRoomTS',
                             out_field='OutRoomTS',
                             start_analysis_dt='2024-01-02',
                             end_analysis_dt='2024-09-30',
                             cat_field='PatType',
                             bin_size_minutes=30,
                             plot_export_path='./output',
                             export_all_week_plots=True)

# Show datatype
type(oo_demo_ssu_30)

You can see the values for all of the input parameters by printing the scenario object.

In [None]:
print(oo_demo_ssu_30)

Let's make some hills. By default, the `verbosity` parameter is set to 0 - we won't get any messages unless they are warnings or errors.

In [None]:
oo_demo_ssu_30.make_hills()

All of the outputs from running `make_hills` get stored in an attribute dictionary called `hills`.

In [None]:
oo_demo_ssu_30.hills.keys()

There are methods for retrieving specific items from this dictionary. For example, there is `get_plot`.

In [None]:
help(oo_demo_ssu_30.get_plot)

In [None]:
oo_demo_ssu_30.get_plot('occupancy')

Similarly, we can use `get_bydatetime_df` and `get_summary` to get at specific results `DataFrame` objects.

In [None]:
help(oo_demo_ssu_30.get_bydatetime_df)

In [None]:
oo_demo_ssu_30.get_bydatetime_df(by_category=False)

You may have noticed that there were a few other keys in the `hills` dictionary. For example, we can get a simple length of stay summary. The summary is comprised of two histogram plots and two tabular summaries - one each that are by category and one that is aggregated over all records in the stop data.

In [None]:
oo_demo_ssu_30.hills['los_summary']

In [None]:
oo_demo_ssu_30.get_los_plot()

In [None]:
oo_demo_ssu_30.get_los_plot(by_category=False)

In [None]:
oo_demo_ssu_30.get_los_stats()

In [None]:
oo_demo_ssu_30.get_los_stats(by_category=False)

See {doc}`using_oo_api` for all the details on using the API.