# Using the `make_hills()` function

The `hillmaker.hills.make_hills` function is the gateway to hillmaker and is used by the CLI, the object oriented API, or on its own to launch the hillmaking process. It has numerous input arguments for customizing how hillmaker works. In this tutorial we will describe all of the input arguments and discuss their use. This same information applies to CLI as well as the object oriented API's `Scenario.make_hills` method. 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import hillmaker as hm

ModuleNotFoundError: No module named 'hillmaker'

In [None]:
ssu_stopdata = './data/ssu_2024.csv'
ssu_stops_df = pd.read_csv(ssu_stopdata, parse_dates=['InRoomTS','OutRoomTS'])
ssu_stops_df.info() # Check out the structure of the resulting DataFrame

In [None]:
help(hm.make_hills)

## Required input arguments

### `scenario_name` (*str*)

This is a string that gets used in a few places:

- part of filenames of exported CSV files,
- part of filenames of exported plots,
- plot subtitle default

Since it gets used in filenames, best to avoid spaces and special characters (other than underscore). Any non-alphanumeric characters other than the underscore will get transformed to underscores.

### `data` (str | Path | DataFrame)

The data table with each row representing one visit, or *stop*, by an entity. You can pass a string or a `Path` object to a CSV file, which will then get read into a pandas `DataFrame`. Or, you can directly pass in a `DataFrame`.

For example, in the SSU example, each row is a a patient who visits the short stay unit. In cycle share data, each row might be a rental of a bike for some period of time. Here are the first few records from `ssu_stops_df`. It is **NOT** necessary to have a field containing the duration of time that the entity spent in the system (e.g. `LOS_hours` below). You only need to have fields representing the arrival and departure times from the system - `InRoomTS` and `OutRoomTS` in this example.



In [None]:
ssu_stops_df.head()

### `in_field` (*str*)

The fieldname in `data` containing the arrival times. The datatype for the field itself must be a pandas `Timestamp` (or `datetime64`). 

### `out_field` (*str*)

The fieldname in `data` containing the departure times. The datatype for the field itself must be a pandas `Timestamp` (or `datetime64`). 

In [None]:
ssu_stops_df.info()

### `start_analysis_dt` and `end_analysis_dt` (*something convertible to a pandas `Timestamp`*)

These two dates define what we call the *analysis date range*. All records in `stops_df` whose `in_field` and `out_field` values overlap this range in any way, are included in the hillmaker computations.

Care must be taken in selecting the analysis date range. In an example like the SSU, where most patients are staying less than 24 hours, we are probably fine with picking a `start_analysis_dt` very close or even equal to the earliest arrival date in our stop data. However, for a system in which the length of stay may be on the order of several days, we need to be cognizant of *warm up* effects. In such a case, if we used the earliest arrival date for the start of the analysis, we are essentially assuming that the system starts out empty on that date. This is certainly not likely to be true in a busy system where entities are staying multiple days. Similarly, the end date should not be after the date of the latest arrival or the system will appear to be emptying out - again, not realistic.

See {doc}`example_occupancy_analysis` for more on this issue.

## Optional, but frequently used, input arguments

### `cat_field` (*str*, default=None)

The fieldname in `stops_df` containing some sort of categorical information for which you would like to get hillmaker statistics. In the SSU example, this would be the `PatType` field. If a `cat_field` is specified, then arrival, departure and occupancy statistics are computed by category as well as overall. A common use of the category field is to specify a location. In this way, one hillmaker run can compute occupancy statitics for multiple locations. An example could be the name of the nursing unit visited as inpatients flow through a hospital. In the cycle share data example, a field specifying whether the renter was a subscription holder or a casual renter, lets us see the very different bike rental patterns by these two distinct populations.

### `bin_size_minutes` (*int*, default=60)

Central to hillmaker is the notion of dividing each day into equally sized time bins such as hours (`bin_size_minutes=60`) or half-hours. All of the summary tables and plots will use `bin_size_minutes`. Pick a value that makes sense for your study and for the level of time of day fluctuations present. Try different values and compare the plots. Large values might obscure important short-term fluctions in arrivals or occupancy.  

The value used must be a factor of 1440 (minutes in a day).

## More optional input arguments

### `cats_to_exclude` (*list*, default=None)

If you specify a category field via `cat_field`, you can optionally provide a list of specific category values for which you do **not** want to consider in the analysis. A similar effect could be obtained by pre-filtering out these records from `data`.

### `occ_weight_field` (*str*, default=None)

While hillmaker is usually used to compute occupancy summaries, it can also be used to compute associated measures that are directly related to occupancy. A common example is using hillmaker to estimate staffing requirements based on staff to patient ratios (by category). This can be done by specifying a column in the stops `Dataframe` which contains the weights to use for occupancy incrementing. The default of None
corresponds to a weight of 1.0. For example, in the SSU dataset, let's assume that a 4:1 patient to staff ratio was appropriate for all patient types except CAT, for which 2:1 was needed. We could create a column of occupancy weights of 0.50 for the CAT patients and 0.25 for all other patient types.



### `percentiles` (*tuple or list*, default=(0.25, 0.5, 0.75, 0.95, 0.99))

Use this parameter to control which percentiles of occupancy (and number of arrivals and departures) are computed.

### `los_units` (*str*, default='hours')

A statistical summary of length of stay (difference between departure and arrival times) is done. This parameter controls the time units to use for reporting the results.
    
See https://pandas.pydata.org/docs/reference/api/pandas.Timedelta.html for allowable values (smallest
value allowed is 'seconds', largest is 'days'. The default is 'hours'.

### `export_bydatetime_csv` and `export_summaries_csv` (*bool*, default=True)

These two parameters control the exporting to CSV of the bydatetime and summary `Dataframe` objects. They are exported to the location specified by the `csv_export_path` parameter.

### `csv_export_path` (*str or Path*, default=current directory)

Destination path for exported csv files - default is current directory.

### `make_all_dow_plots` and `make_all_week_plots` (*bool*, default=True)

If `make_all_dow_plots=True`, day of week plots are created for occupancy, arrival, and departure (resulting in 21 plots).
If `make_all_week_plots=True`, full week plots are created for occupancy, arrival, and departure (3 plots).

For the CLI and the `make_hills` legacy interface, these parameters default to True and all plots are created by default. In the object oriented API, these default to False and the user has full control over plot creation and exporting.
   

### `plot_export_path` (*str or Path*, default=current directory)

Destination path for exported plots (png) files - default is current directory.

## Optional plot related input arguments

### `plot_style` (*str*, default='ggplot')

You can use any Matplotlib plot style sheet. See https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html.

### `figsize` (*tuple*, default=(15, 10))

See https://matplotlib.org/stable/gallery/subplots_axes_and_figures/figure_size_units.html.

### `bar_color_mean` (*str*, default='steelblue')

Matplotlib color specification for the bars representing mean values. See https://matplotlib.org/stable/users/explain/colors/colors.html#sphx-glr-users-explain-colors-colors-py.

### `plot_percentiles` (*tuple* or *list* of *float*, default=(0.95, 0.75))

Percentiles to plot. The order matters as subsequent parameters will allow you to style these series.

### `pctile_color` (*list* or *tuple* of color codes (e.g. ['blue', 'green'] or list('gb'))
    
Line color for each percentile series plotted. Order should match order of percentiles list.
Default is ('black', 'grey').



### `pctile_linestyle` (*list* or *tuple* of line styles (e.g. ['-', '--']), default is ('-', '--')

Line style for each percentile series plotted. 

### `pctile_linewidth` (*list* or *tuple* of line widths (e.g. [1.0, 0.75]), default is (0.75, 0.75)

Line width for each percentile series plotted. 

### `cap` (*int*, default=None)

Use this to display a capacity line on the occupancy graphs.

### `cap_color` (*str*, default='r')

Color to use for the capacity line.

### `xlabel` and `ylabel` (*str*, default='Hour' and 'Volume')

Axes labels for the plots. For more detailed control over plotting, use the object oriented API.

### `main_title` (*str*, default='{Occupancy or Arrivals or Departures} by time of day and day of week')

Main title for plot.

### `main_title_properties` (*Dict*} default={'loc': 'left', 'fontsize': 16})

Dictionary of main title properties. See https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html

### `subtitle` (*str*, default='Scenario: {scenario_name}')

Subtitle for plot.

### `subtitle_properties` (*Dict*} default={'loc': 'left', 'style': 'italic'})

Dictionary of subtitle properties. See https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html.
Both the title and subtitle are actually matplotlib `title` objects.

### `legend_properties` (*Dict*} default={{'loc': 'best', 'frameon': True, 'facecolor': 'w'})

Dictionary of legend properties. See https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html.

### `first_dow` (*str*, default='mon')

Controls which day of week appears first in weekly plot. One of `['mon', 'tue', 'wed', 'thu', 'fri', 'sat, 'sun']`.

## Advanced optional input arguments

There are a few advanced options related to how occupancy is computed. Best not to use these unless you know what you're doing.

### `edge_bins` (*int*, default=1)

Occupancy contribution method for arrival and departure bins. 1=fractional (the default), 2=entire bin. The default way that hillmaker computes occupancy is described in detail in {doc}`computing_occupancy`. For the datetime bins corresponding to an entity's arrival and departure, occupancy in that bin is the fraction of time for which the entity was present during that time bin. Alternatively, you could choose to give "full credit" for occupancy during the arrival and departure bins. However, this can lead to serious overestimates of occupancy, especially for bin sizes that are large relative to typical time spent in the system by each entity. If, for whatever reason, you choose to use `edge_bins=2`, you should use a small value of the next parameter, `highres_bin_size_minutes`,

### `highres_bin_size_minutes` (*int*, default=`bin_size_minutes`)

Number of minutes in each time bin of the day used for initial computation of the number of arrivals,
departures, and the occupancy level - i.e. in the creation of the bydatetime table. By default, this is set equal to the value of `bin_size_minutes` since it doesn't affect aggregate arrival, occupancy or departure statistics as long as the default of `edge_bins=1` is used. So, why would you ever use this parameter? 

1. If you use `edge_bins=2`, you should use a small value for `highres_bin_size_minutes` to avoid serious overestimates of occupancy.
2. You may want to take a very detailed look at occupancy on specific dates for very short time bin sizes.

### `keep_highres_bin_size_minutes` (*bool*, default=False)
    
If you want to save the high resolution version of the bydatetime `Dateframe` in the dictionary returned by `make_hills()`, then set this parameter to True.     


### `nonstationary_stats` (*bool*, default=True)

If True, datetime bin stats are computed. Else, they aren't computed. Default is True

### `stationary_stats` (*bool*, default=True)

If True, overall, non-time bin dependent, stats are computed. Else, they aren't computed. Default is True

### `verbosity` (*int*, default=1)

Used to set level in loggers. 0=logging.WARNING, 1=logging.INFO (the default), 2=logging.DEBUG. The default level provides quite a bit of detail about the hillmaking process and is a good way to make sure that everything is okay with your data.

## Calling `make_hills()`

## Using the `make_hills` function

Before the creation of the object-oriented API, hillmaker could be used by calling a single, module level, function called `make_hills`. This type of legacy use is still possible. The `make_hills` function returns the same `hills` dictionary returned by the OO API but can also create and export plots and dataframes via function arguments. 

We'll build on the example from the {doc}`getting_started` section. For this scenario, we want to:

- use half-hourly time bins
- analyze the summer period of 2024-06-01 - 2024-08-31
- include a capacity line at 100 for the occupancy plot
- plot the 85th and 95th percentile


In [None]:
# Required inputs
scenario_name = 'ssu_summer24'
in_field_name = 'InRoomTS'
out_field_name = 'OutRoomTS'
start_date = '2024-06-01'
end_date = '2024-08-31'

# Optional inputs
cat_field_name = 'PatType'
bin_size_minutes = 30
csv_export_path = './output'

# Optional plotting inputs
plot_export_path = './output'
plot_style = 'default'
bar_color_mean = 'grey'
percentiles = [0.85, 0.95]
plot_percentiles = [0.95, 0.85]
pctile_color = ['blue', 'green']
pctile_linewidth = [0.8, 1.0]
cap = 110
cap_color = 'black'
main_title = 'Occupancy summary'
main_title_properties = {'loc': 'center', 'fontsize':20}
subtitle = 'Summer 2024 analysis'
subtitle_properties = {'loc': 'center'}
xlabel = ''
ylabel = 'Patients'

# Optional plotting related inputs

# Use legacy function interface
hills = hm.make_hills(scenario_name=scenario_name, data=ssu_stops_df,
                      in_field=in_field_name, out_field=out_field_name,
                      start_analysis_dt=start_date, end_analysis_dt=end_date,
                      cat_field=cat_field_name,
                      bin_size_minutes=bin_size_minutes,
                      csv_export_path=csv_export_path, 
                      plot_export_path=plot_export_path, plot_style = plot_style,
                      percentiles=percentiles, plot_percentiles=plot_percentiles,
                      pctile_color=pctile_color, pctile_linewidth=pctile_linewidth,
                      cap=cap, cap_color=cap_color,
                      main_title=main_title, main_title_properties=main_title_properties,
                      subtitle=subtitle, subtitle_properties=subtitle_properties,
                      xlabel=xlabel, ylabel=ylabel
                     )

Now we can use the `get_plot` function to retrieve the occupancy plot. All the plots and csv files were also exported.

In [None]:
# Get and display occupancy plot
occ_plot = hm.get_plot(hills, 'occupancy')
occ_plot

## Output dictionary

The `make_hills` function returns a dictionary containing all of the outputs created. There are convenience functions such as `get_plot`, `get_bydatetime_df` and `get_summary_df` that can be used to pull out items of interest. Of course, you can also access the dictionary directly. Let's check it out.

In [None]:
hills.keys()

### `bydatetime` - Detailed bydatetime dataframes

In [None]:
hills['bydatetime'].keys()

In [None]:
hills['bydatetime']['PatType_datetime']

In [None]:
hm.get_bydatetime_df(hills, by_category=False)

### `summaries` - Occupancy, arrival and departure summary dataframes

In [None]:
hills['summaries'].keys()

In [None]:
hills['summaries']['nonstationary'].keys()

In [None]:
hills['summaries']['nonstationary']['PatType_dow_binofday'].keys()

In [None]:
hills['summaries']['nonstationary']['PatType_dow_binofday']['occupancy']

In [None]:
help(hm.get_summary_df)

In [None]:
hm.get_summary_df(hills, flow_metric='arrivals', by_category=False)

### `los_summary` - Length of stay summary

In [None]:
hills['los_summary']

In [None]:
hills['los_summary']['los_stats_bycat']

In [None]:
hills['los_summary']['los_histo_bycat']

In [None]:
help(hm.get_los_plot)

In [None]:
hm.get_los_plot(hills, by_category=False)

In [None]:
hm.get_los_stats(hills, by_category=False)

### `plots` - Daily and weekly summary plots

In [None]:
hills['plots'].keys()

In [None]:
hm.get_plot?

In [None]:
hm.get_plot(hills, 'o', day_of_week='Mon')

In [None]:
hm.get_plot(hills, 'o', 'week')

### `settings` - Scenario settings

In [None]:
hills['settings']

### `runtime` - Total runtime in seconds

In [None]:
hills['runtime']

## Using a config file

Instead of passing a bunch of arguments to `make_hills`, you can use a [TOML formatted](https://realpython.com/python-toml/) config file. Here's what an example config file might look like:

```
[scenario_data]
scenario_name = "ss_example_1_config"
data = "./data/ssu_2024.csv"

[fields]
in_field = "InRoomTS"
out_field = "OutRoomTS"
# Just remove the following line if no category field
cat_field = "PatType"

[analysis_dates]
start_analysis_dt = 2024-01-02
end_analysis_dt = 2024-03-30

[settings]
bin_size_minutes = 60
verbosity = 1
csv_export_path = './output'
plot_export_path = './output'

# Add any additional arguments here
# Strings should be surrounded in double quotes
# Floats and ints are specified in the normal way as values
# Dates are specified as shown above

# For arguments that take lists, the entries look
# just like Python lists and following the other rules above

# cats_to_exclude = ["IVT", "OTH"]
# percentiles = [0.5, 0.8, 0.9]

# For arguments that take dictionaries, do this:
# main_title_properties = {loc = 'left', fontsize = 16}
# subtitle_properties = {loc = 'left', style = 'italic'}
# legend_properties = {loc = 'best', frameon = true, facecolor = 'w'}
```

The sections headings, `[scenario_data]`, `[fields]`, and `[analysis_dates]` aren't actually necessary. You could actually put all input parameters within the `[settings]` section. Including the other headings is just an organizational aid.

```{warning}
You MUST include the `[settings]` section header.
```

In [None]:
hills_config = hm.make_hills(config='./input/ssu_example_1_config.toml')

In [None]:
hills_config['settings']