# An Introduction to Using PyBEN

This is a small tutorial file that is meant to help users get to using PyBEN: the python interface
for the [binary-ensemble](https://crates.io/crates/binary-ensemble) Rust package.


BEN (short for Binary-Ensemble) is a compression algorithm designed for efficient storage and
access of ensembles of districting plans, and was designed to work primarily as a companion to
to the GerrySuite collection of packages (GerryChain, GerryTools, FRCW) and to also be compatible
with other ensemble generators (e.g. ForestRecom, Sequential Monte Carlo \[SMC\]). 

When working with an ensemble of plans, there is generally an underlying dual graph, $G$, and 
on which there is an ordering of nodes $(n_1, n_2, \cdots, n_\ell)$. If we then wish to 
partition the graph into, say, districts, then the only thing that we need to do is assign each
node in a graph to a district number. This is what we call the ***assignment vector*** for the 
districting plan. Then to encode an ensemble of districting plans in a JSONL file (short for JSON 
Lines and it really just means a file with a dictionary on every line), we may format each of the
lines in the following way:

```json
{"assignment": <assignment_vector>, "sample": <sample_number_indexed_from_1>}
```

However, if the graph has a lot of nodes in it and we want to collect millions of samples (as we 
tend to want to do), then this JSONL format can make for MASSIVE (tens or hundreds of Gb) files. So
This is why we have BEN and XBEN (e\[X\]treme BEN): to make the storage and processing of these
millions of plans possible without needing to buy an extra hard drive for every project that you 
would like to work with.

## Setup for the Tutorial

For this tutorial, you will need access to a few files. We are going to go ahead and download
them here and then place them in a folder called "example_data"

In [17]:
from urllib.request import urlopen
from pathlib import Path

Path("example_data").mkdir(exist_ok=True)

In [23]:

def open_and_save(base_url, file_name):
    url = f"{base_url}/{file_name}"
    out_path = f"./example_data/{file_name}"

    chunk = 1024 * 64
    with urlopen(url, timeout=30) as resp, open(out_path, "wb") as f:
        while True:
            buf = resp.read(chunk)
            if not buf:
                break
            f.write(buf)

url_base = "https://raw.githubusercontent.com/peterrrock2/binary-ensemble/main/example"
for file_name in [
    "100k_CO_chain.jsonl.xben",
    "CO_small.json",
    "small_example.jsonl",
]:
    out_path = f"./example_data/{file_name}"
    if not Path(out_path).exists():
        print(f"Downloading {file_name}...")
        open_and_save(url_base, file_name)
    else:
        print(f"{file_name} already exists, skipping download.")



url_base = "https://raw.githubusercontent.com/mggg/GerryChain/refs/heads/main/docs/_static"
for file_name in [
    "gerrymandria.json",
]:
    out_path = f"./example_data/{file_name}"
    if not Path(out_path).exists():
        print(f"Downloading {file_name}...")
        open_and_save(url_base, file_name)
    else:
        print(f"{file_name} already exists, skipping download.")

100k_CO_chain.jsonl.xben already exists, skipping download.
CO_small.json already exists, skipping download.
small_example.jsonl already exists, skipping download.
Downloading gerrymandria.json...


## Converting between file types

PyBen comes equiped with some utility functions for users who wish to convert between different
file types.

In [None]:
from pyben import (
    compress_jsonl_to_ben, compress_ben_to_xben, compress_jsonl_to_xben, decompress_ben_to_jsonl, decompress_xben_to_jsonl, decompress_xben_to_ben
)



### BEN compression

The most basic and quickest type of compression available is the BEN compression format. You 
may convert between a standard JSONL file to a BEN file using the following function:


In [None]:
compress_jsonl_to_ben(
    in_file="example_data/small_example.jsonl", 
    out_file="example_data/small_example_jsonl_to_ben.jsonl.ben",
)

As a small note, the above function (and all the conversion functions) has a default behavior of 
not overwriting output. There is 
an optional "overwrite" parameter that can be set to `True`. In addition, there is a `variant`
parameter with two options: "standard" and "mkv_chain". The "mkv_chain" variation is a special 
version of the compression is a special variation of BEN that is optimized for ensembles generated
using a MCMC method with a non-zero rejection probability (so the generated maps may repeat a few 
times to target an appropriate probability distribution like in 
[Reversible ReCom](https://mggg.org/rrc))

In [None]:
compress_jsonl_to_xben(
    in_file="example_data/small_example.jsonl", 
    out_file="example_data/small_example_jsonl_to_xben.jsonl.xben",
)
compress_ben_to_xben(
    in_file="example_data/small_example_jsonl_to_ben.jsonl.ben", 
    out_file="example_data/small_example_jsonl_to_ben_to_xben.jsonl.xben",
)