![PZ-banner](https://palimpzest-workloads.s3.us-east-1.amazonaws.com/palimpzest-cropped.png)

# Palimpzest Quickstart
This notebook contains a sample program to guide you through the features of the Palimpzest (PZ) library. PZ provides a high-level, declarative interface for composing and executing pipelines of semantic operators.

## Load Private Key(s)
1. Click on the "key" icon on the left-hand-side of the Colab notebook.
2. In the sidebar that opens, click `+ Add new secret`
  - **Note:** your secrets are not visible to anyone other than Google and your version of the notebook.
3. Enter one or more of the following keys as secrets:
  - `OPENAI_API_KEY`
  - `TOGETHER_API_KEY`
    - You can create a `together.ai` API key [here](https://api.together.ai/) for this demo (it comes with $1 of free API requests)
4. Make sure you have toggled `Notebook access` ON
5. Execute the cell below to store these keys in notebook environment variables.


#### Note: for the changes to take effect, you may need to restart the session (`Runtime > Restart Session`) if you've already connected the notebook to a runtime

In [None]:
from google.colab import userdata
import os

# set environment variables
def set_api_key_from_secret(key_name):
  try:
    os.environ[key_name] = userdata.get(key_name)
  except:
    pass

set_api_key_from_secret('OPENAI_API_KEY')
set_api_key_from_secret('TOGETHER_API_KEY')

## Install Palimpzest
First, let's install the Palimpzest package. This may take a few minutes. **PIP dependency error messages are expected and can be ignored.**

In [None]:
!pip install palimpzest==0.7.6
!pip install --upgrade pyarrow
!pip install chromadb==0.6.3
import palimpzest as pz

## Download Test Files

Next, we'll download the dataset we need for this demo:

In [None]:
# download tar files with testdata
!wget -nc https://people.csail.mit.edu/gerarvit/PalimpzestData/enron-tiny.tar.gz
!wget -nc wget -nc https://people.csail.mit.edu/gerarvit/PalimpzestData/real-estate-eval-5.tar.gz
!wget -nc https://palimpzest-workloads.s3.us-east-1.amazonaws.com/chroma-biodex.tar.gz

# open tar files
!tar -xzf enron-tiny.tar.gz
!tar -xzf real-estate-eval-5.tar.gz
!tar -xzf chroma-biodex.tar.gz

# First PZ Program: Filtering Enron Emails
For this demo, we will work with a small subset of the Enron Email Dataset to identify emails matching some search criteria.

We are going to use Palimpzest to perform the following tasks:
1. Load the text files that contain the emails. (Each `.txt` file contains a single email).
2. Compute the sender, subject, and date of each email.
3. Filter the emails for ones that mention a vacation plan and were sent in the month of July.

We can compose these tasks into a PZ program as follows:


In [None]:
# define the fields we wish to compute
email_cols = [
    {"name": "sender", "type": str, "desc": "The email address of the sender"},
    {"name": "subject", "type": str, "desc": "The subject of the email"},
    {"name": "date", "type": str, "desc": "The date the email was sent"},
]

# lazily construct the computation to get emails about holidays sent in July
dataset = pz.Dataset("enron-tiny/")
dataset = dataset.sem_add_columns(email_cols)
dataset = dataset.sem_filter("The email was sent in July")
dataset = dataset.sem_filter("The email is about holidays")

First, we define the set of columns we want to compute in `email_cols`.

Next, we create a dataset by simply constructing `pz.Dataset()` with to the path to our files.

We then instruct PZ to compute the email columns with a call to `sem_add_columns()`.

Finally, we apply our two natural language filters with `sem_filter()`.

**Note:** due to PZ's lazy execution, the code above will not execute the PZ program. It simply defines the semantic computation graph.

In the next cell, we execute the PZ program with the goal of optimizing for quality:

In [None]:
# execute the computation w/the MaxQuality policy
config = pz.QueryProcessorConfig(policy=pz.MaxQuality(), execution_strategy="parallel", progress=True)
output = dataset.run(config)

Once our pipeline completes, we can convert the output to a Pandas DataFrame:

In [None]:
# display output (if using Jupyter, otherwise use print(output_df))
output_df = output.to_df(cols=["date", "sender", "subject"])
display(output_df)

Furthermore, Palimpzest provides a detailed report of the execution, with statistics about the runtime and cost of each operation, as well as the final plan that PZ executed.

These statistics are stored in `output.execution_stats`:

In [None]:
print(f"Optimization Time: {output.execution_stats.optimization_time:.2f}s")
print(f"Optimization Cost: ${output.execution_stats.optimization_cost:.3f}")
print("---")
print(f"Plan Execution Time: {output.execution_stats.plan_execution_time:.2f}s")
print(f"Plan Execution Cost: ${output.execution_stats.plan_execution_cost:.3f}")

print("Final plan executed:")
print("---")
final_plan_id = list(output.execution_stats.plan_strs.keys())[-1]
print(output.execution_stats.plan_strs[final_plan_id])

# Second PZ Program: Multi-Modal Data Processing

For our next demo, we will work with a small dataset of five real estate listings to search for properties of interest.

We are going to use Palimpzest to execute the following pipeline.
1. Load the images and text description for each listing
2. Compute the price and address of each listing from the text description
3. Filter for homes within our price range
4. Filter for homes that look modern and attractive

Let's take a moment to visualize the homes in our dataset:


In [None]:
from PIL import Image
import numpy as np
import gradio as gr

# Boilerplate code to build our visualization
fst_imgs, snd_imgs, thrd_imgs, texts = [], [], [], []
for idx in range(1, 6):
    listing = f"listing{idx}"
    with open(os.path.join("real-estate-eval-5", listing, "listing-text.txt")) as f:
        texts.append(f.read())
    for idx, img_name in enumerate(["img1.png", "img2.png", "img3.png"]):
        path = os.path.join("real-estate-eval-5", listing, img_name)
        img = Image.open(path)
        img_arr = np.asarray(img)
        if idx == 0:
            fst_imgs.append(img_arr)
        elif idx == 1:
            snd_imgs.append(img_arr)
        elif idx == 2:
            thrd_imgs.append(img_arr)

with gr.Blocks() as demo:
    fst_img_blocks, snd_img_blocks, thrd_img_blocks, text_blocks = [], [], [], []
    for fst_img, snd_img, thrd_img, text in zip(fst_imgs, snd_imgs, thrd_imgs, texts):
        with gr.Row(equal_height=True):
            with gr.Column():
                fst_img_blocks.append(gr.Image(value=fst_img))
            with gr.Column():
                snd_img_blocks.append(gr.Image(value=snd_img))
            with gr.Column():
                thrd_img_blocks.append(gr.Image(value=thrd_img))
        with gr.Row():
            with gr.Column():
                text_blocks.append(gr.Textbox(value=text, info="Text Description"))

demo.launch()

In [None]:
demo.close()

As a first step, we need to write a custom `pz.DataReader` to enable PZ to load our data properly:

In [None]:
from palimpzest.core.lib.fields import ImageFilepathField, ListField

# we first define the schema for each record output by the DataReader
real_estate_listing_cols = [
    {"name": "listing", "type": str, "desc": "The name of the listing"},
    {"name": "text_content", "type": str, "desc": "The content of the listing's text description"},
    {"name": "image_filepaths", "type": ListField(ImageFilepathField), "desc": "A list of the filepaths for each image of the listing"},
]

# we then implement the DataReader
class RealEstateListingReader(pz.DataReader):
    def __init__(self, listings_dir):
        super().__init__(schema=real_estate_listing_cols)
        self.listings_dir = listings_dir
        self.listings = sorted(os.listdir(self.listings_dir))

    def __len__(self):
        return len(self.listings)

    def __getitem__(self, idx: int):
        # get listing
        listing = self.listings[idx]

        # get fields
        image_filepaths, text_content = [], None
        listing_dir = os.path.join(self.listings_dir, listing)
        for file in os.listdir(listing_dir):
            if file.endswith(".txt"):
                with open(os.path.join(listing_dir, file), "rb") as f:
                    text_content = f.read().decode("utf-8")
            elif file.endswith(".png"):
                image_filepaths.append(os.path.join(listing_dir, file))

        # construct and return dictionary with fields
        return {"listing": listing, "text_content": text_content, "image_filepaths": image_filepaths}

Every `pz.DataReader` must have the following:
1. A `schema` defining the fields present in each output record
2. A `__len__()` function which returns the number of items in the dataset
3. A `__getitem__(idx)` function which returns the `idx`th item in the dataset

Once we've implemented the `pz.DataReader`, we can compose our PZ program as follows:

In [None]:
# schema for computing the address and price of each home
real_estate_text_cols = [
    {"name": "address", "type": str, "desc": "The address of the property"},
    {"name": "price", "type": int | float, "desc": "The listed price of the property"},
]

# define a UDF for filtering based on a price range
def in_price_range(record: dict):
    try:
        price = record["price"]
        if isinstance(price, str):
            price = price.strip()
            price = int(price.replace("$", "").replace(",", ""))
        return 6e5 < price <= 2e6
    except Exception:
        return False

# construct our PZ program to filter for listings matching our search criteria
ds = pz.Dataset(RealEstateListingReader("real-estate-eval-5"))
ds = ds.sem_add_columns(real_estate_text_cols, depends_on="text_content")
ds = ds.sem_filter(
    "The interior is modern and attractive, and has lots of natural sunlight",
    depends_on="image_filepaths",
)
ds = ds.filter(in_price_range, depends_on="price")

First, we write a schema for the `address` and `price` fields we wish to compute.

Next, we write a UDF to filter for homes based on our price range.

Then we compose our program by:
1. Constructing our `pz.DataReader` with the real estate data
2. Using `sem_add_columns()` to compute the `address` and `price`
3. Using a `sem_filter()` to filter for modern homes with lots of sunlight
4. Using our UDF to filter for homes based on our price range

We now execute the program:

In [None]:
# execute the computation w/the MaxQuality policy
config = pz.QueryProcessorConfig(policy=pz.MaxQuality(), execution_strategy="parallel", progress=True)
output = ds.run(config)

Now let's take a look at our output:

In [None]:
from PIL import Image
import numpy as np
import gradio as gr

demo.close()

# Boilerplate code to build our visualization
fst_imgs, snd_imgs, thrd_imgs, addrs, prices = [], [], [], [], []
for record in output:
    addrs.append(record.address)
    prices.append(record.price)
    for idx, img_name in enumerate(["img1.png", "img2.png", "img3.png"]):
        path = os.path.join("real-estate-eval-5", record.listing, img_name)
        img = Image.open(path)
        img_arr = np.asarray(img)
        if idx == 0:
            fst_imgs.append(img_arr)
        elif idx == 1:
            snd_imgs.append(img_arr)
        elif idx == 2:
            thrd_imgs.append(img_arr)

with gr.Blocks() as demo:
    fst_img_blocks, snd_img_blocks, thrd_img_blocks, addr_blocks, price_blocks = [], [], [], [], []
    for fst_img, snd_img, thrd_img, addr, price in zip(fst_imgs, snd_imgs, thrd_imgs, addrs, prices):
        with gr.Row(equal_height=True):
            with gr.Column():
                fst_img_blocks.append(gr.Image(value=fst_img))
            with gr.Column():
                snd_img_blocks.append(gr.Image(value=snd_img))
            with gr.Column():
                thrd_img_blocks.append(gr.Image(value=thrd_img))
        with gr.Row():
            with gr.Column():
                addr_blocks.append(gr.Textbox(value=addr, info="Address"))
            with gr.Column():
                price_blocks.append(gr.Textbox(value=price, info="Price"))

    plan_str = list(output.execution_stats.plan_strs.values())[0]
    gr.Textbox(value=plan_str, info="Query Plan")

demo.launch()

In [None]:
demo.close()

# Third PZ Program: Optimizing a Biomedical Classification Pipeline

For our final demo, we will work with a subset of the BioDEX dataset.

Each input in the dataset is a medical report describing an adverse reaction a patient had in response to taking one or more drugs.

The goal is to correctly predict the reactions experienced by the patient by matching them to a database of ~24,300 official medical reaction terms.

We are going to use Palimpzest to implement the following pipeline:
1. Load a medical report
2. Compute a list of reactions mentioned in the report
3. Retrieve the most similar reaction terms from a vector database with embeddings for each of the ~24,300 official terms
4. Re-rank the list of official terms based on their relevance

First, we will once again create a `pz.DataReader` to load the medical reports:

In [None]:
import datasets
from functools import partial

# define the schema for records returned by the DataReader
biodex_entry_cols = [
    {"name": "pmid", "type": str, "desc": "The PubMed ID of the medical paper"},
    {"name": "title", "type": str, "desc": "The title of the medical paper"},
    {"name": "abstract", "type": str, "desc": "The abstract of the medical paper"},
    {"name": "fulltext", "type": str, "desc": "The full text of the medical paper, which contains information relevant for creating a drug safety report."},
]

# implement the DataReader
class BiodexReader(pz.DataReader):
    def __init__(
        self,
        rp_at_k: int = 5,
        num_samples: int = 10,
        split: str = "test",
        shuffle: bool = True,
        seed: int = 42,
    ):
        super().__init__(biodex_entry_cols)

        self.dataset = datasets.load_dataset("BioDEX/BioDEX-Reactions", split=split).to_pandas()
        if shuffle:
            self.dataset = self.dataset.sample(n=num_samples, random_state=seed).to_dict(orient="records")
        else:
            self.dataset = self.dataset.to_dict(orient="records")[:num_samples]

        self.rp_at_k = rp_at_k
        self.num_samples = num_samples
        self.shuffle = shuffle
        self.seed = seed
        self.split = split

    def compute_label(self, entry: dict) -> dict:
        """Compute the label for a BioDEX report given its entry in the dataset."""
        reactions_lst = [
            reaction.strip().lower().replace("'", "").replace("^", "")
            for reaction in entry["reactions"].split(",")
        ]
        label_dict = {
            "reactions": reactions_lst,
            "reaction_labels": reactions_lst,
            "ranked_reaction_labels": reactions_lst,
        }
        return label_dict

    @staticmethod
    def rank_precision_at_k(preds, targets, k: int):
        if preds is None:
            return 0.0

        try:
            # lower-case each list
            preds = [pred.strip().lower().replace("'", "").replace("^", "") for pred in preds]
            targets = set([target.strip().lower().replace("'", "").replace("^", "") for target in targets])

            # compute rank-precision at k
            rn = len(targets)
            denom = min(k, rn)
            total = 0.0
            for i in range(k):
                total += preds[i] in targets if i < len(preds) else 0.0

            return total / denom

        except Exception:
            return 0.0

    @staticmethod
    def term_recall(preds, targets):
        if preds is None:
            return 0.0

        try:
            # normalize terms in each list
            pred_terms = set([
                term.strip()
                for pred in preds
                for term in pred.lower().replace("'", "").replace("^", "").split(" ")
            ])
            target_terms = ([
                term.strip()
                for target in targets
                for term in target.lower().replace("'", "").replace("^", "").split(" ")
            ])

            # compute term recall and return
            intersect = pred_terms.intersection(target_terms)
            term_recall = len(intersect) / len(target_terms)

            return term_recall

        except Exception:
            return 0.0

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx: int):
        # get entry
        entry = self.dataset[idx]

        # get input fields
        pmid = entry["pmid"]
        title = entry["title"]
        abstract = entry["abstract"]
        fulltext = entry["fulltext"]

        # create item with fields
        item = {"fields": {}, "labels": {}, "score_fn": {}}
        item["fields"]["pmid"] = pmid
        item["fields"]["title"] = title
        item["fields"]["abstract"] = abstract
        item["fields"]["fulltext"] = fulltext

        if self.split == "train":
            # add label info
            item["labels"] = self.compute_label(entry)

            # add scoring functions for list fields
            rank_precision_at_k = partial(BiodexReader.rank_precision_at_k, k=self.rp_at_k)
            item["score_fn"]["reactions"] = BiodexReader.term_recall
            item["score_fn"]["reaction_labels"] = BiodexReader.term_recall
            item["score_fn"]["ranked_reaction_labels"] = rank_precision_at_k

        return item


There are a few new features of this `pz.DataReader` which are needed for the optimization process:
1. `__getitem__()` returns a dictionary with top-level keys `{"fields", "labels", "score_fn"}`
2. `fields` contains the data emitted by the `pz.DataReader`
3. (for `train` data only): `labels` contains the expected results for each output field
4. (for `train` data only): `score_fn` contains scoring functions for each output field

Once we've defined our `pz.DataReader`, we can create our training and test datasets:

In [None]:
SEED = 123

# create train and test datasets; and validator
train_datareader = BiodexReader(split="train", seed=SEED)
test_datareader = BiodexReader(split="test", num_samples=20, seed=SEED)
validator = pz.Validator(train_datareader, None)

We now implement the logic for the `sem_topk` operator for you. It fetches the five most similar medical terms for each reaction computed by PZ, sorts them based on similarity, and then returns the final top-k most similar terms.

In [None]:
import chromadb
from chromadb.utils.embedding_functions.openai_embedding_function import OpenAIEmbeddingFunction

# load index [text-embedding-3-small]
chroma_client = chromadb.PersistentClient(".chroma-biodex")
openai_ef = OpenAIEmbeddingFunction(
  api_key=os.environ["OPENAI_API_KEY"],
  model_name="text-embedding-3-small",
)
index = chroma_client.get_collection("biodex-reaction-terms", embedding_function=openai_ef)

def search_func(index: chromadb.Collection, query: list[list[float]], k: int) -> list[str]:
    # execute query with embeddings
    results = index.query(query, n_results=5)

    # get list of result terms with their cosine similarity scores
    final_results = []
    for query_docs, query_distances in zip(results["documents"], results["distances"]):
        for doc, dist in zip(query_docs, query_distances):
            cosine_similarity = 1 - dist
            final_results.append({"content": doc, "similarity": cosine_similarity})

    # sort the results by similarity score
    sorted_results = sorted(final_results, key=lambda result: result["similarity"], reverse=True)

    # remove duplicates
    sorted_results_set = set()
    final_sorted_results = []
    for result in sorted_results:
        if result["content"] not in sorted_results_set:
            sorted_results_set.add(result["content"])
            final_sorted_results.append(result["content"])

    # return the top-k similar results and generation stats
    return {"reaction_labels": final_sorted_results[:k]}

Finally, we can construct our PZ program:

In [None]:
# define the schema for each computation in our program
biodex_reactions_cols = [
    {"name": "reactions", "type": list[str], "desc": "The list of all medical conditions experienced by the patient as discussed in the report. Try to provide as many relevant medical conditions as possible."},
]
biodex_reaction_labels_cols = [
    {"name": "reaction_labels", "type": list[str], "desc": "Official terms for medical conditions listed in `reactions`"},
]
biodex_ranked_reactions_labels_cols = [
    {"name": "ranked_reaction_labels", "type": list[str], "desc": "The ranked list of medical conditions experienced by the patient. The most relevant label occurs first in the list. Be sure to rank ALL of the inputs."},
]


# construct pz plan
plan = pz.Dataset(test_datareader)
plan = plan.sem_add_columns(biodex_reactions_cols)
plan = plan.sem_topk(
    index=index,
    search_func=search_func,
    search_attr="reactions",
    output_attrs=biodex_reaction_labels_cols,
)
plan = plan.sem_add_columns(biodex_ranked_reactions_labels_cols, depends_on=["title", "abstract", "fulltext", "reaction_labels"])


First, let's execute our plan without training data and score our performance:

In [None]:
def score_output(output, seed):
    # score output
    test_dataset = datasets.load_dataset("BioDEX/BioDEX-Reactions", split="test").to_pandas()
    test_dataset = test_dataset.sample(n=20, random_state=seed).to_dict(orient="records")

    # construct mapping from pmid --> label (field, value) pairs
    def compute_target_record(entry):
        reactions_lst = [
            reaction.strip().lower().replace("'", "").replace("^", "")
            for reaction in entry["reactions"].split(",")
        ]
        label_dict = {"ranked_reaction_labels": reactions_lst}
        return label_dict

    label_fields_to_values = {
        entry["pmid"]: compute_target_record(entry) for entry in test_dataset
    }

    def rank_precision_at_k(preds: list, targets: list, k: int):
        if preds is None:
            return 0.0

        # lower-case each list
        preds = [pred.lower().replace("'", "").replace("^", "") for pred in preds]
        targets = set([target.lower().replace("'", "").replace("^", "") for target in targets])

        # compute rank-precision at k
        rn = len(targets)
        denom = min(k, rn)
        total = 0.0
        for i in range(k):
            total += preds[i] in targets if i < len(preds) else 0.0

        return total / denom

    def compute_avg_rp_at_k(records, k=5):
        total_rp_at_k = 0
        bad = 0
        for record in records:
            pmid = record['pmid']
            preds = record['ranked_reaction_labels']
            targets = label_fields_to_values[pmid]['ranked_reaction_labels']
            try:
                total_rp_at_k += rank_precision_at_k(preds, targets, k)
            except Exception:
                bad += 1

        return total_rp_at_k / len(records), bad

    rp_at_k, bad = compute_avg_rp_at_k([record.to_dict() for record in output], k=5)
    final_plan_id = list(output.execution_stats.plan_stats.keys())[0]
    final_plan_str = output.execution_stats.plan_strs[final_plan_id]
    print("---")
    print("#########################")
    print(f"##### RP@5: {rp_at_k:.5f} #####")
    print("#########################")
    print("---")
    print(f"Optimization time: {output.execution_stats.optimization_time:.2f}s")
    print(f"Optimization cost: ${output.execution_stats.optimization_cost:.3f}")
    print("---")
    print(f"Plan exec. time: {output.execution_stats.plan_execution_time:.2f}s")
    print(f"Plan exec. cost: ${output.execution_stats.plan_execution_cost:.3f}")
    print("---")
    print(f"Total time: {output.execution_stats.total_execution_time:.2f}s")
    print(f"Total Cost: ${output.execution_stats.total_execution_cost:.3f}")
    print("---")
    print("Final Plan:")
    print(final_plan_str)

import logging
logger = logging.getLogger()
logger.disabled = True

# execute pz plan
config = pz.QueryProcessorConfig(
    policy=pz.MaxQuality(),
    execution_strategy="parallel",
    max_workers=64,
    progress=True,
)

output = plan.run(config=config, seed=SEED)
score_output(output, seed=SEED)

Now, let's run the program again while using our `train_datareader` as a validation dataset:

In [None]:
import logging
logger = logging.getLogger()
logger.disabled = True

# execute pz plan
config = pz.QueryProcessorConfig(
    policy=pz.MaxQuality(),
    validator=validator,
    optimizer_strategy="pareto",
    sentinel_execution_strategy="mab",
    execution_strategy="parallel",
    use_final_op_quality=True,
    max_workers=64,
    progress=True,
)

output = plan.run(config=config, k=6, j=4, sample_budget=72, seed=SEED)
score_output(output, seed=SEED)