### Quickstart
This notebook contains a sample program to guide you through the features of the Palimpzest (PZ) library. 
PZ provides a high-level, declarative interface for composing and executing pipelines of semantic operators.

### Pre-requisites
As Palimpzest is accessing LLM models, you need to set up **at least** one of the following
API keys as environment variables:

- `OPENAI_API_KEY` for using OPENAI's GPT-3.5 and GPT-4 models
- `TOGETHER_API_KEY` for using TogetherAI's LLM models, including Mixtral

Support for local model execution and more APIs is underway!

Edit the following snippet with your API key in order to run the notebook.
You don't need to run this cell if you have already set one of the keys in the corresponding environment variable.
You can provide one, two, or all three keys in the snippet below. The more keys, the more optimizations Palimpzest will be able to perform!


In [1]:
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# os.environ["TOGETHER_API_KEY"] = "your-together-api-key"

### Application Use Case: Enron Email Dataset
In this application use case, we will work with the Enron Email Dataset. The Enron Email Dataset is a large database of over 600,000 emails. Don't worry! For this demo, we will only be working with a small subset of the dataset.

In this demo, we are going to use Palimpzest to perform the following tasks:
1. Load the text files that contain the emails. Each `.txt` file contains a single email.
2. We will use Palimpzest to convert the textual files into an "Email" type. This will mean extracting explicitly the sender, subject, and date of each email.
3. Then, we will use Palimpzest to filter the emails to only retain the ones that mention a vacation plan and were sent in the month of July.


### Step 1: Load the dataset

First, we have to load the directory containing the textual files in Palimpzest. To do so, we use the `register_dataset` function from the `datamanager` module. This function takes the path to the directory and a name which can be later used to reference this dataset.

This step has to be run once for each dataset you want to load, and this information will be persisted on disk. Therefore if you have already loaded the dataset, you can skip this step.

As you can see, when we load the dataset, we can specify a schema for the input objects we are going to work with. 
A schema is comprised of a set of attributes that Palimpzest will extract from the input objects.

In this case, we know the content of `enron-tiny` is textual files, so we can specify the schema type `TextFile`. This built-in schema is used to parse the textual content of the files which will be saved in the `content` attribute.
Palimpzest will automatically detect the file format and the number of files in the directory.

In [24]:
import palimpzest.datamanager.datamanager as pzdm
from palimpzest.core.lib.schemas import TextFile
from palimpzest.sets import Dataset

# Dataset registration
dataset_path = "testdata/enron-tiny"
dataset_name = "enron-tiny"
pzdm.DataDirectory().register_local_directory(dataset_path, dataset_name)

# Dataset loading
dataset = Dataset(dataset_name, schema=TextFile)

### Step 2: Convert the textual files into an "Email" type
Since we want to extract useful information from the input files, we need to define a custom `Schema` to specify which attributes we are interested in.
Fear not! This is a simple process. We just need to define a class that inherits from `Schema` and specify the attributes we want to extract, using descriptive names and natural language descriptions.

Do not forget to include a class description, as this will be used by Palimpzest during the conversion process!

The `Email` schema will extract the sender, subject, and date of the email. We will use this schema when calling the `dataset.convert(output_schema)` function, which will signal to Palimpzest that we want to convert files with a certain input schema into a given output schema (by extracting the necessary attributes).

In [25]:
from palimpzest.core.lib.fields import Field
from palimpzest.core.lib.schemas import Schema


class Email(Schema):
    """Represents an email, which in practice is usually from a text file"""
    sender = Field(desc="The email address of the sender")
    subject = Field(desc="The subject of the email")
    date = Field(desc="The date the email was sent")

dataset = dataset.convert(Email, desc="An email from the Enron dataset")

If you inspect the dataset, you will see that it now has a schema of Email. 
However, the schema is not yet applied to the files themselves and the attributes are not yet extracted.
This is by design: first, users define all of the operations they want to perform on the dataset, and then they invoke the execution of these operations.

Thanks to this design, Palimpzest can optimize the execution of the operations and also avoid unnecessary computations, for example if it recognizes that some of the later computation does not depend on previous steps.


In [26]:
print("Dataset", dataset)
print("The schema of the dataset is", dataset.schema)

Dataset Dataset(schema=<class '__main__.Email'>, desc=An email from the Enron dataset, filter=None, udf=None, agg_func=None, limit=None, project_cols=None, uid=06a23b1a60)
The schema of the dataset is <class '__main__.Email'>


### Step 3: Apply a Filter to the Emails
Now that we have the emails in the dataset, we can filter them to only retain the ones that mention a vacation plan and were sent in the month of July.

To do this, we will use the `filter` function. This function takes a string which describes in natural language which condition we want the records to satisfy to pass the filter.

When using natural language, you don't need to worry about implementing the filter itself, but the computation will be performed by LLM models. Such is the power of Palimpzest! 

In [27]:
dataset = dataset.filter("The email was sent in July")
dataset = dataset.filter("The email is about holidays")

### Execute the operations
Finally, we can execute the operations we have defined on the dataset by calling the `Execute` function on the final dataset. 
There is one important parameter to discuss here: an execution `policy`. This parameter allows you to specify how the operations should be executed.
Palimpzest optimizes along three axes: cost, time, and quality of the output. You can specify which of these axes is most important to you, and Palimpzest will optimize the execution accordingly.

Here, we use the `MinCost` policy, which will try to minimize the cost of the execution regardless of output quality and runtime. This is useful for large datasets or when you are experimenting with the framework and want to keep the costs low.
You can experiment with the `MaxQuality` policy to see how it affects the execution of the operations!


In [28]:
from palimpzest.policy import MaxQuality, MinCost
from palimpzest.query import Execute

policy = MinCost()
results, execution_stats = Execute(dataset, policy)

VBox(children=(IntProgress(value=0, bar_style='info', description='Processing:', max=36), HTML(value='<pre>Ini…

### Displaying the output

The output of our data pipeline can be found in the `results` variable. 
To print the results as a table, we will initialize a pandas dataframe using the `to_dict` method of the output objects.

In [39]:
import pandas as pd

output_df = pd.DataFrame([r.to_dict() for r in results])[["date","sender","subject"]]
display(output_df)


Unnamed: 0,date,sender,subject
0,6 Jul 2001,sheila.nacey@enron.com,Vacation plans
1,26 Jul 2001,larry.berger@enron.com,Vacation Days in August


However, that is not the only output of the pipeline execution! 

Palimpzest also provides a detailed report of the execution, with statistics about the runtime and cost of each operation.
To access these statistics, you can use the `execution_stats` attribute returned by the call to `Execute()`.


In [51]:
print("Time to find an optimal plan:", execution_stats.total_optimization_time,"s")
print("Time to execute the plan:", execution_stats.total_execution_time, "s")
print("Total cost:", execution_stats.total_execution_cost, "USD")

print("Final plan executed:")
for plan, stats in execution_stats.plan_stats.items():
    print(stats)

Time to find an optimal plan: 0.0 s
Time to execute the plan: 50.940274477005005 s
Total cost: 0.0032746499999999996 USD
Final plan executed:
Total_plan_time=50.935797929763794 
Total_plan_cost=0.0032746499999999996 
0. MarshalAndScanDataOp time=0.004853487014770508 cost=0.0 
1. LLMConvertBonded time=37.450973987579346 cost=0.0024560999999999997 
2. LLMFilter time=11.312363147735596 cost=0.00067695 
3. LLMFilter time=2.13616943359375 cost=0.0001416 



We hope this notebook is only the start of your Palimpzest journey! Feel free to reach us for more information!