fix table, pgml.transform

postgresml · Apr 27, 2024 · db94ff1 · db94ff1
1 parent 158853d
commit db94ff1
Show file tree

Hide file tree

Showing 3 changed files with 123 additions and 26 deletions.
diff --git a/pgml-cms/docs/SUMMARY.md b/pgml-cms/docs/SUMMARY.md
@@ -16,8 +16,18 @@
 
 * [Overview](api/apis.md)
 * [SQL extension](api/sql-extension/README.md)
-  * [pgml.deploy()](api/sql-extension/pgml.deploy.md)
   * [pgml.embed()](api/sql-extension/pgml.embed.md)
+  * [pgml.transform()](api/sql-extension/pgml.transform/README.md)
+    * [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
+    * [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
+    * [Summarization](api/sql-extension/pgml.transform/summarization.md)
+    * [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
+    * [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
+    * [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
+    * [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
+    * [Translation](api/sql-extension/pgml.transform/translation.md)
+    * [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md)
+  * [pgml.deploy()](api/sql-extension/pgml.deploy.md)
   * [pgml.chunk()](api/sql-extension/pgml.chunk.md)
   * [pgml.generate()](api/sql-extension/pgml.generate.md)
   * [pgml.predict()](api/sql-extension/pgml.predict/README.md)
@@ -29,16 +39,6 @@
     * [Data Pre-processing](api/sql-extension/pgml.train/data-pre-processing.md)
     * [Hyperparameter Search](api/sql-extension/pgml.train/hyperparameter-search.md)
     * [Joint Optimization](api/sql-extension/pgml.train/joint-optimization.md)
-  * [pgml.transform()](api/sql-extension/pgml.transform/README.md)
-    * [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
-    * [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
-    * [Summarization](api/sql-extension/pgml.transform/summarization.md)
-    * [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
-    * [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
-    * [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
-    * [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
-    * [Translation](api/sql-extension/pgml.transform/translation.md)
-    * [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md)
   * [pgml.tune()](api/sql-extension/pgml.tune.md)
 * [Client SDK](api/client-sdk/README.md)
   * [Collections](api/client-sdk/collections.md)

diff --git a/pgml-cms/docs/api/sql-extension/pgml.transform/README.md b/pgml-cms/docs/api/sql-extension/pgml.transform/README.md
@@ -17,37 +17,134 @@ layout:
 
 # pgml.transform()
 
-PostgresML integrates [🤗 Hugging Face Transformers](https://huggingface.co/transformers) to bring state-of-the-art models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw inputs into useful results. Many state of the art deep learning architectures have been published and made available for download. You will want to browse all the [models](https://huggingface.co/models) available to find the perfect solution for your [dataset](https://huggingface.co/dataset) and [task](https://huggingface.co/tasks).
+The `pgml.transform()` is the most powerful function in PostgresML. It integrates open-source large language models, like Llama, Mixtral, and many more, which allows to perform complex tasks on your data.
 
-We'll demonstrate some of the tasks that are immediately available to users of your database upon installation: [translation](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#translation), [sentiment analysis](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#sentiment-analysis), [summarization](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#summarization), [question answering](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#question-answering) and [text generation](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#text-generation).
+The models are downloaded from [🤗 Hugging Face](https://huggingface.co/transformers) which hosts tens of thousands of pre-trained and fine-tuned models for various tasks like text generation, question answering, summarization, text classification, and more.
 
-### Examples
+## API
 
-All of the tasks and models demonstrated here can be customized by passing additional arguments to the `Pipeline` initializer or call. You'll find additional links to documentation in the examples below.
+The `pgml.transform()` function comes in two flavors, task-based and model-based.
 
-The Hugging Face [`Pipeline`](https://huggingface.co/docs/transformers/main\_classes/pipelines) API is exposed in Postgres via:
+### Task-based API
 
-```sql
+The task-based API automatically chooses a model to use based on the task:
+
+```postgresql
+pgml.transform(
+    task TEXT,
+    args JSONB,
+    inputs TEXT[]
+)
+```
+
+| Argument | Description | Example |
+|----------|-------------|---------|
+| task | The name of a natural language processing task. | `text-generation` |
+| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
+| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |
+
+#### Example
+
+{% tabs %}
+{% tab title="SQL" %}
+
+```postgresql
+SELECT *
+FROM pgml.transform (
+  'translation_en_to_fr',
+  'How do I say hello in French?',
+);
+```
+
+{% endtab %}
+{% endtabs %}
+
+### Model-based API
+
+The model-based API requires the name of the model and the task, passed as a JSON object, which allows it to be more generic:
+
+```postgresql
 pgml.transform(
-    task TEXT OR JSONB,      -- task name or full pipeline initializer arguments
-    call JSONB,              -- additional call arguments alongside the inputs
-    inputs TEXT[] OR BYTEA[] -- inputs for inference
+    model JSONB,
+    args JSONB,
+    inputs TEXT[]
 )
 ```
 
-This is roughly equivalent to the following Python:
+| Argument | Description | Example |
+|----------|-------------|---------|
+| task | Model configuration, including name and task. | `{"task": "text-generation", "model": "mistralai/Mixtral-8x7B-v0.1"}` |
+| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
+| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |
+
+#### Example
+
+{% tabs %}
+{% tab title="SQL" %}
+
+```postgresql
+SELECT pgml.transform(
+  task   => '{
+    "task": "text-generation",
+    "model": "TheBloke/zephyr-7B-beta-GPTQ",
+    "model_type": "mistral",
+    "revision": "main",
+  }'::JSONB,
+  inputs  => ['AI is going to change the world in the following ways:'],
+  args   => '{
+    "max_new_tokens": 100
+  }'::JSONB
+);
+```
+
+{% endtab %}
+
+{% tab title="Equivalent Python" %}
 
 ```python
 import transformers
 
 def transform(task, call, inputs):
     return transformers.pipeline(**task)(inputs, **call)
+
+transform(
+    {
+        "task": "text-generation",
+        "model": "TheBloke/zephyr-7B-beta-GPTQ",
+        "model_type": "mistral",
+        "revision": "main",
+    },
+    {"max_new_tokens": 100},
+    ['AI is going to change the world in the following ways:']
+)
 ```
 
-Most pipelines operate on `TEXT[]` inputs, but some require binary `BYTEA[]` data like audio classifiers. `inputs` can be `SELECT`ed from tables in the database, or they may be passed in directly with the query. The output of this call is a `JSONB` structure that is task specific. See the [Postgres JSON](https://www.postgresql.org/docs/14/functions-json.html) reference for ways to process this output dynamically.
+{% endtab %}
+{% endtabs %}
+
+
+### Supported tasks
+
+PostgresML currently supports most NLP tasks available on Hugging Face:
+
+| Task | Name | Description |
+|------|-------------|---------|
+| [Fill mask](fill-mask) | `key-mask` | Fill in the blank in a sentence. |
+| [Question answering](question-answering) | `question-answering` | Answer a question based on a context. |
+| [Summarization](summarization) | `summarization` | Summarize a long text. |
+| [Text classification](text-classification) | `text-classification` | Classify a text as positive or negative. |
+| [Text generation](text-generation) | `text-generation` | Generate text based on a prompt. |
+| [Text-to-text generation](text-to-text-generation) | `text-to-text-generation` | Generate text based on an instruction in the prompt. |
+| [Token classification](token-classification) | `token-classification` | Classify tokens in a text. |
+| [Translation](translation) | `translation` | Translate text from one language to another. |
+| [Zero-shot classification](zero-shot-classification) | `zero-shot-classification` | Classify a text without training data. |
+
+
+## Performance
 
-!!! tip
+Much like `pgml.embed()`, the models used in `pgml.transform()` are downloaded from Hugging Face and cached locally. If the connection to the database is kept open, the model remains in memory, which allows for faster inference on subsequent calls. If you want to free up memory, you can close the connection.
 
-Models will be downloaded and stored locally on disk after the first call. They are also cached per connection to improve repeated calls in a single session. To free that memory, you'll need to close your connection. You may want to establish dedicated credentials and connection pools via [pgcat](https://github.com/levkk/pgcat) or [pgbouncer](https://www.pgbouncer.org/) for larger models that have billions of parameters. You may also pass `{"cache": false}` in the JSON `call` args to prevent this behavior.
+## Additional resources
 
-!!!
+- [Hugging Face datasets](https://huggingface.co/datasets)
+- [Hugging Face tasks](https://huggingface.co/tasks)
diff --git a/pgml-dashboard/src/components/tables/small/table/table.scss b/pgml-dashboard/src/components/tables/small/table/table.scss
@@ -11,7 +11,7 @@ table.table.table-sm {
       background: transparent;
       text-transform: uppercase;
       font-size: 12px;
-      padding: 12px 0 12px 0;
+      padding: 12px 12px 12px 0;
       border-bottom: 1px solid #{$gray-600};
       font-weight: #{$font-weight-semibold};
     }