Skip to content

Commit

Permalink
fix table, pgml.transform
Browse files Browse the repository at this point in the history
  • Loading branch information
levkk committed Apr 27, 2024
1 parent 158853d commit db94ff1
Show file tree
Hide file tree
Showing 3 changed files with 123 additions and 26 deletions.
22 changes: 11 additions & 11 deletions pgml-cms/docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,18 @@

* [Overview](api/apis.md)
* [SQL extension](api/sql-extension/README.md)
* [pgml.deploy()](api/sql-extension/pgml.deploy.md)
* [pgml.embed()](api/sql-extension/pgml.embed.md)
* [pgml.transform()](api/sql-extension/pgml.transform/README.md)
* [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
* [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
* [Summarization](api/sql-extension/pgml.transform/summarization.md)
* [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
* [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
* [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
* [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
* [Translation](api/sql-extension/pgml.transform/translation.md)
* [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md)
* [pgml.deploy()](api/sql-extension/pgml.deploy.md)
* [pgml.chunk()](api/sql-extension/pgml.chunk.md)
* [pgml.generate()](api/sql-extension/pgml.generate.md)
* [pgml.predict()](api/sql-extension/pgml.predict/README.md)
Expand All @@ -29,16 +39,6 @@
* [Data Pre-processing](api/sql-extension/pgml.train/data-pre-processing.md)
* [Hyperparameter Search](api/sql-extension/pgml.train/hyperparameter-search.md)
* [Joint Optimization](api/sql-extension/pgml.train/joint-optimization.md)
* [pgml.transform()](api/sql-extension/pgml.transform/README.md)
* [Fill Mask](api/sql-extension/pgml.transform/fill-mask.md)
* [Question Answering](api/sql-extension/pgml.transform/question-answering.md)
* [Summarization](api/sql-extension/pgml.transform/summarization.md)
* [Text Classification](api/sql-extension/pgml.transform/text-classification.md)
* [Text Generation](api/sql-extension/pgml.transform/text-generation.md)
* [Text-to-Text Generation](api/sql-extension/pgml.transform/text-to-text-generation.md)
* [Token Classification](api/sql-extension/pgml.transform/token-classification.md)
* [Translation](api/sql-extension/pgml.transform/translation.md)
* [Zero-shot Classification](api/sql-extension/pgml.transform/zero-shot-classification.md)
* [pgml.tune()](api/sql-extension/pgml.tune.md)
* [Client SDK](api/client-sdk/README.md)
* [Collections](api/client-sdk/collections.md)
Expand Down
125 changes: 111 additions & 14 deletions pgml-cms/docs/api/sql-extension/pgml.transform/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,37 +17,134 @@ layout:

# pgml.transform()

PostgresML integrates [🤗 Hugging Face Transformers](https://huggingface.co/transformers) to bring state-of-the-art models into the data layer. There are tens of thousands of pre-trained models with pipelines to turn raw inputs into useful results. Many state of the art deep learning architectures have been published and made available for download. You will want to browse all the [models](https://huggingface.co/models) available to find the perfect solution for your [dataset](https://huggingface.co/dataset) and [task](https://huggingface.co/tasks).
The `pgml.transform()` is the most powerful function in PostgresML. It integrates open-source large language models, like Llama, Mixtral, and many more, which allows to perform complex tasks on your data.

We'll demonstrate some of the tasks that are immediately available to users of your database upon installation: [translation](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#translation), [sentiment analysis](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#sentiment-analysis), [summarization](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#summarization), [question answering](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#question-answering) and [text generation](https://github.com/postgresml/postgresml/blob/v2.7.12/pgml-dashboard/content/docs/guides/transformers/pre\_trained\_models.md#text-generation).
The models are downloaded from [🤗 Hugging Face](https://huggingface.co/transformers) which hosts tens of thousands of pre-trained and fine-tuned models for various tasks like text generation, question answering, summarization, text classification, and more.

### Examples
## API

All of the tasks and models demonstrated here can be customized by passing additional arguments to the `Pipeline` initializer or call. You'll find additional links to documentation in the examples below.
The `pgml.transform()` function comes in two flavors, task-based and model-based.

The Hugging Face [`Pipeline`](https://huggingface.co/docs/transformers/main\_classes/pipelines) API is exposed in Postgres via:
### Task-based API

```sql
The task-based API automatically chooses a model to use based on the task:

```postgresql
pgml.transform(
task TEXT,
args JSONB,
inputs TEXT[]
)
```

| Argument | Description | Example |
|----------|-------------|---------|
| task | The name of a natural language processing task. | `text-generation` |
| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |

#### Example

{% tabs %}
{% tab title="SQL" %}

```postgresql
SELECT *
FROM pgml.transform (
'translation_en_to_fr',
'How do I say hello in French?',
);
```

{% endtab %}
{% endtabs %}

### Model-based API

The model-based API requires the name of the model and the task, passed as a JSON object, which allows it to be more generic:

```postgresql
pgml.transform(
task TEXT OR JSONB, -- task name or full pipeline initializer arguments
call JSONB, -- additional call arguments alongside the inputs
inputs TEXT[] OR BYTEA[] -- inputs for inference
model JSONB,
args JSONB,
inputs TEXT[]
)
```

This is roughly equivalent to the following Python:
| Argument | Description | Example |
|----------|-------------|---------|
| task | Model configuration, including name and task. | `{"task": "text-generation", "model": "mistralai/Mixtral-8x7B-v0.1"}` |
| args | Additional kwargs to pass to the pipeline. | `{"max_new_tokens": 50}` |
| inputs | Array of prompts to pass to the model for inference. | `['Once upon a time...']` |

#### Example

{% tabs %}
{% tab title="SQL" %}

```postgresql
SELECT pgml.transform(
task => '{
"task": "text-generation",
"model": "TheBloke/zephyr-7B-beta-GPTQ",
"model_type": "mistral",
"revision": "main",
}'::JSONB,
inputs => ['AI is going to change the world in the following ways:'],
args => '{
"max_new_tokens": 100
}'::JSONB
);
```

{% endtab %}

{% tab title="Equivalent Python" %}

```python
import transformers

def transform(task, call, inputs):
return transformers.pipeline(**task)(inputs, **call)

transform(
{
"task": "text-generation",
"model": "TheBloke/zephyr-7B-beta-GPTQ",
"model_type": "mistral",
"revision": "main",
},
{"max_new_tokens": 100},
['AI is going to change the world in the following ways:']
)
```

Most pipelines operate on `TEXT[]` inputs, but some require binary `BYTEA[]` data like audio classifiers. `inputs` can be `SELECT`ed from tables in the database, or they may be passed in directly with the query. The output of this call is a `JSONB` structure that is task specific. See the [Postgres JSON](https://www.postgresql.org/docs/14/functions-json.html) reference for ways to process this output dynamically.
{% endtab %}
{% endtabs %}


### Supported tasks

PostgresML currently supports most NLP tasks available on Hugging Face:

| Task | Name | Description |
|------|-------------|---------|
| [Fill mask](fill-mask) | `key-mask` | Fill in the blank in a sentence. |
| [Question answering](question-answering) | `question-answering` | Answer a question based on a context. |
| [Summarization](summarization) | `summarization` | Summarize a long text. |
| [Text classification](text-classification) | `text-classification` | Classify a text as positive or negative. |
| [Text generation](text-generation) | `text-generation` | Generate text based on a prompt. |
| [Text-to-text generation](text-to-text-generation) | `text-to-text-generation` | Generate text based on an instruction in the prompt. |
| [Token classification](token-classification) | `token-classification` | Classify tokens in a text. |
| [Translation](translation) | `translation` | Translate text from one language to another. |
| [Zero-shot classification](zero-shot-classification) | `zero-shot-classification` | Classify a text without training data. |


## Performance

!!! tip
Much like `pgml.embed()`, the models used in `pgml.transform()` are downloaded from Hugging Face and cached locally. If the connection to the database is kept open, the model remains in memory, which allows for faster inference on subsequent calls. If you want to free up memory, you can close the connection.

Models will be downloaded and stored locally on disk after the first call. They are also cached per connection to improve repeated calls in a single session. To free that memory, you'll need to close your connection. You may want to establish dedicated credentials and connection pools via [pgcat](https://github.com/levkk/pgcat) or [pgbouncer](https://www.pgbouncer.org/) for larger models that have billions of parameters. You may also pass `{"cache": false}` in the JSON `call` args to prevent this behavior.
## Additional resources

!!!
- [Hugging Face datasets](https://huggingface.co/datasets)
- [Hugging Face tasks](https://huggingface.co/tasks)
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ table.table.table-sm {
background: transparent;
text-transform: uppercase;
font-size: 12px;
padding: 12px 0 12px 0;
padding: 12px 12px 12px 0;
border-bottom: 1px solid #{$gray-600};
font-weight: #{$font-weight-semibold};
}
Expand Down

0 comments on commit db94ff1

Please sign in to comment.