# 2/ Introduction to SQL AI Function: generating fake data with custom Model Serving Endpoint

For this demo, we'll start by generating fake data using `AI_QUERY()`. 

The sample data will mimics customer reviews for grocery products submitted to an e-commerce website.

## Working with [`AI_QUERY`](https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_query) function

Our function signature is the following:

```
SELECT ai_query(endpointName, request) for external models and foundation models. 
SELECT ai_query(endpointName, request, returnType) for custom model serving endpoint. 
```

`AI_QUERY` will send the prompt to the remote model configured and retrive the result as SQL. 

In addition, we use the optional `responseFormat` argument with `AI_QUERY` to specify the response format you want the model to follow.

The [Query Profile UI](https://docs.databricks.com/aws/en/sql/user/queries/query-profile) provides real-time execution status, processing times, and error visibility for `AI_QUERY`

*Note: this will reproduce the behavior or the built-in `gen_ai` function, but leveraging one of the Model Serving Endpoint of your choice.*<br/>
*If you're looking at quickly generating data, we recommend you to just go with the built-in.*

*This notebook will use the foundation Llama 3.3 70B Instruct model for inference*

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=DBSQL&org_id=984752964297111&notebook=%2F02-Generate-fake-data-with-AI-functions&demo_name=sql-ai-functions&event=VIEW&path=%2F_dbdemos%2FDBSQL%2Fsql-ai-functions%2F02-Generate-fake-data-with-AI-functions&version=1&user_hash=d0474b04fa5f647b58e56efa05dd4a6e679929fda0191106826e4a09d21dcede">

To run this notebook, connect to <b> SQL endpoint </b>. The AI_QUERY function is available on Databricks SQL Pro and Serverless.

In [0]:
-- as previously, make sure you run this notebook using a SQL Warehouse or Serverless endpoint (not a classic cluster)
-- assert_true function returns an untyped null if no error is returned
SELECT assert_true(current_version().dbsql_version is not null, 'YOU MUST USE A SQL WAREHOUSE OR SERVERLESS, not a classic cluster');

USE CATALOG dbacademy;
CREATE SCHEMA IF NOT EXISTS himanshu_gupta;
USE SCHEMA himanshu_gupta;

In [0]:
SELECT
  AI_QUERY(
    "databricks-meta-llama-3-3-70b-instruct",
    "Generate a short product review for a red dress. The customer is very happy with the article."
  ) as product_review

### Introduction to SQL Function: adding a wrapper function to simplify the call

While it's easy to call this function, having to our model endpoint name as parameter can be harder to use, especially for Data Analyst who should focus on crafting proper prompt. 

To simplify our demo next steps, we'll create a wrapper SQL function `ASK_LLM_MODEL` with string input parameters prompt (the question to ask), response_format as string output format and wrap all the model configuration.


<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/main/images/product/sql-ai-functions/sql-ai-query-function-review-wrapper.png" width="1200px">

In [0]:
CREATE OR REPLACE FUNCTION ASK_LLM_MODEL(prompt STRING, response_format STRING DEFAULT '{"type": "string"}') 
  RETURNS STRING
  RETURN 
    AI_QUERY("databricks-meta-llama-3-3-70b-instruct", 
              prompt,
              response_format);

-- ALTER FUNCTION ASK_LLM_MODEL OWNER TO `your_principal`; -- for the demo only, make sure other users can access your function

In [0]:
SELECT ASK_LLM_MODEL("Generate a short product review for a red dress. The customer is very happy with the article.")

## Generating a more complete sample dataset with prompt engineering

Now that we know how to send a basic query to Open AI using SQL functions, let's ask the model a more detailed question.

We'll directly ask to model to generate multiple rows and directly return as a json. Notice we set the response format as json object to guide LLM to generate a JSON.

Here's a prompt example to generate JSON:
```
Generate a sample dataset for me of 2 rows that contains the following columns: "review_date" (random dates in 2022), 
"review_id" (random id), "customer_id" (random long from 1 to 100), and "review". Reviews should mimic useful product reviews 
left on an e-commerce marketplace website. The review must include the product name.

The reviews should vary in length (shortest: one sentence, longest: 2 paragraphs), sentiment, and complexity. A very complex review 
would talk about multiple topics (entities) about the product with varying sentiment per topic. Provide a mix of positive, negative, 
and neutral reviews

Return JSON ONLY. No other text outside the JSON. JSON format:
[{"review_date":<date>, "review_id":<review_id>, "product_name":<product_name>, "review":<review>}]
```

In [0]:
SELECT ASK_LLM_MODEL(
      'Generate a sample dataset of 2 rows that contains the following columns: "review_date" (random dates in 2022), 
      "review_id" (random id), "customer_id" (random long from 1 to 100)  and "review". 
      Reviews should mimic useful product reviews from popular grocery brands product left on an e-commerce marketplace website. The review must include the product name.

      The reviews should vary in length (shortest: one sentence, longest: 2 paragraphs), sentiment, and complexity. A very complex review 
      would talk about multiple topics (entities) about the product with varying sentiment per topic. Provide a mix of positive, negative, 
      and neutral reviews.

      Give me JSON only. No text outside JSON. No explanations or notes
      [{"review_date":<date>, "review_id":<long>, "customer_id":<long>, "review":<string>}]', "{'type': 'json_object'}") as fake_reviews;

## Converting the results as json 

Our results are looking good. All we now have to do is transform the results from text as a JSON and explode the results over N rows.

Let's create a new function to do that:

In [0]:
CREATE OR REPLACE FUNCTION GENERATE_FAKE_REVIEWS(num_reviews INT DEFAULT 5)
  RETURNS array<struct<review_date:date, review_id:long, customer_id:long, review:string>>
  RETURN 
  SELECT FROM_JSON(
      ASK_LLM_MODEL(
        CONCAT('Generate a sample dataset of ', num_reviews, ' rows that contains the following columns: "review_date" (random dates in 2022), 
        "review_id" (random long), "customer_id" (random long from 1 to ', num_reviews, '), and "review". 
        Reviews should mimic useful product reviews from popular grocery brands product left on an e-commerce marketplace website. The review must include the product name.
        
        The reviews should vary in length (shortest: one sentence, longest: 2 paragraphs), sentiment, and complexity. A very complex review 
        would talk about multiple topics (entities) about the product with varying sentiment per topic. Provide a mix of positive, negative, 
        and neutral reviews.

        Give me JSON only. No text outside JSON. No explanations or notes
        [{"review_date":<date>, "review_id":<long>, "customer_id":<long>, "review":<string>}]'), "{'type': 'json_object'}"), 
        "array<struct<review_date:date, review_id:long, customer_id:long, review:string>>")

-- ALTER FUNCTION GENERATE_FAKE_REVIEWS OWNER TO `your_principal`; -- for the demo only, make sure other users can access your function

In [0]:
SELECT
  review.*
FROM
  (
    SELECT
      explode(reviews) as review
    FROM
      (
        SELECT
          GENERATE_FAKE_REVIEWS(10) as reviews
      )
  )


## Saving our dataset as a table to be used directly in our demo.

*Note that if you want to create more rows, you can first create a table and add multiple rows, with extra information that you can then concatenate to your prompt like categories, expected customer satisfaction etc. Once your table is created you can then call a new custom GENERATE function taking more parameters and crafting a more advanced prompt*

In [0]:
CREATE OR REPLACE TABLE fake_reviews COMMENT "Raw Review Data" AS
SELECT
  review.*
FROM
  (
    SELECT
      explode(reviews) as review
    FROM
      (
        SELECT
          generate_fake_reviews(50) as reviews
      )
  )

In [0]:
CREATE OR REPLACE FUNCTION GENERATE_FAKE_CUSTOMERS(num_reviews INT DEFAULT 10)
  RETURNS array<struct<customer_id:long, firstname:string, lastname:string, order_count:int>>
  RETURN 
  SELECT FROM_JSON(
      ASK_LLM_MODEL(
        CONCAT('Generate a sample dataset of ', num_reviews, ' customers containing the following columns: 
        "customer_id" (long from 1 to ', num_reviews, '), "firstname", "lastname" and order_count (random positive number, smaller than 200)

        Give me JSON only. No text outside JSON. No explanations or notes
        [{"customer_id":<long>, "firstname":<string>, "lastname":<string>, "order_count":<int>}]'), "{'type': 'json_object'}"), 
        "array<struct<customer_id:long, firstname:string, lastname:string, order_count:int>>")
        
-- ALTER FUNCTION GENERATE_FAKE_CUSTOMERS OWNER TO `your_principal`; -- for the demo only, make sure other users can access your function

In [0]:
CREATE OR REPLACE TABLE fake_customers
  COMMENT "Raw customers"
  AS
  SELECT customer.* FROM (
    SELECT explode(customers) as customer FROM (
      SELECT GENERATE_FAKE_CUSTOMERS(50) as customers))

In [0]:
SELECT * FROM fake_reviews

In [0]:
SELECT * FROM fake_customers

## Next steps
We're now ready to implement our pipeline to extract information from our reviews! Open [03-automated-product-review-and-answer]($./03-automated-product-review-and-answer) to continue.


Go back to [the introduction]($./00-SQL-AI-Functions-Introduction)