## **Creating SQL Queries from Natural Language**

This tutorial will guide you through creating an application that generates SQL queries from natural language instructions and evaluates the quality of the generated queries. Along the way, you'll learn how to use Orq's deployment feature to enhance SQL generation. By the end of this tutorial, you'll be ready to experiment with SQL generation in your own projects.

Before starting, ensure you have an Orq account. If not, sign up at Orq.ai. Let's dive in!

Additionally, to simplify the process, we’ve prepared this [Google Colab](https://colab.research.google.com/drive/1OYST2gldxBXbAN10wRTfnTeExCjWrF9i#scrollTo=EJ5-MLEjmpg9) file that you can copy and run immediately after replacing your API key. This file provides a ready-to-use environment with all the required configurations set up, allowing you to focus on experimenting with SQL generation without worrying about initial setup. Let's dive in!

**Step 1: Setting Up the Environment**  
The following commands install the required libraries for working with the Orq platform, handling datasets, and managing the SQL generation workflow. Feel free to reuse and adapt this code for your projects.

## Import packages

In [None]:
pip install orq-ai-sdk datasets huggingface_hub

Collecting orq-ai-sdk
  Downloading orq_ai_sdk-2.13.4.tar.gz (17 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting deprecation<3.0.0,>=2.1.0 (from orq-ai-sdk)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting httpx<0.28.0,>=0.27.0 (from orq-ai-sdk)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1

### Initializing the OrqAI Client

This code initializes the OrqAI client with an API key, either from the `ORQ_API_KEY` environment variable or a hardcoded default, and sets the environment to production.


In [None]:
import os

from orq_ai_sdk import OrqAI

client = OrqAI(
  api_key=os.environ.get("ORQ_API_KEY", "your_api_key_here"),
  environment="production"
)

In [None]:
client.set_user(id=2024)

### **Hugging Face**

Before proceeding, sign up for a free Hugging Face account if you don’t already have one. You’ll need an API key to access their datasets library. Retrieve your API key [here](https://huggingface.co/settings/tokens) after signing up or logging in.

**Step 3: Loading the Dataset**  
Use the Hugging Face datasets library to load a dataset containing table schemas and natural language instructions. Convert the dataset to a pandas DataFrame for easy manipulation.

In [None]:
from huggingface_hub import login

# Use your Hugging Face API token
login(token="your_HF_api_key_here")

In [None]:
from datasets import load_dataset

ds = load_dataset("Clinton/Text-to-sql-v1")

# Convert to a pandas DataFrame (selecting the "train" split as an example)
df = ds["train"].to_pandas()

# Select the top 50 rows
df = df.head(50)

# Display the DataFrame or save it
print(df)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/118 [00:00<?, ?B/s]

texttosqlv2.jsonl:   0%|          | 0.00/635M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/262208 [00:00<?, ? examples/s]

                                          instruction  \
0            Name the home team for carlton away team   
1   what will the population of Asia be when Latin...   
2   How many faculty members do we have for each g...   
3              List the record of 0-1 from the table?   
4   Which silver has a Gold smaller than 12, a Ran...   
5   When did Samsung Electronics Co LTD make the G...   
6   what are the early morning flights from BOSTON...   
7                             Name the most 3 credits   
8   What is every yellow jersey entry for the dist...   
9   In what years was there a rank lower than 9, u...   
10  What aired at 10:00 when Flashpoint aired at 9...   
11  count the number of patients whose insurance i...   
12  What was the record of the game in which Dydek...   
13       When was the game played at glenferrie oval?   
14  What is the highest K 2 O, when Na 2 O is grea...   
15  what is the total number of patients diagnosed...   
16  count the number of patient

In [None]:
df.columns

Index(['instruction', 'input', 'response', 'source', 'text'], dtype='object')

### **SQL Query Generation Use Case**
This deployment is designed to generate valid SQL queries based on specific table schemas and user-provided instructions. The model analyzes the instruction and the associated table schema to produce a precise and contextually appropriate SQL query.

SQL query generation is particularly useful when automating database interactions, building query assistants, or streamlining the process of accessing structured data through natural language inputs.

```plaintext
Below are SQL table schemas paired with instructions that describe a task. Using valid SQLite, write a response that appropriately completes the request for the provided tables:

Here is the instruction: {{instruction}}

Here is the table: {{table}}

OUTPUT ONLY VALID SQL


In [None]:
df = df[["instruction", "input", "response"]]

**Step 4: Generating SQL Queries**

This step involves invoking the Orq deployment to generate SQL queries for each row in the dataset. The instruction column provides the natural language task, while the input column contains the table schema. The results are stored in a new column named output.

In [None]:
# Initialize the outputs list
outputs = []

# Iterate through each row in the DataFrame
for _, row in df.iterrows():
    # Extract the 'instruction' and 'input' columns for each row
    instruction = row["instruction"]
    table = row["input"]

    # Invoke the deployment for each row
    generation = client.deployments.invoke(
        key="text_to_SQL",  # Replace with your actual deployment key
        context={
            "environments": []
        },
        inputs={
            "table": table,
            "instruction": instruction
        },
        metadata={
            "custom-field-name": "custom-metadata-value"
        }
    )

    # Append the model's output to the outputs list
    outputs.append(generation.choices[0].message.content)

# Add the outputs as a new column in the DataFrame
df["output"] = outputs


OrqAIException: [system] - [code:500]: Cannot use 'in' operator to search for 'workspaceId' in undefined

### Performance Check

**Step 5: Saving and Evaluating Results**  
Save the updated DataFrame containing the SQL queries to a file and evaluate their quality. Use metrics or manual inspection to verify the accuracy and relevance of the generated queries.

In [None]:
df

In [None]:

from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

true_labels = df["response"]
predicted_labels = df["output"]

# Calculate performance metrics
accuracy = accuracy_score(true_labels, predicted_labels)
precision = precision_score(true_labels, predicted_labels, average='macro')
recall = recall_score(true_labels, predicted_labels, average='macro')
f1 = f1_score(true_labels, predicted_labels, average='macro')

# Print the metrics
print("Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

**Next Steps**  
Congratulations! You've successfully built and tested a SQL generation application using Orq. To further enhance your project:

- Experiment with different datasets or deployment keys.
- Refine the prompt to improve SQL generation quality.
- Integrate the solution into a larger application for automated data access.

For more details and advanced features, visit the Orq documentation.