This repository contains the fine-tuned model Qwen2-0.5B-Instruct-SQL-query-generator, designed to generate SQL queries from natural language text prompts.
The Qwen2-0.5B-Instruct-SQL-query-generator is a specialized model fine-tuned on the motherduckdb/duckdb-text2sql-25k
dataset (first 10k rows). This model can convert natural language questions into SQL queries, facilitating data retrieval and database querying through natural language interfaces.
- Convert natural language questions to SQL queries.
- Facilitate data retrieval from databases using natural language.
- Assist in building natural language interfaces for databases.
- The model is fine-tuned on a specific subset of data and may not generalize well to all SQL query formats or databases.
- It is recommended to review the generated SQL queries for accuracy and security, especially before executing them on live databases.
The model was fine-tuned on the motherduckdb/duckdb-text2sql-25k
dataset, specifically using the first 10,000 rows. This dataset includes natural language questions and their corresponding SQL queries.
The evaluation data used for fine-tuning was a subset of the same dataset, ensuring consistency in training and evaluation metrics.
The training code is available on GitHub.
- Learning Rate: 1e-4
- Batch Size: 8
- Save Steps: 1
- Logging Steps: 500
- Number of Epochs: 5
- Transformers: 4.39.0
- PyTorch: 2.2.0
- Datasets: 2.20.0
- Tokenizers: 0.15.2
Evaluation metrics such as accuracy, precision, recall, and F1-score were used to assess the model's performance.
To use this model, load it from the Hugging Face Model Hub and provide natural language text prompts. The model will generate the corresponding SQL queries.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("omaratef3221/Qwen2-0.5B-Instruct-SQL-query-generator")
model = AutoModelForCausalLM.from_pretrained("omaratef3221/Qwen2-0.5B-Instruct-SQL-query-generator")
inputs = tokenizer("Show me all employees with a salary greater than $100,000", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
To train the model, follow these steps:
-
Clone the repository:
git clone https://github.com/omaratef3221/SQL_Query_Generator_llm.git cd SQL_Query_Generator_llm
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the training script:
python train.py --model_id Qwen/Qwen2-0.5B-Instruct --dataset_id motherduckdb/duckdb-text2sql-25k --epochs 5 --batch_size 8
After training, a notification will be sent to https://ntfy.sh/sql_query_generator_llm
.
This project is licensed under the MIT License. See the LICENSE file for more details.
For any questions or issues, please open an issue on this GitHub repository or contact Omar Atef.
⭐️ Don't forget to star the repository if you find it useful!