In [2]:
!pip install -r requirements.txt



In [None]:
import os
os.environ["OPENAI_API_KEY"] = "chaveaqui"

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert data science."),
    ("user", "{input}")
])
chain = prompt | llm

In [None]:
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

chain = prompt | llm | output_parser

In [None]:
chain.invoke({"input": "define data"})

'Data refers to raw, unprocessed facts, figures, and symbols that are collected from various sources. These can include numbers, text, images, audio, and video, among other formats. Data itself does not have meaning until it is processed and analyzed to extract useful information. In the context of data science and computing, data is often used as the foundation for generating insights, making decisions, and driving technological advancements. It can be structured, like data in a database with rows and columns, or unstructured, like emails, social media posts, or multimedia files.'

In [None]:
from langchain_openai import OpenAIEmbeddings
import pandas as pd
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='example.csv',
    csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['Sentence', 'Label']
})


data = loader.load()

embeddings = embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
)



In [None]:
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(data)
vector = FAISS.from_documents(documents, embeddings)

In [None]:
from langchain.chains import create_retrieval_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain

prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

document_chain = create_stuff_documents_chain(llm, prompt)

retriever = vector.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [None]:
from langchain_core.documents import Document

print(document_chain.invoke({
    "input": "generate a generic regex that matches the most common queries present on this database",
    "context": [Document(page_content="This is a database where the ""Sentence"" column represents SQL queries and ""Label"" column represents if the SQL query is valid")]
}))

To generate a generic regex that matches the most common SQL queries, we need to consider the basic structure of SQL queries. Here is a simple regex pattern that can match common SQL queries like `SELECT`, `INSERT`, `UPDATE`, and `DELETE`:

```regex
(?i)^\s*(SELECT\s+.+?\s+FROM\s+\w+|INSERT\s+INTO\s+\w+\s*\(.+?\)\s*VALUES\s*\(.+?\)|UPDATE\s+\w+\s+SET\s+.+?\s+WHERE\s+.+?|DELETE\s+FROM\s+\w+\s+WHERE\s+.+?)\s*;?\s*$
```

Explanation:
- `(?i)`: Case-insensitive matching.
- `^\s*`: Matches the start of the string, allowing for optional leading whitespace.
- `SELECT\s+.+?\s+FROM\s+\w+`: Matches a basic `SELECT` query structure.
- `|`: Acts as an OR operator to match different query types.
- `INSERT\s+INTO\s+\w+\s*\(.+?\)\s*VALUES\s*\(.+?\)`: Matches a basic `INSERT` query structure.
- `UPDATE\s+\w+\s+SET\s+.+?\s+WHERE\s+.+?`: Matches a basic `UPDATE` query structure.
- `DELETE\s+FROM\s+\w+\s+WHERE\s+.+?`: Matches a basic `DELETE` query structure.
- `\s*;?\s*$`: Allows for optional trailing w

In [None]:
response = retrieval_chain.invoke({"input": "generate generic regexes that matches boolean values and numeric values in the document"})
print(response["answer"])

Based on the provided context, here are some generic regular expressions that could match boolean and numeric values:

1. **Boolean Values:**
   - To match boolean values like "true" or "false", you can use the following regex:
     ```
     \b(true|false)\b
     ```

2. **Numeric Values:**
   - To match numeric values, including integers and decimals, you can use the following regex:
     ```
     \b\d+(\.\d+)?\b
     ```

These regex patterns are designed to match standalone boolean and numeric values in a text.
