### _Pretrain_:

Pretrain data corresponds to plain text data stored in the `"text"` key. E.g:

```jsonl
{"text": "Text contained in document n°1"}
{"text": "Text contained in document n°2"}
```

### _Instruct_:

Currently two different types of instruction following data are supported:

- _Instruct_: conversational data stored in the `"messages"` key in the form of a list. Each list item is a dictionary containing the `"content"` and `"role"` keys. `"role"` is a string being one of "user", "assistant" or "system". The loss will only be computed if "role" == "assistant". E.g.:

```jsonl
{
  "messages": [
    {
      "role": "user",
      "content": "User interaction n°1 contained in document n°1"
    },
    {
      "role": "assistant",
      "content": "Bot interaction n°1 contained in document n°1"
    },
    {
      "role": "user",
      "content": "User interaction n°2 contained in document n°1"
    },
    {
      "role": "assistant",
      "content": "Bot interaction n°2 contained in document n°1"
    }
  ]
}

In [None]:
# let's start with the data we already have, then we can check for more data later on!

In [1]:
import pandas as pd

In [2]:
test = pd.read_csv(
    "/Users/naimsassine/Desktop/DS & AI/BelgianLaw Finetuning/Data/synthetic.csv"
)

In [8]:
# let's prompt chatgpt to get the answers to my questionss!
test = test.sample(2000)

In [9]:
import openai

openai.api_type = "azure"
openai.api_key = ""
openai.api_base = ""
openai.api_version = ""

In [10]:
def generate_text(question, model="gpt-4"):
    prompt = "Tu es un assistant avec des connaissances approfondies dans le milieu légal Belge. Je veux que tu répondes à la question suivante qui réside dans le contexte légal Belge en maximum 2-3 lignes, pas plus, avec les justifications nécessaires."
    full_prompt = f"{prompt}\n\n{question}"

    try:
        response = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "user", "content": full_prompt},
            ],
            deployment_id=model,
        )
        return response.choices[0].message["content"].strip()
    except openai.OpenAIError as e:
        return f"An error occurred: {e}"

In [None]:
# Example usage
question = (
    "Quels sont les droits des travailleurs en Belgique concernant les congés annuels?"
)


generated_text = generate_text(question)
print(generated_text)

In [None]:
test.extra_description = test.extra_description.fillna("pas d'information en plus")

In [11]:
test["detailed_question"] = test.question

In [None]:
test["detailed_question"] = test.apply(
    lambda x: x["question"]
    + ". En sachant que la categorie de la question est : "
    + x["subcategory"]
    + " et voici une information en plus par rapport à la question: "
    + x["extra_description"],
    axis=1,
)

In [12]:
generate_text(test["detailed_question"].values[0])

"Un conseiller peut exécuter une réclamation à l'administration belge lorsqu'il identifie une violation des droits ou des intérêts légitimes d'un individu ou d'une entreprise, et ce, conformément aux procédures légales en vigueur (par exemple, dans le cadre d'un recours administratif)."

In [13]:
def generate_answers_and_write_to_file(df, question_column, output_file):
    with open(output_file, "w") as file:
        for index, row in df.iterrows():
            question = row[question_column]
            answer = generate_text(question)
            file.write(f"Question: {question}\nAnswer: {answer}\n\n")

In [None]:
generate_answers_and_write_to_file(test, "detailed_question", "answers_synthet.txt")

In [15]:
def parse_qa_file(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        content = file.read()

    questions = []
    answers = []

    qa_pairs = content.split("\n\n")

    for pair in qa_pairs:
        if pair.strip():
            question_part, answer_part = pair.split("Answer:", 1)
            question = question_part.replace("Question:", "").strip()
            answer = answer_part.strip()
            questions.append(question)
            answers.append(answer)

    df = pd.DataFrame({"Question": questions, "Answer": answers})

    return df

In [20]:
df = parse_qa_file("answers_synthet.txt")

In [21]:
df = df.rename(columns={"Question": "question", "Answer": "answer"})

In [22]:
from fuzzywuzzy import process


# Function to perform fuzzy matching and merge dataframes
def fuzzy_merge(df1, df2, key1, key2, threshold=90, limit=1):
    """
    df1, df2: Dataframes to be merged
    key1, key2: Column names to match on
    threshold: Score above which to consider it a match
    limit: Number of matches to return, using 1 for the best match
    """
    s = df2[key2].tolist()

    matches = df1[key1].apply(
        lambda x: process.extractOne(x, s, score_cutoff=threshold)
    )

    df1["match"] = matches
    df1["match_score"] = df1["match"].apply(lambda x: x[1] if x is not None else None)
    df1["match"] = df1["match"].apply(lambda x: x[0] if x is not None else None)

    merged_df = df1.merge(df2, left_on="match", right_on=key2, how="left")
    merged_df = merged_df.drop(columns=["match", "match_score"])

    return merged_df


# Use the fuzzy_merge function to merge the dataframes
merged_df = fuzzy_merge(test, df, "question", "question", threshold=90)

In [23]:
merged_df = merged_df[merged_df.answer.notna()]

In [28]:
merged_df.to_csv("synth_QA.csv")