# Step 2: Generate queries

From Step 1, we now have a set of EN/FR pairs that we can use for IR and CLIR evaluations.

But first, we need queries. Particularly, we want natural language queries since this is the intended user interaction.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('laws_pairs.csv.zip').fillna("")

In [5]:
df.columns

Index(['section_id', 'doc_id', 'type', 'doc_title_eng', 'doc_title_fra',
       'section_str_eng', 'section_str_fra', 'heading_str_eng',
       'heading_str_fra', 'text_eng', 'text_fra', 'char_cnt_eng',
       'char_cnt_fra', 'token_cnt_eng', 'token_cnt_fra'],
      dtype='object')

In [20]:
def combine_text(row):
    nl = "\n"
    return (
        f"{row['doc_title_eng']}\n"
        f"{' > ' + row['heading_str_eng'] + nl if row['heading_str_eng'] else ''}"
        f"{row['section_str_eng']}\n"
        f"---\n{row['text_eng']}"
    ), (
        f"{row['doc_title_fra']}\n"
        f"{' > ' + row['heading_str_fra'] + nl if row['heading_str_fra'] else ''}"
        f"{row['section_str_fra']}\n"
        f"---\n{row['text_fra']}"
    )

combined_texts = df.apply(combine_text, axis=1)
df["text_combined_eng"] = [x[0] for x in combined_texts]
df["text_combined_fra"] = [x[1] for x in combined_texts]

df[['text_combined_eng', 'text_combined_fra']].sample(5)

Unnamed: 0,text_combined_eng,text_combined_fra
32470,Telecommunications Act\n > Investigation and E...,Loi sur les télécommunications\n > Enquêtes et...
58177,Canada Occupational Health and Safety Regulati...,Règlement canadien sur la santé et la sécurité...
13221,Excise Tax Act\n > Air Transportation Tax > Pe...,Loi sur la taxe d’accise\n > Taxe de transport...
50300,Apprentice Loans Regulations\n > Removal of Re...,Règlement sur les prêts aux apprentis\n > Levé...
28950,Pension Act\n > Pensions > Pensions for Death\...,Loi sur les pensions\n > Pensions > Pensions p...


In [35]:
np.random.seed(42)
small_df = df.sample(1000)

## Question generation with LlamaIndex

In [33]:
import os
import textwrap as tr

import nest_asyncio
from azure.identity import DefaultAzureCredential, ManagedIdentityCredential
from azure.keyvault.secrets import SecretClient
from dotenv import load_dotenv
from llama_index import ServiceContext, set_global_service_context
from llama_index.embeddings import AzureOpenAIEmbedding
from llama_index.llms import AzureOpenAI
from llama_index.prompts import ChatMessage, ChatPromptTemplate, MessageRole
from tqdm import tqdm

# This is a hack to get some things to work in Jupyter Notebooks
nest_asyncio.apply()

def pwrap(text):
    print(tr.fill(str(text), width=80))

# Load environment variables from .env file
try:
    load_dotenv(dotenv_path=".env")
except:
    pass

# If we're running on Azure, use the Managed Identity to get the secrets
if os.environ.get("CREDENTIAL_TYPE").lower() == "managed":
    credential = ManagedIdentityCredential()
else:
    credential = DefaultAzureCredential()

# Login to KeyVault using Azure credentials
client = SecretClient(
    vault_url=os.environ.get("AZURE_KEYVAULT_URL"), credential=credential
)

OPENAI_API_BASE = os.environ.get("AZURE_OPENAI_ENDPOINT")
OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_VERSION")
OPENAI_API_KEY = client.get_secret("OPENAI-SERVICE-KEY").value

api_key = OPENAI_API_KEY
azure_endpoint = OPENAI_API_BASE
api_version = OPENAI_API_VERSION

llm = AzureOpenAI(
    model="gpt-35-turbo",
    deployment_name="gpt-35-turbo-unfiltered",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
    temperature=0.1,
)

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="text-embedding-ada-002",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)


service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
)

llm4 = AzureOpenAI(
    model="gpt-4",
    deployment_name="gpt-4",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
    temperature=0,
)

service_context_gpt4 = ServiceContext.from_defaults(
    llm=llm4,
    embed_model=embed_model,
)

set_global_service_context(service_context)


QUESTION_GEN_USER_TMPL = (
    "Context information is below.\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "Given the context information and not prior knowledge, "
    "generate the relevant question."
)

QUESTION_GEN_SYS_TMPL = """\
You are labelling an cross-language information retrieval (CLIR) dataset.
You are given a chunk of context information, which will be in {language}.
Generate a question, in Canadian {language}, that relates to the context information.
The questions will be used to evaluate the quality of the information retrieval system.
Restrict the question to the context information provided.\
"""

question_gen_template = ChatPromptTemplate(
    message_templates=[
        ChatMessage(role=MessageRole.SYSTEM, content=QUESTION_GEN_SYS_TMPL),
        ChatMessage(role=MessageRole.USER, content=QUESTION_GEN_USER_TMPL),
    ]
)

def generate_queries(texts, language="english"):
    queries = []
    for i, text in enumerate(tqdm(texts)):
        fmt_messages = question_gen_template.format_messages(
            context_str=text,
            language=language,
        )
        chat_response = llm.chat(fmt_messages)
        queries.append(chat_response.message.content)
        tqdm.write(chat_response.message.content)
        # Save every 10 iterations
        if i % 10 == 0:
            with open("queries.txt", "w") as f:
                f.write("\n".join(queries))

    return queries

In [34]:
text_combined_eng_queries = generate_queries(small_df["text_combined_eng"], "English")

  5%|▌         | 1/20 [00:01<00:23,  1.25s/it]

What are the conditions for a judge to make an election to become a supernumerary judge in a provincial superior court?


 10%|█         | 2/20 [00:01<00:16,  1.07it/s]

What happens to ongoing applications under section 66 of the former Act when the Budget Implementation Act, 2021, No. 1 comes into force?


 15%|█▌        | 3/20 [00:02<00:13,  1.25it/s]

What criteria must be satisfied for an expungement order to be granted for a conviction related to procuring a woman or female person's miscarriage?


 20%|██        | 4/20 [00:03<00:10,  1.55it/s]

What are the minimum maintenance standards for cottages and accessory buildings in the National Parks of Canada?


 25%|██▌       | 5/20 [00:03<00:10,  1.40it/s]

What factors are considered in determining whether the requirements of subsection 5.1(2) of the Act have been met in respect of the adoption of a person referred to in section 7 of the Regulations?


 30%|███       | 6/20 [00:04<00:09,  1.47it/s]

What types of positions can the Department of National Defence recruit for under the Department of National Defence Terms Under Three Months Regulations, 1992?


 35%|███▌      | 7/20 [00:05<00:08,  1.48it/s]

What measures should be taken by certain professionals and entities to determine if a person they have a business relationship with is a politically exposed individual?


 40%|████      | 8/20 [00:05<00:07,  1.70it/s]

What are the requirements for a register that incorporates an electronic display?


 45%|████▌     | 9/20 [00:06<00:06,  1.77it/s]

What happens to the Canadian Environmental Assessment Act if a project is referred to a review panel under subsection 29(1) of that Act?


 50%|█████     | 10/20 [00:06<00:05,  1.73it/s]

What is the purpose of the General Rules under Section 209 of the Bankruptcy and Insolvency Act?


 55%|█████▌    | 11/20 [00:07<00:07,  1.28it/s]

What are the percentages of the fee or charge that a satellite operator or a ground station operator must pay to the Minister for receiving satellite remote sensing imagery, tape or service described in Column I of items 6, 7, 15, and 16 of the schedule?


 60%|██████    | 12/20 [00:08<00:06,  1.31it/s]

What are the eligibility criteria for a Canadian offender who committed a murder as a young person and is serving an adult sentence under the International Transfer of Offenders Act?


 65%|██████▌   | 13/20 [00:09<00:05,  1.36it/s]

What is the purpose of deleting the name of a distinct breed or evolving breed from an association's articles of incorporation under the Animal Pedigree Act?


 70%|███████   | 14/20 [00:09<00:04,  1.45it/s]

What happens if a commercial operation or eligible person is not entitled to a contribution made under the Industrial and Regional Development Act?


 75%|███████▌  | 15/20 [00:10<00:02,  1.76it/s]

What is the purpose of establishing park reserves under the Canada National Parks Act?


 80%|████████  | 16/20 [00:10<00:02,  1.72it/s]

What is the purpose of the Minister's order in relation to the operation of remote sensing space systems?


 85%|████████▌ | 17/20 [00:11<00:01,  1.70it/s]

What are the purposes of the pre-arbitration meeting according to subsection 14(1) of the Rules of Procedure for Rail Level of Service Arbitration?


 90%|█████████ | 18/20 [00:11<00:01,  1.70it/s]

What is the requirement for the leader of a political party regarding the provision of personal information and internet address to the Chief Electoral Officer?


 95%|█████████▌| 19/20 [00:12<00:00,  1.52it/s]

What fees are imposed on a vessel if the marine safety inspector requires more than 3.75 hours to assess a tank prewash operation exemption request?


100%|██████████| 20/20 [00:13<00:00,  1.49it/s]

What information should a person or entity obtain and record if they determine that an account will be used by or on behalf of a third party?



