<a href="https://colab.research.google.com/github/k-dovan/ALQAC23/blob/master/Sonic_QA_Competition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ALQAC 2023 - Sonic team's notebook



## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/log-level)


## Installing Haystack

To start, let's install the latest release of Haystack with `pip`:

In [1]:
!pip install --upgrade pip
!pip install farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Set the logging level to INFO:

In [2]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Initializing the DocumentStore

We'll start creating our question answering system by initializing a DocumentStore. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, we're using the `InMemoryDocumentStore`, which is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

Let's initialize the the DocumentStore:

In [3]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


The DocumentStore is now ready. Now it's time to fill it with some Documents.

## Preparing Dataset

1. Mount dataset from Google Drive.

In [4]:
# mount data from Google Drive
from google.colab import drive
drive.mount('/content/drive')

data_dir = "/content/drive/MyDrive/Colab Notebooks/ALQAC_2023_training_data"
print (data_dir)

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/ALQAC_2023_training_data


2. Read dataset from json file and load into Document Store object.

In [5]:
# read law data from json file to dict (with required format)
# {"content": "...", "law_id": "05/2022/QH15", "article_id": "95"}
import json
def read_data(file_path: str) -> dict:
  res = []
  with open(file_path, "r", encoding='utf-8') as f:
    data = json.load(f)
    if not isinstance(data, list):
      return res
    for law in data:
      if not (law["id"] and law["articles"]):
        continue
      if not isinstance(law["articles"], list):
        continue
      for art in law["articles"]:
        if not (art["id"] and art["text"]):
          continue
        item = {"content": art["text"], "meta": {"law_id": law["id"], "article_id": art["id"]}}
        res.append(item)

  return res

In [6]:
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(use_bm25=True)

# load data from json file
data = read_data(f"{data_dir}/law.json")

# skip duplicate documents if exist
document_store.write_documents(data, batch_size=1000, duplicate_documents="skip")

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.document_stores.base:Duplicate Documents: Document with id 'bf5f5b16a6442013ce41e0a198119614' already exists in index 'document'


Updating BM25 representation...:   0%|          | 0/2130 [00:00<?, ? docs/s]

## Initializing the Retriever

Our search system will use a Retriever, so we need to initialize it. A Retriever sifts through all the Documents and returns only the ones relevant to the question. This tutorial uses the BM25 algorithm. For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

Let's initialize a BM25Retriever and make it use the InMemoryDocumentStore we initialized earlier in this tutorial:

In [18]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

# retrieve all relevant documents provided given a question
relevants = retriever.retrieve(
    query = "Người trong quan hệ công tác mà xúc phạm nghiêm trọng nhân phẩm, danh dự đồng đội, dẫn đến nạn nhân tự sát thì bị phạt tù từ bao nhiêu đến bao nhiêu năm?",
    top_k = 1
)

print (relevants)

[<Document: {'content': 'Xử lý hành vi xúc phạm uy tín của Tòa án, danh dự, nhân phẩm, sức khoẻ của những người tiến hành tố tụng hoặc những người khác thực hiện nhiệm vụ theo yêu cầu của Tòa án\n\nNgười có hành vi xúc phạm uy tín của Tòa án, danh dự, nhân phẩm của những người tiến hành tố tụng hoặc những người khác thực hiện nhiệm vụ theo yêu cầu của Tòa án thì tùy theo tính chất, mức độ vi phạm mà bị xử phạt vi phạm hành chính hoặc bị truy cứu trách nhiệm hình sự theo quy định của pháp luật.', 'content_type': 'text', 'score': 0.983657998003853, 'meta': {'law_id': 'Luật Tố tụng hành chính', 'article_id': '317'}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '8e81b1d15e5e99784af6d11bef7a1f94'}>]


## Plugin Ranker into Pipeline

The improvement that the Ranker brings comes at the cost of some additional computation time. The ranking models supported by Haystack are models powered by transformers, meaning that they are sensitive to word order and syntax.

A Ranker can pair nicely with a sparse retriever, such as the BM25Retriever. While the BM25Retriever is fast and lightweight, it is not sensitive to word order but rather treats text as a bag of words. By placing a Ranker afterwards, you can offset this weakness and have a better sorted list of relevant documents.

In [26]:
from haystack.nodes import BM25Retriever, SentenceTransformersRanker
from haystack import Pipeline

ranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")

pipe = Pipeline()

pipe.add_node(component=retriever, name="BM25Retriever", inputs=["Query"])
pipe.add_node(component=ranker, name="Ranker", inputs=["BM25Retriever"])

prediction = pipe.run(
    query='Người trong quan hệ công tác mà xúc phạm nghiêm trọng nhân phẩm, danh dự đồng đội, dẫn đến nạn nhân tự sát thì bị phạt tù từ bao nhiêu đến bao nhiêu năm?',
    params={"BM25Retriever": {"top_k": 2}, "Ranker": {"top_k": 1}}
)

print (prediction)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


{'documents': [<Document: {'content': 'Xử lý hành vi xúc phạm uy tín của Tòa án, danh dự, nhân phẩm, sức khoẻ của những người tiến hành tố tụng hoặc những người khác thực hiện nhiệm vụ theo yêu cầu của Tòa án\n\nNgười có hành vi xúc phạm uy tín của Tòa án, danh dự, nhân phẩm của những người tiến hành tố tụng hoặc những người khác thực hiện nhiệm vụ theo yêu cầu của Tòa án thì tùy theo tính chất, mức độ vi phạm mà bị xử phạt vi phạm hành chính hoặc bị truy cứu trách nhiệm hình sự theo quy định của pháp luật.', 'content_type': 'text', 'score': 0.965289831161499, 'meta': {'law_id': 'Luật Tố tụng hành chính', 'article_id': '317'}, 'id_hash_keys': ['content'], 'embedding': None, 'id': '8e81b1d15e5e99784af6d11bef7a1f94'}>], 'root_node': 'Query', 'params': {'BM25Retriever': {'top_k': 2}, 'Ranker': {'top_k': 1}}, 'query': 'Người trong quan hệ công tác mà xúc phạm nghiêm trọng nhân phẩm, danh dự đồng đội, dẫn đến nạn nhân tự sát thì bị phạt tù từ bao nhiêu đến bao nhiêu năm?', 'node_id': 'Ranke

## Initializing the Reader

A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. In this tutorial, we're using a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a strong all-round model that's good as a starting point. To find the best model for your use case, see [Models](https://haystack.deepset.ai/pipeline_nodes/reader#models).

Let's initialize the Reader:

In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


We've initalized all the components for our pipeline. We're now ready to create the pipeline.

## Creating the Retriever-Reader Pipeline

In this tutorial, we're using a ready-made pipeline called `ExtractiveQAPipeline`. It connects the Reader and the Retriever. The combination of the two speeds up processing because the Reader only processes the Documents that the Retriever has passed on. To learn more about pipelines, see [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines).

To create the pipeline, run:

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

The pipeline's ready, you can now go ahead and ask a question!

## Asking a Question

1. Use the pipeline `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments). To understand the importance of the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).

In [None]:
prediction = pipe.run(
    query="Who is the father of Arya Stark?",
    params={
        "Retriever": {"top_k": 1},
        "Reader": {"top_k": 1}
    }
)

Inferencing Samples:   0%|          | 0/4 [00:00<?, ? Batches/s]

Here are some questions you could try out:
- Who is the father of Arya Stark?
- Who created the Dothraki vocabulary?
- Who is the sister of Sansa?

2. Print out the answers the pipeline returned:

In [None]:
from pprint import pprint

pprint(prediction)

3. Simplify the printed answers:

In [None]:
from haystack.utils import print_answers

print_answers(
    prediction,
    details="minimum" ## Choose from `minimum`, `medium`, and `all`
)

'Query: Who is the father of Arya Stark?'
'Answers:'
[   {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'}]


And there you have it! Congratulations on building your first machine learning based question answering system!