GitHub - nguyenvulebinh/extractive-qa-mrc: Machine Reading Comprehension special for the Vietnamese language

Model Description

Language model: XLM-RoBERTa
Fine-tune: MRCQuestionAnswering
Language: Vietnamese, Englsih
Downstream-task: Extractive QA
Dataset (combine English and Vietnamese):

This model is intended to be used for QA in the Vietnamese language so the valid set is Vietnamese only (but English works fine). The evaluation result below using 10% of the Vietnamese dataset.

Model	EM	F1
base	76.43	84.16
large	77.32	85.46

MRCQuestionAnswering using XLM-RoBERTa as a pre-trained language model. By default, XLM-RoBERTa will split word in to sub-words. But in my implementation, I re-combine sub-words representation (after encoded by BERT layer) into word representation using sum strategy.

Using pre-trained model

Hugging Face pipeline style (NOT using sum features strategy).

from transformers import pipeline
# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
nlp = pipeline('question-answering', model=model_checkpoint,
                   tokenizer=model_checkpoint)
QA_input = {
  'question': "Bình là chuyên gia về gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}
res = nlp(QA_input)
print('pipeline: {}'.format(res))
#{'score': 0.5782045125961304, 'start': 45, 'end': 68, 'answer': 'xử lý ngôn ngữ tự nhiên'}

More accurate infer process (Using sum features strategy)

from infer import tokenize_function, data_collator, extract_answer
from model.mrc_model import MRCQuestionAnswering
from transformers import AutoTokenizer

# model_checkpoint = "nguyenvulebinh/vi-mrc-large"
model_checkpoint = "nguyenvulebinh/vi-mrc-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = MRCQuestionAnswering.from_pretrained(model_checkpoint)

QA_input = {
  'question': "Bình được công nhận với danh hiệu gì ?",
  'context': "Bình Nguyễn là một người đam mê với lĩnh vực xử lý ngôn ngữ tự nhiên . Anh nhận chứng chỉ Google Developer Expert năm 2020"
}

inputs = [tokenize_function(*QA_input)]
inputs_ids = data_collator(inputs)
outputs = model(**inputs_ids)
answer = extract_answer(inputs, outputs, tokenizer)

print(answer)
# answer: Google Developer Expert. Score start: 0.9926977753639221, Score end: 0.9909810423851013

Training model

In data-bin/raw folder already exist some sample data files for the training process. Do following steps:

Create environment by using file requirements.txt
Clean data

python squad_to_mrc.py
python train_valid_split.py

Train model

python main.py

Test model

python infer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data-bin

data-bin

model

model

utils

utils

README.md

README.md

infer.py

infer.py

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Model Description

Using pre-trained model

Training model

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data-bin		data-bin
model		model
utils		utils
README.md		README.md
infer.py		infer.py
main.py		main.py
requirements.txt		requirements.txt

nguyenvulebinh/extractive-qa-mrc

Folders and files

Latest commit

History

Repository files navigation

Model Description

Using pre-trained model

Training model

About

Topics

Resources

Stars

Watchers

Forks

Languages