
## Create Training Data

There are two ways to generate training data

1. **Annotation**: You can use the [annotation tool](https://haystack.deepset.ai/guides/annotation) to label your data, i.e. highlighting answers to your questions in a document. The tool supports structuring your workflow with organizations, projects, and users. The labels can be exported in SQuAD format that is compatible for training with Haystack.

![Snapshot of the annotation tool](https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/_src/img/annotation_tool.png)

2. **Feedback**: For production systems, you can collect training data from direct user feedback via Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api). This includes a customizable user feedback API for providing feedback on the answer returned by the API. The API provides a feedback export endpoint to obtain the feedback data for fine-tuning your model further.


## Fine-tune your model

Once you have collected training data, you can fine-tune your base models.
We initialize a reader as a base model and fine-tune it on our own custom dataset (should be in SQuAD-like format).
We recommend using a base model that was trained on SQuAD or a similar QA dataset before to benefit from Transfer Learning effects.

**Recommendation**: Run training on a GPU.
If you are using Colab: Enable this in the menu "Runtime" > "Change Runtime type" > Select "GPU" in dropdown.
Then change the `use_gpu` arguments below to `True`

In [2]:
# Make sure you have a GPU running
!nvidia-smi

Sun Feb 27 14:17:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.0.3-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 12.3 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.3
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-9c8m7kbm/farm-haystack_ea171481cf994da0aec5ffe4f7b8a6d8
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-9c8m7kbm/farm-haystack_ea171481cf994da0aec5ffe4f7b8a6d8
  Resolved https://github.com/deepset-ai/haystack.git to commit b563b6622cbe65a249baff8735bf31cf99baee35
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting azure-ai-formrecognizer==3.2.0b2
  D

In [4]:
from haystack.nodes import FARMReader

INFO - haystack.modeling.model.optimization -  apex not found, won't use it. See https://nvidia.github.io/apex/


In [5]:
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/"
# data_dir = "PATH/TO_YOUR/TRAIN_DATA"


INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find distilbert-base-uncased-distilled-squad locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/451 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/253M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded distilbert-base-uncased-distilled-squad


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


In [7]:
reader.train(data_dir=data_dir, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1, save_dir="my_model",)

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.data_handler.data_silo -  
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
INFO - haystack.modeling.data_handler.data_silo -  LOADING TRAIN DATA
INFO - haystack.modeling.data_handler.data_silo -  Loading train set from: data/dev-v2.0.json 
INFO - haystack.modeling.data_handler.data_silo -  Multiprocessing disabled, using a single worker to convert 1204dictionaries to pytorch datasets.
Preprocessing Dataset data/dev-v2.0.json: 100%|██████████| 1204/1204 [00:08<00:00, 147.67 Dicts/s]  
INFO - haystack.modeling.data_handler.data_silo -  
INFO - haystack.modeling.data_handler.data_silo -  LOADING DEV DATA
INFO - haystack.modeling.data_handler.data_silo -  No dev set is being loaded
INFO - haystack.modeling.data_handler.data_silo - 

In [None]:
# Saving the model happens automatically at the end of training into the `save_dir` you specified
# However, you could also save a reader manually again via:
reader.save(directory="my_model")

In [8]:
# If you want to load it at a later point, just do:
new_reader = FARMReader(model_name_or_path="my_model")

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Model found locally at my_model
INFO - haystack.modeling.model.language_model -  Loaded my_model
INFO - haystack.modeling.model.adaptive_model -  Found files for loading 1 prediction heads
INFO - haystack.modeling.model.prediction_head -  Loading prediction head from my_model/prediction_head_0.bin
INFO - haystack.modeling.data_handler.processor -  Initialized processor without tasks. Supply `metric` and `label_list` to the constructor for using the default task or add a custom task later via processor.add_task()
INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  G

In [9]:
reader_eval_results = new_reader.eval_on_file("data/", "dev-v2.0.json", device="cuda")

INFO - haystack.modeling.data_handler.data_silo -  
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
INFO - haystack.modeling.data_handler.data_silo -  LOADING TRAIN DATA
INFO - haystack.modeling.data_handler.data_silo -  No train set is being loaded
INFO - haystack.modeling.data_handler.data_silo -  
INFO - haystack.modeling.data_handler.data_silo -  LOADING DEV DATA
INFO - haystack.modeling.data_handler.data_silo -  No dev set is being loaded
INFO - haystack.modeling.data_handler.data_silo -  
INFO - haystack.modeling.data_handler.data_silo -  LOADING TEST DATA
INFO - haystack.modeling.data_handler.data_silo -  Loading test set from: data/dev-v2.0.json
INFO - haystack.modeling.data_handler.data_silo -  Got ya 2 parallel workers to convert 1204 dictionaries to pytorch datasets (chunksize = 121)...
INFO - haystack.modeling.data_handler.data_silo -   0    0 

In [10]:
reader_eval_results

{'EM': 0.6663859176282321,
 'f1': 0.6848796258878742,
 'top_n_accuracy': 0.9693422050029479}

In [11]:
context = '''Bangalore (/bæŋɡəˈlɔːr/), officially known as Bengaluru (Kannada pronunciation: [ˈbeŋɡəɭuːɾu] (audio speaker iconlisten)), is the capital and the largest city of the Indian state of Karnataka. It has a population of more than 8 million and a metropolitan population of around 11 million, making it the third most populous city and fifth most populous urban agglomeration in India.[12] Located in southern India on the Deccan Plateau, at a height of over 900 m (3,000 ft) above sea level, Bangalore is known for its pleasant climate throughout the year. Its elevation is the highest among the major cities of India.[13]'''

In [13]:
new_reader.predict_on_texts("what is the language spoken in Karnataka?",[context])

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.57 Batches/s]


{'answers': [<Answer {'answer': 'Kannada', 'type': 'extractive', 'score': 0.10681728646159172, 'context': 'Bangalore (/bæŋɡəˈlɔːr/), officially known as Bengaluru (Kannada pronunciation: [ˈbeŋɡəɭuːɾu] (audio speaker iconlisten)), is the capital and the larg', 'offsets_in_document': [{'start': 57, 'end': 64}], 'offsets_in_context': [{'start': 57, 'end': 64}], 'document_id': 'f035ef1741f01a771e207677acb66291', 'meta': {}}>],
 'no_ans_gap': -4.2062296867370605,
 'query': 'what is the language spoken in Karnataka?'}

###Inference Using Pipeline


In [14]:
from haystack import Pipeline, Document
from haystack.utils import print_answers
# reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
p = Pipeline()
p.add_node(component=new_reader, name="Reader", inputs=["Query"])
res = p.run(
    query="what is the largest city in karnataka? ", documents=[Document(content=context)]
)
print_answers(res,details="medium")

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.36 Batches/s]


Query: what is the largest city in karnataka? 
Answers:
[   {   'answer': 'Bangalore',
        'context': 'Bangalore (/bæŋɡəˈlɔːr/), officially known as Bengaluru '
                   '(Kannada pronunciation: [ˈbeŋɡəɭuːɾu] (audio speaker '
                   'iconlisten)), is the capital and the larg',
        'score': 0.3878067880868912}]



