<a href="https://colab.research.google.com/github/leopard8k/IRCC_Scraping/blob/master/IRCC_Basic_QA_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import re

In [None]:
CND_SITE = 'https://www.canada.ca'
IRCC_SUFFIX='/en/immigration-refugees-citizenship/'
filter = re.compile('^' + IRCC_SUFFIX + '.*')
exclude_some = re.compile("#")
url = CND_SITE + IRCC_SUFFIX

In [None]:
def get_hrefs(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.text, "html.parser")
  return set(sorted([a['href'] for a in soup.findAll('a', href=filter) if not exclude_some.search(a['href'])]))

In [None]:
scraped_uris = get_hrefs(url)
# in my experience two iterations returns over 5000 links
#   one can experience with more if necessary
SCRAPE_DEPTH = 1
WEBCALLS_WITH_NO_BREAK = 10

for i in range(SCRAPE_DEPTH):
  new_uris = set()
  web_calls=0
  for suffix in scraped_uris:
    new_uris |= get_hrefs(CND_SITE+suffix)
    # take a second between every few site calls to not flood the site
    web_calls += 1
    if web_calls >= WEBCALLS_WITH_NO_BREAK:
      time.sleep(1)
      web_calls = 0
  scraped_uris |= new_uris

len(scraped_uris)

629

In [6]:
count = 10000
for ahref in scraped_uris:
  download_url = CND_SITE + ahref
  count += 1
  urllib.request.urlretrieve(download_url,'./file-'+str(count)+'.html') 
  time.sleep(1)

In [55]:
!pip install html2text

Collecting html2text
  Downloading https://files.pythonhosted.org/packages/ae/88/14655f727f66b3e3199f4467bafcc88283e6c31b562686bf606264e09181/html2text-2020.1.16-py3-none-any.whl
Installing collected packages: html2text
Successfully installed html2text-2020.1.16


In [None]:
!mkdir dataIRCC
!for file in file*html;do html2text ${file} > dataIRCC/${file}.txt;done

In [9]:
# Make sure you have a GPU running
!nvidia-smi

Fri Feb 26 22:59:09 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install urllib3==1.25.4


In [11]:
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

02/26/2021 23:00:09 - INFO - faiss.loader -   Loading faiss with AVX2 support.
02/26/2021 23:00:09 - INFO - faiss.loader -   Loading faiss.
02/26/2021 23:00:10 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

**Here:** We recommended Elasticsearch as it comes preloaded with features like [full-text queries](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector storage for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

**Alternatives:** If you are unable to setup an Elasticsearch instance, then follow the [Tutorial 3](https://github.com/deepset-ai/haystack/blob/master/tutorials/Tutorial3_Basic_QA_Pipeline_without_Elasticsearch.ipynb) for using SQL/InMemory document stores.

**Hint**: This tutorial creates a new document store instance with Wikipedia articles on Game of Thrones. However, you can configure Haystack to work with your existing document stores.

### Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# Recommended: Start Elasticsearch using Docker
#! docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

0ae423cd9c30d6f02ca2073e430d4e1f4403d88b8ec316411ec4c198bad3d416


In [12]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [13]:
# Connect to Elasticsearch

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

02/26/2021 23:00:55 - INFO - elasticsearch -   HEAD http://localhost:9200/ [status:200 request:0.083s]
02/26/2021 23:00:56 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:200 request:0.339s]
02/26/2021 23:00:56 - INFO - elasticsearch -   PUT http://localhost:9200/label [status:200 request:0.177s]


## Preprocessing of documents

Haystack provides a customizable pipeline for:
 - converting files into texts
 - cleaning texts
 - splitting texts
 - writing them to a Document Store


In [35]:

doc_dir = "dataIRCC"

# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, split_paragraphs=False)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is:
# {
#    'text': "<DOCUMENT_TEXT_HERE>",
#    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
#}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10287.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10467.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10567.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10446.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10283.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10202.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10291.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10267.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10237.txt
02/26/2021 23:28:30 - INFO - haystack.preprocessor.utils -   Converting dataIRCC/file-10296.txt
02/26/2021 23:28:30 - INFO - haystack.pr

[{'text': '<!doctype html>\n\n\n<html class="no-js" dir="ltr" lang="en" xmlns="http://www.w3.org/1999/xhtml">\n\n<head prefix="og: http://ogp.me/ns#">\n    \n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<meta charset="utf-8"/>\n<title>Review of the Migration Policy Development Program - Canada.ca</title>\n<meta content="width=device-width,initial-scale=1" name="viewport"/>\n\n\n\t<link rel="schema.dcterms" href="http://purl.org/dc/terms/"/>\n\t\n\t\t<meta name="description" content="Review of the Migration Policy Development Program"/>\n\t\n\t\n\t\t<meta name="keywords" content="Evaluation; Report; Migration Policy Development Program (MPDP); International migration policy and research; Inter-Governmental Consultations on Asylum, Refugee and Migration Policies (IGC); Regional Conference on Migration (RCM); Migration Policy Institute (MPI)"/>\n\t\n\t\n\t\t<meta name="author" content="Immigration, Refugees and Citizenship Canada"/>\n\t\n\t\n\t\t<meta name="dcterms.title" conte

02/26/2021 23:28:41 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:4.934s]
02/26/2021 23:28:43 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.148s]


## Initalize Retriever, Reader,  & Finder

### Retriever

Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered.
They use some simple but fast algorithm.

**Here:** We use Elasticsearch's default BM25 algorithm

**Alternatives:**

- Customize the `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `DensePassageRetriever` to use different embedding models for passage and query (see Tutorial 6)

In [36]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [37]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.retriever.sparse import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based
on powerful, but slower deep learning models.

Haystack currently supports Readers based on the frameworks FARM and Transformers.
With both you can either load a local model or one from Hugging Face's model hub (https://huggingface.co/models).

**Here:** a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)

**Alternatives (Reader):** TransformersReader (leveraging the `pipeline` of the Transformers package)

**Alternatives (Models):** e.g. "distilbert-base-uncased-distilled-squad" (fast) or "deepset/bert-large-uncased-whole-word-masking-squad2" (good accuracy)

**Hint:** You can adjust the model to return "no answer possible" with the no_ans_boost. Higher values mean the model prefers "no answer possible"

#### FARMReader

In [38]:
# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

02/26/2021 23:29:13 - INFO - farm.utils -   Using device: CUDA 
02/26/2021 23:29:13 - INFO - farm.utils -   Number of GPUs: 1
02/26/2021 23:29:13 - INFO - farm.utils -   Distributed Training: False
02/26/2021 23:29:13 - INFO - farm.utils -   Automatic Mixed Precision: None
02/26/2021 23:29:13 - INFO - filelock -   Lock 140139225830992 acquired on /root/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=571.0, style=ProgressStyle(description_…

02/26/2021 23:29:13 - INFO - filelock -   Lock 140139225830992 released on /root/.cache/huggingface/transformers/c40d0abb589629c48763f271020d0b1f602f5208c432c0874d420491ed37e28b.122ed338b3591c07dba452777c59ff52330edb340d3d56d67aa9117ad9905673.lock
02/26/2021 23:29:13 - INFO - filelock -   Lock 140142269740944 acquired on /root/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=496313727.0, style=ProgressStyle(descri…

02/26/2021 23:29:23 - INFO - filelock -   Lock 140142269740944 released on /root/.cache/huggingface/transformers/eac3273a8097dda671e3bea1db32c616e74f36a306c65b4858171c98d6db83e9.084aa7284f3a51fa1c8f0641aa04c47d366fbd18711f29d0a995693cfdbc9c9e.lock





Some weights of RobertaModel were not initialized from the model checkpoint at deepset/roberta-base-squad2 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
02/26/2021 23:29:45 - INFO - filelock -   Lock 140139683011920 acquired on /root/.cache/huggingface/transformers/81c80edb4c6cefa5cae64ccfdb34b3b309ecaf60da99da7cd1c17e24a5d36eb5.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898822.0, style=ProgressStyle(descripti…

02/26/2021 23:29:45 - INFO - filelock -   Lock 140139683011920 released on /root/.cache/huggingface/transformers/81c80edb4c6cefa5cae64ccfdb34b3b309ecaf60da99da7cd1c17e24a5d36eb5.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
02/26/2021 23:29:45 - INFO - filelock -   Lock 140139190725904 acquired on /root/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…

02/26/2021 23:29:45 - INFO - filelock -   Lock 140139190725904 released on /root/.cache/huggingface/transformers/b87d46371731376b11768b7839b1a5938a4f77d6bd2d9b683f167df0026af432.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
02/26/2021 23:29:45 - INFO - filelock -   Lock 140139187398608 acquired on /root/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=772.0, style=ProgressStyle(description_…

02/26/2021 23:29:45 - INFO - filelock -   Lock 140139187398608 released on /root/.cache/huggingface/transformers/c9d2c178fac8d40234baa1833a3b1903d393729bf93ea34da247c07db24900d0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0.lock
02/26/2021 23:29:45 - INFO - filelock -   Lock 140139187337936 acquired on /root/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=79.0, style=ProgressStyle(description_w…

02/26/2021 23:29:45 - INFO - filelock -   Lock 140139187337936 released on /root/.cache/huggingface/transformers/e8a600814b69e3ee74bb4a7398cc6fef9812475010f16a6c9f151b2c2772b089.451739a2f3b82c3375da0dfc6af295bedc4567373b171f514dd09a4cc4b31513.lock





02/26/2021 23:29:45 - INFO - farm.utils -   Using device: CUDA 
02/26/2021 23:29:45 - INFO - farm.utils -   Number of GPUs: 1
02/26/2021 23:29:45 - INFO - farm.utils -   Distributed Training: False
02/26/2021 23:29:45 - INFO - farm.utils -   Automatic Mixed Precision: None
02/26/2021 23:29:45 - INFO - farm.infer -   Got ya 2 parallel workers to do inference ...
02/26/2021 23:29:45 - INFO - farm.infer -    0    0 
02/26/2021 23:29:45 - INFO - farm.infer -   /w\  /w\
02/26/2021 23:29:45 - INFO - farm.infer -   /'\  / \
02/26/2021 23:29:45 - INFO - farm.infer -     


#### TransformersReader

In [None]:
# Alternative:
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [39]:
from haystack.pipeline import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [48]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers. 
prediction = pipe.run(query="who can get a work permit in canada?", top_k_retriever=10, top_k_reader=5)

02/26/2021 23:36:47 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.024s]
Inferencing Samples: 100%|██████████| 8/8 [00:05<00:00,  1.41 Batches/s]
Inferencing Samples: 100%|██████████| 4/4 [00:02<00:00,  1.44 Batches/s]
Inferencing Samples: 100%|██████████| 3/3 [00:01<00:00,  1.54 Batches/s]
Inferencing Samples: 100%|██████████| 9/9 [00:06<00:00,  1.45 Batches/s]
Inferencing Samples: 100%|██████████| 4/4 [00:02<00:00,  1.35 Batches/s]
Inferencing Samples: 100%|██████████| 4/4 [00:02<00:00,  1.67 Batches/s]
Inferencing Samples: 100%|██████████| 4/4 [00:02<00:00,  1.44 Batches/s]
Inferencing Samples: 100%|██████████| 5/5 [00:03<00:00,  1.50 Batches/s]
Inferencing Samples: 100%|██████████| 3/3 [00:02<00:00,  1.26 Batches/s]
Inferencing Samples: 100%|██████████| 3/3 [00:02<00:00,  1.37 Batches/s]


In [49]:
# prediction = pipe.run(query="Who created the Dothraki vocabulary?", top_k_reader=5)
# prediction = pipe.run(query="Who is the sister of Sansa?", top_k_reader=5)

In [50]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Mexican citizens who have been admitted to Canada as '
                  'visitors',
        'context': ' was not issued at a port of entry;</li>\n'
                   '<li>Mexican citizens who have been admitted to Canada as '
                   'visitors may apply for a work permit under any <abbr'},
    {   'answer': 'a person who is not a Canadian citizen or a permanent '
                  'resident of Canada',
        'context': 'work in Canada issued by an officer to a person who is not '
                   'a Canadian citizen or a permanent resident of Canada. It '
                   'is required if the employment loca'},
    {   'answer': 'spouse or common-law partner',
        'context': 'rmit pilot program for permanent residence applicants in '
                   'the spouse or common-law partner in Canada class '
                   '(A70)</li>\n'
                   '      <li>Foreign physicians comi'},
    {   'answer': 'Mexican citizens',
        'contex