<a href="https://colab.research.google.com/github/patzacher/extractive_qa/blob/main/extractive_qa_webscrape_haystack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extractive QA System Using Python and Haystack




##Overview

Haystack is an end-to-end open-source framework for creating Question-Answering models. Haystack has three primary components: the DocumentStore, Retriever, and Reader.

1. DocumentStore: This is exactly what it sounds like. The DocumentStore stores text documents and their meta data. Documents are typically split into smaller units (e.g., paragraphs) before indexing to enable higher accuracy and granularity to answers.

2. Retriever: These are fast and simple algorithms to identify candidate passages from a large collection of documents. It allows a set of k-candidate documents to be sent to the Reader. In general, the Retriever helps narrow the scope for the Reader, which will then perform a thorough search of the top-k documents for the best answer.

3. Reader: Takes passages of text as input and returns top-k answers with their corresponding confidence scores (range 0-1). Readers are powerful models that are able to make a full search in the selected documents with the aim of finding the right answer.

The DocumentStore, Retriever, and Reader are connected using a querying pipeline. Querying pipelines are used to receive a query from the user and produce a result.


## Preparing the Colab Environment

- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)


## Install Packages

Install Haystack and other required packages:

In [None]:
%%bash

pip install --upgrade pip
pip install farm-haystack[colab,preprocessing,elasticsearch,inference]
pip install bs4
pip install requests
pip install youtube-transcript-api

Configure Haystack's logging level:

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.DEBUG)
logging.getLogger("haystack").setLevel(logging.DEBUG)

## Import Packages

In [None]:
from bs4 import BeautifulSoup
from google.colab import files
from urllib.parse import urlparse, urlunparse
from youtube_transcript_api import YouTubeTranscriptApi
from youtube_transcript_api.formatters import TextFormatter

import json
import openpyxl
import requests
import re
import os

## Webscrape

### Upload URL List
Upload an Excel file containing the URLs we want to scrape.

In [None]:
quest_url_list = files.upload()

Saving NUIT-RCS-KB-Articles-08-24-2023.xlsx to NUIT-RCS-KB-Articles-08-24-2023.xlsx


### Retrieve URLs from Excel File

URLs are contained in a single column of the .xlsx file and appear as hyperlinks. For each hyperlink, extract the URL, and put it in `url_list`.

In [None]:
# Load the Excel file
file_path = "NUIT-RCS-KB-Articles-08-24-2023.xlsx"
workbook = openpyxl.load_workbook(file_path)

# Select the desired worksheet
worksheet = workbook['NUIT-RCS-KB-Articles']

# Specify the column index containing hyperlinks (starting from 1)
column_index = 4

# Get the number of rows in the worksheet
num_rows = worksheet.max_row

# Iterate through rows and retrieve hyperlinks
url_list = []
for row in range(2, num_rows + 1):
    url = worksheet.cell(row=row, column = column_index).hyperlink.target
    url_list.append(url)

### Create a new directory to store all of the scraped content.

In [None]:
# Create new directory for scraped content.
os.mkdir("/content/Data")

# Switch to new directory.
os.chdir("/content/Data")

### **Optional Step**: Crawl for more URLs
To obtain more data, we can find all of the links on a given page (supplied by the Excel file), check that they are a Northwestern site, remove duplicates or near-duplicates, and save them to the set `unique_urls`. We can then use this larger data set to train our model.

In [None]:
# Define constants
SEARCH_STRING = 'northwestern' # To ensure we are looking at Northwestern websites
normalized_urls = []

def normalize_url(url):
    """
    Normalize a URL by removing '.aspx' extension while preserving query
    parameters. The purpose of this function is to remove near-duplicate URLs
    which point to the same website but differ only by '.aspx'.
    """
    parsed_url = urlparse(url)
    path = parsed_url.path
    if path.endswith('.aspx'):
        path = path[:-5]  # Remove the '.aspx' extension
    normalized_url = urlunparse(parsed_url._replace(path=path))
    return normalized_url

def get_unique_urls(soup, search_string):
    """
    Extract unique URLs containing the search string within the webpage.
    """
    links = set()
    for link in soup.find_all('a', href=True):
        href = link.get('href')
        if href and href.startswith(('http://', 'https://')) and search_string in href:
            links.add(href)
    return links

def scrape_urls(start_url, search_string):
    """
    Scrape unique URLs containing the search string within the webpage.
    """
    try:
        source_code = requests.get(start_url)
        soup = BeautifulSoup(source_code.content, 'lxml')
        urls = get_unique_urls(soup, search_string)
        return urls
    except Exception as e:
        print(f"Error while processing {start_url}: {e}")
        return set()

if __name__ == '__main__':

    # Initialize a set to store all scraped URLs. Start with those contained in
    # `url_list` because we want to ensure that none of the scraped urls are
    # duplicates of those we provided.
    scraped_urls = set(url_list)

    # Loop through the list of URLs and scrape each one
    for start_url in url_list:
        scraped_urls.update(scrape_urls(start_url, SEARCH_STRING))
        #scraped_urls.update(normalize_url(scraped))

    # Remove near-duplicate URLs
    for url in scraped_urls:
        normalized_url = normalize_url(url)
        normalized_urls.append(normalized_url)

    url_list = set(normalized_urls)

### Scrape Websites


In [None]:
# Create an empty list to store text content from all pages in `url_list`, as
# well as the metadata for each url.
all_pages_content = []
all_titles = []
meta_file = []
all_links = []

for url in url_list:
    # Send a GET request to the URL and store it to `page`. Some requests may
    # timeout so we add a condition to print an error in that case.
    try:
        page = requests.get(url)

    except Exception as e:
        print(f"Error while processing {start_url}: {e}")

    # Scrape the webpage and create a BeautifulSoup object.
    soup = BeautifulSoup(page.content, 'html.parser')

    # Find the <div> elements with specified id's and exclude them. These are
    # largely related to footers that contain content unrelated to the article.
    divs_to_remove = soup.find_all('div', {'id': ['divDetails',
                                                  'divFeedback2',
                                                  'ArticleID',
                                                  'divAuthor',
                                                  'ctl00_ctl00_cpContent_cpContent_pbMain',
                                                  'ctl00_ctl00_cpContent_cpContent_upFeedbackGrid',
                                                  'divShareModal']})

    for div in divs_to_remove:
      div.extract()

    # Find the main content of the webpage by excluding headers and footers
    headers = soup.find_all(['header', 'nav', 'div class="header"',
                             'div class="navbar"'])
    footers = soup.find_all(['footer', 'div class="footer"',
                             'div class="panel panel-default gutter-top"'])
    for header in headers:
        header.extract()  # Remove headers
    for footer in footers:
        footer.extract()  # Remove footers

    # Get the title of the webpage
    page_title = soup.find('title').get_text()

    # Clean page_title by removing \r\n\t characters
    page_title = page_title.replace('\r', '').replace('\n', '').replace('\t', '')

    # Clean page_title by removing non-alphanumeric characters using regex
    page_title = re.sub(r'[^a-zA-Z0-9\s]', '', page_title)

    # Find all occurrences of `<head>` and `<body>` tags in the HTML content.
    p_tags = (soup.find_all(['head', 'body']))

    # Find all occurrences of `<a>` (link) tags.
    link = soup.find_all('a')

    # Create an empty list to store the content of the current webpage
    page_content = []

    # Create an empty list to store the links contained in the current webpage
    links = []

    # Return plain text by finding the first tag in the HTML content and use `.get_text()` to
    # extract only the text content inside the tag. Clean `page_content` by
    # removing \r\n\t characters.
    for p_tag in p_tags:
        page_content.append(p_tag.get_text().replace('\r', ' ').replace('\n', ' ').replace('\t', ' '))

    # Define a regular expression pattern to match consecutive '#' characters
    hash_pattern = re.compile(r'#+')

    # Define a regular expression pattern to match consecutive spaces
    space_pattern = re.compile(r'\s+')

    # Remove multiple consecutive spaces and replace with a single space.
    page_content = [space_pattern.sub(' ', text) for text in page_content]

    # Remove multiple consecutive # and replace with a single #.
    page_content = [hash_pattern.sub('#', text) for text in page_content]

    # Save links from current webpage to a list
    for link in soup.findAll('a'):
        links.append(link.get('href'))

    # Append the `page_content` list to the `all_pages_content` list,
    # `page_title` to the `all_titles` list, and `links` to the `all_links`
    # list.
    all_pages_content.append((page_title, page_content))
    all_titles.append(page_title)
    all_links.append(links)

    # Create a dictionary for the current page and store as metadata
    page_info = {'Link': url, 'Title': page_title}

    # Append `page_info` dictionary to the `meta_file` list
    meta_file.append(page_info)


### Scrape YouTube Transcripts
**From Northwestern IT YouTube channel**

NUIT has many Quest-related YouTube videos. We can scrape transcripts and use in model training.

First, we define a function to retrieve the transcript from a YouTube video given its URL. Then, we loop through a list of URLs to retrieve transcripts and other useful information.

In [None]:
def retrieve_transcript(video_link):
  """
  Retrieve the transcript and title of a YouTube video, perform minimal
  cleaning, and save as .txt file.
  """
  index = video_link.find("?v=")
  # Extract video id
  if index != -1:
    # Extract everything after "?v="
    video_id = video_link[index + 3:]

  # Retrieve the available transcripts
  transcript = YouTubeTranscriptApi.get_transcript(video_id)

  # Intialize a formatter to convert the transcript from its Python data type
  # into a consistent string of a given format, such as a basic text (.txt).
  formatter = TextFormatter()

  # .format_transcript(transcript) turns the transcript into a TXT string.
  txt_formatted = formatter.format_transcript(transcript)

  # Clean transcript by removing '\n' and "\'" characters.
  # Normalize instances of 2 or more spaces to a single space.
  txt_formatted = txt_formatted.replace('\n', ' ').replace('\'', '')

  # Remove multiple consecutive spaces and replace with a single space.
  # Define a regular expression pattern to match consecutive spaces
  space_pattern = re.compile(r'\s+')
  txt_formatted = [space_pattern.sub(' ', text) for text in txt_formatted]

  # Retrieve video title
  response = requests.get(video_link)
  soup = BeautifulSoup(response.text)
  link = soup.find_all(name="title")[0]
  video_title = link.text

  return txt_formatted, video_id, video_title


In [None]:
# List of URLs to scrape
video_links = [
    "https://www.youtube.com/watch?v=g1jIaDCnH-k", # Intro to Quest Part 5
    "https://www.youtube.com/watch?v=_OV5TxIQ4ss", # Intro to Quest Part 4
    "https://www.youtube.com/watch?v=eLOqLyysZjc", # Intro to Quest Part 3
    "https://www.youtube.com/watch?v=zuaveFyOJ_o", # Intro to Quest Part 2
    "https://www.youtube.com/watch?v=rGWOoR9ASBY", # Intro to Quest Part 1
    "https://www.youtube.com/watch?v=YhXATOQRISw", # Applying to Access Quest
    "https://www.youtube.com/watch?v=i-QPjF580Sc", # Debugging Jobs on Quest
    "https://www.youtube.com/watch?v=87bVF0aw4Hs", # Intro to Vim
    "https://www.youtube.com/watch?v=l2wBwT8e0h8", # Intro to Nano
    "https://www.youtube.com/watch?v=VkPv1k-IalA", # Singularity Part 4
    "https://www.youtube.com/watch?v=a6V9hqQ2M8k", # Singularity Part 3
    "https://www.youtube.com/watch?v=m67YUxxSlho", # Singularity Part 2
    "https://www.youtube.com/watch?v=YnzWpLXts9c", # Singularity Part 1
    "https://www.youtube.com/watch?v=SQol0ji83WM", # Bash Scripting Part 5
    "https://www.youtube.com/watch?v=AN7gBi0EDAw", # Bash Scripting Part 4
    "https://www.youtube.com/watch?v=htXBwXtEEV4", # Bash Scripting Part 3
    "https://www.youtube.com/watch?v=KNe42MngDg4", # Bash Scripting Part 2
    "https://www.youtube.com/watch?v=AMsEvgDoZdc", # Bash Scripting Part 1
    "https://www.youtube.com/watch?v=rIFbHt_2g4s", # Intro to Quest Remote
    "https://www.youtube.com/watch?v=xFYHs19gGjU", # Logging into Quest
    "https://www.youtube.com/watch?v=cLn_1BzRxCM" # Navigating Quest via Shell
]

# Loop through the links and scrape the transcript from each one.
for video_link in video_links:
  video_transcript, video_id, video_title = retrieve_transcript(video_link)

  # Write each transcript to a file.
  output_file = f"{video_title}.txt"
  with open(output_file, 'w', encoding='utf-8') as file:
    file.write(f"Title: {video_title}\n")
    file.write(f"Link: {video_link}\n")
    file.write("".join(video_transcript))

  # Create a dictionary for the current video and store as metadata
  page_info_transcripts = {'Link': video_link, 'Title': video_title}

  # Append `page_info` dictionary to the `meta_file` list
  meta_file.append(page_info_transcripts)

### Save All Webpage Content
Includes scraped websites, transcripts, and additional content from crawler

In [None]:
# If `url_list` is a set we want to convert it to a list. This should only be
# the case if we used the crawler to obtain more websites.
if type(url_list) == 'set':
  url_list = list(url_list)

# Save the scraped content from each page to its own .txt file.
for i, (page_title, page_content) in enumerate(all_pages_content):
    output_file = f"{page_title}.txt"
    with open(output_file, "w", encoding="utf-8") as file:
        file.write(f"Title: {page_title}\n")
        file.write("Link: " + url_list[i] + "\n")
        file.write("\n".join(page_content))
        file.write("\n")

# Save metadata file
with open("meta_file.txt", "w", encoding="utf-8") as file:
    for page_info in meta_file:
        file.write(f"Title: {page_info['Title']}\n")
        file.write(f"Link: {page_info['Link']}\n")
        file.write("\n")

# Change back to parent directory.
os.chdir("/content/")

### Download Scraped Data Folder
Execute this block as needed.

If we are running this notebook for the first time, lost previously scraped data, or changed any of the code above that affects the data we are using to train our model, we will run the following block. Otherwise, it is more efficient to load previously scraped data than to scrape new data every time we open this notebook.

In [None]:
# Change back to parent directory.
os.chdir("/content/")

In [None]:
from google.colab import drive
from google.colab import files

# Mount Google Drive
drive.mount('/content/drive')

# Create zip file
!zip -r Data.zip Data/

# Download folder
files.download('Data.zip')

## Initialize the ElasticsearchDocumentStore

A DocumentStore stores the documents that the question-answering system uses to find answers to questions. Here, we're using the [`ElasticsearchDocumentStore`](https://docs.haystack.deepset.ai/reference/document-store-api#module-elasticsearch) which connects to a running Elasticsearch service. It's a fast and scalable text-focused storage option. This service runs independently from Haystack and persists even after the Haystack program has finished running. To learn more about the DocumentStore and the different types of external databases that we support, see [DocumentStore](https://docs.haystack.deepset.ai/docs/document_store).

As an aside, Elasticsearch is an open-source, distributed search and analytics engine designed for scalability, real-time searching, and data analysis. Among other things, Elasticsearch can index and search large volumes of text data quickly and efficiently.

1. Download, extract, and set the permissions for the Elasticsearch installation image:

In [None]:
%%bash

# Use `wget` utility to quietly (-q) download the Elasticsearch archive from
# this URL
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q

# After downloading, use `tar` to extract contents (-x extract, -z archive is
# compressed with gzip, -f specifies file name)
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz

# Change ownership and group ownership (`chown`) of extracted files and
# directories to `daemon`, a non-priveleged user and group name for running
# Elastic search in a secure manner.
chown -R daemon:daemon elasticsearch-7.9.2

2. Start the server:

In [None]:
%%bash --bg

# Start Elasticsearch server as the `daemon` user.
sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

If Docker is available in your environment (Colab notebooks do not support Docker), you can also start Elasticsearch using Docker. You can do this manually, or using our [`launch_es()`](https://docs.haystack.deepset.ai/reference/utils-api#module-doc_store) utility function.

In [None]:
# from haystack.utils import launch_es

# launch_es()

3. Wait 30 seconds for the server to fully start up:

In [None]:
import time

time.sleep(30)

4. Initialize the ElasticsearchDocumentStore:


In [None]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

# Configure ElasticsearchDocumentStore for accessing and storing documents.
document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


ElasticsearchDocumentStore is up and running and ready to store the Documents.

## Upload Files


####**Option 1:** Upload Previously Scraped Data

Step 1.

If we scraped content from Quest KB websites and Northwestern IT YouTube channel in a previous session, upload it here.

In [None]:
from google.colab import files

# Create new directory for scraped content.
os.mkdir("/content/Data")

# Switch to new directory.
os.chdir("/content/Data")

files.upload()

Saving Article  Finding the full path to yo.txt to Article  Finding the full path to yo.txt
Saving Article  Using VASP on Quest.txt to Article  Using VASP on Quest.txt
Saving Introduction to Quest–Remote - YouTube.txt to Introduction to Quest–Remote - YouTube.txt
Saving Article  Parabricks on the Genomics .txt to Article  Parabricks on the Genomics .txt
Saving Article  Genomic Data Commons Data T.txt to Article  Genomic Data Commons Data T.txt
Saving Bash Scripting Practice Part 1: General Considerations - YouTube.txt to Bash Scripting Practice Part 1: General Considerations - YouTube.txt
Saving Article  Creating an Amazon S3 bucket.txt to Article  Creating an Amazon S3 bucket.txt
Saving Introduction to the Text Editor: Vim - YouTube.txt to Introduction to the Text Editor: Vim - YouTube.txt
Saving Article  Managing Full Access Alloca.txt to Article  Managing Full Access Alloca.txt
Saving Introduction to Quest Part 5   Submitting Jobs - YouTube.txt to Introduction to Quest Part 5   Subm

{'Article  Finding the full path to yo.txt': b'Title: Article  Finding the full path to yo\nLink: https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet.aspx?ID=2026\n Article - Finding the full path to yo... \n Updating... Finding the full path to your Research Data Storage Service (RDSS) share This page describes how to format your RDSS/FSMResfiles filepath when requesting access to Globus To allow users to access RDSS via Globus, we first need to tell our data transfer node where to look for your files. To do this, we need the full path to your RDSS share. Step 1: Identify your access zone Your access zone was determined when the share was created. Two of these access zones are allowed to use Globus to transfer data to and from Quest. fsmresfiles.fsm.northwestern.edu is for Feinberg School of Medicine users resfiles.northwestern.edu is for other unaudited shares If you\'re not sure which group you fall into, look at the address that you use to connect to your RDSS share.

Step 2.

Set `meta_file` from meta_file.txt for use in the indexing pipeline.

In [None]:
meta_file_path = "/content/Data/meta_file.txt"

# Read meta_file.txt
with open(meta_file_path, "r") as file:
  meta_data = file.read()

# Split the input string into items based on the newline characters
meta_items = meta_data.strip().split('\n\n')

# Initialize a list to store dictionaries
meta_file = []

# Iterate through the items and create the list of dictionaries contained in
# `meta_file`.
for item in meta_items:
    item_lines = item.split('\n')
    item_dict = {}

    for line in item_lines:
        key, value = line.split(': ', 1)
        item_dict[key] = value

    meta_file.append(item_dict)


# Set directory where documents are located.
doc_dir = "/content/Data"

# Switch back to parent directory.
os.chdir("/content/")

####**Option 2:**
Retrieve webscraped content from Google Drive folder. Perform this step if we have scraped new data in this session. Otherwise, use Option 1.

In [None]:
doc_dir = "/content/Data"

####**Option 3: Example Dataset from Haystack Tutorial**

Download 517 articles from the Game of Thrones Wikipedia. You can find them in *data/build_a_scalable_question_answering_system* as a set of *.txt* files.

In [None]:
from haystack.utils import fetch_archive_from_http

doc_dir = "data/build_a_scalable_question_answering_system"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip",
    output_dir=doc_dir,
)

INFO:haystack.utils.import_utils:Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt3.zip to 'data/build_a_scalable_question_answering_system'


True

## Index Documents with a Pipeline

Indexing pipelines prepare the files for search. The main objective here is to convert files (.txt, in our case) into Haystack Documents, so they can be saved in a DocumentStore. Our indexing pipeline will have three nodes:

1. `TextConverter`, which turns `.txt` files into Haystack `Document` objects and sends to the `PreProcessor`.
2. `PreProcessor`, which cleans and splits the text within a `Document` and sends to the `DocumentStore`.
3. `DocumentStore` is the database that stores text and meta data and provides them to the Retriever at query time. Our `ElasticsearchDocumentStore` has already been initialized.

Once we combine these nodes into a pipeline, the pipeline will ingest `.txt` file paths, preprocess them, and write them into the DocumentStore.

Note: More nodes are available for our indexing pipeline as needed. For example, a `FileClassifier` can be added as the first node to classify files into text, PDF, Markdown, docx, and HTML files and route them to the appropriate `FileConverter`. Also, a `DocumentClassifier` could be used to attach a classification label each Document's meta data (e.g., sentiment labels like "positive", "negative").


####**Option 1:**

Step 1.

Manually apply the converter(s) to each file. If we are only using one type of file (in this case, .txt), then we can specify the type of file converter we want to use.

Next, intialize the preprocessor with recommended values.

In [None]:
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor

indexing_pipeline = Pipeline() # Initialize the indexing pipeline
text_converter = TextConverter() # Reads text from .txt file
                                 # Sends to preprocessor
preprocessor = PreProcessor(
    clean_whitespace=True, # Remove whitespace at start/end of each line in text
    clean_header_footer=True, # Remove repeated header/footer
    clean_empty_lines=True, # Normalize 3+ empty lines to 2 empty lines
    split_by="word", # Unit to split document by
    split_length=100, # Max number of units per document (Recommended value)
    split_overlap=10, # Overlap between adjacent documents
    split_respect_sentence_boundary=True, # Doc boundaries preserve sentences
    max_chars_check = 1000 # Some docs have very long passages so we need to split them
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


To learn more about the parameters of the `PreProcessor`, see [Usage](https://docs.haystack.deepset.ai/docs/preprocessor#usage).

[Document splitting](https://docs.haystack.deepset.ai/docs/optimization#document-length) is important for your question answering system's performance. If you halve the length of your documents, you will halve the workload placed on your Retriever. Depending on the type of Retriever used, the maximum number of words will vary (between 100 - 500 words).

Our current pipeline uses a dense retriever, which have more restrictive guidelines for sentence length. We have to ensure that documents are not longer than the retriever's maximium input length (256 tokens). As such, decent performance has been found with documents around 100 words long (see [Optimization - Document Length](https://docs.haystack.deepset.ai/docs/optimization#document-length) for more details).

Step 2.

Add the nodes into an indexing pipeline. You should provide the `name` or `name`s of preceding nodes as the `input` argument. Note that in an indexing pipeline, the input to the first node is `File`.

In [None]:
import os

indexing_pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["PreProcessor"])

Step 3.

Run the indexing pipeline to write the text data into the DocumentStore. We can add metadata to our files using the `meta` argument in the `indexing_pipeline.run_batch` command. For example, we can include the title and URL of a document and return them for additional context when the user asks a question.

Note that `meta_file` has to be a list of dictionaries, the same length as `files_to_index`. Also, we are alphabetically sorting these arguments because the entries in `meta` must be in the same order as `file_paths`.

In [None]:
# Specify the files we want to send to the DocumentStore.
files_to_index = [
    os.path.join(doc_dir, f)
    for f in os.listdir(doc_dir)
    if f != "meta_file.txt" # Don't include metadata file because we read that
]                           # separately.

# Run our indexing pipeline to convert files, preprocess, and store them.
indexing_pipeline.run_batch(file_paths=sorted(files_to_index),
                            meta=sorted(meta_file, key=lambda x: x['Title']))

# Earlier versions used a single .txt document containing all scraped content.
#files_to_index = ["quest_webpage_content.txt"]
#indexing_pipeline.run_batch(file_paths=files_to_index)

INFO:haystack.pipelines.base:It seems that an indexing Pipeline is run, so using the nodes' run method instead of run_batch.
DEBUG:haystack.pipelines.base:Running node 'File` with input: {'root_node': 'File', 'params': None, 'file_paths': ['/content/Data/Applying to Access Quest - YouTube.txt', '/content/Data/Article  Advanced Globus Features.txt', '/content/Data/Article  AlphaFold on Quest.txt', '/content/Data/Article  Amazon Web Services.txt', '/content/Data/Article  Checking Processor and Memo.txt', '/content/Data/Article  Compiling Code on Quest.txt', '/content/Data/Article  Connecting to Quest with FastX.txt', '/content/Data/Article  Connecting to a Research Da.txt', '/content/Data/Article  Creating an Amazon S3 bucket.txt', '/content/Data/Article  Debugging your Slurm submis.txt', '/content/Data/Article  Determine the SMB\xa0version y.txt', '/content/Data/Article  Disconnecting from a Resear.txt', '/content/Data/Article  Everything You Need to Know.txt', '/content/Data/Article  E

{'documents': [<Document: {'content': 'Title: Applying to Access Quest - YouTube\nLink: https://www.youtube.com/watch?v=YhXATOQRISw\nhi everyone my name is Alexis Porter Im with  research Computing Services and today were going   to focus on applying to allocations to gain access  to Quest before you can log into quest for the   first time youll need to have an account there  are several options to choose from when creating   an account so lets go through these options  together before we jump into more detail this   video is going to cover several different sections  from joining an existing allocation to General   access allocations buy-in allocations and workshop  where classroom allocations and its important to   note that allocations on quests are not backed up  regardless of what kind of allocation you choose so well start with the first option join  an existing allocation this is a common   procedure if youre joining a lab or  working on a collaborative project to join an existi

####**Option 2:**
Haystack has a convenience function that will automatically apply the right converter to each file in a directory instead of having to specify a converter (i.e., for pdf, docx, txt). See [Better Retrieval via Embedding Retrieval](https://haystack.deepset.ai/tutorials/06_better_retrieval_via_embedding_retrieval) tutorial for an example.

In [None]:
from haystack.utils import convert_files_to_docs

all_docs = convert_files_to_docs(dir_path=doc_dir)

## Initializing the Retriever

Now that the Documents are in the DocumentStore, let's initialize the nodes we want to use in our query pipeline. First, a Retreiver.

In a query pipeline, the Retriever takes a query as input and checks it against the documents contained in the DocumentStore. It scores each document for its relevance to the query and returns the top candidates (top-k documents) to the Reader. The Reader will then perform the more complex task of question-answering using transformer-based language models (if using a dense retriever).

Two (out of many) Retriever options are the **BM25Retriever** (no GPU needed) and an **EmbeddingRetriever** with Sentence Transformers models (recommended if we have a GPU available). The BM25Retriever is a *sparse* retriever while the EmbeddingRetriever is *dense*.

Sparse methods operate by looking for shared keywords between the document and the query. Dense approaches perform better than sparse counterparts, but are computationally more expensive. The models used by the EmbeddingRetriever are trained to embed similar sentences close to each other in a shared embedding space.

A starting model for a dense Retriever is the `multi-qa-mpnet-base-dot-v1` as it was tuned for semantic search (i.e., given a query, it can find relevant passages). It was trained on a large and diverse set of question/answer pairs. Downside is that it is one of the larger models (420 MB), while a smaller, similar option might be the `multi-qa-MiniLM-L6-cos-V1` (80 MB). Here we have to consider model size with performance, as the smaller model generally has poorer performance.

For more Retriever options, see [Retriever](https://docs.haystack.deepset.ai/docs/retriever).

For model info, see [Pretrained Models](https://www.sbert.net/docs/pretrained_models.html#).


####**Option 1: EmbeddingRetriever**

Let's use the `multi-qa-mpnet-base-dot-v1` model.

In [None]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store, embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1"
)
# Important:
# Now that we initialized the Retriever, we need to call update_embeddings() to iterate over all
# previously indexed documents and update their embedding representation.
# While this can be a time consuming operation (depending on the corpus size), it only needs to be done once.
# At query time, we only need to embed the query and compare it to the existing document embeddings, which is very fast.
document_store.update_embeddings(retriever)


INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


(…)e/main/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

INFO:haystack.nodes.retriever.dense:Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1


(…)70bdf8fca0ca826b6b5d16ebc/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

(…)ca0ca826b6b5d16ebc/1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(…)abd4f70bdf8fca0ca826b6b5d16ebc/README.md:   0%|          | 0.00/8.65k [00:00<?, ?B/s]

(…)d4f70bdf8fca0ca826b6b5d16ebc/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

(…)d16ebc/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

(…)bdf8fca0ca826b6b5d16ebc/data_config.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

(…)a826b6b5d16ebc/sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

(…)0ca826b6b5d16ebc/special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

(…)70bdf8fca0ca826b6b5d16ebc/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

(…)ca0ca826b6b5d16ebc/tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

(…)0bdf8fca0ca826b6b5d16ebc/train_script.py:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

(…)abd4f70bdf8fca0ca826b6b5d16ebc/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)4f70bdf8fca0ca826b6b5d16ebc/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()
INFO:haystack.document_stores.search_engine:Updating embeddings for all 1692 docs ...
Updating embeddings:   0%|          | 0/1692 [00:00<?, ? Docs/s]

Batches:   0%|          | 0/53 [00:00<?, ?it/s]

Updating embeddings: 10000 Docs [00:47, 211.13 Docs/s]


####**Option 2: BM25 Retriever**

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

##Route Documents##

Now that the Retriever has been initialized, we can move on specifying our approach to routing documents. We can use the EmbeddingRetriever to retrieve both texts and tables. To do question-answering on these documents, we need to route the "text" documents to a FARMReader and "table" documents to a TableReader. Then we need to join the answers coming from the two Readers to a single list of answers.

To read more about this process, see [Pipeline for QA on Combination of Text and Tables](https://haystack.deepset.ai/tutorials/15_tableqa) including how to evaluate the pipeline and how to add tables from PDFs.

## Initializing the Reader

Our query pipeline also needs a Reader, so we'll initialize it next. A Reader scans the texts it received from the Retriever and extracts the top answer candidates. Readers are based on powerful deep learning models but are much slower than Retrievers at processing the same amount of text. This is due to the model complexity (e.g., number of parameters), but also the difficulty of the task. Readers must process the text within the selected documents to extract the answer to a question, which involves fine-grained language understanding and reasoning.

We'll use a FARMReader with a base-sized RoBERTa question answering model called [`deepset/roberta-base-squad2`](https://huggingface.co/deepset/roberta-base-squad2). It's a good all-round model to start with and has been trained on QA pairs, including unanswerable questions, for the task of question-answering.

See [Models](https://docs.haystack.deepset.ai/docs/reader#models) for more options.

In [None]:
from haystack.nodes import FARMReader, TableReader, RouteDocuments, JoinAnswers

text_reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", context_window_size=300, use_gpu=True)
table_reader = TableReader(model_name_or_path="deepset/tapas-large-nq-hn-reader")
route_documents = RouteDocuments()
join_answers = JoinAnswers()

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


(…)rta-base-squad2/resolve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.
DEBUG:haystack.modeling.model.prediction_head:Prediction head initialized with size [768, 2]


(…)quad2/resolve/main/tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

(…)erta-base-squad2/resolve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

(…)erta-base-squad2/resolve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)ad2/resolve/main/special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


(…)ge-nq-hn-reader/resolve/main/config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.35G [00:00<?, ?B/s]

(…)arge-nq-hn-reader/resolve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)der/resolve/main/special_tokens_map.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

(…)eader/resolve/main/tokenizer_config.json:   0%|          | 0.00/558 [00:00<?, ?B/s]

## Creating the Retriever-Reader Pipeline

You can combine the Reader and Retriever in a querying pipeline using the `Pipeline` class. The combination of the two speeds up processing because the Reader only processes the Documents that it received from the Retriever.

To speed things up, Haystack comes with a few predefined pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer questions.

**Option 1: Manually Define a Pipeline**

*We'll use this option if we expect some answers to be contained within tables.

Initialize the `Pipeline` object and add the Retriever and Reader as nodes. You should provide the `name` or `name`s of preceding nodes as the input argument. Note that in a querying pipeline, the input to the first node is `Query`.

In [None]:
from haystack import Pipeline

text_table_qa_pipeline = Pipeline()
text_table_qa_pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
text_table_qa_pipeline.add_node(component=route_documents, name="RouteDocuments", inputs=["EmbeddingRetriever"])
text_table_qa_pipeline.add_node(component=text_reader, name="TextReader", inputs=["RouteDocuments.output_1"])
text_table_qa_pipeline.add_node(component=table_reader, name="TableReader", inputs=["RouteDocuments.output_2"])
text_table_qa_pipeline.add_node(component=join_answers, name="JoinAnswers", inputs=["TextReader", "TableReader"])

**Option 2: Predefined Pipeline**

In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(text_reader, retriever)

That's it! The pipeline is ready to answer questions!

## Ask your Question

1. Use the pipeline's `run()` method to ask a question. The query argument is where you type your question. Additionally, you can set the number of documents you want the Reader and Retriever to return using the `top-k` parameter. The `top-k` parameter in both the Retriever and Reader determine how many results they return and is a trade-off between speed and accuracy. Specifically, Retriever top-k dictates how many retrieved documents are passed on to the Reader, while Reader top-k determines how many answer candidates to show. Haystack recommends using a Retriever top-k = 10 for decent overall performance.

To learn more about setting arguments, see [Arguments](https://docs.haystack.deepset.ai/docs/pipelines#arguments).

To read more about the `top-k` parameter, see [Choosing the Right top-k Values](https://docs.haystack.deepset.ai/docs/optimization#choosing-the-right-top-k-values).


In [None]:
question = "allocation"

# Wrap prediction pipeline in a try/except statement to prevent errors from
# impeding operation.
try:
  prediction = text_table_qa_pipeline.run(
          query = question,
          params = {"EmbeddingRetriever": {"top_k" : 10},
                    "TableReader": {"top_k" : 2},
                    "TextReader": {"top_k" : 2}}
          )
except:
  prediction = [] # If we run into an error, return an empty list.

DEBUG:haystack.pipelines.base:Running node 'Query` with input: {'root_node': 'Query', 'params': {'EmbeddingRetriever': {'top_k': 10}, 'TableReader': {'top_k': 2}, 'TextReader': {'top_k': 2}}, 'query': 'allocation', 'node_id': 'Query'}
DEBUG:haystack.pipelines.base:Running node 'EmbeddingRetriever` with input: {'root_node': 'Query', 'params': {'EmbeddingRetriever': {'top_k': 10}, 'TableReader': {'top_k': 2}, 'TextReader': {'top_k': 2}}, 'query': 'allocation', 'node_id': 'EmbeddingRetriever'}


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

DEBUG:haystack.nodes.retriever.base:Retrieved documents with IDs: ['208d8a0e4320dc8d99761f76e057334d', '3bdb8e9924653dea647495f24ddd5a7a', '4cc01300cdbc06da702b157e332748c5', '9081dc0c8182c3bd61510640bf1e99f6', 'b4fe11e2d205f265853ee74d9ed8927c', '5ff97ea9aa436a648df2019be7e5e4fe', 'b7d4e77b029d7292f75ab7e5a3e12329', 'fcd2faba1beed3c7979b1350e34ceb27', '8c05280cfe0ce2247b0a45fa726a8ca0', 'd78ebc2992241df881214985da5109c0']
DEBUG:haystack.pipelines.base:Running node 'RouteDocuments` with input: {'documents': [<Document: {'content': 'An allocation is a group of people typically who are sharing some resources. So theyre sharing access to time on the compute nodes - so how much time or what priority jobs are when they land on the compute nodes - and this group also an allocation will share storage related to a project or an allocation. So storage related to allocations is always in slash projects slash the allocation name. This is not backed up. ', 'content_type': 'text', 'score': 0.565722

##Filter by Score Threshold
First, check if the prediction pipeline returned an answer (of type 'dict'). If it did, use a score threshold to filter documents so that the only answers returned are greater than the threshold. Also include a condition to return a default answer if no answers are returned that meet our threshold.

If the prediction pipeline ran into an error and returned an empty list, use the default answer.

In [None]:
if isinstance(prediction, dict):
  score_threshold = 0.1
  filtered_documents = [doc for doc in prediction['answers'] if doc.score > score_threshold]

  if not filtered_documents:
        default_answer = {"answer": "Sorry, I don't have an answer for that. Try asking your question in a different way or send an email stating your issue to quest-help@northwestern.edu and the Northwestern IT Research Computing Services team will assist you with your issue", "score": 0.0}
        filtered_documents = [default_answer]

else:
  default_answer = {"answer": "Sorry, I don't have an answer for that. Try asking your question in a different way or send an email stating your issue to quest-help@northwestern.edu and the Northwestern IT Research Computing Services team will assist you with your issue", "score": 0.0}
  filtered_documents = [default_answer]

2. Print out the answers the pipeline returns:

In [None]:
from haystack.schema import Answer

# Set a hyperlink format for answers.
hyperlink_format = '<a href="{link}">{text}</a>'

# Check if filtered_documents is a Haystack Answer object.
# If so, print the answers. If not, print the default
# answer.
for answer in filtered_documents:
    if isinstance(answer, Answer):
        print('The suggested answer is:',
              '"',
              answer.answer,
              '"',
              'with {} percent probability.'.format(round((answer.score)*100)),
              '\n\n',
              'See here for more information related to this answer: ',
              hyperlink_format.format(link = answer.meta['Link'], text = answer.meta['Title']),
              '\n\n',
              'Context for this answer: ',
              answer.context,
              '\n\n',
              'Document ID: ',
              answer.document_ids,
              '\n\n')

    else:
        print(answer['answer'])

#print(filtered_documents) # Print all filtered answers and related info

Sorry, I don't have an answer for that. Try asking your question in a different way or send an email stating your issue to quest-help@northwestern.edu and the Northwestern IT Research Computing Services team will assist you with your issue


## Improvements/Extras

Improve the performance of the Reader, by [fine-tuning](https://haystack.deepset.ai/tutorials/02_finetune_a_model_on_your_data).

In [None]:
# Count of documents in the DocumentStore
document_store.describe_documents()

{'count': 1920,
 'chars_mean': 627.371875,
 'chars_max': 1000,
 'chars_min': 1,
 'chars_median': 603.0}

In [None]:
blah = document_store.get_all_documents()

In [None]:
# Save document_store
with open("document_store.txt", "w", encoding="utf-8") as file:
    for document in blah:
        file.write(f"{document}\n")
        file.write("\n")

In [None]:
filtered_documents

[<Answer {'answer': 'high performance compute cluster', 'type': 'extractive', 'score': 0.8137719631195068, 'context': ' you through the details of how to set up your job submission so lets get right into the first section what is Quest is northwesterns high performance compute cluster or hvc now this might sound like different to some people so I wanted to break this down for you Quest is called hype of Mormons beca', 'offsets_in_document': [{'start': 887, 'end': 919}], 'offsets_in_context': [{'start': 134, 'end': 166}], 'document_ids': ['1fac8a7369df7dd130292bbc950b0994'], 'meta': {'_split_id': 0, '_split_overlap': [], 'Title': 'Introduction to Quest Part 1   What is Quest - YouTube', 'Link': 'https://www.youtube.com/watch?v=rGWOoR9ASBY'}}>,
 <Answer {'answer': "Northwestern's High Performance Computing (HPC) cluster", 'type': 'extractive', 'score': 0.681267499923706, 'context': "st Analytics Nodes or KLC, submit new jobs, run jobs, or access files stored on Quest in any way including

In [None]:
# Print the text from documents used in an answer
docs = document_store.get_documents_by_id("5d631f6d428816e7b8c3fd7d59cd8af4")
docs[0].content

'Was this helpful?                  Thank you. Your feedback has been recorded.                  0 reviews        CommentsDo not fill this field out. It is used to deter robots.SubmitCancel FeedbackBlankBlankBlankDetailsArticle ID:       1669            Created      Thu 5/12/22 12:39 PM        Modified      Thu 9/7/23 11:36 AMDeleting...×Share        Recipient(s) - separate email addresses with a commaMessagePress Alt + 0 within the editor to access accessibility instructions, or press Alt + F10 to access the menu.Check out this article I found in the Client Portal knowledge base.<br /><br /><a href="https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1669">https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1669</a><br /><br />Getting Started on the Genomics Compute Cluster (b1042) on QuestSendClose          Powered by TeamDynamix | Site Map'

In [None]:
# Initialize variables to track the longest string and its length
longest_string = ""
longest_length = 0
item_num = 0

# Iterate through the list of dictionaries
for i,item in enumerate(blah):
  content = item.content
  if len(content) > longest_length:
    longest_string = content
    longest_length = len(content)
    item_num = i

# Print the longest string and its length
print("Longest String:", longest_string)
print("Length:", longest_length)
print("Location:", item_num)

Longest String: Title: Applying to Access Quest - YouTube
Link: https://www.youtube.com/watch?v=YhXATOQRISw
hi everyone my name is Alexis Porter Im with  research Computing Services and today were going   to focus on applying to allocations to gain access  to Quest before you can log into quest for the   first time youll need to have an account there  are several options to choose from when creating   an account so lets go through these options  together before we jump into more detail this   video is going to cover several different sections  from joining an existing allocation to General   access allocations buy-in allocations and workshop  where classroom allocations and its important to   note that allocations on quests are not backed up  regardless of what kind of allocation you choose so well start with the first option join  an existing allocation this is a common   procedure if youre joining a lab or  working on a collaborative project to join an existing allocation youll need 

In [None]:
def short_answers(answers):
    for i, answer in enumerate(answers['answers']):
        print('Answer no. {}: {}, with {} percent probability'.format(i, answer['answer'], round(answer['probability'] * 100)))

In [None]:
page_content = ["This is  some  text.     ", "###    ####", "#######   #####", "More text.", "", "Even more text.", "   "]

# Define a regular expression pattern to match consecutive '#' characters
hash_pattern = re.compile(r'#+')

# Define a regular expression pattern to match consecutive spaces
space_pattern = re.compile(r'\s+')

# Remove multiple consecutive spaces and replace with a single space.
page_content = [space_pattern.sub(' ', text) for text in page_content]

# Remove multiple consecutive # and replace with a single #.
page_content = [hash_pattern.sub('#', text) for text in page_content]
page_content

['This is some text. ', '# #', '# #', 'More text.', '', 'Even more text.', ' ']