# Indexing Data

Before we can perform question-answering (QA) operations we much initialize a document store which will store all of our source text and allow us to ask questions in reference to this source text. The process of initializing the document store and adding data to it is called *indexing*.

## Load Data

Before we begin indexing we must load and prepare our dataset. We are using an NHS dataset from Deepset, these same files can be downloaded from the demo repository. TK link

We load them from the `/data` directory like so:

In [None]:
from pathlib import Path

paths = [str(x) for x in Path('../data').glob('*.txt')]
paths[:5]

These are all plaintext files that contain NHS information on a particular condition like *pre-exlampsia*, *acute-lymphoblastic-leukaemia*, etc. Let's see how many of these files we have...

In [None]:
len(paths)

We will load them all into a variabled called `nhs_text`.

In [None]:
nhs_text = []
for path in paths:
    with open(path, 'r') as fp:
        text = fp.read()
        nhs_text.append(text)

Let's see what they look like:

In [None]:
nhs_text[0]

In [None]:
nhs_text[1]

Clearly there is some data cleanup required to remove the generic text found on each page.

In [None]:
nhs_text[0][:623]

In [None]:
nhs_text[2][:623]

In [None]:
nhs_text[5][:623]

In [None]:
import re

header = re.compile(r'(?s)Cookies.*\nHome Health A to Z\n')

def clean(text):
    text = header.sub('', text)
    text = re.sub(r'\n+', '\n', text)
    text = text.split("Page last reviewed:")[0]
    return text

In [None]:
clean(nhs_text[0])

In [None]:
clean(nhs_text[100])

This cleaning function seems to work well enough so we can apply it to our data to returning a cleaner set of results.

In [None]:
nhs_text = [clean(text) for text in nhs_text]

Our embedding model expects no more than 128 *tokens* of text, this translates to roughly 400-600 characters, so we will split the text into chunks of ~500 characters with a few extra conditions:

* We split on newline characters `\n`
* We leave an overlap of one sentence between each chunk (to avoid missing relevant relationships between sentences).

In [None]:
chunked = []
chunk = 500

for i, page in enumerate(nhs_text):
    url = paths[i][8:-5].replace('_', '/')
    page = page.split("\n")
    context = ""
    for j in range(len(page)):
        if j != 0 and len(context) == 0:
            context += page[j-1] + " "
        context += page[j] + " "
        if len(context) >= chunk:
            chunked.append({
                "text": context,
                "url": url
            })
            context = ""

---

*If you'd like to keep things simpler, we can avoid adding the sentence overlap and cutting on newline characters with:*

```python
chunked = []
chunk = 500

for i, page in enumerate(nhs_text):
    url = paths[i][8:]
    for i in range(0, len(page), chunk):
        i_end = min(i + chunk, len(page))
        chunked.append({
            "text": page[i:i_end],
            "url": url
        })
```

---

In [None]:
chunked[0]

In [None]:
chunked[1]

Here we can see the sentence overlap `"If you're pregnant, hospitals and clinics are making sure it's safe for you to go to appointments."` shared between the two chunks. We also split chunks between sentences rather than mid-word (which would be the case if cutting on every 500 characters without consideration of newlines).

## Indexing the Data

With our data prepared we can begin indexing it. First, we initialize the document store.

In [None]:
from haystack.document_stores import PineconeDocumentStore

API_KEY = "<<YOUR_API_KEY>>"
INDEX_NAME = "haystack-nhs-jul"

document_store = PineconeDocumentStore(
    api_key=API_KEY,
    index=INDEX_NAME,
    similarity="dot_product",
    embedding_dim=768,
    metadata_config={"indexed": ["url"]}
)

And now we create our documents and add them to the document store.

In [None]:
from haystack import Document
from tqdm.auto import tqdm  # progress bar

batch_size = 256

counter = 0
docs = []

for d in tqdm(chunked):
    # create haystack document object with text content and doc metadata
    doc = Document(
        content=d["text"],
        meta={
            "url": d["url"]
        }
    )
    docs.append(doc)
    counter += 1
    if counter % batch_size == 0:
        # writing docs everytime 10k docs are reached
        document_store.write_documents(docs)
        docs.clear()

if len(docs) > 0:
    document_store.write_documents(docs)

Although we have added the documents to our document store, they do not include *embeddings*, which are required for us to perform a vector search. To create these embeddings we need to initialize a retriever model.

In [None]:
from haystack.retriever.dense import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model='sentence-transformers/multi-qa-mpnet-base-dot-v1',
    model_format="sentence_transformers"
)

We use this retriever model to create the embeddings by calling the `update_embeddings` method.

In [None]:
document_store.update_embeddings(
    retriever,
    batch_size=16
)

We can check that our documents have been added using a simple `DocumentSearchPipeline`.

In [None]:
from haystack.pipelines import DocumentSearchPipeline
from haystack.utils import print_documents

search_pipe = DocumentSearchPipeline(retriever)
result = search_pipe.run(
    query="Who is affected by pre-eclampsia?",
    params={"Retriever": {"top_k": 2}}
)

print_documents(result)

This all looks good and we're returning relevant information for our query.

This is just the indexing step, eg adding the data/documents + their embeddings to our document store. In the following notebook *01_test_pipeline.ipynb* we will initialize and test a full *extractive QA* pipeline.

---