# Similarity Research

In this notebook an experiments is conducted in attempt to fine similarities between segments.

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

You can double check whether the GPU runtime is enabled with the following command:

In [None]:
%%bash

nvidia-smi

To start, install the latest release of Haystack with `pip`:

In [None]:
%%bash

pip install --upgrade pip
pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

## Logging

In [None]:
import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

## Document Stor

In [None]:
from haystack.utils import launch_es

launch_es()

### Start an Elasticsearch server in Colab

If Docker is not readily available in your environment (e.g. in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

In [None]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

### Create the Document Store

In [None]:
import time
time.sleep(30)

Finally, we create the Document Store instance:

In [None]:
import os
from haystack.document_stores import ElasticsearchDocumentStore

# Get the host where Elasticsearch is running, default to localhost
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")
document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")

## Preprocessing of documents

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
import pandas as pd
file_path = "drive/MyDrive/Colab Notebooks/data/segemnts.csv"
df = pd.read_csv(file_path)

# cleanup
df.fillna(value="", inplace=True)
df["text"] = df["text"].apply(lambda x: x.strip())
df = df.rename(columns={"text": "content"})
print(df.head())
print(df.count)

In [None]:
print(df.head())

In [None]:
docs = df.to_dict(orient="records")

from pprint import pprint
# Let's have a look at the first 3 entries:
pprint(docs[:3])

In [None]:

document_store.write_documents(docs)

## Initialize Retriever, Reader & Pipeline

### Retriever

In [None]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

In [None]:
# Alternative: An in-memory TfidfRetriever based on Pandas dataframes for building quick-prototypes with SQLite document store.

# from haystack.nodes import TfidfRetriever
# retriever = TfidfRetriever(document_store=document_store)

### Reader

#### FARMReader

In [None]:
from haystack.nodes import FARMReader

# Load a  local model or any of the QA models on
# Hugging Face's model hub (https://huggingface.co/models)

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

#### TransformersReader

Alternative:

In [None]:
from haystack.nodes import TransformersReader
# reader = TransformersReader(model_name_or_path="distilbert-base-uncased-distilled-squad", tokenizer="distilbert-base-uncased", use_gpu=-1)

### Pipeline


In [None]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Voilà! Ask a question!

In [None]:
# You can configure how many candidates the Reader and Retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
prediction = pipe.run(
    query="artificial intelligence", params={"Retriever": {"top_k": 100}, "Reader": {"top_k": 10}}
)

In [None]:
from haystack.utils import print_answers

# Change `minimum` to `medium` or `all` to raise the level of detail
print_answers(prediction, details="all")

In [None]:
from pprint import pprint

pprint(prediction)

# Similar

In [None]:
from haystack.nodes import DensePassageRetriever, JoinDocuments
from haystack.pipelines import Pipeline
dpr = DensePassageRetriever(
  document_store=document_store,
  query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
  passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
  use_gpu=True,
)

document_store.update_embeddings(dpr)

In [None]:
from haystack.pipelines import MostSimilarDocumentsPipeline
mspipe = MostSimilarDocumentsPipeline(document_store=document_store)
results = mspipe.run(document_ids=["162c7a1bf9dfeb9a306933936249c71d"])
pprint(results)

In [None]:
join_node = JoinDocuments(join_mode="merge")
p = Pipeline()
p.add_node(component=dpr, name="R2", inputs=["Query"])
p.add_node(component=pipe, name="R1", inputs=["Query"])
p.add_node(component=join_node, name="Join", inputs=["R1", "R2"])

In [None]:
query = "Where does the sun rise?"
results = p.run(query=query)