#### [LangChain Handbook](https://qdrant.tech/articles/langchain-integration/)

# LangChain Retrieval Agent

`Conversational agents` although being very accurate, face some issues with data freshness, accessing internal documentations and knowledge about specific domains. On the other hand `retrieval augmentation` solves these issues but then it would always retrieve on every query which is inefficient in most of simple cases in which retrieval is not required. Using both of these methods simultaneously gives us a system which can answer simple questions directly and seek for extra knowledge when queried with complex questions. We will see how to do so with LangChain and Qdrant in this notebook with Falcon-7B-Instruct as our LLM. Falcon-7B-Instruct is a ready-to-use chat/instruct model based on Falcon-7B which outperforms comparable open-source models.


## Install Dependencies
Let's get started by installing the packages needed for notebook to run:

In [1]:
!pip install -qU qdrant-client==1.3.1 langchain==0.0.235 datasets==2.13.1 sentence_transformers==2.2.2

## Import libraries

In [2]:
from datasets import load_dataset
import qdrant_client
import os
import torch
from pathlib import Path
from tqdm.auto import tqdm
from qdrant_client import QdrantClient
from qdrant_client.http import models
from langchain.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from sentence_transformers import SentenceTransformer
from langchain import HuggingFaceHub

C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


## Building the Knowledge Base

Our knowledge base will be prepared from a dataset from Hugging Face called `vietgpt/multi_news_en`, it consists of about 45k records of news articles and human-written summaries of these articles.

In [3]:
data = load_dataset("vietgpt/multi_news_en", split="train")
data

Found cached dataset parquet (C:/Users/karti/.cache/huggingface/datasets/vietgpt___parquet/vietgpt--multi_news_en-4921e62a5a375465/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)


Dataset({
    features: ['document', 'summary'],
    num_rows: 44972
})

We convert the dataset into a pandas dataframe for further use:

In [4]:
data = data.to_pandas()
data.head()

Unnamed: 0,document,summary
0,"National Archives \n \n Yes, it’s that time ag...",– The unemployment rate dropped to 8.2% last m...
1,LOS ANGELES (AP) — In her first interview sinc...,"– Shelly Sterling plans ""eventually"" to divorc..."
2,"GAITHERSBURG, Md. (AP) — A small, private jet ...",– A twin-engine Embraer jet that the FAA descr...
3,Tucker Carlson Exposes His Own Sexism on Twitt...,– Tucker Carlson is in deep doodoo with conser...
4,A man accused of removing another man's testic...,– What are the three most horrifying words in ...


### Initialize Embedding Model

We will use the `all-MiniLM-L6-v2`, which is used to create vector representations of our records and also for our search queries. These vector embeddings capture the semantic meaning of the documents or records. Then, during the retrieval phase, similarity measure (i.e., cosine similarity) is applied in vector space to find the most similar records to a given query.

In [5]:
# set device to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
if device != "cuda":
    print(
        f"You are using {device}. This is much slower than using "
        "a CUDA-enabled GPU. If on Colab you can change this by "
        "clicking Runtime > Change runtime type > GPU."
    )
# Instantiate the SentenceTransformer model
model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
model


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
binary_path: C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...


SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## Initialize Qdrant client

In [6]:
# Initialize Qdrant client

current_folder = Path.cwd()  # Get the current folder
qdrant_folder = current_folder / "qdrant"
qdrant_folder.mkdir()  # Create qdrant folder to store collection

client = QdrantClient(path=qdrant_folder.resolve())  # path to new qdrant folder

collection_name = "langchain-retrieval-agent"

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if collection_name not in collections:
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=384,  # specifying dimensionality of vectors output by model
            distance=models.Distance.COSINE,  # specifying which metric will be used to check similarity of vectors
        ),
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='langchain-retrieval-agent')]


## Generate Embeddings -> Store in Qdrant
Now we will generate embeddings for our summary column. We will do so in batches which is much faster than doing it individually. And then send a single api call to upsert the batch (also much faster).

In qdrant, we need an id (a unique value), embedding (embeddings for the summary column), and metadata for each document in the dataset. The metadata is a dictionary containing data relevant to our embeddings.

In [7]:
%%time
batch_size = 1024  # specify batch size according to your RAM and compute, higher batch size = more RAM usage

for i in tqdm(range(0, len(data), batch_size)):
    i_end = min(len(data), i + batch_size)  # get end of batch
    batch = data.iloc[i:i_end]  # extract batch
    meta = batch.to_dict(orient="records")  # first get metadata fields for this record
    embeds = model.encode(
        batch["summary"].tolist()
    ).tolist()  # encoding the whole batch of summary passages into vectors

    ids = list(range(i, i_end))  # create unique IDs

    # upsert to qdrant
    client.upsert(
        collection_name=collection_name,
        points=models.Batch(ids=ids, vectors=embeds, payloads=meta),
    )

collection_vector_count = client.get_collection(
    collection_name=collection_name
).vectors_count
print(f"Vector count in collection: {collection_vector_count}")
assert collection_vector_count == len(data)

  0%|          | 0/44 [00:00<?, ?it/s]

Vector count in collection: 44972
CPU times: total: 4min 51s
Wall time: 10min 43s


Let's check our collection info:

In [15]:
client.get_collection(collection_name=collection_name)

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=44972, indexed_vectors_count=0, points_count=44972, segments_count=1, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=384, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None), shard_number=None, replication_factor=None, write_consistency_factor=None, on_disk_payload=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=None, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None), payload_schema={})

## Creating a Vector Store

We will reuse the same collection to create a vector store of langchain.

In [9]:
qdrant = Qdrant(
    client=client,
    collection_name=collection_name,
    embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2"),
    content_payload_key="summary",
)
qdrant

<langchain.vectorstores.qdrant.Qdrant at 0x13e2e1748d0>

## Querying
Now with the help of langchain we can directly do `similarity search`(without generation component).

In [16]:
query = "When did the biggest terror attack on USA happen?"
qdrant.similarity_search(query, k=2)

 Document(page_content='– Since 9/11, the NYPD has stopped 14 terror attacks, right? It must be true because Mayor Bloomberg and police chief Ray Kelly keeping trumpeting the stat, and the media keep circulating it. (Like in this profile.) ProPublica took a look at the NYPD\'s own list of the 14 to test the accuracy of the claim. "Is it true? In a word, no," writes Justin Elliott. The boast "overstates both the number of serious, developed terrorist plots against New York and exaggerates the NYPD\'s role in stopping attacks." Of the 14, ProPublica says two, maybe three, qualify as true terror threats. And that includes "a failed attempt to bomb Times Square by a Pakistani-American in 2010 that the NYPD did not stop." What\'s more, the NYPD doesn\'t seem to have played a big role in most of the busts. "In several cases, it played no role at all." See the full article and a breakdown of the 14 cases here. (Asked about the story today, Bloomberg responded: “I could make as cogent an argum

Looks like we're getting good results. Let's take a look at how we can begin integrating this into a conversational chain.

## Initializing the Conversational Chain

We will use `Falcon-7B-Instruct` as our LLM, we will also need `conversational memory` to store previous conversations and a `ConversationalRetrievalChain` chain to retrieve extra data when needed.

We will use inference api of Falcon-7B-Instruct from hugging face to query. We need an API_TOKEN to do so which we can get from hugging face.

In [18]:
# get API_TOKEN from huggingface website
API_TOKEN = os.getenv("API_TOKEN") or "API_TOKEN"
if not API_TOKEN:
    raise ValueError(
        "API_TOKEN is not set. Please obtain a valid API_TOKEN from the Hugging Face website."
    )

In [27]:
# chat completion llm
llm = HuggingFaceHub(
    huggingfacehub_api_token=API_TOKEN,
    repo_id="tiiuae/falcon-7b-instruct",
    model_kwargs={
        "temperature": 0.1,
        "max_new_tokens": 2000,  # maximum number of tokens the model will generate
    },
)
# conversational memory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# conversational retrieval qa chain using vector store
qa = ConversationalRetrievalChain.from_llm(llm, qdrant.as_retriever(), memory=memory)

To generate answer to our query we will use the `query` function:

In [26]:
def query(question: str) -> str:
    """
    Generates an answer to the given question using the ConversationalRetrievalChain which uses

    Args:
        question (str): The question to generate an answer for.

    Returns:
        str: The generated answer.
    """
    result = qa({"question": question})
    return result["answer"][1:]

Now all the components are ready. We can start querying.

In [28]:
query("What was the largest gdp in 2020?")

'The largest GDP in 2020 was the United States with a GDP of 21.44 trillion USD.'

In [29]:
query("Which person has won the most olympic medals in history?")

' Michael Phelps\n\nAnswer:  Michael Phelps\n\nExplanation:  Michael Phelps has won a total of 21 Olympic gold medals, making him the most decorated Olympian in history.'

In [30]:
query("What sport did he used to play?")

' Michael Phelps is a swimmer.'

We are getting the answers in the way we wanted. The agent can refer to previous conversation as a source of information.
That's all we wanted to showcase. You can do more queries.

In [None]:
client.delete_collection(collection_name=collection_name)

---