# Langchain

This notebook explains how to use langchain to simplify a RAG workflow. It is based on Pere Martra's [LLM Course](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/3-LangChain/ask-your-documents-with-langchain-vectordb-hf.ipynb) ([article](https://pub.towardsai.net/query-your-dataframes-with-powerful-large-language-models-using-langchain-abe25782def5)).

Langchain is a framework for building LLM applications. It improves the customization and accuracy of LLMs. Use LangChain to build prompt chains or customize existing templates. LangChain also enables LLMs to use new data sets without retraining.

In [None]:
import pandas as pd
import numpy as np

## Create a Vector Store

Uncomment one of the following data sets to experiment, [Labeled Newscatche](https://www.kaggle.com/datasets/kotartemiy/topic-labeled-news-dataset?source=post_page-----ab2f995f09ba--------------------------------), [BBC News](https://www.kaggle.com/datasets/gpreda/bbc-news?source=post_page-----ab2f995f09ba--------------------------------&select=bbc_news.csv), or [MIT AI News](https://www.kaggle.com/datasets/deepanshudalal09/mit-ai-news-published-till-2023?source=post_page-----ab2f995f09ba--------------------------------). We'll treat one of the columns as the document, and one of others as a searchable part of the metadata, the topic.

In [11]:
# news = pd.read_csv("./data/bbc_news.csv")
# MAX_NEWS = 500
# DOCUMENT = "description"
# TOPIC = "title"

news = pd.read_csv("./data/labeled_newscatcher.csv", sep=';')
MAX_NEWS = 1000
DOCUMENT = "title"
TOPIC = "topic"

# news = pd.read_csv("./data/mit_ai_news.csv")
# MAX_NEWS = 1000
# DOCUMENT = "Article Body"
# TOPIC = "Article Header"

subset_news = news.head(MAX_NEWS)

subset_news.head(3)

Unnamed: 0,topic,link,domain,published_date,title,lang
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en


Use LangChain to create a data frame loader, then use the loader to create a document object. Each row of `df_document` consists of `page_content` containing the text, and `metadata` containing the other columns in the data frame (topic, link, domain, etc.).

In [28]:
from langchain.document_loaders import DataFrameLoader

df_loader = DataFrameLoader(subset_news, page_content_column=DOCUMENT)
df_document = df_loader.load()

display(df_document[:2])

[Document(page_content="A closer look at water-splitting's solar fuel potential", metadata={'topic': 'SCIENCE', 'link': 'https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php', 'domain': 'eurekalert.org', 'published_date': '2020-08-06 13:59:45', 'lang': 'en'}),
 Document(page_content='An irresistible scent makes locusts swarm, study finds', metadata={'topic': 'SCIENCE', 'link': 'https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw', 'domain': 'pulse.ng', 'published_date': '2020-08-12 15:14:19', 'lang': 'en'})]

Use LangChain to split `page_content` into text blocks. The block size is set to 250 with an overlap of 10 to balance benefit of context size with the cost of memory usage.

In [29]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=250, chunk_overlap=10)
texts = text_splitter.split_documents(df_document)

display(texts[:2])

[Document(page_content="A closer look at water-splitting's solar fuel potential", metadata={'topic': 'SCIENCE', 'link': 'https://www.eurekalert.org/pub_releases/2020-08/dbnl-acl080620.php', 'domain': 'eurekalert.org', 'published_date': '2020-08-06 13:59:45', 'lang': 'en'}),
 Document(page_content='An irresistible scent makes locusts swarm, study finds', metadata={'topic': 'SCIENCE', 'link': 'https://www.pulse.ng/news/world/an-irresistible-scent-makes-locusts-swarm-study-finds/jy784jw', 'domain': 'pulse.ng', 'published_date': '2020-08-12 15:14:19', 'lang': 'en'})]

Create embeddings using the all-MiniLM-L6-v2 model and load into the ChromaDB vector store. 

In [31]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

chromadb_index = Chroma.from_documents(texts, embedding_function, persist_directory='./input')


## LangChain

With the vector store in hand, we can use LangChain to create a RAG application. The RAG performs a similarity-based search of the vector store using embeddings, then sends the documents as context along with the user query to an LLM, either **dolly-v2-3b** or **flan-t5-large** (try them both!). The vector store search is handled by the **retriever**. 



In [32]:
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from langchain_core.output_parsers import StrOutputParser

# You can get these models from Hugging Face. The model's supported task is in
# the documentation.
# https://huggingface.co/databricks/dolly-v2-3b
model_id = "databricks/dolly-v2-3b"
task="text-generation"
#model_id = "google/flan-t5-large"
#task="text2text-generation"

hf_llm = HuggingFacePipeline.from_model_id(
    model_id=model_id,
    task=task,
    model_kwargs={
        "temperature": 0,
        "max_length": 256
    },
    pipeline_kwargs={
        "repetition_penalty":1.1
    }
)

retriever = chromadb_index.as_retriever()

# "stuff" sends documents in the prompt. Other options:
# "refine" sends documents one at a time to refine the response each time;
# "map reduce" reduces the documents into a single document;
# "map re-rank" sends documents one at a time and ranks the results to return the best one.
document_qa = RetrievalQA.from_chain_type(
    llm=hf_llm, chain_type="stuff", retriever=retriever
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


We're ready to use the chain. Let's ask "Can I buy a Toshiba laptop?" from the newscatcher data set.

In [33]:
response = document_qa.invoke("Can I buy a Toshiba laptop?")

#Sample question for BBC Dataset. 
#response = document_qa.run("Who is going to meet boris johnson?")

display(response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


{'query': 'Can I buy a Toshiba laptop?',
 'result': ' No, Toshiba has officially stopped making laptops.\n\n'}

There is a newer way to do this.

In [34]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

template = """Answer the question based on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | hf_llm
    | StrOutputParser()
)

chain.invoke("Can I buy a Toshiba laptop?")

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


'Answer'