## Example RAG 
with Langchain and ChromaDB

In [1]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter

# Load text data from a directory
loader = DirectoryLoader("textdata", glob="**/*.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
chunks

Created a chunk of size 645, which is longer than the specified 500
Created a chunk of size 846, which is longer than the specified 500
Created a chunk of size 590, which is longer than the specified 500
Created a chunk of size 586, which is longer than the specified 500
Created a chunk of size 605, which is longer than the specified 500
Created a chunk of size 756, which is longer than the specified 500


[Document(page_content="Rafael Nadal Parera[pron 1] (born 3 June 1986) is a Spanish professional tennis player. Nadal has been ranked world No. 1 in singles by the Association of Tennis Professionals (ATP) for 209 weeks, and has finished as the year-end No. 1 five times. Nadal has won 22 Grand Slam men's singles titles, including a record 14 French Open titles. He has won 92 ATP-level singles titles, including 36 Masters titles and an Olympic gold medal, with 63 of these on clay courts. Nadal is one of only two men to complete the Career Golden Slam in singles. [b] His 81 consecutive wins on clay constitute the longest single-surface win streak in the Open Era.", metadata={'source': 'textdata\\rafaelnadal.txt'}),
 Document(page_content="For over a decade, Nadal has led men's tennis along with Roger Federer and Novak Djokovic as the Big Three. [c] At the start of his professional career, Nadal became one of the most successful teenagers in ATP Tour history, reaching the world No. 2 rank

In [9]:
import getpass
from langchain.embeddings import HuggingFaceHubEmbeddings
from langchain.vectorstores import Chroma

# HuggingFace Embeddings
api_key = getpass.getpass('HuggingFace API Key:') 

embeddings = HuggingFaceHubEmbeddings(
    huggingfacehub_api_token=api_key,
    repo_id="sentence-transformers/all-MiniLM-L6-v2",
)

# load it into Chroma
db = Chroma.from_documents(chunks, embeddings)

Using embedded DuckDB without persistence: data will be transient
No embedding_function provided, using default embedding function: SentenceTransformerEmbeddingFunction


In [3]:
# Similiarity Search in Chroma
query = "How old is Roger Federer?"
docs = db.similarity_search(query, k=2)
docs

[Document(page_content='Federer and Stan Wawrinka led the Switzerland Davis Cup team to their first title in 2014, following their Olympic doubles gold victory at the 2008 Beijing Olympics. Federer also won a silver medal in singles at the 2012 London Olympics, finishing runner-up to Andy Murray. After a half-year hiatus in late 2016 to recover from knee surgery, Federer returned to tennis, winning three more majors over the next two years, including the 2017 Australian Open over Nadal and an eighth singles title at the 2017 Wimbledon Championships. At the 2018 Australian Open, Federer became the first man to win 20 major singles titles and shortly arter the oldest ATP world No. 1 at age 36. In September 2022, he retired from professional tennis following the Laver Cup.', metadata={'source': 'textdata\\rogerfederer.txt'}),
 Document(page_content="Roger Federer (German pronunciation: [ˈrɔdʒər ˈfeːdərər]; born 8 August 1981) is a Swiss former professional tennis player. Federer was ranke

In [4]:
from langchain.llms import HuggingFaceHub

# Load LLM
model = HuggingFaceHub(huggingfacehub_api_token=api_key, repo_id="mistralai/Mistral-7B-Instruct-v0.1")

  warn_deprecated(


In [5]:
from langchain.prompts import PromptTemplate

# Build prompt template
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

print(prompt)

input_variables=['context', 'question'] template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'


In [12]:
from langchain.chains import RetrievalQA

# Set chroma as retriever and build chain
retriever = db.as_retriever(search_kwargs={"k": 2})

chain = RetrievalQA.from_chain_type(
    llm=model,
    retriever=retriever, 
    chain_type_kwargs = {"prompt": prompt})

In [13]:
question = "How many titles did Roger Federer win?"
answer = chain.run({"query": question})
print(answer)

Answer the question based only on the following context:
Roger Federer (German pronunciation: [ˈrɔdʒər ˈfeːdərər]; born 8 August 1981) is a Swiss former professional tennis player. Federer was ranked world No. 1 in singles by the Association of Tennis Professionals (ATP) for 310 weeks, including a record 237 consecutive weeks, and finished as the year-end No. 1 five times. He won 103 singles titles on the ATP Tour, the second most of all time, including 20 major men's singles titles (among which a record eight men's singles Wimbledon titles, and an Open Era joint-record five men's singles US Open titles) and six year-end championships.

Federer and Stan Wawrinka led the Switzerland Davis Cup team to their first title in 2014, following their Olympic doubles gold victory at the 2008 Beijing Olympics. Federer also won a silver medal in singles at the 2012 London Olympics, finishing runner-up to Andy Murray. After a half-year hiatus in late 2016 to recover from knee surgery, Federer retur

In [14]:
question = "When whan Roger Federer was born and how old is he?"
answer = chain.run({"query": question})
print(answer)

Answer the question based only on the following context:
Federer and Stan Wawrinka led the Switzerland Davis Cup team to their first title in 2014, following their Olympic doubles gold victory at the 2008 Beijing Olympics. Federer also won a silver medal in singles at the 2012 London Olympics, finishing runner-up to Andy Murray. After a half-year hiatus in late 2016 to recover from knee surgery, Federer returned to tennis, winning three more majors over the next two years, including the 2017 Australian Open over Nadal and an eighth singles title at the 2017 Wimbledon Championships. At the 2018 Australian Open, Federer became the first man to win 20 major singles titles and shortly arter the oldest ATP world No. 1 at age 36. In September 2022, he retired from professional tennis following the Laver Cup.

Roger Federer (German pronunciation: [ˈrɔdʒər ˈfeːdərər]; born 8 August 1981) is a Swiss former professional tennis player. Federer was ranked world No. 1 in singles by the Association 

In [18]:
question = "Summarize the life of Rafa in 10 words."
answer = chain.run({"query": question})
print(answer)

Answer the question based only on the following context:
As a left-handed player, one of Nadal's main strengths is his forehand, which he hits with a high degree of topspin. He also regularly places among the Tour leaders in percentage of return games, return points, and break points won. Nadal has won the Stefan Edberg Sportsmanship Award five times and was the Laureus World Sportsman of the Year in 2011 and 2021. Time named Nadal one of the 100 most influential people in the world in 2022. He is a recipient of the Grand Cross of Royal Order of Sports Merit, Grand Cross of Order of the Second of May, the Grand Cross of Naval Merit, and the Medal of the City of Paris. Representing Spain, he has won two Olympic gold medals, and led the nation to four Davis Cup titles. Nadal has also opened a tennis academy in Mallorca, and is an active philanthropist.

For over a decade, Nadal has led men's tennis along with Roger Federer and Novak Djokovic as the Big Three. [c] At the start of his prof