https://medium.com/international-school-of-ai-data-science/implementing-rag-with-langchain-and-hugging-face-28e3ea66c5f7

In [3]:
print('Hello World')

Hello World


In [20]:
# !python -m pip list

In [21]:
# !python -m pip install langchain --user

In [22]:
# !python -m pip install torch

In [23]:
# !python -m pip install transformers

In [24]:
# !python -m pip install sentence-transformers

In [25]:
# !python -m pip install datasets   

In [26]:
# !python -m pip install faiss-cpu

In [27]:
# !python -m pip install ipywidgets

In [15]:
# revert to 1.x, issues with 2.x
# !pip install pydantic==1.10.14

langchain.document_loader required Python 3.10

In [28]:
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

# Load

In [29]:
# specify dataset name and column containing the content
dataset_name = 'databricks/databricks-dolly-15k'
page_content_column = 'context'

# create loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# load data
data = loader.load()

# display first 15 
data[:2]

Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 8.49MB/s]
Downloading data: 100%|██████████| 13.1M/13.1M [00:01<00:00, 9.86MB/s]
Generating train split: 15011 examples [00:00, 102632.28 examples/s]


[Document(page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."', metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'}),
 Document(page_content='""', metadata={'instruction': 'Which is a species of fish? Tope or Rope', 'response': 'Tope', 'category': 'classification'})]

In [31]:
type(data)

list

Test Splitter

In [32]:
# Create instance of resursive text splitter

# split text into chunks of 1000 chars, w/ 150 char overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' hold text to split -> split text into documents 'docs' using splitter
docs = text_splitter.split_documents(data)

In [36]:
docs[0]

Document(page_content='"Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia\'s domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."', metadata={'instruction': 'When did Virgin Australia start operating?', 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.', 'category': 'closed_qa'})

Text Embedding

In [43]:
## Note: Required enabling Developer Mode in Windows ##

# Define path to pre-trained  model
modelPath = 'sentence-transformers/all-MiniLM-l6-v2'

# Dict with model config options -> specifying CPU
model_kwargs = {'device' : 'cpu'}

# Dict of encoding options -> specifying 'normalize_embedding' to False
encode_kwargs = {'normalize_embeddings' : False}

# Initialize instance of HuggingFaceEmbeddings with specified params
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [46]:
text = 'this is a text document.'
query_result = embeddings.embed_query(text)
query_result[:3]

[-0.039057452231645584, 0.1383126676082611, -0.01722455956041813]

Vector Stores

In [48]:
%%time
db = FAISS.from_documents(docs, embeddings)

CPU times: total: 40min 2s
Wall time: 6min 45s


Ask Question

In [50]:
question = 'What is cheesemaking?'
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be deliberately left to ferment from naturally airborne spores and bacteria; this approach generally leads to a less consistent product but one that is valuable in a niche market.\n\nCulturing\nCheese is made by bringing milk (possibly pasteurised) in the cheese vat to a temperature required to promote the growth of the bacteria that feed on lactose and thus ferment the lactose into lactic acid. These bacteria in the milk may be wild, as is the case with unpasteurised milk, added from a culture,


Preparing LLM Model

In [51]:
# Create tokenizer object by loading pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained('Intel/dynamic_tinybert')

# Create a question-answer model object by loading pretrained model
model = AutoModelForQuestionAnswering.from_pretrained('Intel/dynamic_tinybert')

In [52]:
# Specify model name
model_name = 'Intel/dynamic_tinybert'

# Load tokenizer associated with specified model
tokenizer = AutoTokenizer.from_pretrained(model_name, padding=True, truncation=True, max_length=512)

# Define question-answer pipeline 
question_answerer = pipeline(
    'question-answering',
    model=model_name,
    tokenizer=tokenizer,
    return_tensors='pt'
)

# Create instance of the HuggingFacePipeline, which wraps the question-answer pipeline with add'l model-specific args
llm = HuggingFacePipeline(
    pipeline=question_answerer,
    model_kwargs={'temperature':0.7, 'max_length':512}
)

Retrievers

In [53]:
# create retriever object from the 'db' using 'as_retriever' method
retriever = db.as_retriever()

Search Relevant Doc for Question

In [54]:
docs = retriever.get_relevant_documents('What is Cheesemaking?')
print(docs[0].page_content)

"The goal of cheese making is to control the spoiling of milk into cheese. The milk is traditionally from a cow, goat, sheep or buffalo, although, in theory, cheese could be made from the milk of any mammal. Cow's milk is most commonly used worldwide. The cheesemaker's goal is a consistent product with specific characteristics (appearance, aroma, taste, texture). The process used to make a Camembert will be similar to, but not quite the same as, that used to make Cheddar.\n\nSome cheeses may be deliberately left to ferment from naturally airborne spores and bacteria; this approach generally leads to a less consistent product but one that is valuable in a niche market.\n\nCulturing\nCheese is made by bringing milk (possibly pasteurised) in the cheese vat to a temperature required to promote the growth of the bacteria that feed on lactose and thus ferment the lactose into lactic acid. These bacteria in the milk may be wild, as is the case with unpasteurised milk, added from a culture,


Retrieval QA Chain

In [63]:
# Create a retriever object from 'db' w/ search config where it retrieves 4 relevant splits/documents
retriever = db.as_retriever(search_kwargs={'k':4})

# Create a question-answering instance (qa) using RetrievalQA class
qa = RetrievalQA.from_chain_type(llm=llm, chain_type='refine', retriever= retriever, return_source_documents=False)

Call QA Chain

In [62]:
question = 'Who is Thomas Jefferson?'
result = qa.invoke({"query" : question})
print(result['result'])

ValueError: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

"Thomas Jefferson (April 13, 1743 \u2013 July 4, 1826) was an American statesman, diplomat, lawyer, architect, philosopher, and Founding Father who served as the third president of the United States from 1801 to 1809. Among the Committee of Five charged by the Second Continental Congress with authoring the Declaration of Independence, Jefferson was the Declaration's primary author. Following the American Revolutionary War and prior to becoming the nation's third president in 1801, Jefferson was the first United States secretary of state under George Washington and then the nation's second vice president under John Adams."

"Washington played an indispensable role in adopting and ratifying the Constitution of the United States, which replaced the Articles of Confederation in 1789 and remains the world's longest-standing written and codified national constitution to this day. He was then twice elected president by the Electoral College unanimously. As the first U.S. president, Washington implemented a strong, well-financed national government while remaining impartial in a fierce rivalry that emerged between cabinet members Thomas Jefferson and Alexander Hamilton. During the French Revolution, he proclaimed a policy of neutrality while sanctioning the Jay Treaty. He set enduring precedents for the office of president, including use of the title \"Mr. President\" and taking an Oath of Office with his hand on a Bible. His Farewell Address on September 19, 1796, is widely regarded as a preeminent statement on republicanism."

"Marcus Morton (1784 \u2013 February 6, 1864) was an American lawyer, jurist, and politician from Taunton, Massachusetts. He served two terms as Governor of Massachusetts and several months as Acting Governor following the death in 1825 of William Eustis. He served for 15 years as an associate justice of the Massachusetts Supreme Judicial Court, all the while running unsuccessfully as a Democrat for governor. He finally won the 1839 election, acquiring exactly the number of votes required for a majority win over Edward Everett. After losing the 1840 and 1841 elections, he was elected in a narrow victory in 1842.\n\nThe Massachusetts Democratic Party was highly factionalized, which contributed to Morton's long string of defeats. His brief periods of ascendancy, however, resulted in no substantive Democratic-supported reforms, since the dominant Whigs reversed most of the changes enacted during his terms. An opponent of the extension of slavery, he split with longtime friend John C.

"Muskets with interchangeable locks caught the attention of Thomas Jefferson through the efforts of Honor\u00e9 Blanc when Jefferson was Ambassador to France in 1785. Jefferson tried to persuade Blanc to move to America, but was not successful, so he wrote to the American Secretary of War with the idea, and when he returned to the USA he worked to fund its development. President George Washington approved of the idea, and by 1798 a contract was issued to Eli Whitney for 12,000 muskets built under the new system."

Question: Who is Thomas Jefferson?
Helpful Answer: argument needs to be of type (SquadExample, dict)