# HW11: Retrieval-augmented LM for QA
In the lecture and notebook we learn that LLMs can become stronger when it is granted with retrieval (i.e., context for text execution) and elaborate prompt designing.

In this homework, we will make a Question-answering system using context retrieval + text-ada-001 (the Cheapest OpenAI GPT-3 checkpoint, which I believe would be affordable by the free budget of OpenAI account)

If it's impossible for you to make any OpenAI API call, you can also finish the assignment without executing the code.

In [1]:
!pip install langchain openai faiss-cpu wikipedia tiktoken

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.183-py3-none-any.whl (938 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m938.0/938.0 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.7-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m83.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tiktoken
  Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━

### Set Your OpenAI Key

In [8]:
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

OpenAI API Key:··········


### Gather the context for question answering.

In [2]:
import wikipedia

wikipedia.set_lang('en')
page = wikipedia.page("Python (programming language)")
content = page.content

In [17]:
print(type(content))

<class 'str'>


The context contains 8485 tokens, which is too long to fit in text-ada-001. Therefore, we need to retrieve the most related information for question answering.

In [3]:
import tiktoken

enc = tiktoken.encoding_for_model("text-ada-001")

len(enc.encode(content))

8485

### TODO 1: Split the context into chunks of 100-character length. Then store them in FAISS using OpenAI embedding model.

In [4]:
print(enc.encode(content))

[37906, 318, 257, 1029, 12, 5715, 11, 2276, 12, 29983, 8300, 3303, 13, 6363, 1486, 8876, 31648, 2438, 1100, 1799, 351, 262, 779, 286, 2383, 33793, 341, 2884, 262, 572, 12, 1589, 3896, 13, 37906, 318, 32366, 25683, 290, 15413, 12, 4033, 12609, 13, 632, 6971, 3294, 8300, 11497, 328, 907, 11, 1390, 20793, 357, 31722, 27931, 828, 2134, 12, 17107, 290, 10345, 8300, 13, 632, 318, 1690, 3417, 355, 257, 366, 65, 1436, 444, 3017, 1, 3303, 2233, 284, 663, 9815, 3210, 5888, 13, 8205, 17305, 5719, 9847, 388, 2540, 1762, 319, 11361, 287, 262, 2739, 7169, 82, 355, 257, 17270, 284, 262, 9738, 8300, 3303, 290, 717, 2716, 340, 287, 10249, 355, 11361, 657, 13, 24, 13, 15, 13, 11361, 362, 13, 15, 373, 2716, 287, 4751, 13, 11361, 513, 13, 15, 11, 2716, 287, 3648, 11, 373, 257, 1688, 18440, 407, 3190, 19528, 12, 38532, 351, 2961, 6300, 13, 11361, 362, 13, 22, 13, 1507, 11, 2716, 287, 12131, 11, 373, 262, 938, 2650, 286, 11361, 362, 13, 37906, 9835, 9803, 355, 530, 286, 262, 749, 2968, 8300, 8950, 13, 628, 

In [11]:
from langchain.document_loaders import TextLoader
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()

In [14]:
print(documents)

[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  \n\nLast year COVID-19 kept us apart. This year we are finally together again. \n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \n\nWith a duty to one another to the American people to the Constitution. \n\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \n\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \n\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \n\nHe met the Ukrainian people. \n\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. \n\nGroups of citize

In [16]:
print(type(documents[0]))

<class 'langchain.schema.Document'>


In [18]:
from langchain.schema import Document

In [21]:
doc = Document(content)

TypeError: ignored

In [25]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI

# TODO: Split the context into chunks of 100-character length. Then store them in FAISS using OpenAI embedding model.
# Hint: use the above imported tools
# Hint: refer to https://python.langchain.com/en/latest/modules/indexes/getting_started.html for the document of textsplitters, vectorstores, and retrievers

splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap = 10)
chunks = splitter.split_text(content)

embeddings_model = OpenAIEmbeddings()
#vectorstore = FAISS(embedding_function=embeddings_model.embed_query, index="FlatL2", docstore="Memory", index_to_docstore_id={})


for chunk in chunks:
    if len(chunk) == 0:
        print(len(chunk))

vectorstore = FAISS.from_texts(chunks, embeddings_model)


#for chunk in chunks:
 #   document = {'text': chunk, 'metadata': {}}
  #  vectorstore.add_documents([document])
    #vectorstore.add_documents(page_content=chunk)

#db = FAISS.from_documents(chunks, embeddings_model)

#db = [embeddings_model.embed_query(chunk) for chunk in chunks]

AuthenticationError: ignored

In [34]:
vectorstore = FAISS(embedding_function=embeddings_model.embed_query, index="FlatL2", docstore="Memory", index_to_docstore_id={})
for chunk, embedding in zip(chunks, embeddings):
    #document = {'page_content': chunk, 'metadata': {}}
    vectorstore.add_documents(chunk, embedding)

retriever = OpenAI(vectorstore)

TypeError: ignored

In [20]:
question = "Who invented Python?"

# retrieve useful information for QA using the retriever you obtained.
docs = retriever.get_relevant_documents(question)
context = "\n".join([doc.page_content for doc in docs])

NameError: ignored

In [None]:
print(context)

### TODO 2: question answering using retrieved context and OpenAI call.

In [None]:
qa_prompt = """Given the context: {context}
The answer to "{question}" is:"""
model_name = 'text-ada-001'

# TODO 2: question answering using retrieved context and OpenAI call.


In [None]:
output

' Guido van Rossum.'