# Generative Question Answering






## Installation

1. Click File -> Save a Copy in Drive. This will create a copy for you to modify in your Google Drive.

2. This project will require you to install langchain, cohere, chromadb, and other libraries. Run the command below to start the installation. Installation may take a couple of minutes.


In [None]:
!pip install cohere nltk unstructured langchain chromadb langchain-community langchain-cohere

## Import Libraries




In [None]:
import os
import nltk
nltk.download('punkt')

# from langchain import Cohere
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_cohere import CohereEmbeddings #NEW
from langchain_cohere import ChatCohere #New

## Setup

Before running the next cell, make sure you have a [Cohere API key](https://dashboard.cohere.ai/api-keys). Enter your API key using getpass in the cell below.


In [None]:
from getpass import getpass
api_key = getpass()
os.environ["COHERE_API_KEY"] = api_key

## Uploading documents to Colab

There are two ways you can upload your documents to Colab.

#### Method 1

1. Right-click on the link below and save the text file to your computer.

[Wikipedia Article: Quantum Computing](https://raw.githubusercontent.com/Thinkful-Ed/ai-in-web-dev-resources/refs/heads/main/books/quantumcomputers.txt)

2. Click on the folder icon tab at the left of the Colab editor. Create a new *documents* directory. Click on the *upload file* icon and select the file that you want to upload.

#### Method 2

Run the command below:

In [None]:
!curl -o ./quantumcomputers.txt https://raw.githubusercontent.com/Thinkful-Ed/ai-in-web-dev-resources/refs/heads/main/books/quantumcomputers.txt

## Load the documents

The [DirectoryLoader](https://js.langchain.com/docs/api/document_loaders_fs_directory/classes/DirectoryLoader) class will allow you to load multiple documents that are in a directory. Create a directory by clicking in the folder icon at the left and upload your text documents there. If you want to use another format, such as `md` files, make sure to change the `glob="**/*.txt"` variable and read the documentation to make sure that the file type is supported.

**Note:** Cohere is now imposing a limit on embeddings. If you get an error try a smaller text (like a wikipedia article).

In [None]:
loader = TextLoader("./quantumcomputers.txt", encoding="utf-8")
documents = loader.load()
print (f'You have {len(documents)} document(s) in your documents folder')

## Splitting Text

Splitting text is useful because when you have a long document, it can be difficult to find the specific information relevant to a question. The `CharacterTextSplitter` helps break down the text into smaller, more manageable chunks.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
print (f'Your document(s) was/were splitted into {len(texts)} chunks.')

## Embeddings and RetrievalQA

In simple terms, this code sets up the necessary tools for question-answering. It converts text into numerical representations called embeddings, creates a search index for the documents, and sets up a retrieval-based question-answering system using a specific type of chain (RetrievalQA).


In [None]:
embeddings = CohereEmbeddings(
    cohere_api_key="OrU9yPWradw31QYCeXtQavimy6jV2EbsXRmYSHvz", #TODO Remove API Key
    model="embed-english-v3.0"  # Required model name
)
docsearch = Chroma.from_documents(texts, embeddings, persist_directory = 'db')
docsearch.persist() # save db
llm = ChatCohere(cohere_api_key="OrU9yPWradw31QYCeXtQavimy6jV2EbsXRmYSHvz") #TODO Remove API Key

qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=docsearch.as_retriever(),
    return_source_documents=True
)


In [None]:
query = "what is quantum computing?"
result = qa.invoke({"query": query})
result['result']

You can also get the sources:

In [None]:
result['source_documents']

In [None]:
sources = ""
for count, source in enumerate(result['source_documents'],1):
  sources += "Source " + str(count) + "\n"
  sources += source.page_content + "\n"

print(sources)


## Download the database

Zip the database into `db.zip` and download it to your computer.


In [None]:
import shutil
shutil.make_archive('db', 'zip', root_dir='db')  # zips the folder named 'db' into db.zip


You should see a zipped folder appear inside your folder where you ran the above cell