## Building Q&A application over a text data source

###### importing libraries and packages

In [1]:
import getpass
import os
from langchain_community.document_loaders import WebBaseLoader
import bs4
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import hub
from langchain_cohere import ChatCohere, CohereEmbeddings
from langchain_chroma import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

###### setting the environment variables

In [2]:
os.environ['COHERE_API_KEY'] = getpass.getpass()

 ········


In [3]:
cohere_api_key = os.environ['COHERE_API_KEY']

In [4]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = getpass.getpass()

 ········


#### Indexing Pipeline

###### 1. Document Loading

In [5]:
loader = WebBaseLoader(
    web_paths=('https://karpathy.github.io/2019/04/25/recipe/',)
)
docs = loader.load()

In [6]:
#len(docs) = 1
docs

[Document(page_content="\n\n\n\n\nA Recipe for Training Neural Networks\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nAndrej Karpathy blog\n\n\n\n\n\n\n\n\n\nAbout\n\n\n\n\n\n\n\n\nA Recipe for Training Neural Networks\nApr 25, 2019\n\n\nSome few weeks ago I posted a tweet on “the most common neural net mistakes”, listing a few common gotchas related to training neural nets. The tweet got quite a bit more engagement than I anticipated (including a webinar :)). Clearly, a lot of people have personally encountered the large gap between “here is how a convolutional layer works” and “our convnet achieves state of the art results”.\nSo I thought it could be fun to brush off my dusty blog to expand my tweet to the long form that this topic deserves. However, instead of going into an enumeration of more common errors or fleshing them out, I wanted to dig a bit deeper and talk about how one can avoid making these errors altogether (or fix them very fast). The trick to doing so is to follow a certain proc

In [7]:
docs[0].metadata

{'source': 'https://karpathy.github.io/2019/04/25/recipe/',
 'title': 'A Recipe for Training Neural Networks',
 'description': 'Musings of a Computer Scientist.',
 'language': 'No language found.'}

###### 2. Document Transformation : Splitting into chunks

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200, add_start_index = True)
doc_splits = text_splitter.split_documents(documents=docs)

In [9]:
# len(doc_splits) = 34
doc_splits[7]

Document(page_content='training. Or you initialized your weights from a pretrained checkpoint but didn’t use the original mean. Or you just screwed up the settings for regularization strengths, learning rate, its decay rate, model size, etc. Therefore, your misconfigured neural net will throw exceptions only if you’re lucky; Most of the time it will train but silently work a bit worse.', metadata={'source': 'https://karpathy.github.io/2019/04/25/recipe/', 'title': 'A Recipe for Training Neural Networks', 'description': 'Musings of a Computer Scientist.', 'language': 'No language found.', 'start_index': 3994})

In [10]:
doc_splits[7].metadata

{'source': 'https://karpathy.github.io/2019/04/25/recipe/',
 'title': 'A Recipe for Training Neural Networks',
 'description': 'Musings of a Computer Scientist.',
 'language': 'No language found.',
 'start_index': 3994}

###### 3. Document Embedding and Storing in Chroma Vectorstore

In [11]:
vectorstore = Chroma.from_documents(documents=doc_splits, embedding=CohereEmbeddings())

#### Retrieval and Generation

###### Retriever

In [12]:
retriever = vectorstore.as_retriever()

###### Prompt

In [13]:
prompt = hub.pull('rlm/rag-prompt')

In [14]:
prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))])

###### Chat Model

In [15]:
model = ChatCohere(model='command-r')

###### Output Parser

In [16]:
output_parser = StrOutputParser()

###### RAG Chain

In [17]:
chain = (
    {'context' : retriever, 'question' : RunnablePassthrough()}
    | prompt
    | model
    | output_parser
)

In [18]:
output = chain.invoke('list down the specific process followed by Andrej Karparthy when applying neural network to a specific problem.')

In [19]:
print('question: list down the specific process followed by Andrej Karparthy when applying neural network to a specific problem. \n\nanswer : ', output)

question: list down the specific process followed by Andrej Karparthy when applying neural network to a specific problem. 

answer :  Andrej Karpathy's process when applying neural networks to a problem is as follows:

1. Inspect your data thoroughly and become familiar with it. 
2. Resist the temptation to use complex architectures; instead, opt for simplicity and copy proven architectures. 
3. Set up an end-to-end training framework and establish dumb baselines.


In [20]:
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer \
the question. If you don't know the answer, just say that you don't know. Don't make up anything yourself, answer \
based on the context and keep the answer concise.

Question: {question} 
Context: {context}
Answer: \
"""

prompt = ChatPromptTemplate.from_template(template)

chain = (
    {'context' : retriever, 'question' : RunnablePassthrough()}
    | prompt
    | model
    | output_parser
)

output = chain.invoke('List down the points mentioned by Andrej on how one can get familiar with the data before applying neural networks to a problem.')

In [21]:
print('question : List down the points mentioned by Andrej on how one can get familiar with the data before applying neural networks to a problem.\n\nanswer : ',output)

question : List down the points mentioned by Andrej on how one can get familiar with the data before applying neural networks to a problem.

answer :  Here is a list of suggestions by Andrej on getting familiar with the data before applying neural networks:

1. Spend a lot of time examining the data closely, looking for patterns, imbalances, and biases.
2. Think about the types of network architectures needed based on the data.
3. Look for outliers and try to understand why they exist.
4. Write code to visualise the distributions for better understanding.


In [22]:
output = chain.invoke('Can you brief the tips and tricks suggested by Andrej to setup the training or evaluation skeleton and a baselin')

In [23]:
print('question : Can you brief the tips and tricks suggested by Andrej to setup the training or evaluation skeleton and a baseline?\n\nanswer : ',output)

question : Can you brief the tips and tricks suggested by Andrej to setup the training or evaluation skeleton and a baseline?

answer :  Andrej suggests spending a lot of time understanding the data. He recommends using simple models for the first run, such as a linear classifier or a tiny ConvNet, and choosing a well-performing architecture from a related paper.


In [24]:
output = chain.invoke("What are Andrej's suggestion on the topics of Overfit, Regularize and Tune? List down your answer for each one of them separately.")

In [25]:
print("question : What are Andrej's suggestion on the topics of Overfit, Regularize and Tune? List down your answer for each one of them separately.\n\nanswer : ",output)

question : What are Andrej's suggestion on the topics of Overfit, Regularize and Tune? List down your answer for each one of them separately.

answer :  Here are Andrej's suggestions on the topics of overfitting, regularizing, and tuning based on the available context:

1. Overfit: Andrej suggests first getting a model large enough that it can overfit by focusing on the training loss.

2. Regularize: To regularize the model, he recommends giving up some training loss to improve the validation loss. This can be achieved by weight decay and early stopping before the model fully overfits.

3. Tune: Andrej does not explicitly mention 'tuning' in the context provided. However, his suggestion of employing a two-stage approach, comprising obtaining a model capable of overfitting followed by regularization, implies that the model is tuned during the regularization stage to improve validation loss.
