### Basic App for Evaluation

In [8]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

In [9]:
from langchain import OpenAI

In [10]:
llm = OpenAI()

**Load document**

In [11]:
from langchain.document_loaders import TextLoader

In [13]:
loader = TextLoader('be-good-and-how-not-to-die.txt')
document = loader.load()

In [14]:
print(f"Th document has {len(document[0].page_content)} characters")

Th document has 27419 characters


**Split the document in smaller chunks**

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [16]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 3000,
    chunk_overlap = 400
)

In [17]:
documents_chunks = text_splitter.split_documents(document)

In [18]:
print(f"Now you have {len(documents_chunks)} chunks")

Now you have 12 chunks


In [19]:
print(type(documents_chunks))

<class 'list'>


**Convert text chunks in numeric embeddings and load them to the vector database**

In [21]:
from langchain.embeddings import OpenAIEmbeddings

In [22]:
embeddings = OpenAIEmbeddings()

In [23]:
from langchain.vectorstores import FAISS

In [24]:
stored_embeddings = FAISS.from_documents(documents_chunks, embeddings)

**Create a Retrieval Question & Answering Chain**

In [26]:
from langchain.chains import RetrievalQA

In [28]:
# input_key: Tells the chain where will the user prompt be located
QA_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=stored_embeddings.as_retriever(),
    input_key="question"
)

Notice that we have added input_key in the QA_chain configuration. This tells the chain where will the user prompt be located.


**We are going to evaluate this app with 2 questions and answers we already know (these answers are technically known as "ground truth answers")**

In [29]:
questions_and_answers = [
    {
        'question' : "Where is a whole neighborhood of YC-funded startups?", 
        'answer' :"In San Francisco"},
    {
        'question' : "What may be the most valuable  thing Paul Buchheit made for Google?", 
        'answer' : "The motto Don't be evil"}
]

In [30]:
predictions = QA_chain.apply(questions_and_answers)

In [31]:
predictions

[{'question': 'Where is a whole neighborhood of YC-funded startups?',
  'answer': 'In San Francisco',
  'result': ' San Francisco'},
 {'question': 'What may be the most valuable  thing Paul Buchheit made for Google?',
  'answer': "The motto Don't be evil",
  'result': ' The phrase "Don\'t be evil."'}]

* Question is the input query
* Answer is the expected response
* Result is the app's response

**The evaluation of this App has been positive, since the App has responded the 2 evaluation questions right.**

**But instead of confirming that manually ourselves, we can ask the LLM to check if the responses are coincidental with the "ground truth answers"**

In [33]:
from langchain.evaluation.qa import QAEvalChain

In [34]:
evaluation_chain = QAEvalChain.from_llm(llm)

In [35]:
evaluate_responses = evaluation_chain.evaluate(
    questions_and_answers,
    predictions,
    question_key="question",
    answer_key="answer"
)

In [36]:
evaluate_responses

[{'results': ' CORRECT'}, {'results': ' CORRECT'}]