<img src="https://www.rp.edu.sg/images/default-source/default-album/rp-logo.png" width="200" alt="Republic Polytechnic"/>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/koayst-rplesson/SST_DP2025/blob/main/Day_02/L11/L11_Answer.ipynb)

# Setup and Installation

You can run this Jupyter notebook either on your local machine or run it at Google Colab.

* For local machine, it is recommended to install Anaconda and create a new development environment called `SST_DP2025`.
* Pip/Conda install the libraries stated below when necessary.
---

# <font color='red'>ATTENTION</font>

## Google Colab
- If you are running this code in Google Colab, **DO NOT** store the API Key in a text file and load the key later from Google Drive. This is insecure and will expose the key.
- **DO NOT** hard code the API Key directly in the Python code, even though it might seem convenient for quick development.
- You need to enter the API key at python code `getpass.getpass()` when ask.

## Local Environment/Laptop
- If you are running this code locally in your laptop, you can create a env.txt and store the API key there.
- Make sure env.txt is in the same directory of this Jupyter notebook.
- You need to install `python-dotenv` and run the Python code to load in the API key.

---
```
%pip install python-dotenv

from dotenv import load_dotenv

load_dotenv('env.tx')
openai_api_key = os.getenv('OPENAI_API_KEY')
```
---

## GitHub/GitLab
- **DO NOT** `commit` or `push` API Key to services like GitHub or GitLab.

## <font color="#FF0000">IMPORTANT</font>
If you are running this code in Google Colab, you need to run the below commands to download `MicrosoftEULA.txt` to Google Colab.

Comment out the below code with '#' if you are not running it in your local machine.

In [None]:
!wget https://raw.githubusercontent.com/koayst-rplesson/SST_DP2025/refs/heads/main/Day_02/L11/MicrosoftEULA.txt
!dir

# Lesson 11

In [None]:
%%capture --no-stderr
%pip install --quiet -U langchain
%pip install --quiet -U langgraph
%pip install --quiet -U langchain-openai
%pip install --quiet -U grandalf
%pip install --quiet -U langchain-community
%pip install --quiet -U faiss-cpu
%pip install --quiet -U pytube
%pip install --quiet -U youtube-transcript-api

In [1]:
# grandalf               0.8
# langchain              0.3.11
# langgraph              0.2.59
# langchain-core         0.3.24
# langchain-openai       0.2.12
# langchain-community    0.3.12
# openai                 1.57.2
# pydantic               2.10.3
# pytube                 15.0.0
# faiss-cpu              1.9.0.post1
# youtube-transcript-api 0.6.3

In [2]:
import getpass
import os

# setup the OpenAI API Key

# get OpenAI API key ready and enter it when ask
os.environ["OPENAI_API_KEY"] = getpass.getpass()

 ········


In [3]:
# setup the Langchain API Key
# Goto https://smith.langchain.com/ to register and get the key

os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
#os.environ["LANGCHAIN_TRACING_V2"] ="false"

 ········


## Retrieval

Retrieve relevant information from an external data source and pass the data to LLM.

### Load a text file
- Load a text file using `TextLoader`.
- Chunk the document using `CharacterTextSplitter`.
- Encode the document using `OpenAIEmbeddings` and store it in a vector store.
- Use `create_stuff_documents_chain` to create a chain for passing a list of Documents to a model.
- `create_retrieval_chain` creates retrieval chain to retrieves documents.
- `FAISS` is an open-source library designed for efficient similarity search and clustering of dense vectors.
- [Langchain Hub](https://blog.langchain.dev/langchain-prompt-hub/) is a home for uploading, browsing, pulling and managing prompts 

In [4]:
# load langchain libraries

from langchain import hub

from langchain.chains.combine_documents.stuff import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

from langchain_community.document_loaders.text import TextLoader
from langchain_community.vectorstores import FAISS

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from langchain_text_splitters import CharacterTextSplitter

In [5]:
document_loader = TextLoader(
    "MicrosoftEULA.txt",
    encoding="utf-8"
)

documents = document_loader.load()

In [6]:
# the chunk_size (1000) is arbitary. May need to experiement to find the optimum size

text_splitter = CharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=0
)

text = text_splitter.split_documents(documents)

Created a chunk of size 1406, which is longer than the specified 1000
Created a chunk of size 1161, which is longer than the specified 1000
Created a chunk of size 1083, which is longer than the specified 1000
Created a chunk of size 1125, which is longer than the specified 1000
Created a chunk of size 1073, which is longer than the specified 1000


In [7]:
embeddings = OpenAIEmbeddings()

vectorstore = FAISS.from_documents(text, embeddings)
retriever = vectorstore.as_retriever()

In [8]:
model = ChatOpenAI()

retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

# You will get a warning if you haven't not gotten the Langchain API key
# LangSmithMissingAPIKeyWarning: API key must be provided when using hosted LangSmith API 

In [9]:
print(retrieval_qa_chat_prompt)

input_variables=['context', 'input'] optional_variables=['chat_history'] input_types={'chat_history': list[typing.Annotated[typing.Union[typing.Annotated[langchain_core.messages.ai.AIMessage, Tag(tag='ai')], typing.Annotated[langchain_core.messages.human.HumanMessage, Tag(tag='human')], typing.Annotated[langchain_core.messages.chat.ChatMessage, Tag(tag='chat')], typing.Annotated[langchain_core.messages.system.SystemMessage, Tag(tag='system')], typing.Annotated[langchain_core.messages.function.FunctionMessage, Tag(tag='function')], typing.Annotated[langchain_core.messages.tool.ToolMessage, Tag(tag='tool')], typing.Annotated[langchain_core.messages.ai.AIMessageChunk, Tag(tag='AIMessageChunk')], typing.Annotated[langchain_core.messages.human.HumanMessageChunk, Tag(tag='HumanMessageChunk')], typing.Annotated[langchain_core.messages.chat.ChatMessageChunk, Tag(tag='ChatMessageChunk')], typing.Annotated[langchain_core.messages.system.SystemMessageChunk, Tag(tag='SystemMessageChunk')], typing.

In [10]:
# it is ok to just use Chapt-3.5-turbo for this sample code

model.model_name

'gpt-3.5-turbo'

In [11]:
question_answer_chain = create_stuff_documents_chain(model, retrieval_qa_chat_prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

In [12]:
response = chain.invoke({
    "input":"What is this document about? Summarize it in a paragraph"}
)

In [13]:
response['answer']

'This document is a Microsoft Software License Agreement for the Windows Operating System. It outlines the rights and conditions for using the Windows software, including the requirement to review the entire agreement and any supplemental license terms that accompany the software. By accepting the agreement or using the software, users consent to the transmission of certain information during activation and usage as per the privacy statement. It specifies that depending on how the software was obtained, the agreement is between the user and the device manufacturer, software installer, or Microsoft Corporation. The agreement emphasizes the importance of reviewing all terms, including any linked terms, before using the software or services. It also provides instructions on how to access and review the terms during software usage.'

### Retrieval and Query a YouTube Video

Observe in this sample code, the only change is the YouTube video loader.

In [14]:
from langchain import hub

from langchain_community.document_loaders import YoutubeLoader

from langchain.chains.combine_documents.stuff import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

from langchain_community.vectorstores import FAISS

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

from langchain_text_splitters import CharacterTextSplitter

In [15]:
# Youtube: How a Worm Could Save Humanity From Bad AI | Ramin Hasani | TED
# https://www.youtube.com/watch?v=x6oM9hQMjUY

loader = YoutubeLoader.from_youtube_url( 
    "https://www.youtube.com/watch?v=x6oM9hQMjUY",
    add_video_info=False,
    language=["en"]    
)

youtube_data = loader.load()

In [16]:
youtube_data

[Document(metadata={'source': 'x6oM9hQMjUY'}, page_content="My wildest dream is to design\nartificial intelligence that is our friend, you know. If you have an AI system\nthat helps us understand mathematics, you can solve the economy of the world. If you have an AI system\nthat can understand humanitarian sciences, we can actually solve\nall of our conflicts. I want this system to, given Einstein’s\nand Maxwell’s equations, take it and solve new physics, you know. If you understand physics,\nyou can solve the energy problem. So you can actually design ways for humans to be the better versions of themselves. I'm Ramin Hasani. I’m the cofounder and CEO of Liquid AI. Liquid AI is an AI company built\non top of a technology that I invented back at MIT. It’s called “liquid neural networks.” These are a form of flexible intelligence, as opposed to today's AI systems\nthat are fixed, basically. So think about your brain. You can change your thoughts. When somebody talks to you, you can compl

In [17]:
text_splitter = CharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=0
)

text = text_splitter.split_documents(youtube_data)

In [18]:
embeddings = OpenAIEmbeddings()

vectorstore = FAISS.from_documents(text, embeddings)
retriever = vectorstore.as_retriever()

In [19]:
model = ChatOpenAI()

retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [20]:
question_answer_chain = create_stuff_documents_chain(model, retrieval_qa_chat_prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

In [21]:
response = chain.invoke({
    "input" : "Summarize it in a paragraph"}
)

print(response['answer'])

Ramin Hasani, the CEO of Liquid AI, envisions designing artificial intelligence that can be a beneficial friend to humanity. By developing liquid neural networks inspired by the brain of the C. elegans worm, his company aims to create flexible and understandable AI systems that can solve complex problems and evolve beyond human capabilities. By understanding the behavior of the AI through transparent mathematics, Liquid AI seeks to provide a level of control over superintelligent technology, ensuring it does not pose a threat to humanity. Their goal is to demonstrate that powerful AI can coexist safely with humans, offering immense benefits while maintaining oversight and preventing potential doomsday scenarios caused by unchecked technology advancement.


In [22]:
response = chain.invoke({
    "input" : "What is the fascinating fact about the brain of the worm?"}
)

print(response['answer'])

The fascinating fact about the brain of the worm is that it shares 75 percent of its genome with humans.


## Evaluation
In LangChain, the evaluation of Language Models (LLMs) involves multiple methods such as comparing chain outputs, pairwise string comparisons, string distances, and embedding distances. These evaluations help determine the most preferred model by analyzing the differences in their outputs.

The types of evaluator are documented [here](https://python.langchain.com/api_reference/langchain/evaluation/langchain.evaluation.schema.EvaluatorType.html#langchain.evaluation.schema.EvaluatorType).

In [23]:
from langchain.evaluation import load_evaluator

In [24]:
help(load_evaluator)

# notice that `llm` is optional
# according to the documentation: The language model to use for evaluation, if none is provided, a default ChatOpenAI gpt-4 model will be used.

Help on function load_evaluator in module langchain.evaluation.loading:

load_evaluator(evaluator: langchain.evaluation.schema.EvaluatorType, *, llm: Optional[langchain_core.language_models.base.BaseLanguageModel] = None, **kwargs: Any) -> Union[langchain.chains.base.Chain, langchain.evaluation.schema.StringEvaluator]
    Load the requested evaluation chain specified by a string.
    
    Parameters
    ----------
    evaluator : EvaluatorType
        The type of evaluator to load.
    llm : BaseLanguageModel, optional
        The language model to use for evaluation, by default None
    **kwargs : Any
        Additional keyword arguments to pass to the evaluator.
    
    Returns
    -------
    Chain
        The loaded evaluation chain.
    
    Examples
    --------
    >>> from langchain.evaluation import load_evaluator, EvaluatorType
    >>> evaluator = load_evaluator(EvaluatorType.QA)



### Pairwise String Comparison

In [25]:
# before you run the code, don't forget to setup the OpenAI API key

evaluator = load_evaluator("labeled_pairwise_string")

In [26]:
text_01 = '''
LangChain is a Python library designed to make it easier to build applications with large language models, 
providing tools for chaining components and managing complex natural language processing workflows
'''

text_02 = '''
LangChain is a Python framework that connects learners and tutors worldwide, 
offering personalized lessons, real-time practice, and transparent payment using blockchain technology 
to ensure security and fairness.
'''

# some criteria require reference labels to work correctly
reference = '''
LangChain is a Python framework designed to streamline AI application development, focusing on real-time
data processing and integration with Large Language Models.
'''

eval_result = evaluator.evaluate_string_pairs(
    prediction = text_01,
    prediction_b = text_02,
    input = "describe LangChain in thirty words",
    reference = reference,
)

In [27]:
print(eval_result['reasoning'])

Assistant A's response is more accurate and relevant to the user's question. It correctly describes LangChain as a Python library designed for building applications with large language models and managing complex natural language processing workflows. On the other hand, Assistant B's response is incorrect as it describes LangChain as a platform that connects learners and tutors worldwide, which is not accurate. Therefore, Assistant A's response is more helpful, correct, and demonstrates a better depth of thought. 

Final Verdict: [[A]]


### Predefined Criteria - Conciseness
According to `Cambridge Dictionary`, conciseness is the quality of being short and clear, and expressing what needs to be said without unnecessary words.

### Concise Example:

In [28]:
evaluator = load_evaluator("criteria", criteria = "conciseness")

In [29]:
concise='''
Generative AI is a type of artificial intelligence that can autonomously create new and original content, such as images, 
text, or other forms of data, using algorithms and models. It has the ability to generate outputs that mimic human-created 
content without relying solely on predefined patterns.
'''

eval_result = evaluator.evaluate_strings(
    prediction = concise,
    input = "What is generative AI?",
)

In [30]:
print(eval_result['reasoning'])

The criterion for this assessment is conciseness. 

The submission is a brief explanation of generative AI, providing a clear definition and an example of what it can do. It does not include unnecessary information or go off-topic. 

The submission is concise and to the point, therefore it meets the criterion.

Y


### Inconcise Example:

In [31]:
inconcise='''
In the vast landscape of artificial intelligence, generative AI emerges as a subset intricately enmeshed in a labyrinthine 
array of algorithms, particularly those rooted in the complex neural networks exemplified by the captivating Generative 
Adversarial Networks (GANs). This convoluted field empowers machines to autonomously and creatively navigate the expansive 
spectrum of content creation, spanning from vividly evocative imagery to the nuanced articulation found in various textual expressions. 
This intricate process, laden with multifaceted intricacies, tangentially mirrors the profound subtleties inherent in the intricate 
tapestry of human cognition and expressive exploration.
'''

eval_result = evaluator.evaluate_strings(
    prediction = inconcise,
    input = "What is generative AI?",
)

In [32]:
print(eval_result['reasoning'])

The criterion for this assessment is conciseness, which means the submission should be brief, to the point, and without unnecessary details or complex language.

Looking at the submission, it is clear that the answer is not concise. The language used is complex and verbose, with many unnecessary details and metaphors. The answer could have been much shorter and simpler, while still conveying the same information.

For example, the first sentence could have been simplified to: "Generative AI is a subset of artificial intelligence that uses algorithms like Generative Adversarial Networks (GANs)." This would have conveyed the same information in a much more concise manner.

Therefore, the submission does not meet the criterion of conciseness.

N


### Correctness
According to `Cambridge Dictionary`, correctness is the quality of being in agreement with the true facts or with what is generally accepted.

In [33]:
evaluator = load_evaluator("labeled_criteria", criteria = "correctness")

In [34]:
# Kopi C Kosong
# Black coffee with evaporated milk and no sugar – think of it as a cafe au lait
# Ref: https://thehoneycombers.com/singapore/order-kopi-singapore/

correctness_test_1='''
The name 'Kopi C Kosong' means 'empty coffee' which refers to the fact that no milk is added to the coffee.
'''

correctness_test_2='''
'Kopi C Kosong' is a coffee with evaporated milk but without any sugar.
'''

reference='''
'Kopi C Kosong' is a coffee with evaporated milk but without any sugar. It is one of the variations in a 
wide array of coffee styles available in the coffee shops (kopitiams) of Singapore and Malaysia.
'''

In [35]:
# test 1

eval_result = evaluator.evaluate_strings(
    input = "What is Kopi C Kosong?",
    prediction = correctness_test_1,
    reference = reference,
)

print(f'With ground truth: {eval_result["score"]}')

With ground truth: 0


In [36]:
# test 2

eval_result = evaluator.evaluate_strings(
    input = "What is Kopi C Kosong?",
    prediction = correctness_test_2,
    reference = reference,
)

print(f'With ground truth: {eval_result["score"]}')

With ground truth: 1


## Custom Criteria
- LangChain supports custom criteria and predefined principles for evaluation
- Custom criteria can be defined using a key-value pairs {criterion_name : criterion_description}
- These criteria can be used to assess outputs based on requirements or rubrics

**Note**
[LangChain](https://python.langchain.com/v0.1/docs/guides/productionization/evaluation/string/criteria_eval_chain/#custom-criteria): it's recommended that you create a single evaluator per criterion. This way, separate feedback can be provided for each aspect. Additionally, if you provide antagonistic criteria, the evaluator won't be very useful, as it will be configured to predict compliance for ALL

In [37]:
custom_criteria = {
    "simplicity": "Is the language straightforward and unpretentious?",
    "clarity": "Are the sentences clear and easy to understand?",
    "precision": "Is the writing precise, with no unnecessary words or details?",
    "truthfulness": "Does the writing feel honest and sincere?"
}

In [38]:
evaluator = load_evaluator("pairwise_string", criteria=custom_criteria)

In [39]:
eval_result = evaluator.evaluate_string_pairs(
    prediction="Every cheerful household shares a similar rhythm of joy; but sorrow, in each household, plays a unique, haunting melody.",
    prediction_b="Where one finds a symphony of joy, every domicile of happiness resounds in harmonious,"
                 "identical notes; yet, every abode of despair conducts a dissonant orchestra, each "
                 "playing an elegy of grief that is peculiar and profound to its own existence.",
    input="Write some prose about families.",
)

In [40]:
print(eval_result['reasoning'])

Assistant A's response is simpler, clearer, and more precise than Assistant B's. Assistant A uses straightforward language and clear sentences, making it easy to understand. On the other hand, Assistant B's response is more complex and uses more sophisticated language, which may make it harder for some users to understand. Both responses seem truthful and sincere, but Assistant A's response is more in line with the criteria provided. Therefore, Assistant A's response is better. 

Final Verdict: [[A]]
