## Dependencies

In [1]:
!pip install -qU langchain-core langchain-openai langchain-chroma ragas neptune pandas datasets

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

## ETL and Data Preparations

In [2]:
import bs4
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=[
        "https://neptune.ai/blog/llm-hallucinations",
        "https://neptune.ai/blog/llmops",
        "https://neptune.ai/blog/llm-guardrails"
    ],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(name=["p", "h2", "h3", "h4"])
    ),
)
docs = loader.load()
len(docs)

USER_AGENT environment variable not set, consider setting it to identify your requests.


3

In [3]:
docs[0]

Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='Tell 120+K peers about your AI research → Learn more 💡Live Neptune projectHow deepsense.ai Tracked and Analyzed 120K+ Models Using NeptuneHow ReSpo.Vision Uses Neptune to Easily Track Training Pipelines at ScaleObservability in LLMOps: Different Levels of ScaleBreaking Down AI Research Across 3 Levels of DifficultyLive Neptune projectHow deepsense.ai Tracked and Analyzed 120K+ Models Using NeptuneHow ReSpo.Vision Uses Neptune to Easily Track Training Pipelines at ScaleObservability in LLMOps: Different Levels of ScaleBreaking Down AI Research Across 3 Levels of Difficulty\n            TL;DR        Hallucinations are an inherent feature of LLMs that becomes a bug in LLM-based applications.Causes of hallucinations include insufficient training data, misalignment, attention limitations, and tokenizer issues.Hallucinations can be detected by verifying the accuracy and reliability of the model’s respon

In [4]:
assert len(docs[0].page_content) > 0

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

In [6]:
# Filter out header and footer chunks.
header_footer_keywords = ["peers about your research", "deepsense", "ReSpo", "Was the article useful?", "related articles", "All rights reserved"]
splits = []
for s in text_splitter.split_documents(docs):
    kw_found = False
    for kw in header_footer_keywords:
        if kw in s.page_content:
            kw_found = True
            break
    if not kw_found:
        splits.append(s)

len(splits)

57

In [7]:
splits[:5]

[Document(metadata={'source': 'https://neptune.ai/blog/llm-hallucinations'}, page_content='TL;DR        Hallucinations are an inherent feature of LLMs that becomes a bug in LLM-based applications.Causes of hallucinations include insufficient training data, misalignment, attention limitations, and tokenizer issues.Hallucinations can be detected by verifying the accuracy and reliability of the model’s responses.Effective mitigation strategies involve enhancing data quality, alignment, information retrieval methods, and prompt engineering.In 2022, when GPT-3.5 was introduced with ChatGPT, many, like me, started experimenting with various use cases. A friend asked me if it could read an article, summarize it, and answer some questions, like a research assistant. At that time, ChatGPT had no tools to explore websites, but I was unaware of this. So, I gave it the article’s link. It responded with an abstract of the article. Since the article was a medical research paper, and I had no medical

In [8]:
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [9]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever(search_kwargs={'k': 1})

In [10]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [11]:
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables.base import Runnable
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What are DOM-based attacks?"})
print(response["answer"])

DOM-based attacks are a type of vulnerability that involves feeding harmful instructions to a system by hiding them within a website's code. This can occur when an attacker embeds malicious key phrases in parts of the HTML that are not visible to users, such as matching text color to the background or placing it in a style tag. The rendered page may appear normal to users, but the hidden instructions can affect how the system processes the content.


In [12]:
response.keys()

dict_keys(['input', 'context', 'answer'])

In [13]:
response['context']

[Document(metadata={'source': 'https://neptune.ai/blog/llm-guardrails'}, page_content='By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code 

In [14]:
question_answer_chain.invoke({"input": "What are DOM-based attacks?", "context": response['context']})

"DOM-based attacks are a type of vulnerability that involves injecting harmful instructions into a system by concealing them within a website's code. This can happen when a program crawls websites and sends the raw HTML to a language model, allowing attackers to hide malicious key phrases in parts of the HTML that are not visible to users. The objective is to exploit the way the system processes and renders the code, potentially leading to unauthorized actions."

In [15]:
def predict(chain: Runnable, query: str, context: list[Document] | None = None)-> dict:
    """
    Accepts a retrieval chain or a stuff documents chain. If the latter, context must be passed in.
    Return a response dict with keys "input", "context", and "answer"
    """
    inputs = {"input": query}
    if context:
        inputs.update({"context": context})
    response = chain.invoke(inputs)
    result = {
        response['input']: {
            "context": [d.page_content for d in response['context']], 
            "answer": response['answer'],
        }
    }
    return result

## RAGAS

### Eval set generation

In [None]:
!pip install nltk

In [87]:
# import pandas as pd
# pd.set_option('display.max_colwidth', None)

In [16]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  from .autonotebook import tqdm as notebook_tqdm


In [17]:
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import AbstractQuerySynthesizer, SpecificQuerySynthesizer

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(
    splits, 
    testset_size=50, 
    query_distribution=[
        (AbstractQuerySynthesizer(llm=generator_llm), 0.1),
        (SpecificQuerySynthesizer(llm=generator_llm), 0.9),
    ],
)
dataset.to_pandas()

Applying [SummaryExtractor, HeadlinesExtractor]:   0%|          | 0/114 [00:00<?, ?it/s]

Generating Samples: 100%|██████████| 50/50 [00:13<00:00,  3.85it/s]                                                  


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How users trick chatbot to bypass restrictions?,[By prompting the application to pretend to be...,Users trick chatbots to bypass restrictions by...,AbstractQuerySynthesizer
1,What distnguishes 'promt injecton' frm 'jailbr...,[Although “prompt injection” and “jailbreaking...,'Prompt injection' and 'jailbreaking' are dist...,AbstractQuerySynthesizer
2,DOM-based attacks exploit vulnerabilities web ...,[DOM-based attacksDOM-based attacks are an ext...,DOM-based attacks exploit vulnerabilities in w...,AbstractQuerySynthesizer
3,What are the challenges and benefits of fine-t...,[Fine-tuning and serving pre-trained Large Lan...,The challenges of fine-tuning Large Language M...,AbstractQuerySynthesizer
4,What Neptune and Transformers do in LLM fine-t...,[LLM Fine-Tuning and Model Selection Using Nep...,Neptune and Transformers are used in LLM fine-...,AbstractQuerySynthesizer
5,What role does reflection play in identifying ...,[After the responseCorrecting a hallucination ...,Reflection plays a role in identifying and cor...,SpecificQuerySynthesizer
6,What role does Giskard play in scanning LLM ap...,[Assessing an LLM application for vulnerabilit...,Giskard plays a role in scanning LLM applicati...,SpecificQuerySynthesizer
7,How does an LLM's architecture contribute to h...,[What causes LLMs to hallucinate?While there a...,An LLM's architecture contributes to hallucina...,SpecificQuerySynthesizer
8,What are some key practices involved in the ob...,[Large Language Model (LLM) Observability: Fun...,The key practices involved in the observabilit...,SpecificQuerySynthesizer
9,What are some examples of LLMs that utilize a ...,[Post-training or alignmentIt is hypothesized ...,Some examples of LLMs that utilize a reasoning...,SpecificQuerySynthesizer


In [18]:
dataset.to_pandas()['reference_contexts'][46]

['pipeline']

In [42]:
dataset.to_pandas()['user_input'].nunique()

48

In [55]:
# Remove duplicated questions
unique_indices = list(dataset.to_pandas().drop_duplicates(subset=['user_input']).index)
len(unique_indices)

48

In [56]:
# Remove not helpful contexts/answers
not_helpful = list(dataset.to_pandas()[dataset.to_pandas()['reference'].str.contains("does not contain|does not provide|context does not|is insufficient|is incomplete", case=False, regex=True)].index)
len(not_helpful)

7

In [57]:
not_helpful

[12, 20, 21, 24, 27, 32, 43]

In [58]:
for x in not_helpful:
    if x in unique_indices:
        unique_indices.remove(x)

In [59]:
len(unique_indices)

41

In [60]:
ds = dataset.to_hf_dataset().select(unique_indices)
ds

Dataset({
    features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name'],
    num_rows: 41
})

In [61]:
ds.to_csv("eval_data.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 247.45ba/s]


32380

### Run inference over eval set

In [62]:
predict(rag_chain, ds['user_input'][0])

{'How users trick chatbot to bypass restrictions?': {'context': ['By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code that are not rendered

In [63]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from datasets import Dataset

def concurrent_predict_retrieval_chain(chain: Runnable, dataset: Dataset):
    results = {}
    threads = []
    with ThreadPoolExecutor(max_workers=5) as pool:
        for query in dataset['user_input']:
            threads.append(pool.submit(predict, chain, query))
        for task in as_completed(threads):
            results.update(task.result())
    return results

predictions = concurrent_predict_retrieval_chain(rag_chain, ds)

len(predictions.keys())

41

In [64]:
predictions[next(iter(predictions.keys()))]

{'context': ['potential cost increases with scaling.Fine-tuning and serving pre-trained Large Language ModelsAs needs become more specific and off-the-shelf APIs prove insufficient, teams progress to fine-tuning pre-trained models like Llama-2-70B or Mistral 8x7B. This middle ground balances customization and resource management, so teams can adapt these models to niche use cases or proprietary data sets.The process is more resource-intensive than using APIs directly. However, it provides a tailored experience that leverages the inherent strengths of pre-trained models without the exorbitant cost of training from scratch. This stage introduces challenges such as the need for quality domain-specific data, the risk of overfitting, and navigating potential licensing issues.                LLM Fine-Tuning and Model Selection Using Neptune and Transformers            Training and serving LLMsFor larger organizations or dedicated research teams, the journey may involve training LLMs from scr

### Evaluation Metrics
https://docs.ragas.io/en/stable/getstarted/rag_evaluation/

In [None]:
!pip install rapidfuzz

In [65]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity, NoiseSensitivity
from ragas import EvaluationDataset
from ragas import evaluate

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

metrics = [
    LLMContextRecall(llm=evaluator_llm), 
    FactualCorrectness(llm=evaluator_llm), 
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings),
    NoiseSensitivity(llm=evaluator_llm),
]

In [66]:
# map predictions back to eval set
ds_k_1 = ds.map(lambda example: {"response": predictions[example["user_input"]]['answer'], "retrieved_contexts": predictions[example['user_input']]["context"]})

Map: 100%|██████████| 41/41 [00:00<00:00, 5500.99 examples/s]


In [67]:
ds_k_1

Dataset({
    features: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name', 'response', 'retrieved_contexts'],
    num_rows: 41
})

In [68]:
results = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k_1), metrics=metrics)
df = results.to_pandas()
df.head()

Evaluating:  19%|█▉        | 39/205 [00:15<01:20,  2.07it/s]Exception raised in Job[151]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
Evaluating:  71%|███████   | 145/205 [01:03<00:18,  3.23it/s]Exception raised in Job[186]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
Evaluating:  92%|█████████▏| 188/205 [01:22<00:08,  2.09it/s]Exception raised in Job[131]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
Evaluating: 100%|██████████| 205/205 [01:38<00:00,  2.08it/s]


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,factual_correctness,faithfulness,semantic_similarity,noise_sensitivity_relevant
0,How users trick chatbot to bypass restrictions?,[By prompting the application to pretend to be...,[By prompting the application to pretend to be...,Users can trick chatbots to bypass restriction...,Users trick chatbots to bypass restrictions by...,1.0,0.73,0.666667,0.919034,0.166667
1,What distnguishes 'promt injecton' frm 'jailbr...,"[one day, you suddenly find an ineligible cand...",[Although “prompt injection” and “jailbreaking...,Prompt injection aims to manipulate an applica...,'Prompt injection' and 'jailbreaking' are dist...,0.0,0.0,0.833333,0.904588,0.833333
2,DOM-based attacks exploit vulnerabilities web ...,[By prompting the application to pretend to be...,[DOM-based attacksDOM-based attacks are an ext...,"Yes, DOM-based attacks exploit vulnerabilities...",DOM-based attacks exploit vulnerabilities in w...,1.0,0.89,1.0,0.946643,0.666667
3,What are the challenges and benefits of fine-t...,[potential cost increases with scaling.Fine-tu...,[Fine-tuning and serving pre-trained Large Lan...,The challenges of fine-tuning Large Language M...,The challenges of fine-tuning Large Language M...,1.0,0.83,1.0,0.993499,0.25
4,What Neptune and Transformers do in LLM fine-t...,[potential cost increases with scaling.Fine-tu...,[LLM Fine-Tuning and Model Selection Using Nep...,Neptune and Transformers are used in LLM fine-...,Neptune and Transformers are used in LLM fine-...,1.0,0.0,0.0,0.935244,0.0


In [69]:
df.tail()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,factual_correctness,faithfulness,semantic_similarity,noise_sensitivity_relevant
36,How were users able to manipulate ChatGPT by m...,[By prompting the application to pretend to be...,[By prompting the application to pretend to be...,Users manipulated ChatGPT by prompting it to a...,Users were able to manipulate ChatGPT by promp...,1.0,0.4,0.75,0.929504,0.25
37,What are the differences between online and ba...,[models in production.The main difference is t...,"[LLM deployment: Serving, monitoring, and obse...",I don't know.,Online inference mode in LLM deployment involv...,0.0,,0.0,0.739923,0.0
38,What role does containerization play in the de...,"[language understanding (e.g., using techniqu...","[DeploymentDeploy models through pipelines, ty...",Containerization plays a crucial role in the d...,Containerization plays a role in the deploymen...,1.0,0.8,0.125,0.979062,0.0
39,What r the key compnents n functons of a pipline?,[pipelineImplement the inference componentTest...,[pipeline],The key components of a pipeline typically inc...,The key components and functions of a pipeline...,0.0,0.29,0.1,0.941395,0.0
40,What are the key vulnerabilities in LLMs that ...,[key guardrail methodsThe common vulnerabiliti...,[LLM guardrails are small programs that valida...,The key vulnerabilities in LLMs that guardrail...,The key vulnerabilities in LLMs that LLM guard...,0.0,0.17,0.428571,0.907696,0.0


In [70]:
df.to_csv("eval_results.csv", index=False)

## Neptune Experiment Tracking

In [71]:
os.environ["NEPTUNE_PROJECT"] = "your_workspace/your_project"
os.environ["NEPTUNE_API_TOKEN"] = "your_neptune_API_token"

In [72]:
import neptune

run = neptune.init_run()



[neptune] [info   ] Neptune initialized. Open in the app: https://app.neptune.ai/community/building-RAG-using-LangChain/e/BUIL1-5


In [None]:
# Upload the eval file

run["eval_data"].upload("eval_results.csv")

In [74]:
# Track metrics for each row of the eval set, and the overall metric.
import pandas as pd

def log_detailed_metrics(results_df: pd.DataFrame, run: neptune.Run):
    for i, row in results_df.iterrows():
        for m in metrics:
            val = row[m.name]
            run[f"eval/q{i}/{m.name}"].append(val)
    
    overall_metrics = results_df[[m.name for m in metrics]].mean(axis=0).to_dict()
    for k, v in overall_metrics.items():
        run[f"eval/overall/{k}"].append(v)

In [75]:
overall_metrics = df[[m.name for m in metrics]].mean(axis=0).to_dict()
overall_metrics

{'context_recall': 0.7317073170731707,
 'factual_correctness': 0.4694736842105263,
 'faithfulness': 0.615642487593707,
 'semantic_similarity': 0.9280948510865946,
 'noise_sensitivity_relevant': 0.29297856614929785}

In [76]:
log_detailed_metrics(df, run)



## Iterate on RAG system

In [None]:
def concurrent_predict(chain: Runnable, dataset: Dataset, k: int = 1):
    """Uses the stuff documents chain, and thus needs context."""
    results = {}
    with ThreadPoolExecutor(max_workers=5) as pool:
        for result in pool.map(predict, chain, dataset['user_input'], dataset[f'context_{k}']):
            results.update(result)
    return results

predictions = concurrent_predict(rag_chain, ds)

len(predictions.keys())

In [77]:
for k in [3, 5]:
    retriever_k = vectorstore.as_retriever(search_kwargs={'k': k})
    rag_chain_k = create_retrieval_chain(retriever_k, question_answer_chain)
    predictions_k = concurrent_predict_retrieval_chain(rag_chain_k, ds)

    # map predictions back to eval set
    ds_k = ds.map(lambda example: {
        "response": predictions_k[example["user_input"]]['answer'], 
        "retrieved_contexts": predictions_k[example['user_input']]["context"]
    })

    results_k = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k), metrics=metrics)
    df_k = results_k.to_pandas()
    df_k.to_csv("eval_results.csv", index=False)

    log_detailed_metrics(df_k, run)

Map: 100%|██████████| 41/41 [00:00<00:00, 4247.66 examples/s]
Evaluating:  37%|███▋      | 75/205 [00:46<01:49,  1.19it/s]Exception raised in Job[186]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
Evaluating: 100%|██████████| 205/205 [02:35<00:00,  1.32it/s]
Map: 100%|██████████| 41/41 [00:00<00:00, 4820.50 examples/s]
Evaluating: 100%|██████████| 205/205 [03:32<00:00,  1.04s/it]


In [78]:
# Ends the run
run.stop()

[neptune] [info   ] Shutting down background jobs, please wait a moment...
[neptune] [info   ] Done!
[neptune] [info   ] All 0 operations synced, thanks for waiting!
[neptune] [info   ] Explore the metadata in the Neptune app: https://app.neptune.ai/community/building-RAG-using-LangChain/e/BUIL1-5/metadata
