# Langchain x Phoenix

The following notebooks showcases the power of Langchain and Phoenix combined. 

I am quite passionate about the Trust and Safety domain, and thus thought there would be lots of utility in scraping and building a RAG around this website: https://features.integrityinstitute.org/. I call it the Integrity RAG. For the sake of simplicity, I am focusing on a subset of online harms.

## Scope of Notebook

The order of steps is the following:
1. Setup OpenAI, Phoenix, and Langchain
2. Build Integrity RAG.
3. Generate LLM Based Questions (substituting for user questions)
4. Generate RAG-based Answers 
5. Generate LLM Based Evaluations (substituting for user based)

## Setup Phoenix

It's quite simple actually to setup. Connect phoenix using the LangChainInstrumentor which acts like magic.


Side Note: This code was hard to find in the onboarding flow. Llama-index was much easier to find

In [2]:
import os
import nest_asyncio
import phoenix as px
from phoenix.trace.langchain import LangChainInstrumentor

os.environ["OPENAI_API_KEY"] = "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"
nest_asyncio.apply()
LangChainInstrumentor().instrument()
px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x28e9793d0>

## Setup the RAG

Sets up a simple RAG using a RecursiveUrlLoader, simple length 1000 chunking, k=6. No experimentation was done.

In [3]:
from bs4 import BeautifulSoup as Soup
from langchain_community.document_loaders import RecursiveUrlLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import Chroma
from langchain_community.vectorstores.utils import filter_complex_metadata
from langchain_openai import OpenAIEmbeddings
from langchain_openai import ChatOpenAI
from langchain import hub

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

URL = "https://features.integrityinstitute.org/" 

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

loader = RecursiveUrlLoader(url=URL, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text)
docs = loader.load()


splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_chunks = splitter.split_documents(docs)
filtered_chunks = filter_complex_metadata(all_chunks)

vector_store = Chroma.from_documents(documents=filtered_chunks, embedding=OpenAIEmbeddings())
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 6})

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Have to generate some llm generated questions to simulate users. 

#  Next is to add in fake data
#  some feedback items I have, it's hard to find how to incorporate pheonix with langchain
#  once you do, the api for pheonix is not obvious
#  why doesn't pheonix UI have an easy way to make LLM data and use it to eval
#  why doesn't pheonix UI have like idk, buttons to check eval.
#  too much pandas here, and too many things to do. I had to llm to dummy data, and then i had to get another llm to actually do the evaluate
#  the onboarding is crazy
#  give open source onboarding ideas
#  and give non-open source onboarding dieas
# why not open source models as well



## Generate LLM Based Questions

This section generates questions using gpt-4 for our RAG to simulate dummy data. In practice, ideally we have real data for our RAG application.

That being said, the llm_generate function should allow us a flag to easily concat the input instead of having to do so manually. Ideally, the llm_generate does all the work for us! #TODO

In [4]:
import json
from phoenix.experimental.evals import OpenAIModel, llm_generate
import pandas as pd


def output_parser(response: str, index: int):
    try:
        return json.loads(response)
    except json.JSONDecodeError as e:
        return {"__error__": str(e)}

generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""


chunks_df = pd.DataFrame({"text": [doc.page_content for doc in filtered_chunks]})

questions_df = llm_generate(
    dataframe=chunks_df,
    template=generate_questions_template,
    model=OpenAIModel(
        model_name="gpt-3.5-turbo",
    ),
    output_parser=output_parser,
    concurrency=20,
)


questions_chunks_df = pd.concat([questions_df, chunks_df], axis=1)
questions_chunks_df = questions_chunks_df.melt(
    id_vars=["text"], value_name="question"
).drop("variable", axis=1)
questions_chunks_df = questions_chunks_df[
    questions_chunks_df["question"].notnull()
]

llm_generate |██████████| 144/144 (100.0%) | ⏳ 00:48<00:00 |  2.96it/s


## Answer Questions Based on RAG

This section answers the sample questions using our RAG.

In [7]:
import concurrent.futures

with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
    _ = list(executor.map(rag_chain.invoke, questions_chunks_df["question"]))

RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-BDBDjjZSO7DkFd7sFj5HpbS1 on tokens per min (TPM): Limit 60000, Used 58791, Requested 1275. Please try again in 66ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

## Evaluation

This section evaluates the RAG using GPT.

We can already start to see the inefficiencies here. This is quite hard to use. The pain is quite apparent, we have to:

1. We have to define a Relevance Evaluator
2. Massage the dataframe
3. compute the nDCG
4. precision
5. hit rate ourselves


Generally speaking almost all this code, can be put into one "evaluate and put on phoenix". The API is not declarative at all.

In [1]:
from phoenix.experimental.evals import (
    RelevanceEvaluator,
    run_evals,
)
from phoenix.session.evaluation import get_retrieved_documents
import numpy as np
from sklearn.metrics import ndcg_score


retrieved_documents_df = get_retrieved_documents(px.Client())
relevance_evaluator = RelevanceEvaluator(OpenAIModel(model_name="gpt-4-turbo-preview"))

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]

documents_with_relevance_df = pd.concat(
    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)

def _compute_ndcg(df, n):
    n = max(len(df), 2)
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    eval_scores[: len(df)] = df.eval_scores
    doc_scores[: len(df)] = df.document_scores
    
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan

ndcg_at_2 = pd.DataFrame(
    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)

precision_at_2 = pd.DataFrame(
    {
        "score": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) / 2
        )
    }
)

hit = pd.DataFrame(
    {
        "hit": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) > 0
        )
    }
)

retrievals_df = px.Client().get_spans_dataframe("span_kind == 'RETRIEVER'")
rag_evaluation_dataframe = pd.concat(
    [
        retrievals_df["attributes.input.value"],
        ndcg_at_2.add_prefix("ncdg@2_"),
        precision_at_2.add_prefix("precision@2_"),
        hit,
    ],
    axis=1,
)
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.log_evaluations(
    SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
    SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)

  from .autonotebook import tqdm as notebook_tqdm


KeyboardInterrupt: 

NameError: name 'documents_with_relevance_df' is not defined