# Setup
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pprados/langchain-references/blob/master/langchain_reference.ipynb)


In [15]:
!python -m pip -q install --upgrade pip

In [16]:
# Document loading, retrieval methods and text splitting
%pip install -qU wikipedia

%pip install -qU langchain-references
%pip install -qU langchain-community
%pip install -qU langchain-text-splitters

# Local vector store via Chroma
%pip install -qU langchain-chroma

# inference and embeddings 
%pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [17]:
import langchain_references

langchain_references.__version__

'0.0.0'

# Document loading, retrieval methods and text splitting
Load documents from the web and split them into smaller chunks for processing.

In [18]:
import os
from pprint import pprint

os.environ["USER_AGENT"] = "langhchain-references"

from langchain_community.retrievers import WikipediaRetriever

documents = WikipediaRetriever(
    top_k_results=10,
    doc_content_chars_max=2000
).invoke("mathematic")


In [19]:
pprint([(doc.metadata["title"], doc.metadata["source"]) for doc in documents])

[('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('History of mathematics',
  'https://en.wikipedia.org/wiki/History_of_mathematics'),
 ('Mathematical Reviews', 'https://en.wikipedia.org/wiki/Mathematical_Reviews'),
 ('List of mathematics competitions',
  'https://en.wikipedia.org/wiki/List_of_mathematics_competitions'),
 ('Applied mathematics', 'https://en.wikipedia.org/wiki/Applied_mathematics'),
 ('List of mathematics awards',
  'https://en.wikipedia.org/wiki/List_of_mathematics_awards'),
 ('Group (mathematics)', 'https://en.wikipedia.org/wiki/Group_(mathematics)'),
 ('Indian mathematics', 'https://en.wikipedia.org/wiki/Indian_mathematics'),
 ('Mathematical sciences',
  'https://en.wikipedia.org/wiki/Mathematical_sciences'),
 ('Encyclopedia of Mathematics',
  'https://en.wikipedia.org/wiki/Encyclopedia_of_Mathematics')]


In [20]:
import os
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass()

In [21]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

embeddings = OpenAIEmbeddings()
model = ChatOpenAI(model="gpt-4o-mini")

Load the documents into a vector store.

In [22]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(documents=documents,
                                    embedding=embeddings,
                                    )

Create a context with a combination of documents in a single string, with a single numerical identifier.

In [23]:
def format_docs(docs):
    # return "\n\n".join(doc.page_content for doc in docs)
    return "\n".join(
        # Add a document id so that LLM can reference it 
        [f"<document id={i + 1}>\n{doc.page_content}\n</document>\n" for i, doc in
         enumerate(docs)]
    )


# Manage references with langchain-reference

Create a prompt with `{format_references}`, `{context}` and `{question}` placeholders.

In [24]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

RAG_TEMPLATE = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved documents to answer the question. 
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

{format_references}
  
<documents>
{context}
</documents>

Answer the following question:

{question}"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

Create a context with documents and format_references.

In [25]:
from langchain_references import FORMAT_REFERENCES

context = RunnablePassthrough.assign(
    context=lambda input: format_docs(input["documents"]),
    format_references=lambda _: FORMAT_REFERENCES,
)
pprint(FORMAT_REFERENCES)

('When referencing the documents, add a citation right after. Use '
 '"[NUMBER](id=ID_NUMBER)" for the citation (e.g. "The Space Needle is in '
 'Seattle [1](id=55)[2](id=12).").')


Create a chain with the `context`, `rag_prompt` and `model`.

In [26]:
from langchain_core.output_parsers import StrOutputParser

# Invoke the chain without `manage_references()`
chain = (
        context
        | rag_prompt
        | model
)

Select documents similar to the question.

In [27]:
question = "What is the difference kind of games and competition of mathematics?"

docs = vectorstore.similarity_search(question, k=6)
pprint([(d.metadata["title"], d.metadata["source"]) for d in docs])

[('Mathematical game', 'https://en.wikipedia.org/wiki/Mathematical_game'),
 ('List of mathematics competitions',
  'https://en.wikipedia.org/wiki/List_of_mathematics_competitions'),
 ('List of mathematics competitions',
  'https://en.wikipedia.org/wiki/List_of_mathematics_competitions'),
 ('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('History of mathematics',
  'https://en.wikipedia.org/wiki/History_of_mathematics')]


Invoke the chain with the `documents` and `question`, but without `manage_references()`. You can see some **\[1](id=1)** references in the answer.

In [28]:
answer = (chain | StrOutputParser()).invoke({"documents": docs, "question": question})
answer

'Mathematical games are typically defined by clear rules and strategies that can be analyzed mathematically, often focusing on recreational aspects without requiring deep mathematical knowledge to play [1](id=1). In contrast, mathematics competitions involve participants completing tests that may require problem-solving skills, proofs, or detailed answers, often under competitive conditions [2](id=2). Thus, the former emphasizes play and strategy, while the latter emphasizes assessment and skill in mathematics.'

Invoke the chain with the `documents` and `question` with `manage_references()`.

In [29]:
from langchain_references import manage_references

managed_chain = context | manage_references(rag_prompt | model)

In [30]:

answer = (managed_chain | StrOutputParser()).invoke(
    {"documents": docs, "question": question})
pprint(answer)

('Mathematical games are generally defined by clear rules and strategies, '
 'allowing players to engage without requiring deep mathematical knowledge, '
 'focusing instead on enjoyment and skill '
 '<sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup> In '
 'contrast, mathematics competitions, such as mathematical olympiads, involve '
 'participants completing tests that may include multiple-choice questions or '
 'proofs, emphasizing problem-solving and mathematical expertise '
 '<sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup> '
 'Thus, the primary distinction lies in the recreational nature of games '
 'versus the academic challenge of competitions.\n'
 '\n'
 '- **1** [Mathematical '
 'game](https://en.wikipedia.org/wiki/Mathematical_game)\n'
 '- **2** [List of mathematics '
 'competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)\n')


In [31]:
# Print in markdown format
from IPython.display import display, Markdown

display(Markdown(answer))

Mathematical games are generally defined by clear rules and strategies, allowing players to engage without requiring deep mathematical knowledge, focusing instead on enjoyment and skill <sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup> In contrast, mathematics competitions, such as mathematical olympiads, involve participants completing tests that may include multiple-choice questions or proofs, emphasizing problem-solving and mathematical expertise <sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup> Thus, the primary distinction lies in the recreational nature of games versus the academic challenge of competitions.

- **1** [Mathematical game](https://en.wikipedia.org/wiki/Mathematical_game)
- **2** [List of mathematics competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)


# Chain with retriever
The previous model requires the list of documents to be obtained before invoking the chain.
But, it's possible to use a retriever to get the documents and invoke the chain in a single step.

In [32]:
from operator import itemgetter
from langchain_core.runnables import RunnableParallel

retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
context = (
    RunnableParallel(
        # Get list of documents, necessary for reference analysis
        documents=(itemgetter("question") | retriever),
        # and question
        question=itemgetter("question"),
    ).assign(
        context=lambda input: format_docs(input["documents"]),
        format_references=lambda _: FORMAT_REFERENCES,
    )
)

In [33]:
context.invoke({"question": question}).keys()

dict_keys(['documents', 'question', 'context', 'format_references'])

In [34]:
chain = (
        context
        | rag_prompt
        | model
)

In [35]:
pprint((chain | StrOutputParser()).invoke({"question": question}))

('Mathematical games are structured activities defined by clear mathematical '
 'rules and strategies, often involving elements of chance or strategy, and '
 'can enhance arithmetic skills in an engaging manner [1](id=1). In contrast, '
 'mathematics competitions, such as mathematical olympiads, are formal events '
 'where participants solve math problems or proofs, often under timed '
 'conditions [2](id=2). The primary distinction lies in the interactive, '
 'playful nature of games versus the competitive, evaluative context of '
 'competitions.')


In [36]:
answer = (context | manage_references(rag_prompt | model) | StrOutputParser() ).invoke({"question": question})
pprint(answer)

('Mathematical games are structured activities defined by clear mathematical '
 'rules and strategies, often focusing on skill development without requiring '
 'deep mathematical knowledge, such as tic-tac-toe or chess '
 '<sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup> In '
 'contrast, mathematics competitions are formal events where participants '
 'solve mathematical problems, often requiring significant knowledge and '
 'understanding of mathematics, like the International Mathematical Olympiad '
 '<sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup> '
 'Thus, the primary difference lies in the nature of engagement: games are '
 'recreational and skill-based, while competitions are serious and '
 'knowledge-based.\n'
 '\n'
 '- **1** [Mathematical '
 'game](https://en.wikipedia.org/wiki/Mathematical_game)\n'
 '- **2** [List of mathematics '
 'competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)\n')


In [37]:
display(Markdown(answer))

Mathematical games are structured activities defined by clear mathematical rules and strategies, often focusing on skill development without requiring deep mathematical knowledge, such as tic-tac-toe or chess <sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup> In contrast, mathematics competitions are formal events where participants solve mathematical problems, often requiring significant knowledge and understanding of mathematics, like the International Mathematical Olympiad <sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup> Thus, the primary difference lies in the nature of engagement: games are recreational and skill-based, while competitions are serious and knowledge-based.

- **1** [Mathematical game](https://en.wikipedia.org/wiki/Mathematical_game)
- **2** [List of mathematics competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)


## Customize the references format
It's possible to customize the references format and remove some reference, and delete references, because a chunk cannot be referenced and the corresponding document has too many pages

In [38]:
from langchain_core.documents.base import BaseMedia
from langchain_references import ReferenceStyle
from typing import List, Tuple


class MyReferenceStyle(ReferenceStyle):
    # If document hasn't "header 1" and the total_pages is > 3, remove the reference
    source_id_key = "source"
    chunk_anchor_key = "header 1"
    total_pages_key = "total_pages"
    max_total_pages = 3

    def format_reference(self, ref: int, media: BaseMedia) -> str:
        if media.metadata.get(self.chunk_anchor_key) is None:
            # Detect chunks without a specific references in the document
            # If the size of the document is too big, remove the reference
            get_total_pages = self._get_key_assigner(self.total_pages_key)
            total_pages = get_total_pages(media)
            if total_pages and total_pages > self.max_total_pages:
                return None
        return f"[{ref}]"

    def format_all_references(self, refs: List[Tuple[int, BaseMedia]]) -> str:
        if not refs:
            return ""
        result = []
        for ref, media in refs:
            source = media.metadata[self.source_id_key]
            if media.metadata.get(self.chunk_anchor_key):
                # Add the anchor to the chunk to the source
                source += "#" + media.metadata[self.chunk_anchor_key]
            if "title" in media.metadata:
                result.append(f"- [{ref}] {media.metadata['title']} ({source})\n")
            else:
                result.append(f"- [{ref}] {source}\n")
        if not result:
            return ""
        return "\n\n" + "".join(result)

answer = (context | manage_references(rag_prompt | model,
                                      style=MyReferenceStyle()) | StrOutputParser() ).invoke({"question": question})
display(Markdown(answer))

Mathematical games involve clear mathematical parameters and are designed for play, often focusing on strategy and skill, such as chess or checkers [1] In contrast, mathematics competitions, like the International Mathematical Olympiad, are structured events where participants solve mathematical problems, often requiring detailed proofs or solutions [2][2] While games are primarily recreational, competitions are competitive and evaluative in nature.

- [1] Mathematical game (https://en.wikipedia.org/wiki/Mathematical_game)
- [2] List of mathematics competitions (https://en.wikipedia.org/wiki/List_of_mathematics_competitions)
