# Setup
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pprados/langchain-references/blob/master/langchain_reference.ipynb)


In [1]:
!python -m pip -q install --upgrade pip

In [2]:
# Document loading, retrieval methods and text splitting
%pip install -qU wikipedia

%pip install -qU langchain-references
%pip install -qU langchain-community
%pip install -qU langchain-text-splitters

# Local vector store via Chroma
%pip install -qU langchain-chroma

# inference and embeddings 
%pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import langchain_references

langchain_references.__version__

'0.0.0'

# Document loading, retrieval methods and text splitting
Load documents from the web and split them into smaller chunks for processing.

In [4]:
import os
from pprint import pprint

os.environ["USER_AGENT"] = "langhchain-references"

from langchain_community.retrievers import WikipediaRetriever

documents = WikipediaRetriever(
    top_k_results=10,
    doc_content_chars_max=2000
).invoke("mathematic")


In [5]:
pprint([(doc.metadata["title"], doc.metadata["source"]) for doc in documents])

[('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('History of mathematics',
  'https://en.wikipedia.org/wiki/History_of_mathematics'),
 ('Mathematical Reviews', 'https://en.wikipedia.org/wiki/Mathematical_Reviews'),
 ('Applied mathematics', 'https://en.wikipedia.org/wiki/Applied_mathematics'),
 ('Mathematical game', 'https://en.wikipedia.org/wiki/Mathematical_game'),
 ('Mathematical object', 'https://en.wikipedia.org/wiki/Mathematical_object'),
 ('List of mathematics competitions',
  'https://en.wikipedia.org/wiki/List_of_mathematics_competitions'),
 ('Mathematical sciences',
  'https://en.wikipedia.org/wiki/Mathematical_sciences'),
 ('Mathematical logic', 'https://en.wikipedia.org/wiki/Mathematical_logic'),
 ('Group (mathematics)', 'https://en.wikipedia.org/wiki/Group_(mathematics)')]


In [6]:
import os
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass()

In [7]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI

embeddings = OpenAIEmbeddings()
model = ChatOpenAI(model="gpt-4o-mini")

Load the documents into a vector store.

In [8]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(documents=documents,
                                    embedding=embeddings,
                                    )

Create a context with a combination of documents in a single string, with a single numerical identifier.

In [9]:
def format_docs(docs):
    # return "\n\n".join(doc.page_content for doc in docs)
    return "\n".join(
        # Add a document id so that LLM can reference it 
        [f"<document id={i + 1}>\n{doc.page_content}\n</document>\n" for i, doc in
         enumerate(docs)]
    )


# Manage references with langchain-reference

Create a prompt with `{format_references}`, `{context}` and `{question}` placeholders.

In [10]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

RAG_TEMPLATE = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved documents to answer the question. 
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.

{format_references}
  
<documents>
{context}
</documents>

Answer the following question:

{question}"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

Create a context with documents and format_references.

In [11]:
from langchain_references import FORMAT_REFERENCES

context = RunnablePassthrough.assign(
    context=lambda input: format_docs(input["documents"]),
    format_references=lambda _: FORMAT_REFERENCES,
)
pprint(FORMAT_REFERENCES)

('When referencing the documents, add a citation right after. Use '
 '"[NUMBER](id=ID_NUMBER)" for the citation (e.g. "The Space Needle is in '
 'Seattle [1](id=55)[2](id=12).").')


Create a chain with the `context`, `rag_prompt` and `model`.

In [12]:
from langchain_core.output_parsers import StrOutputParser

# Invoke the chain without `manage_references()`
chain = (
        context
        | rag_prompt
        | model
)

Select documents similar to the question.

In [13]:
question = "What is the difference kind of games and competition of mathematics?"

docs = vectorstore.similarity_search(question, k=6)
pprint([(d.metadata["title"], d.metadata["source"]) for d in docs])

[('Mathematical game', 'https://en.wikipedia.org/wiki/Mathematical_game'),
 ('List of mathematics competitions',
  'https://en.wikipedia.org/wiki/List_of_mathematics_competitions'),
 ('Mathematics', 'https://en.wikipedia.org/wiki/Mathematics'),
 ('History of mathematics',
  'https://en.wikipedia.org/wiki/History_of_mathematics'),
 ('Applied mathematics', 'https://en.wikipedia.org/wiki/Applied_mathematics'),
 ('Mathematical sciences',
  'https://en.wikipedia.org/wiki/Mathematical_sciences')]


Invoke the chain with the `documents` and `question`, but without `manage_references()`. You can see some **\[1](id=1)** references in the answer.

In [15]:
answer = (chain | StrOutputParser()).invoke({"documents": docs, "question": question})
pprint(answer)

('Mathematical games are defined by clear rules and strategies that can '
 'involve arithmetic concepts, allowing players to engage without needing deep '
 'mathematical expertise, such as tic-tac-toe or chess [1](id=1). In contrast, '
 'mathematics competitions, like the International Mathematical Olympiad, '
 'require participants to solve mathematical tests, often demanding advanced '
 'problem-solving skills and proofs [2](id=2). Thus, the former focuses on '
 'gameplay and strategy, while the latter emphasizes mathematical '
 'problem-solving and competition.')


Invoke the chain with the `documents` and `question` with `manage_references()`.

In [19]:
from langchain_references import manage_references

managed_chain = manage_references(chain)

In [20]:

answer = (managed_chain | StrOutputParser()).invoke(
    {"documents": docs, "question": question})
pprint(answer)

('Mathematical games are structured activities defined by clear mathematical '
 'rules and strategies, often aimed at enhancing arithmetic skills, while '
 'mathematics competitions involve participants completing math tests that may '
 'require problem-solving, proofs, or calculations. Games focus more on '
 'interactive play and strategy, whereas competitions emphasize assessment and '
 'achievement in mathematical knowledge '
 '<sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup><sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup>\n'
 '\n'
 '- **1** [Mathematical '
 'game](https://en.wikipedia.org/wiki/Mathematical_game)\n'
 '- **2** [List of mathematics '
 'competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)\n')


In [21]:
# Print in markdown format
from IPython.display import display, Markdown

display(Markdown(answer))

Mathematical games are structured activities defined by clear mathematical rules and strategies, often aimed at enhancing arithmetic skills, while mathematics competitions involve participants completing math tests that may require problem-solving, proofs, or calculations. Games focus more on interactive play and strategy, whereas competitions emphasize assessment and achievement in mathematical knowledge <sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup><sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup>

- **1** [Mathematical game](https://en.wikipedia.org/wiki/Mathematical_game)
- **2** [List of mathematics competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)


# Another template
The previous model requires the list of documents to be obtained before invoking the chain.


In [22]:
from operator import itemgetter
from langchain_core.runnables import RunnableParallel

retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
context = (
    RunnableParallel(
        # Get list of documents, necessary for reference analysis
        documents=(itemgetter("question") | retriever),
        # and question
        question=itemgetter("question"),
    ).assign(
        context=lambda input: format_docs(input["documents"]),
        format_references=lambda _: FORMAT_REFERENCES,
    )
)

In [23]:
context.invoke({"question": question}).keys()

dict_keys(['documents', 'question', 'context', 'format_references'])

In [24]:
chain = (
        context
        | rag_prompt
        | model
)

In [27]:
pprint((chain | StrOutputParser()).invoke({"question": question}))

('Mathematical games are structured activities with clear rules and '
 'strategies, often focused on recreational and educational aspects, such as '
 'enhancing arithmetic skills through play [1](id=1). In contrast, mathematics '
 'competitions, like the International Mathematical Olympiad, are formal '
 'events where participants solve complex mathematical problems under '
 'competitive conditions [2](id=2). While games emphasize enjoyment and '
 'learning, competitions prioritize skill demonstration and problem-solving '
 'capabilities.')


In [31]:
answer = (context | manage_references(rag_prompt | model) | StrOutputParser() ).invoke({"question": question})
pprint(answer)

('Mathematical games are defined by clear mathematical parameters and often '
 'involve strategies that do not require deep mathematical knowledge to play, '
 'while mathematical competitions are structured events where participants '
 'solve math problems, often under time constraints. Games are typically '
 'recreational and can enhance arithmetic skills, whereas competitions focus '
 'on testing mathematical proficiency and problem-solving abilities. Thus, the '
 'former emphasizes enjoyment and learning, while the latter emphasizes '
 'performance and assessment in mathematics '
 '<sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup><sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup>\n'
 '\n'
 '- **1** [Mathematical '
 'game](https://en.wikipedia.org/wiki/Mathematical_game)\n'
 '- **2** [List of mathematics '
 'competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)\n')


In [32]:
display(Markdown(answer))

Mathematical games are defined by clear mathematical parameters and often involve strategies that do not require deep mathematical knowledge to play, while mathematical competitions are structured events where participants solve math problems, often under time constraints. Games are typically recreational and can enhance arithmetic skills, whereas competitions focus on testing mathematical proficiency and problem-solving abilities. Thus, the former emphasizes enjoyment and learning, while the latter emphasizes performance and assessment in mathematics <sup>[[1](https://en.wikipedia.org/wiki/Mathematical_game)]</sup><sup>[[2](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)]</sup>

- **1** [Mathematical game](https://en.wikipedia.org/wiki/Mathematical_game)
- **2** [List of mathematics competitions](https://en.wikipedia.org/wiki/List_of_mathematics_competitions)
