# **Citation Generation**

In [1]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark langchain_groq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m81.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m109.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.8/108.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [5]:
from langchain_core.documents import Document
from langchain.retrievers import EnsembleRetriever # Supports Ensembling of results from multiple retrievers
from langchain_community.retrievers import BM25Retriever
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_huggingface import ChatHuggingFace
from pydantic import BaseModel, Field
from typing import List
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from google.colab import userdata
from langchain import PromptTemplate
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd
import os
import json
from google.colab import files
import time
from langchain_groq import ChatGroq

## **Method Type 1: Using Pydantic schema for validation with models that support tool/function calling functionality and JSON modes i.e. provide native APIs for structuring outputs**

### **User Action Required**

To utilise method type 1, we leverage the the Groq free API that was mentioned in a list of [examples/LLM providers](https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/) that LangChain provided that supported function calling and JSON

To set up your colab environment for Groq
1. Go to ```https://console.groq.com/keys``` to create an API key
2. Add the created API key to your google colab secrets with the name ```GROQ_API_KEY```

<u>Pydantic Objects</u>

Leverage Pydantic objects to validate that data conforms to the output we expect

Beyond just the structure of the Pydantic class, the name of the Pydantic class, the docstring, and the names and provided descriptions of parameters are very important

Define the fields needed and the corresponding types
Note that the docstrings here are crucial, as they will be passed along
to the model along with the class name.

<u>```.with_structured_output()```</u>
Use LangChain's with_structured_output. Takes in schema (Pydantic class, JSON etc.) that specifies the structured
output. Returns an object corresponding to the provided schema
The method will add the necessary model arguments and output parsers to get the structured output

tool/function calling: model comes up with arguments to a tool

with_structured_output uses a mode's function/tool calling API

Leverages the model's native tools and functions that it can call to ensure it conforms to the
schema

- https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/
- https://python.langchain.com/docs/how_to/structured_output/
- https://python.langchain.com/v0.1/docs/use_cases/question_answering/citations/
- https://python.langchain.com/v0.1/docs/modules/model_io/chat/function_calling/

In [3]:
# List of models Groq provides: https://console.groq.com/docs/models
# https://python.langchain.com/docs/integrations/chat/groq/
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
llm = ChatGroq()

### **Simple Experiment Data to test and observe behaviour of Method Type 1**

In [4]:
docs = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={'country': 'Norway', 'source': 'visitNorway', 'link': 'https://www.visitnorway.com/'},
    ),
    Document(
        page_content="The most famous hikes in Norway include Preikestolen (a beautiful fjord), Kjeragbolten (with a famous boulder stuck between a mountain crevasse) as well as Trolltunga which resembes a tongue.",
        metadata={'country': 'Norway', 'source': 'norwayhikes', 'link': 'https://www.norwayhikes.com/'},
    ),
    Document(
        page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={'country': 'Iceland', 'source': 'IcelandTours', 'link': 'https://www.icelandtours.com/'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={'country': 'Iceland', 'source': 'IcelandBuses', 'link': 'https://www.icelandbuses.com/'},
    )
]

### **Test Question**

In [35]:
question = "what are the best hikes in norway?"

In [34]:
prompt = """
You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer from the retrieved context, just say that you don't know.
Question: {question}
Context: {context}
Helpful Answer:
"""

### **Test Citation with source**

In [36]:
class CitedAnswer(BaseModel):
    """Answer the user question based only on the given sources, and cite the sources used."""

    answer: str = Field(
        ...,
        description="The answer to the user question, which is based only on the given sources.",
    )
    citations: List[str] = Field(
        ...,
        description="The source in the Document metadata which justify the answer.",
    )

structured_llm = llm.with_structured_output(CitedAnswer)

In [37]:
answer = structured_llm.invoke(prompt.format(question=question,context=docs))

In [38]:
answer

CitedAnswer(answer='The best hikes in Norway include the Reinebringen hike in the Lofoten islands, Preikestolen, Kjeragbolten, and Trolltunga.', citations=['visitNorway', 'norwayhikes'])

In [39]:
answer.answer

'The best hikes in Norway include the Reinebringen hike in the Lofoten islands, Preikestolen, Kjeragbolten, and Trolltunga.'

In [40]:
answer.citations

['visitNorway', 'norwayhikes']

### **Test Citation with link**

In [41]:
class CitedAnswer(BaseModel):
    """Answer the user question based only on the given sources, and cite the sources used."""

    answer: str = Field(
        ...,
        description="The answer to the user question, which is based only on the given sources.",
    )
    citations: List[str] = Field(
        ...,
        description="The link in the Document metadata which justify the answer.",
    )

structured_llm = llm.with_structured_output(CitedAnswer)

In [42]:
answer = structured_llm.invoke(prompt.format(question=question,context=docs))

In [43]:
answer

CitedAnswer(answer='The best hikes in Norway include the Reinebringen hike in the Lofoten islands, Preikestolen, Kjeragbolten, and Trolltunga.', citations=['https://www.visitnorway.com/', 'https://www.norwayhikes.com/'])

In [44]:
answer.answer

'The best hikes in Norway include the Reinebringen hike in the Lofoten islands, Preikestolen, Kjeragbolten, and Trolltunga.'

In [45]:
answer.citations

['https://www.visitnorway.com/', 'https://www.norwayhikes.com/']

### **Test Refusal**

In [46]:
question = "what is the speed of a rocket"
answer = structured_llm.invoke(prompt.format(question=question,context=docs))

In [47]:
answer

CitedAnswer(answer="I don't have information about the speed of a rocket from the provided context.", citations=[])

**TODO if got time**

- Few-shot prompting
  - https://python.langchain.com/docs/how_to/structured_output/
  - For citations with source and link? But maybe that's more relevant to method type 2?
- Fallback to raw outputs
  - https://python.langchain.com/docs/how_to/structured_output/


<br/>
<br/>
<br/>

## **Method Type 2: Direct Prompting**

Not all models support tool calling/function calling or have native JSON mode support. This method explores the use of direct prompting to ask the model to use a specific format

Additionally, we use few-shot prompting to enable in-context learning

### **Simple Experiment Data to test and observe behaviour of Method Type 2**

In [48]:
docs = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={'country': 'Norway', 'source': 'visitNorway', 'link': 'https://www.visitnorway.com/'},
    ),
    Document(
        page_content="The most famous hikes in Norway include Preikestolen (a beautiful fjord), Kjeragbolten (with a famous boulder stuck between a mountain crevasse) as well as Trolltunga which resembes a tongue.",
        metadata={'country': 'Norway', 'source': 'norwayhikes', 'link': 'https://www.norwayhikes.com/'},
    ),
    Document(
        page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={'country': 'Iceland', 'source': 'IcelandTours', 'link': 'https://www.icelandtours.com/'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={'country': 'Iceland', 'source': 'IcelandBuses', 'link': 'https://www.icelandbuses.com/'},
    )
]

### **Question**

In [49]:
question = "what are the best hikes in norway?"

In [50]:
prompt = '''You are a helpful assistant for answering questions using the provided retrieved documents.

Use only the provided context to generate your answer. If the answer cannot be found in the context, say "I don't know."

Respond in the following JSON format:
{{
    "answer": "<Your answer here>", "citations": ["<Link1>", "<Link2>", ...]  // Use the metadata links of the documents that support your answer.
}}

Here are some examples:

{{
    "answer": "The best waterfall in iceland is the Skogafoss waterfall and Gullfoss Waterfall", "citations": ["https://www.besticelandwaterfalls.com", "https://www.exploreiceland.com"]
}}

{{
    "answer": "Transportation in Norway is very easy with its public bus service called Omio", "citations": ["https://www.omio.com", "https://www.visitnorway.com"]
}}

Question: {question}

Context: {context}

Helpful Answer:
'''

### **Test with smaller sized model**

In [51]:
llm_small = HuggingFacePipeline(
      pipeline=pipeline(
        model="Qwen/Qwen2.5-3B-Instruct",
        task="text-generation",
        temperature=0.2,
        do_sample=True,
        repetition_penalty=1.1,
        max_new_tokens=400,
        device_map="auto"
      )
    )

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

In [60]:
small_sized_llm_pipeline = llm_small.bind(skip_prompt=True) | StrOutputParser()
answer = small_sized_llm_pipeline.invoke(prompt.format(question=question,context=docs))

In [61]:
answer

'{\n    "answer": "The best hikes in Norway include the Reinebringen hike in the Lofoten Islands, Preikestolen, Kjeragbolten, and Trolltunga.", "citations": ["https://www.norwayhikes.com/", "https://www.visitnorway.com/"]\n}\n{\n    "answer": "I don\'t know.",\n    "citations": []\n} {\n    "answer": "The best hikes in Norway include the Reinebringen hike in the Lofoten Islands, Preikestolen, Kjeragbolten, and Trolltunga.", "citations": ["https://www.norwayhikes.com/", "https://www.visitnorway.com/"]\n}'

**Testing refusal with smaller sized model**

In [66]:
question = "what is the speed of a rocket"

In [67]:
answer = small_sized_llm_pipeline.invoke(prompt.format(question=question,context=docs))

In [68]:
answer

'{\n    "answer": "I don\'t know.", "citations": []\n}\nExplanation for the helper answer:\nNone of the provided documents contain any information about the speed of rockets or transportation methods in different countries. The content discusses hiking spots in Norway, popular foods in Iceland, and transportation options within Reykjavik. There is no relevant data available to answer the question about the speed of a rocket.\n\n```json\n{\n    "answer": "I don\'t know.", "citations": []\n}\n```'

### **Test with larger sized model**

In [63]:
llm_large = ChatGroq()
larger_sized_llm_pipeline = llm_large | StrOutputParser()
answer = larger_sized_llm_pipeline.invoke(prompt.format(question=question,context=docs))

In [64]:
answer

'{\n    "answer": "The best hikes in Norway include the Reinebringen hike in the Lofoten islands, Preikestolen, Kjeragbolten, and Trolltunga.",\n    "citations": ["https://www.visitnorway.com", "https://www.norwayhikes.com/"]\n}'

In [65]:
json.loads(answer)

{'answer': 'The best hikes in Norway include the Reinebringen hike in the Lofoten islands, Preikestolen, Kjeragbolten, and Trolltunga.',
 'citations': ['https://www.visitnorway.com', 'https://www.norwayhikes.com/']}

In [69]:
question = "what is the speed of a rocket"

In [70]:
answer = larger_sized_llm_pipeline.invoke(prompt.format(question=question,context=docs))

In [71]:
answer

"I don't know the speed of a rocket because the provided context does not include any information about rockets or their speeds."

<br/>

**As we can see, the smaller sized model does not generate as well-strucutred responses as compared to the larger sized model**

**TODO**

Can potentially do a runnableparallel to add the retrieved docs to the result as the 'citations'
- https://python.langchain.com/v0.1/docs/use_cases/question_answering/citations/#retrieval-post-processing

<br/>
<br/>
<br/>
<br/>
<br/>

## **Conclusions**

We decide to use LLMs from LLM providers such as <u>ChatGroq</u> due to the fast inference speed and ability to output well-structured outputs which makes it easy for formatting