# **Citation Generation**

In [1]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark langchain_groq

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m91.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m126.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.8/108.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from langchain_core.documents import Document
from langchain.retrievers import EnsembleRetriever # Supports Ensembling of results from multiple retrievers
from langchain_community.retrievers import BM25Retriever
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_huggingface.llms import HuggingFacePipeline
from langchain_huggingface import ChatHuggingFace
from pydantic import BaseModel, Field
from typing import List
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd
import os
from google.colab import files
import time
from langchain_groq import ChatGroq

## **Method Type 1: Using Pydantic schema for validation with models that support tool/function calling functionality and JSON modes i.e. provide native APIs for structuring outputs**

To utilise method type 1, we leverage the the Groq free API that was mentioned in a list of [examples/LLM providers](https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/) that LangChain provided that supported function calling and JSON

To set up your colab environment for Groq
1. Go to ```https://console.groq.com/keys``` to create an API key
2. Add the created API key to your google colab secrets with the name ```GROQ_API_KEY```

In [11]:
# List of models Groq provides: https://console.groq.com/docs/models
# https://python.langchain.com/docs/integrations/chat/groq/
llm = ChatGroq(api_key="")

In [29]:
# Leverage Pydantic objects to validate that data conforms to the output we expect

'''
Beyond just the structure of the Pydantic class, the name of the Pydantic class,
the docstring, and the names and provided descriptions of parameters are very important

Define the fields needed and the corresponding types
Note that the docstrings here are crucial, as they will be passed along
to the model along with the class name.
'''

class CitedAnswer(BaseModel):
    """Answer the user question based only on the given sources, and cite the sources used."""

    answer: str = Field(
        ...,
        description="The answer to the user question, which is based only on the given sources.",
    )
    citations: List[int] = Field(
        ...,
        description="The integer IDs of the SPECIFIC sources which justify the answer.",
    )

'''
Use LangChain's with_structured_output. Takes in schema (Pydantic class, JSON etc.) that specifies the structured
output. Returns an object corresponding to the provided schema
The method will add the necessary model arguments and output parsers to get the structured output

tool/function calling: model comes up with arguments to a tool

with_structured_output uses a mode's function/tool calling API

Leverages the model's native tools and functions that it can call to ensure it conforms to the
schema

https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/
https://python.langchain.com/docs/how_to/structured_output/
https://python.langchain.com/v0.1/docs/use_cases/question_answering/citations/
https://python.langchain.com/v0.1/docs/modules/model_io/chat/function_calling/
'''
structured_llm = llm.with_structured_output(CitedAnswer)

**Example question with sources**

In [None]:
example_q = """What is Brian's height?

Sources:
1. Suzy is 6'2".
2. Jeremiah is blonde.
3. Brian is 3 inches shorter than Suzy.
"""

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [None]:
structured_llm.invoke(example_q)

### **Simple Experiment Data to test and observe behaviour of Method Type 1**

In [12]:
docs = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={'country': 'Norway', 'source': 'visitNorway', 'link': 'https://www.visitnorway.com/'},
    ),
    Document(
        page_content="The most famous hikes in Norway include Preikestolen (a beautiful fjord), Kjeragbolten (with a famous boulder stuck between a mountain crevasse) as well as Trolltunga which resembes a tongue.",
        metadata={'country': 'Norway', 'source': 'norwayhikes', 'link': 'https://www.norwayhikes.com/'},
    ),
    Document(
        page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={'country': 'Iceland', 'source': 'IcelandTours', 'link': 'https://www.icelandtours.com/'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={'country': 'Iceland', 'source': 'IcelandBuses', 'link': 'https://www.icelandbuses.com/'},
    )
]

**Citation with source**

In [20]:
class CitedAnswer(BaseModel):
    """Answer the user question based only on the given sources, and cite the sources used."""

    answer: str = Field(
        ...,
        description="The answer to the user question, which is based only on the given sources.",
    )
    citations: List[str] = Field(
        ...,
        description="The source in the Document metadata which justify the answer.",
    )

question = '''You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Question: What are the best hikes in Norway?
Context: {context}
Helpful Answer:'''.format(context=docs)

structured_llm = llm.with_structured_output(CitedAnswer)

answer = structured_llm.invoke(question)

In [22]:
answer

CitedAnswer(answer='The best hikes in Norway include the Reinebringen hike in the Lofoten islands.', citations=['visitNorway'])

In [23]:
answer.answer

'The best hikes in Norway include the Reinebringen hike in the Lofoten islands.'

In [24]:
answer.citations

['visitNorway']

**Citation with link**

In [27]:
class CitedAnswer(BaseModel):
    """Answer the user question based only on the given sources, and cite the sources used."""

    answer: str = Field(
        ...,
        description="The answer to the user question, which is based only on the given sources.",
    )
    citations: List[str] = Field(
        ...,
        description="The link in the Document metadata which justify the answer.",
    )

question = '''You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Question: What are the best hikes in Norway?
Context: {context}
Helpful Answer:'''.format(context=docs)

structured_llm = llm.with_structured_output(CitedAnswer)

answer = structured_llm.invoke(question)

In [28]:
answer

CitedAnswer(answer='The best hikes in Norway include the Reinebringen hike in the Lofoten islands.', citations=['https://www.visitnorway.com/'])

**TODO if got time**

- Few-shot prompting
  - https://python.langchain.com/docs/how_to/structured_output/
  - For citations with source and link? But maybe that's more relevant to method type 2?
- Fallback to raw outputs
  - https://python.langchain.com/docs/how_to/structured_output/

<br/>
<br/>
<br/>

## **Method Type 2: Direct Prompting**

Not all models support tool calling/function calling or have native JSON mode support. This method explores the use of direct prompting to ask the model to use a specific format