##### Copyright 2024 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Question Answering using Gemini, LangChain, and Chroma

## Overview

[Gemini](https://ai.google.dev/models/gemini) is a family of generative AI models that lets developers generate content and solve problems. These models are designed and trained to handle both text and images as input.

[LangChain](https://www.langchain.com/) is a data framework designed to make integration of Large Language Models (LLM) like Gemini easier for applications.

[Chroma](https://docs.trychroma.com/) is an open-source embedding database focused on simplicity and developer productivity. Chroma allows users to store embeddings and their metadata, embed documents and queries, and search the embeddings quickly.

In this notebook, you'll learn how to create an application that answers questions using data from a website with the help of Gemini, LangChain, and Chroma.

## Setup

 - Modified to factor all the setup on the requirements.txt
 - Also utilize build an app and scripts to allow easier showcase and code management.


### Web Agent Loading from Website

In [21]:
import os
import pprint
import re
import markdown

In [16]:
os.listdir()

['Gemini_LangChain_QA_Chroma_WebLoad.ipynb']

In [22]:
# Markdown (using the markdown library)
with open("../data/web/links.md", "r") as file:
    text = file.read()

for link_name, url in links:
    print(f"Link Name: {link_name}")
    print(f"URL: {url}")

Link Name: FWD Claim
URL: https://www.fwd.com.sg/claim-online/3ci-insurance/quickQuestions
Link Name: FWD CI Plus Product Page
URL: https://www.fwd.com.sg/critical-illness-insurance/critical-illness-plus/
Link Name: FWD Big3 Product Page
URL: https://www.fwd.com.sg/critical-illness-insurance/big-3-critical-illness/
Link Name: Prudential PruShield Product Page
URL: https://www.prudential.com.sg/products/health-insurance/medical/prushield
Link Name: Prudential Claim Page
URL: https://www.prudential.com.sg/claims-and-support/how-to-submit-a-claim


In [28]:
help(WebBaseLoader)

Help on class WebBaseLoader in module langchain_community.document_loaders.web_base:

class WebBaseLoader(langchain_community.document_loaders.base.BaseLoader)
 |  WebBaseLoader(web_path: Union[str, Sequence[str]] = '', header_template: Optional[dict] = None, verify_ssl: bool = True, proxies: Optional[dict] = None, continue_on_failure: bool = False, autoset_encoding: bool = True, encoding: Optional[str] = None, web_paths: Sequence[str] = (), requests_per_second: int = 2, default_parser: str = 'html.parser', requests_kwargs: Optional[Dict[str, Any]] = None, raise_for_status: bool = False, bs_get_text_kwargs: Optional[Dict[str, Any]] = None, bs_kwargs: Optional[Dict[str, Any]] = None, session: Any = None) -> None
 |  
 |  Load HTML pages using `urllib` and parse them with `BeautifulSoup'.
 |  
 |  Method resolution order:
 |      WebBaseLoader
 |      langchain_community.document_loaders.base.BaseLoader
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __ini

In [29]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(["https://www.prudential.com.sg/claims-and-support/how-to-submit-a-claim","https://www.prudential.com.sg/claims-and-support/how-to-submit-a-claim"]) # https://docs.smith.langchain.com/overview

docs = loader.load()

In [31]:
docs

[Document(page_content="\n\n\n\n\n\n\n\n\n\n\nHow to Submit a Claim | Prudential Singapore \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n   \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nOnline Payment\n\n\nPersonal\nEnterprise\n\nLogin\n\n\n\n\n\n\n\n\n\n\n\n\n\nWe Do\nProducts\nClaims & Services\nPriority Programmes\nWork With Us\nAbout Us\n\n\n\n\n\n\nWe Do\n\n\nWe Do Pulse\nWe Do Life\nWe Do Innovation\n\n\n\n\nProducts\n\n\nHealth Protection\nLife Protection\nWealth Accumulation\nLegacy Planning\nBuy Insurance Online\nPromotions\n\n\n\n\nClaims & Services\n\n\nClaims\nPayments\nSupport\nTools\nPRUPanel Connect\n\n\n\n\nPriority Programmes\n\n\nAscend By Prudential\nOPUS By Prudential\nPURSUE\n\n\n\n\nWork With Us\n\n\nJoin PRU as FC\nCorporate Careers\n\n\n\n\nAbout Us\n\n\nAbout Us\nESG\nIn Our Community\nNewsroom\nCommunity Investment\n\n\n\n\n\n\n\nLogin\n\n\nPRUaccess\nCorporate Insurance\n\n\n\n\nOnline Payment\n\n\nEnterpris

In [27]:
pprint.pprint(docs[0].to_json()['kwargs']['page_content'])

('\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 'How to Submit a Claim | Prudential Singapore \n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '   \n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 'Online Payment\n'
 '\n'
 '\n'
 'Personal\n'
 'Enterprise\n'
 '\n'
 'Login\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 'We Do\n'
 'Products\n'
 'Claims & Services\n'
 'Priority Programmes\n'
 'Work With Us\n'
 'About Us\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 '\n'
 'We Do\n'
 '\n'
 '\n'
 'We Do Pulse\n'
 'We Do Life\n'
 'We Do Innovation\n'
 '\n'
 '\n'
 '\n'
 '\n'
 'Products\n'
 '\n'
 '\n'
 'Health Protection\n'
 'Lif

In [4]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

True

In [5]:
youtube_api_key = os.environ.get("youtube_api_key")

video_id='Moe-2TUYQTI'

In [7]:
import json
import youtube_transcript_api
from googleapiclient.discovery import build

def get_transcript(video_id, api_key):
    """Retrieves the transcript of a YouTube video."""
    youtube = build('youtube', 'v3', developerKey=youtube_api_key)
    transcript = youtube_transcript_api.YouTubeTranscriptApi.get_transcript(video_id)
    return transcript

def get_comments(video_id, api_key):
    """Retrieves the comments of a YouTube video."""
    youtube = build('youtube', 'v3', developerKey=youtube_api_key)
    results = youtube.commentThreads().list(
        part='snippet',
        videoId=video_id,
        textFormat='plainText',
        maxResults=100
    ).execute()
    comments = []
    for item in results['items']:
        comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
        comments.append(comment)
    return comments

In [None]:
def get_comments(video_id, api_key):
    """
    Retrieves top comments and replies of a YouTube video.

    Args:
        video_id: The YouTube video ID.
        api_key: The YouTube API key.

    Returns:
        A list of dictionaries containing top comment and reply information.
    """
    youtube = build('youtube', 'v3', developerKey=api_key)
    results = youtube.commentThreads().list(
        part='snippet,replies',
        videoId=video_id,
        textFormat='plainText',
        maxResults=100
    ).execute()

    comments_list = []
    for item in results['items']:
        comment = {
            'text': item['snippet']['topLevelComment']['snippet']['textDisplay'],
            'author': item['snippet']['topLevelComment']['snippet']['authorDisplayName'],
            'timestamp': item['snippet']['topLevelComment']['snippet']['publishedAt'],
            'replies': []
        }
        if 'replies' in item:
            for reply in item['replies']:
                comment['replies'].append({
                    'text': reply['snippet']['textDisplay'],
                    'author': reply['snippet']['authorDisplayName'],
                    'timestamp': reply['snippet']['publishedAt']
                })
        comments_list.append(comment)
    return comments_list

# Example usage:
# comments = get_comments(video_id, api_key)
# for comment in comments:
#     print(f"Top comment: {comment['text']}")
#     for reply in comment['replies']:
#         print(f"\tReply: {reply['text']}")

In [11]:
get_comments(video_id, youtube_api_key)

['🔥 Improve your problem-solving skills with Brilliant — https://brilliant.org/georgiadow\n👀 My Full Arcane Video Series: https://www.youtube.com/playlist?list=PL3I0HsOf9M_S0HzHJKjhVwsr_wTNhdaF9',
 "Seems like Rio is Singed's daughter\nThink about it.",
 'your analysis/call-out of hugs is a comfort for me 🥺',
 'I have fallen in love with your arcane analysis videos! Your comments and enthusiasm show just how much you enjoy the characters, and it warms my heart to see someone love Viktor just as much as I do!',
 'I think Singed would be a good one to do this for',
 "I rly like Viktor and the fact he's disabled to me was rly relatable and I liked feeling like I could relate to a character especially during his younger scenes since I was also an outcast",
 "The tragedy here is that Viktor chases the Singed and we all know you must NEVER chase the Singed. (jk it's a League of Legends reference)",
 'I love that she changes her hair and outfit depending on what character she reacting to',
 '

In [8]:
transcript_1 = get_transcript(video_id, youtube_api_key)

In [9]:
transcript_1

[{'text': "when i'm doing an assessment on who",
  'start': 9.519,
  'duration': 3.841},
 {'text': 'someone is the first thing that i do is',
  'start': 11.28,
  'duration': 4.239},
 {'text': 'i find out about their history', 'start': 13.36, 'duration': 4.64},
 {'text': 'where were they born what was that like',
  'start': 15.519,
  'duration': 4.721},
 {'text': 'what was their life experience i want to',
  'start': 18.0,
  'duration': 4.96},
 {'text': 'see the world through their eyes and',
  'start': 20.24,
  'duration': 5.279},
 {'text': 'what happens if you are born in a',
  'start': 22.96,
  'duration': 5.2},
 {'text': 'hostile environment and what happens if',
  'start': 25.519,
  'duration': 5.121},
 {'text': 'you feel like an outsider even in that',
  'start': 28.16,
  'duration': 5.36},
 {'text': 'hostile environment or you come from a',
  'start': 30.64,
  'duration': 6.16},
 {'text': 'society where you are looked down on by',
  'start': 33.52,
  'duration': 6.64},
 {'text': 

In [None]:
# comments_1 = get_comments(video_id, youtube_api_key)

In [None]:
from youtube_transcript_api.formatters import TextFormatter

formatter = TextFormatter()

# .format_transcript(transcript) turns the transcript into a JSON string.
text_formatted = formatter.format_transcript(transcript_1)

In [None]:
comments_1

['🔥 Improve your problem-solving skills with Brilliant — https://brilliant.org/georgiadow\n👀 My Full Arcane Video Series: https://www.youtube.com/playlist?list=PL3I0HsOf9M_S0HzHJKjhVwsr_wTNhdaF9',
 "Seems like Rio is Singed's daughter\nThink about it.",
 'your analysis/call-out of hugs is a comfort for me 🥺',
 'I have fallen in love with your arcane analysis videos! Your comments and enthusiasm show just how much you enjoy the characters, and it warms my heart to see someone love Viktor just as much as I do!',
 'I think Singed would be a good one to do this for',
 "I rly like Viktor and the fact he's disabled to me was rly relatable and I liked feeling like I could relate to a character especially during his younger scenes since I was also an outcast",
 "The tragedy here is that Viktor chases the Singed and we all know you must NEVER chase the Singed. (jk it's a League of Legends reference)",
 'I love that she changes her hair and outfit depending on what character she reacting to',
 '

In [None]:
text_formatted

"when i'm doing an assessment on who\nsomeone is the first thing that i do is\ni find out about their history\nwhere were they born what was that like\nwhat was their life experience i want to\nsee the world through their eyes and\nwhat happens if you are born in a\nhostile environment and what happens if\nyou feel like an outsider even in that\nhostile environment or you come from a\nsociety where you are looked down on by\nother people it makes a huge difference\nto our psychological makeup\nand if you don't really grasp that then\nyou're not going to be able to really\nunderstand who someone is\nso here we have victor he's a little boy\nand you see he's\nseparated from where all the other kids\nare and they're playing in this\ntoxic oil filled water but they're happy\nand they're all together and here he is\nwith this beautiful fanciful contraption\nthat we can tell that he's made himself\nand i love his eyes he's seen someone\nanother friend that's come over to take\na a look at wh

### Grab an API Key

To use Gemini you need an *API key*. You can create an API key with one click in [Google AI Studio](https://makersuite.google.com/).
After creating the API key, you can either set an environment variable named `GOOGLE_API_KEY` to your API Key or pass the API key as an argument when using the `ChatGoogleGenerativeAI` class to access Google's `gemini` and `gemini-vision` models or the `GoogleGenerativeAIEmbeddings` class to access Google's Generative AI embedding model using `LangChain`.

In this tutorial, you will set the environment variable `GOOGLE_API_KEY` to configure Gemini to use your API key.

In [1]:
# Run this cell and paste the API key in the prompt
import os
import getpass

os.environ['GOOGLE_API_KEY'] = getpass.getpass('Gemini API Key:')

## Basic steps
LLMs are trained offline on a large corpus of public data. Hence they cannot answer questions based on custom or private data accurately without additional context.

If you want to make use of LLMs to answer questions based on private data, you have to provide the relevant documents as context alongside your prompt. This approach is called Retrieval Augmented Generation (RAG).

You will use this approach to create a question-answering assistant using the Gemini text model integrated through LangChain. The assistant is expected to answer questions about AI and climate change. To make this possible you will add more context to the assistant using data from a website.

In this tutorial, you'll implement the two main components in an RAG-based architecture:

1. Retriever

    Based on the user's query, the retriever retrieves relevant snippets that add context from the document. In this tutorial, the document is the website data.
    The relevant snippets are passed as context to the next stage - "Generator".

2. Generator

    The relevant snippets from the website data are passed to the LLM along with the user's query to generate accurate answers.

You'll learn more about these stages in the upcoming sections while implementing the application.

## Import the required libraries

In [None]:
from langchain import PromptTemplate
from langchain import hub
from langchain.docstore.document import Document
from langchain.document_loaders import WebBaseLoader
from langchain.schema import StrOutputParser
from langchain.schema.prompt_template import format_document
from langchain.schema.runnable import RunnablePassthrough
from langchain.vectorstores import Chroma

## Retriever

In this stage, you will perform the following steps:

1. Read and parse the website data using LangChain.

2. Create embeddings of the website data.

    Embeddings are numerical representations (vectors) of text. Hence, text with similar meaning will have similar embedding vectors. You'll make use of Gemini's embedding model to create the embedding vectors of the website data.

3. Store the embeddings in Chroma's vector store.
    
    Chroma is a vector database. The Chroma vector store helps in the efficient retrieval of similar vectors. Thus, for adding context to the prompt for the LLM, relevant embeddings of the text matching the user's question can be retrieved easily using Chroma.

4. Create a Retriever from the Chroma vector store.

    The retriever will be used to pass relevant website embeddings to the LLM along with user queries.

### Read and parse the website data

LangChain provides a wide variety of document loaders. To read the website data as a document, you will use the `WebBaseLoader` from LangChain.

To know more about how to read and parse input data from different sources using the document loaders of LangChain, read LangChain's [document loaders guide](https://python.langchain.com/docs/integrations/document_loaders).

In [None]:
loader = WebBaseLoader("https://blog.google/outreach-initiatives/sustainability/report-ai-sustainability-google-cop28/")
docs = loader.load()

In [None]:
docs

[Document(page_content=", global leaders will gather in Dubai to build momentum for climate action. The United Nations’ Intergovernmental Panel on Climate Change (IPCC) forecasts that the world needs to reduce emissions by 43% by 2030. We believe that artificial intelligence (AI) and collective action can help achieve this goal and create a sustainable future for everyone.Today, we released a report with Boston Consulting Group (BCG), which shows that AI has the potential to help mitigate 5-10% of global greenhouse gas (GHG) emissions by 2030 — the equivalent of the total annual emissions of the European Union. Here’s a look at how we’re building AI that can drive climate progress, while at the same time working to mitigate AI’s environmental impact.\n\n\n\n\nAccelerating climate action with AIAI can have a transformative effect on climate progress. Already, it is starting to address climate challenges in three key areas: providing people and organizations with better information to ma

If you only want to select a specific portion of the website data to add context to the prompt, you can use regex, text slicing, or text splitting.

In this example, you'll use Python's `split()` function to extract the required portion of the text. The extracted text should be converted back to LangChain's `Document` format.

In [None]:
# Extract the text from the website data document
text_content = docs[0].page_content

# The text content between the substrings "Later this month at COP28" to
# "POSTED IN:" is relevant for this tutorial. You can use Python's `split()`
# to select the required content.
text_content_1 = text_content.split("Later this month at COP28",1)[1]
final_text = text_content_1.split("POSTED IN:",1)[0]

# Convert the text to LangChain's `Document` format
docs =  [Document(page_content=final_text, metadata={"source": "local"})]

In [None]:
docs

[Document(page_content=", global leaders will gather in Dubai to build momentum for climate action. The United Nations’ Intergovernmental Panel on Climate Change (IPCC) forecasts that the world needs to reduce emissions by 43% by 2030. We believe that artificial intelligence (AI) and collective action can help achieve this goal and create a sustainable future for everyone.Today, we released a report with Boston Consulting Group (BCG), which shows that AI has the potential to help mitigate 5-10% of global greenhouse gas (GHG) emissions by 2030 — the equivalent of the total annual emissions of the European Union. Here’s a look at how we’re building AI that can drive climate progress, while at the same time working to mitigate AI’s environmental impact.\n\n\n\n\nAccelerating climate action with AIAI can have a transformative effect on climate progress. Already, it is starting to address climate challenges in three key areas: providing people and organizations with better information to ma

### Initialize Gemini's embedding model

To create the embeddings from the website data, you'll use Gemini's embedding model, **embedding-001** which supports creating text embeddings.

To use this embedding model, you have to import `GoogleGenerativeAIEmbeddings` from LangChain. To know more about the embedding model, read Google AI's [language documentation](https://ai.google.dev/models/gemini).

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# If there is no environment variable set for the API key, you can pass the API
# key to the parameter `google_api_key` of the `GoogleGenerativeAIEmbeddings`
# function: `google_api_key = "key"`.

gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

### Store the data using Chroma

To create a Chroma vector database from the website data, you will use the `from_documents` function of `Chroma`. Under the hood, this function creates embeddings from the documents created by the document loader of LangChain using any specified embedding model and stores them in a Chroma vector database.  

You have to specify the `docs` you created from the website data using LangChain's `WebBasedLoader` and the `gemini_embeddings` as the embedding model when invoking the `from_documents` function to create the vector database from the website data. You can also specify a directory in the `persist_directory` argument to store the vector store on the disk. If you don't specify a directory, the data will be ephemeral in-memory.


In [None]:
# Save to disk
vectorstore = Chroma.from_documents(
                     documents=docs,                 # Data
                     embedding=gemini_embeddings,    # Embedding model
                     persist_directory="./chroma_db" # Directory to save data
                     )

### Create a retriever using Chroma

You'll now create a retriever that can retrieve website data embeddings from the newly created Chroma vector store. This retriever can be later used to pass embeddings that provide more context to the LLM for answering user's queries.


To load the vector store that you previously stored in the disk, you can specify the name of the directory that contains the vector store in `persist_directory` and the embedding model in the `embedding_function` arguments of Chroma's initializer.

You can then invoke the `as_retriever` function of `Chroma` on the vector store to create a retriever.

In [None]:
# Load from disk
vectorstore_disk = Chroma(
                        persist_directory="./chroma_db",       # Directory of db
                        embedding_function=gemini_embeddings   # Embedding model
                   )
# Get the Retriever interface for the store to use later.
# When an unstructured query is given to a retriever it will return documents.
# Read more about retrievers in the following link.
# https://python.langchain.com/docs/modules/data_connection/retrievers/
#
# Since only 1 document is stored in the Chroma vector store, search_kwargs `k`
# is set to 1 to decrease the `k` value of chroma's similarity search from 4 to
# 1. If you don't pass this value, you will get a warning.
retriever = vectorstore_disk.as_retriever(search_kwargs={"k": 1})

# Check if the retriever is working by trying to fetch the relevant docs related
# to the word 'climate'. If the length is greater than zero, it means that
# the retriever is functioning well.
print(len(retriever.get_relevant_documents("climate")))

1


## Generator

The Generator prompts the LLM for an answer when the user asks a question. The retriever you created in the previous stage from the Chroma vector store will be used to pass relevant embeddings from the website data to the LLM to provide more context to the user's query.

You'll perform the following steps in this stage:

1. Chain together the following:
    * A prompt for extracting the relevant embeddings using the retriever.
    * A prompt for answering any question using LangChain.
    * An LLM model from Gemini for prompting.
    
2. Run the created chain with a question as input to prompt the model for an answer.


### Initialize Gemini

You must import `ChatGoogleGenerativeAI` from LangChain to initialize your model.
 In this example, you will use **gemini-pro**, as it supports text summarization. To know more about the text model, read Google AI's [language documentation](https://ai.google.dev/models/gemini).

You can configure the model parameters such as ***temperature*** or ***top_p***,  by passing the appropriate values when initializing the `ChatGoogleGenerativeAI` LLM.  To learn more about the parameters and their uses, read Google AI's [concepts guide](https://ai.google.dev/docs/concepts#model_parameters).

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI

# If there is no environment variable set for the API key, you can pass the API
# key to the parameter `google_api_key` of the `ChatGoogleGenerativeAI` function:
# `google_api_key="key"`.
llm = ChatGoogleGenerativeAI(model="gemini-pro",
                 temperature=0.7, top_p=0.85)

### Create prompt templates

You'll use LangChain's [PromptTemplate](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/) to generate prompts to the LLM for answering questions.

In the `llm_prompt`, the variable `question` will be replaced later by the input question, and the variable `context` will be replaced by the relevant text from the website retrieved from the Chroma vector store.

In [None]:
# Prompt template to query Gemini
llm_prompt_template = """You are an assistant for question-answering tasks.
Use the following context to answer the question.
If you don't know the answer, just say that you don't know.
Use five sentences maximum and keep the answer concise.\n
Question: {question} \nContext: {context} \nAnswer:"""

llm_prompt = PromptTemplate.from_template(llm_prompt_template)

print(llm_prompt)

input_variables=['context', 'question'] template="You are an assistant for question-answering tasks.\nUse the following context to answer the question.\nIf you don't know the answer, just say that you don't know.\nUse five sentences maximum and keep the answer concise.\n\nQuestion: {question} \nContext: {context} \nAnswer:"


### Create a stuff documents chain

LangChain provides [Chains](https://python.langchain.com/docs/modules/chains/) for chaining together LLMs with each other or other components for complex applications. You will create a **stuff documents chain** for this application. A stuff documents chain lets you combine all the relevant documents, insert them into the prompt, and pass that prompt to the LLM.

You can create a stuff documents chain using the [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/expression_language).

To learn more about different types of document chains, read LangChain's [chains guide](https://python.langchain.com/docs/modules/chains/document/).

The stuff documents chain for this application retrieves the relevant website data and passes it as the context to an LLM prompt along with the input question.

In [None]:
# Combine data from documents to readable string format.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Create stuff documents chain using LCEL.
#
# This is called a chain because you are chaining together different elements
# with the LLM. In the following example, to create the stuff chain, you will
# combine the relevant context from the website data matching the question, the
# LLM model, and the output parser together like a chain using LCEL.
#
# The chain implements the following pipeline:
# 1. Extract the website data relevant to the question from the Chroma
#    vector store and save it to the variable `context`.
# 2. `RunnablePassthrough` option to provide `question` when invoking
#    the chain.
# 3. The `context` and `question` are then passed to the prompt where they
#    are populated in the respective variables.
# 4. This prompt is then passed to the LLM (`gemini-pro`).
# 5. Output from the LLM is passed through an output parser
#    to structure the model's response.
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | llm_prompt
    | llm
    | StrOutputParser()
)

### Prompt the model

You can now query the LLM by passing any question to the `invoke()` function of the stuff documents chain you created previously.

In [None]:
rag_chain.invoke("How can AI address climate challenges?")

'AI can address climate challenges by providing better information for sustainable choices, improving predictions for climate adaptation, and optimizing climate action for high-impact applications. It can also help reduce emissions from data centers and contrails. However, it is important to manage the environmental impact of AI and ensure its sustainable and equitable use.'

# Conclusion

That's it. You have successfully created an LLM application that answers questions using data from a website with the help of Gemini, LangChain, and Chroma.