<a href="https://colab.research.google.com/github/priyariyyer/AIML_Projects/blob/main/ChatBot_GitBook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Understanding Problem and Objective

In [20]:
# The objective of this project is to create a Chatbot which can query topics mentioned in GitBook Documentation.
# The project will help summarize topics without going through all pages of the GitBook documentation.
# The project is executed using RAG approach:
  #Step1: Loadind Data
  #Step2: Chunking/Split
  #Step3: Tokenizing & Embedding Chunks
  #Step4: Indexing & Storing Context
  #Step5: Retrieving & Generate Response

# Step1: Loading Data

In [21]:
! pip install langchain_community



In [22]:
# import required libraries
import langchain
from langchain_community.document_loaders import WebBaseLoader
import bs4

In [23]:
#load all details from gitbook documentation
loader = WebBaseLoader(web_paths=("https://docs.gitbook.com/",))
# all_docs = loader.load_and_split()
all_docs = loader.load()
all_docs

[Document(metadata={'source': 'https://docs.gitbook.com/', 'title': 'GitBook Documentation | GitBook Documentation', 'description': 'Create and publish beautiful documentation your users will love. GitBook has all the tools you need to create everything from product guides to API references and beyond.', 'language': 'en'}, page_content='GitBook Documentation | GitBook DocumentationExplore how adaptive content transforms your docs into a dynamic, tailored experience for every user.Read the docsCtrlKGitBook AssistantAskProductPricingLog inSign upMoreðŸ‡ºðŸ‡¸ EnglishDocumentationDevelopersGuidesResourcesGetting StartedGitBook DocumentationQuickstartImporting contentGitHub & GitLab SyncEnabling GitHub SyncEnabling GitLab SyncContent configurationGitHub pull request previewCommit messages & AutolinkMonoreposTroubleshootingCreating ContentFormatting your contentInline contentMarkdownContent structureSpacesPagesCollectionsBlocksParagraphsHeadingsUnordered listsOrdered listsTask listsHintsQuot

In [24]:
all_docs[0].page_content

'GitBook Documentation | GitBook DocumentationExplore how adaptive content transforms your docs into a dynamic, tailored experience for every user.Read the docsCtrlKGitBook AssistantAskProductPricingLog inSign upMoreðŸ‡ºðŸ‡¸ EnglishDocumentationDevelopersGuidesResourcesGetting StartedGitBook DocumentationQuickstartImporting contentGitHub & GitLab SyncEnabling GitHub SyncEnabling GitLab SyncContent configurationGitHub pull request previewCommit messages & AutolinkMonoreposTroubleshootingCreating ContentFormatting your contentInline contentMarkdownContent structureSpacesPagesCollectionsBlocksParagraphsHeadingsUnordered listsOrdered listsTask listsHintsQuotesCode blocksFilesImagesEmbedded URLsTablesCardsTabsExpandableStepperDrawingsMath & TeXPage linksColumnsConditional contentButtonsIconsExpressionsVariables and expressionsReusable contentSearching internal contentSearch & Quick findGitBook AIWriting with GitBook AIVersion controlTranslationsAPI ReferencesOpenAPIAdd an OpenAPI specificat

#Step2: Chunking of Data

In [25]:
print("No of characters in page content : ", len(all_docs[0].page_content))

No of characters in page content :  3293


In [26]:
!pip install langchain-text-splitters



In [27]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 10
)
doc_splits = text_splitter.split_documents(all_docs)
doc_splits

[Document(metadata={'source': 'https://docs.gitbook.com/', 'title': 'GitBook Documentation | GitBook Documentation', 'description': 'Create and publish beautiful documentation your users will love. GitBook has all the tools you need to create everything from product guides to API references and beyond.', 'language': 'en'}, page_content='GitBook Documentation | GitBook DocumentationExplore how adaptive content transforms your docs into'),
 Document(metadata={'source': 'https://docs.gitbook.com/', 'title': 'GitBook Documentation | GitBook Documentation', 'description': 'Create and publish beautiful documentation your users will love. GitBook has all the tools you need to create everything from product guides to API references and beyond.', 'language': 'en'}, page_content='docs into a dynamic, tailored experience for every user.Read the docsCtrlKGitBook'),
 Document(metadata={'source': 'https://docs.gitbook.com/', 'title': 'GitBook Documentation | GitBook Documentation', 'description': 'C

# Step3: Tokenizing & Embedding Chunks

In [28]:
!pip install transformers



In [29]:
from transformers import AutoTokenizer
model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "gptq_model-4bit-128g"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

In [30]:
tokens_count = [len(tokenizer(doc.page_content, return_tensors='pt').input_ids.cuda()[0]) for doc in doc_splits]
tokens_count

[22,
 20,
 22,
 15,
 25,
 16,
 22,
 19,
 18,
 22,
 18,
 25,
 19,
 21,
 22,
 17,
 18,
 20,
 19,
 22,
 24,
 20,
 24,
 20,
 22,
 19,
 30,
 24,
 18,
 20,
 22,
 21,
 21,
 24,
 20,
 18,
 14,
 26,
 3]

In [31]:
def token_len(text):
  return len(tokenizer(text, return_tensors='pt').input_ids.cuda()[0])

In [33]:
!pip install langchain-huggingface

Collecting langchain-huggingface
  Downloading langchain_huggingface-0.3.1-py3-none-any.whl.metadata (996 bytes)
Downloading langchain_huggingface-0.3.1-py3-none-any.whl (27 kB)
Installing collected packages: langchain-huggingface
Successfully installed langchain-huggingface-0.3.1


In [34]:
from langchain_huggingface import HuggingFaceEmbeddings

model_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2") #This model tokenizes and embeds the text. Hence, no need of executing AutoTokenizer on Ducument Chunks.

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [36]:
embeddings = model_embed.embed_documents((doc.page_content for doc in doc_splits))
len(embeddings[0]) # tokenized and embedded texts

768

# Step4: Indexing & Storing

In [37]:
!pip install langchain-pinecone

Collecting langchain-pinecone
  Downloading langchain_pinecone-0.2.12-py3-none-any.whl.metadata (8.6 kB)
Collecting pinecone<8.0.0,>=6.0.0 (from pinecone[asyncio]<8.0.0,>=6.0.0->langchain-pinecone)
  Downloading pinecone-7.3.0-py3-none-any.whl.metadata (9.5 kB)
Collecting langchain-openai>=0.3.11 (from langchain-pinecone)
  Downloading langchain_openai-0.3.32-py3-none-any.whl.metadata (2.4 kB)
Collecting pinecone-plugin-assistant<2.0.0,>=1.6.0 (from pinecone<8.0.0,>=6.0.0->pinecone[asyncio]<8.0.0,>=6.0.0->langchain-pinecone)
  Downloading pinecone_plugin_assistant-1.8.0-py3-none-any.whl.metadata (30 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone<8.0.0,>=6.0.0->pinecone[asyncio]<8.0.0,>=6.0.0->langchain-pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Collecting aiohttp-retry<3.0.0,>=2.9.1 (from pinecone[asyncio]<8.0.0,>=6.0.0->langchain-pinecone)
  Downloading aiohttp_retry-2.9.1-py3-none-any.whl.metadata (8.8 kB)
Collect

In [40]:
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone, ServerlessSpec

pc_store = Pinecone(api_key="pcsk_2htMdG_61tE3AyPDo585E3Z5W3TdjaTRdG7xMSY7j6fpeNZkgJMSBFU7jy3rMKB9Y8NbJR")

index_name = "gitbook"

if pc_store.has_index(index_name):
  print("Index already exists")
  # pc_store.delete_index(index_name)
  index = pc_store.Index(index_name)
else:
  pc_store.create_index(
      name = index_name,
      dimension = len(embeddings[0]),
      metric = 'cosine',
      spec = ServerlessSpec(
          cloud='aws', region='us-east-1')
  )
  index = pc_store.Index(index_name)

print(index.describe_index_stats())

{'dimension': 768,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}


In [41]:
#Create a vector store which will embed using the embedding model and store into vector db at one go
vector_store = PineconeVectorStore(
    index=  index,
    embedding = model_embed
)

In [42]:
#Store embeddings in vector DB
doc_ids = vector_store.add_documents(doc_splits)
doc_ids

['855937ef-d724-4729-95f6-b1616e61f9f8',
 'c0e0c586-f319-49fe-a399-56a67241bbc7',
 'f65ae2e7-2b6e-42d3-8d3d-c6d447050ae8',
 '99ea4bd0-8832-4a62-be41-9760f77f5e7c',
 '7d4e0458-b6b6-47b1-8e63-8e17851f56e5',
 '900fecdd-6ce1-4b38-99b0-4da5e4d03eee',
 '8a53e1a6-ff31-4f59-ae8d-1bb1eb89a8eb',
 '5c251619-2a27-48c6-81e1-222dd5609282',
 'e1f2cc28-f664-4c61-b3c9-771cef19b17e',
 '884581a4-96ae-4afe-8231-3f32b090e507',
 '74b26b5a-34d5-4713-9386-4a46cf8f05fb',
 '43fb5e6b-7254-41f7-bdd8-863d46493cf2',
 '946b17e0-2928-4620-9f1c-7315ea021460',
 '45e66ac2-5e43-44ce-9e03-621b7dce9c8b',
 'afdf9360-357d-4cfb-828b-e001d29efb37',
 '1acd5d87-e7f7-4d63-a045-927bd30d2b2a',
 'd0714698-2752-418b-afb0-ee5aa309eb53',
 '2499a08a-1771-47c9-b95b-e305cec26734',
 '6722e846-50a5-4e53-93cf-f4c7b0a018c0',
 '5c4ca474-0030-45be-ba80-088363bc557e',
 'e9bc9dcb-b6d3-41ac-b481-ee89548e279e',
 '5d8027f5-5bf7-4428-86c1-f612ea0e1f6d',
 '93140b54-eed2-459c-bfac-18a91cf5c69b',
 'c4b8b22f-aeb9-4755-b3ca-a87aa34c258d',
 '94239480-6be0-

In [44]:
#check if storage is done correctly
print(index.describe_index_stats())

{'dimension': 768,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 39}},
 'total_vector_count': 39,
 'vector_type': 'dense'}


#Step5: Retrieving/Querying Data

In [45]:
!pip install langgraph

Collecting langgraph
  Downloading langgraph-0.6.6-py3-none-any.whl.metadata (6.8 kB)
Collecting langgraph-checkpoint<3.0.0,>=2.1.0 (from langgraph)
  Downloading langgraph_checkpoint-2.1.1-py3-none-any.whl.metadata (4.2 kB)
Collecting langgraph-prebuilt<0.7.0,>=0.6.0 (from langgraph)
  Downloading langgraph_prebuilt-0.6.4-py3-none-any.whl.metadata (4.5 kB)
Collecting langgraph-sdk<0.3.0,>=0.2.2 (from langgraph)
  Downloading langgraph_sdk-0.2.6-py3-none-any.whl.metadata (1.5 kB)
Collecting ormsgpack>=1.10.0 (from langgraph-checkpoint<3.0.0,>=2.1.0->langgraph)
  Downloading ormsgpack-1.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Downloading langgraph-0.6.6-py3-none-any.whl (153 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.3/153.3 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langgraph_che

In [46]:
!pip install "langchain[groq]"

Collecting langchain-groq (from langchain[groq])
  Downloading langchain_groq-0.3.7-py3-none-any.whl.metadata (2.6 kB)
Collecting groq<1,>=0.30.0 (from langchain-groq->langchain[groq])
  Downloading groq-0.31.1-py3-none-any.whl.metadata (16 kB)
Downloading langchain_groq-0.3.7-py3-none-any.whl (16 kB)
Downloading groq-0.31.1-py3-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.9/134.9 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq, langchain-groq
Successfully installed groq-0.31.1 langchain-groq-0.3.7


In [68]:
#define the LLM model
from langchain_groq import ChatGroq
import os
import getpass
if not os.environ.get("GROQ_API_KEY"):
  os.environ["GROQ_API_KEY"] = getpass.getpass("Enter API Key for Groq: ")

llm_model = ChatGroq(model = "llama-3.1-8b-instant", groq_api_key= os.environ["GROQ_API_KEY"])

# from langchain.chat_models import init_chat_model
# llm_model = init_chat_model("llama-3.1-8b-instant", model_provider="groq")

Enter API Key for Groq: ··········


In [62]:
#import required libaries for retrieval and response generation
from langchain import hub #used for system prompts
from langgraph.graph import StateGraph, START
from langchain_core.documents import Document
from typing_extensions import List, TypedDict

#define system prompts
prompt = hub.pull("rlm/rag-prompt")

#define State to store context, query and response
class State(TypedDict):
  question : str
  answer : str
  context : List[Document]

#define retrieval steps using similarity search
def retrieve(state: State):
  retrieved_docs = vector_store.similarity_search(state['question']) #this will implicitly tokenize and embed query
  state['context'] = retrieved_docs
  return {"context": retrieved_docs}

#define querying mechanism
def generate_response(state: State):
  docs_content = "\n\n".join(doc.page_content for doc in state['context'])
  messages = prompt.invoke({"question": state['question'], "context": docs_content})
  response = llm_model.invoke(messages)
  state['answer'] = response.content
  return {"answer": response.content}

#compile the chatbot
graph_builder = StateGraph(State).add_sequence([retrieve, generate_response]) #initialize the graph
graph_builder.add_edge(START, "retrieve") #set the starting node in the graph
graph = graph_builder.compile()


In [67]:
# Use the chatbot by sending your query
reponse = graph.invoke({"question": "What is GitBook?"})
print(reponse['answer'])

GitBook is a platform that provides a user-friendly and collaborative solution for creating, editing, and sharing product and API documentation. It offers a feature called adaptive content that transforms documentation into interactive experiences. GitBook's mission is to provide a simple solution for documentation needs.
