<a href="https://colab.research.google.com/github/mahesh-from-sirsi/All_My_AI_Work/blob/main/MaheshVShet_BuildFastWithAI_Module2_5_Chatbot_on_Any_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chatbot with Website/YouTube Video
This guide will walk you through creating a Question-Answering system for Website/YT Video documents using Retrieval-Augmented Generation (RAG) with Langchain and Pinecone.

### Installing Dependencies

In [1]:
%pip install -qU langchain-community  langchain langchain-openai requests chromadb beautifulsoup4

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00

### Storing API keys

- Get OpenAI API key: https://platform.openai.com/account/api-keys

In [2]:
import os

os.environ["OPENAI_API_KEY"] = ""

## Chat with Website Using ChromaDB


### Import Required Libraries

In [3]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory



### Load Website Content

In [4]:
def load_website(url):
    loader = WebBaseLoader(url)
    data = loader.load()
    return data

# Example usage
url = "https://www.buildfastwithai.com/"  # Replace with your target website
website_data = load_website(url)

In [7]:
website_data

[Document(metadata={'source': 'https://www.buildfastwithai.com/', 'title': 'Build Fast with AI', 'description': 'Build Fast with AI - a vibrant community of AI builders, innovators, and enthusiasts. Whether you are an entrepreneur, a product manager, a developer, or anyone intrigued by AI, this is your platform to learn, grow, and innovate.', 'language': 'en'}, page_content='Build Fast with AIGenAI BootcampCorporate TrainingAI WorkshopsBuildFast StudioMore Sign In Sign In Ask toBuildFast BotHey! Wanna know about Generative AI Crash Course?What will I learn?How can I join?What\'s the course duration?What\'s the course fee?What\'s the course syllabus?SendGenAI 2025 Launch PadTransform AI Ideas into RealityJoin 20,000+ professionals mastering practical AI development at Build Fast with AILearn. Build. Deploy.Begin Your AI JourneyJoin WaitlistNow Partnering with Industry LeadersWe don‚Äôt just build with AI. We build with people.Partnering with TWO AILeading DevRel for SUTRALeading develop

### Split Content into Chunks

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(website_data)

In [9]:
len(splits)

16

### Initialize Embeddings and Chroma Vector Store

In [6]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a Chroma vector store
vectorstore = Chroma.from_documents(splits, embeddings)

### Set Up Conversational Retrieval Chain

In [7]:
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

)

  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


### Chat Function

In [8]:
def chat_with_website(query):
    result = qa.invoke({"question": query})
    return result['answer']

# Example usage
query = "What is the main topic of this website?"
response = chat_with_website(query)
print(f"Human: {query}")
print(f"AI: {response}")

Human: What is the main topic of this website?
AI: The main topic of the website is focused on AI tools and automation, particularly in the context of building applications and websites with minimal coding. It offers workshops, courses, and resources for learning AI automation, generative AI, and using AI tools for tasks like coding, finance, and data analysis. The website appears to aim at making AI technology accessible to both technical and non-technical users, with an emphasis on practical applications and skill-building.


In [9]:
# Example usage
query = "Tell me about the Generative AI Bootcamp"
response = chat_with_website(query)
print(f"Human: {query}")
print(f"AI: {response}")

Human: Tell me about the Generative AI Bootcamp
AI: The Generative AI Bootcamp offered on the website is an intensive program designed to help you become a Generative AI developer. The bootcamp covers building and deploying cutting-edge AI applications, from large language models (LLMs) to retrieval-augmented generation (RAG), and takes projects from prototypes to production. The program includes comprehensive lectures and hands-on projects, and it is taught by the founder, Satvik Paramkusham, an AI Developer & Consultant and IIT Delhi alumnus. There are no prerequisites required to join the course.


In [10]:
# Example usage
query = "Who is the author of this website and who is the mentor"
response = chat_with_website(query)
print(f"Human: {query}")
print(f"AI: {response}")

Human: Who is the author of this website and who is the mentor
AI: The author of the website focused on AI tools and automation, and who serves as the mentor for the Generative AI Bootcamp is Satvik, the founder of Build Fast with AI.


## Chat with YouTube Video Using ChromaDB

### Install the required dependencies:

In [11]:
# %pip install -qU langchain-community langchain langchain-openai requests chromadb beautifulsoup4

In [12]:
!pip install -qU langchain-community langchain langchain-openai requests chromadb youtube_transcript_api pytube

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/485.0 kB[0m [31m15.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.0/485.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## IMPORTS OF THE PREVIOUS SECTION


```
# This is formatted as code
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
```



### Import Required Libraries

In [13]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import YoutubeLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from youtube_transcript_api import YouTubeTranscriptApi #
from langchain.schema import Document                   #
import re                                               #

### Load Video Transcript

In [14]:
def load_video_transcript(video_url):
    def extract_video_id(url):
        patterns = [
            r'(?:youtube\.com\/watch\?v=|youtu\.be\/|youtube\.com\/embed\/)([^&\n?#]+)',
            r'youtube\.com\/watch\?.*v=([^&\n?#]+)',
        ]
        for pattern in patterns:
            match = re.search(pattern, url)
            if match:
                return match.group(1)
        return None

    video_id = extract_video_id(video_url)
    api = YouTubeTranscriptApi()
    transcript_data = api.fetch(video_id)

    # Access snippets correctly
    full_text = " ".join([snippet.text for snippet in transcript_data.snippets])

    return [Document(
        page_content=full_text,
        metadata={
            'source': video_url,
            'video_id': video_id,
            'language': transcript_data.language_code
        }
    )]

In [15]:
# Example usage
# video_url = "https://www.youtube.com/watch?v=hzUuklUo5NA"  # Replace with your target video
# video_data = load_video_transcript(video_url)


video_url = "https://www.youtube.com/watch?v=bCz4OMemCcA"  # Replace with your target video
video_data = load_video_transcript(video_url)

In [16]:
video_data

[Document(metadata={'source': 'https://www.youtube.com/watch?v=bCz4OMemCcA', 'video_id': 'bCz4OMemCcA', 'language': 'en'}, page_content="hello guys welcome to my video about the Transformer and this is actually the person 2.0 of my series on the Transformer I had a previous video in which I talked about the Transformer but the audio quality was not good and as suggested by my viewers as the video was really uh had a huge success the viewers suggested me to to improve their audio quality so this this is why I'm doing this video uh you don't have to watch the previous series because I would be doing basically the same things but with some improvements so I'm actually compensating from some mistakes I made or from some improvements that I could add after watching this video I suggest watch my watching my other video about or how to code a Transformer model from scratch so how to code the model itself how to train it online data and how to inference it stick it with me because it's gonna b

### Split Content into Chunks

In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
splits = text_splitter.split_documents(video_data)

### Initialize Embeddings and Chroma Vector Store

In [18]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a Chroma vector store
vectorstore = Chroma.from_documents(splits, embeddings)

### Set Up Conversational Retrieval Chain

In [19]:
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

qa = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
  )

### Chat Function

In [20]:
def chat_with_video(query):
    result = qa.invoke({"question": query})
    return result['answer']

# Example usage
query = "What is the main topic of this video?"
response = chat_with_video(query)
print(f"Human: {query}")
print(f"AI: {response}")

Human: What is the main topic of this video?
AI: The main topic of the video is about the Transformer model, specifically a revised version (2.0) of a previous series on the Transformer. The video aims to improve audio quality and discusses the structure of the Transformer model, how to code it from scratch, train it on a dataset, and perform inference.


###Create an Interactive Chat Interface

In [25]:
from IPython.display import display, HTML
from ipywidgets import widgets

chat_history = []

def on_send_button_clicked(b):
    query = input_box.value
    input_box.value = ""

    response = chat_with_video(query)

    chat_history.append(f"Human: {query}")
    chat_history.append(f"AI: {response}")

    output.clear_output()
    with output:
        print("\n".join(chat_history))

input_box = widgets.Text(description="You:")
send_button = widgets.Button(description="Send")
output = widgets.Output()

send_button.on_click(on_send_button_clicked)

display(HTML("<h3>Chat with Video</h3>"))
display(widgets.VBox([input_box, send_button, output]))

VBox(children=(Text(value='', description='You:'), Button(description='Send', style=ButtonStyle()), Output()))

### Classwork 1

1. Try different open source models using Together




### Classwork 2

1. Create a bot for a famous personality (Bill Gates, Mahatma Gandhi, etc) - add system instructions + image
2. Create a bot for a use-case/scenario (Interview prep, a chatbot for a specific service,  etc )

1. Create a QA engine on CSV/Audio/Video/PDF
2. Experiment with different chunks, models, vector dbs.