### Information about the Code

The code provides a streamlined framework for interacting with YouTube videos and their transcripts using a combination of various libraries and services. It includes functionality to:

1. Fetch YouTube Transcripts: Using the YoutubeLoader from the langchain_community library, the code retrieves transcripts from YouTube videos in chunks for detailed analysis.

2. Search for YouTube Videos: The scrapetube library is employed to search for videos based on user-defined criteria such as relevance, upload date, view count, or rating.

3. Generate Document Embeddings: The OpenAIEmbeddings class from the langchain_openai library is used to create embeddings for the transcript documents, which are essential for semantic understanding.

4. Create and Utilize Vectorstores: The FAISS library is used to index the document embeddings and facilitate efficient retrieval of relevant information.

5. Ask AI Questions: The ChatGroq model from langchain_groq handles the question-answering process, providing responses based on the context derived from video transcripts.

6. Display Video and Results: Functions are included to display video content and transcripts, and to present search results and AI-generated responses in a user-friendly manner.


##### Step 1: Import Libraries

In [1]:
%pip install -qU langchain-groq
%pip install langchain-community
%pip install youtube-transcript-api
%pip install pytube

You should consider upgrading via the '/Users/taurangela/Desktop/Github/YouTubeEcho/env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/taurangela/Desktop/Github/YouTubeEcho/env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
You should consider upgrading via the '/Users/taurangela/Desktop/Github/YouTubeEcho/env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Collecting pytube
  Using cached pytube-15.0.0-py3-none-any.whl (57 kB)
Installing collected packages: pytube
Successfully installed pytube-15.0.0
You should consider upgrading via the '/Users/taurangela/Desktop/Github/YouTubeEcho/env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use upd

In [2]:
import os
import ssl
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import YoutubeLoader
from langchain_community.document_loaders.youtube import TranscriptFormat
import scrapetube
from IPython.display import display, HTML, Video

##### Step 2: Setup Environment

In [3]:
# Bypass SSL certificate verification (development only)
ssl._create_default_https_context = ssl._create_unverified_context

# Load environment variables
load_dotenv()

# Constants
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
GROQ_API_KEY = os.getenv("GROQ_API_KEY")

##### Step 3: Define Function to Fetch Transcript

In [4]:
def fetch_transcript(url):
    try:
        transcript_loader = YoutubeLoader.from_youtube_url(
            url, 
            add_video_info=True, 
            transcript_format=TranscriptFormat.CHUNKS, 
            chunk_size_seconds=30
        )
        return transcript_loader.load()
    except Exception as e:
        print(f"An error occurred while fetching the transcript: {str(e)}")
        return []

##### Step 4: Define Function to Search for Videos

In [5]:
def search_videos(search_term, video_count=1, sorting_criteria="Most Relevant"):
    convert_sorting_option = {
        "Most Relevant": "relevance",
        "Upload Date": "upload_date",
        "View Count": "view_count", 
        "Rating": "rating"
    }
    
    if sorting_criteria not in convert_sorting_option:
        raise ValueError(f"Invalid sorting criteria: {sorting_criteria}. Valid options are: {', '.join(convert_sorting_option.keys())}")

    videos = scrapetube.get_search(
        query=search_term, 
        limit=video_count, 
        sort_by=convert_sorting_option[sorting_criteria]
    )
    return [
        {
            "video_id": video["videoId"],
            "video_title": video["title"]["runs"][0]["text"],
            "video_url": f"https://www.youtube.com/watch?v={video['videoId']}",
            "channel_name": video["longBylineText"]["runs"][0]["text"],
            "duration": video["lengthText"]["accessibility"]["accessibilityData"]["label"],
            "publish_date": video["publishedTimeText"]["simpleText"]
        }
        for video in videos
    ]


##### Step 5: Define Function for Document Embeddings

In [6]:
def embed_documents(documents):
    embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
    return embeddings.embed_documents(documents)

##### Step 6: Define Function to Create Vectorstore

In [7]:
def create_vectorstore(documents):
    embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)
    return FAISS.from_documents(documents, embeddings)

##### Step 7: Define Function to Ask AI Questions

In [8]:
def ask_question(messages):
    llm = ChatGroq(
        model="mixtral-8x7b-32768",  # Adjust model name as needed
        temperature=0,
        max_tokens=None,
        timeout=None,
        max_retries=2,
    )
    return llm.invoke(messages).content

##### Step 8: Define Function for RAG (Retrieve and Generate) Processing

In [9]:
def rag_with_video_transcript(transcript_docs, prompt):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=0,
        length_function=len
    )
    
    splitted_documents = text_splitter.split_documents(transcript_docs)
    
    if len(splitted_documents) == 0:
        print("No documents to process.")
        return "No documents to process.", []

    try:
        document_texts = [doc.page_content for doc in splitted_documents]
        document_embeddings = embed_documents(document_texts)

        if len(document_embeddings) == 0 or len(document_embeddings[0]) == 0:
            print("Document embeddings are empty or invalid.")
            return "Document embeddings are empty or invalid.", []

        vectorstore = create_vectorstore(splitted_documents)
        retriever = vectorstore.as_retriever()
        # Using `invoke` to get relevant documents
        relevant_documents = retriever.invoke(prompt)
        
        context_data = " ".join(doc.page_content for doc in relevant_documents)
        final_prompt = f"""I have the following question: {prompt}
        To answer this question, we have the following information: {context_data}.
        Please use only the information provided here to answer the question. Do not include any external information.
        """
        
        AI_Response = ask_question([("system", "You are a helpful assistant."), ("human", final_prompt)])
        
        return AI_Response, relevant_documents

    except Exception as e:
        print(f"An error occurred during RAG processing: {str(e)}")
        return "An error occurred during RAG processing.", []

##### Step 9: Define Functions for Displaying Videos and Results

In [10]:
def display_video(video_url):
    # HTML embed code for displaying YouTube video
    html_code = f"""
    <iframe width="560" height="315" src="{video_url}" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
    """
    display(HTML(html_code))

def fetch_and_display_transcript(video_url):
    transcript_docs = fetch_transcript(video_url)
    display(HTML(f"<h3>Video Transcript for {video_url}</h3>"))
    for doc in transcript_docs:
        print(doc.page_content)
    return transcript_docs

def display_search_results(videos):
    for i, video in enumerate(videos):
        print(f"Video No: {i+1}")
        display_video(video["video_url"])
        print(f"Title: {video['video_title']}")
        print(f"Channel: {video['channel_name']}")
        print(f"Duration: {video['duration']}")

##### Step 10: Execute Example Usage

In [12]:
# 1. Search for Videos
search_term = "AI and Machine Learning"
video_count = 3
sorting_criteria = "Most Relevant"  # Correct value from the dictionary
videos = search_videos(search_term, video_count, sorting_criteria)
display_search_results(videos)

Video No: 1


Title: AI vs Machine Learning
Channel: IBM Technology
Duration: 5 minutes, 49 seconds
Video No: 2


Title: AI, Machine Learning, Deep Learning and Generative AI Explained
Channel: IBM Technology
Duration: 10 minutes, 1 second
Video No: 3


Title: Machine Learning | What Is Machine Learning? | Introduction To Machine Learning | 2024 | Simplilearn
Channel: Simplilearn
Duration: 7 minutes, 52 seconds


In [13]:
# 2. Fetch and Display Transcript for a Selected Video
video_url = "https://www.youtube.com/watch?v=t6VYByDYg7c"  # Replace with actual URL
display_video(video_url)
transcript_docs = fetch_and_display_transcript(video_url)


This laptop, cell phone, tablet and headphones
together cost more than $3,500, that's about the same as two and a half
months rent for the average American. Let’s face it. Apple products
have never been cheap. And the cost of some of its products
has increased dramatically over time. Just look at how the price of the
iPhone has increased over the years.
What started at $499 in
2007, now starts at $999. So what makes these
products so pricey? Well, some say, it boils down to no other reason than the
fact that Apple can convince us to pay the hefty price. There’s even an unofficial
term for this phenomenon. It’s called The Apple Tax which describes
the extra money customers are willing to pay for an Apple product over a competitor
product with similar features. And, often it’s attributed to the so-called
“cool factor” associated with Apple.
It’s those premium prices that helped catapult Apple into
becoming one of the world’s most valuable companies. And at the start of 2019, it announced

In [14]:
# 3. Ask a Question About the Transcript
prompt = "What are the main topics discussed in this video?"
AI_Response, relevant_documents = rag_with_video_transcript(transcript_docs, prompt)
print("ANSWER:")
print(AI_Response)
print("REFERENCES:")
for doc in relevant_documents:
    print(f"Source: {doc.metadata}")
    print(doc.page_content)

ANSWER:
Based on the information provided, the main topics discussed in the video include:

1. The decline of Apple's share in the smartphone market and the rise of Huawei.
2. The significance of Apple's decision not to report units sold of iPhones and the potential impact on sales.
3. The need for Apple to innovate and not just raise prices to maintain its position as one of the world's most valuable companies.
4. The argument that creating innovative products is not cheap and Apple will not compromise on quality for price.
5. The role of innovation in Apple's success, with examples such as the original iPod and iPhone.
6. The impact of the iPhone on Apple's growth and the record-breaking shipments of the device.
7. The concern about Apple's ability to continue innovating in the future.
REFERENCES:
Source: {'source': 'https://www.youtube.com/watch?v=t6VYByDYg7c&t=360s', 'title': 'Why is Apple so expensive? | CNBC Explains', 'description': 'Unknown', 'view_count': 3474101, 'thumbnail_u