*My friends asked me to post something that would help them to learn stuff like this without actually spending hours on tutorials from various source.*
*So I made this as descriptive as possible so even a **beginner** might understand. This notebook combines my learnings from various sources.* 

***I hope this helps.***


# YouTube Video Summarizer

This notebook aims to summarize YouTube video transcripts using different techniques.

In [3]:
import warnings
warnings.filterwarnings('ignore')  # Suppress all warnings

from langchain._api import LangChainDeprecationWarning
warnings.simplefilter("ignore", category=LangChainDeprecationWarning)

## The TF-IDF (Term Frequency-Inverse Document Frequency) approach.



1. **Transcript Extraction**: The summarizer begins by extracting the transcript from a YouTube video using the `YouTubeTranscriptApi` library. The transcript is converted into a single text string.

2. **Sentence Tokenization**: The text is then split into individual sentences using NLTK's `sent_tokenize` function, which helps maintain the integrity of each sentence for further processing.

3. **TF-IDF Calculation**: The TF-IDF vectorizer from the `scikit-learn` library is used to calculate the importance of each sentence. TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It is calculated by multiplying two metrics:
   - **Term Frequency (TF)**: This measures how frequently a term appears in a document. The more frequently a term appears, the higher its TF value.
   - **Inverse Document Frequency (IDF)**: This measures how unique or rare a term is across all documents in the corpus. If a term appears in many documents, its IDF value is low, indicating that the term is not unique or informative.
   
   TF-IDF is computed by multiplying TF and IDF for each term. The resulting value reflects the importance of the term: it is high for terms that occur frequently in the current document but rarely in the entire corpus. In the context of summarization, TF-IDF helps in determining the sentences that contain the most informative words, allowing us to assign importance scores to each sentence.

4. **Sentence Scoring and Selection**: After calculating the TF-IDF values for each term, the sum of the TF-IDF scores for each sentence is used to determine its importance. The sentences with the highest scores are selected for the summary.

5. **Sentence Ordering**: To maintain coherence, the selected sentences are reordered based on their original positions in the transcript.

6. **Summary Generation**: Finally, the ordered sentences are joined together to generate a concise summary of the transcript.

The TF-IDF approach ensures that the most relevant sentences are selected for summarization based on the importance of the words they contain, making the generated summary informative and efficient.



In [6]:
import youtube_transcript_api
from youtube_transcript_api import YouTubeTranscriptApi #for downloading transcripts
import nltk # Natural Language Toolkit
import re # helps with text processing tasks such as pattern matching, search, and substitution.
from nltk.corpus import stopwords #This imports a list of common words (like "and", "the", etc.) from the Natural Language Toolkit (NLTK) that are often removed from text data during preprocessing to improve efficiency in NLP tasks.
import sklearn # Most basic ML library
from sklearn.feature_extraction.text import TfidfVectorizer # used to convert text into numerical features by calculating the Term Frequency-Inverse Document Frequency (TF-IDF)
import numpy as np # used for numerical operations

In [7]:
# Download necessary NLTK data
nltk.download('punkt')  # Tokenizer model for splitting text into sentences
nltk.download('punkt_tab')  # Additional data required for tokenization

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [8]:
# YouTube video link and extraction of unique ID
link = "https://www.youtube.com/watch?v=Y8Tko2YC5hA"
unique_id = link.split("=")[-1]  # Extract the video ID from the link
# 'Y8Tko2YC5hA' this is the so called unique id of a yt video

In [9]:
# Fetching transcript
try:
    sub = YouTubeTranscriptApi.get_transcript(unique_id)  # Fetch the transcript 
except youtube_transcript_api.exceptions.TranscriptsDisabled:
    print("Transcripts are disabled for this video.")
    exit()  # Exit the program if transcripts are not available
except youtube_transcript_api.exceptions.NoTranscriptFound:
    print("No transcript found for this video.")
    exit()  # Exit the program if no transcript is found

# the except is used for error handling these errors can also be used for logging.
# whats logging? - when running big complex programs, its used to record events or errors which helps in debugging.

In [10]:
# Joining transcript parts into a single string
subtitle = " ".join([x['text'] for x in sub])  # Join each piece of text from the transcript into a single string

In [11]:
subtitle

'In this video, I\'m going to answer the top 3 questions my students ask me about Python. What is Python? What  can you do with it? And why is it so popular? In other words, what does it do that other programming languages don\'t? Python is the  world\'s fastest growing and most popular programming language, not just  amongst software engineers, but also amongst mathematicians,  data analysts, scientists, accountants, networking engineers, and even kids! Because it\'s a very beginner friendly programming  language. So people from different disciplines use Python for a variety of different tasks, such as data analysis and visualization,  artificial intelligence and machine learning, automation  in fact this is one of the big uses of Python amongst people who are not software developers. If you constantly have to do boring, repetitive  tasks, such as copying files and folders around, renaming them,  uploading them to a server, you can easily write a Python script to automate all that and

In [12]:
# Tokenization
from nltk.tokenize import sent_tokenize
subtitle = subtitle.replace("\n", " ")  # Replace newline characters with spaces
sentences = sent_tokenize(subtitle)  # Split the transcript into individual sentences
organized_sent = {k: v for v, k in enumerate(sentences)}  # Create a dictionary mapping each sentence to its index

In [13]:
# TF-IDF Calculation
tf_idf = TfidfVectorizer(min_df=2,  # Ignore terms that appear in fewer than 2 documents
                         strip_accents='unicode',  # Remove accents and normalize characters
                         max_features=None,  # Use all features available (no limit on features)
                         lowercase=True,  # Convert all characters to lowercase
                         token_pattern=r'\w{1,}',  # Token pattern to use for tokenizing the text
                         ngram_range=(1, 3),  # Use unigrams, bigrams, and trigrams
                         use_idf=True,  # Use inverse document frequency for calculating TF-IDF
                         smooth_idf=True,  # Smooth the idf weights by adding 1 to document frequencies
                         sublinear_tf=True,  # Apply sublinear tf scaling (use 1 + log(tf))
                         stop_words='english')  # Remove common English stop words


In [14]:
# Fit and transform sentences into TF-IDF vectors
sentence_vectors = tf_idf.fit_transform(sentences)
# Sum TF-IDF scores for each sentence
sent_scores = np.array(sentence_vectors.sum(axis=1)).ravel()

In [15]:
# Selecting Top N Sentences
N = 3  # Number of sentences to select for the summary
top_n_sentences = [sentences[index] for index in np.argsort(sent_scores, axis=0)[::-1][:N]]  # Select top N sentences with highest scores

In [16]:
# Mapping the scored sentences with their indexes as in the subtitle
mapped_sentences = [(sentence, organized_sent[sentence]) for sentence in top_n_sentences]  # Map each sentence to its original index

In [17]:
# Ordering the top-n sentences in their original order
mapped_sentences = sorted(mapped_sentences, key=lambda x: x[1])  # Sort the selected sentences by their original index
ordered_sentences = [element[0] for element in mapped_sentences]  # Extract the sentences in the correct order

In [18]:
# Joining the ordered sentences to form the summary
summary = " ".join(ordered_sentences)  # Join the ordered sentences to form the summary
print("TF-IDF Summary:")
print(summary)

TF-IDF Summary:
Python is the  world's fastest growing and most popular programming language, not just  amongst software engineers, but also amongst mathematicians,  data analysts, scientists, accountants, networking engineers, and even kids! Now if you have some programming experience you may say, "But Mosh we can do all this stuff with other programming languages, so what's the big deal  about Python?" So in a nutshell, Python is a multi-purpose language with a simple, clean, and beginner-friendly  syntax.


_______________________________________________________________________________

_______________________________________________________________________________

_______________________________________________________________________________

## BART Technique Overview
This section useses the BART transformer model from the `transformers` library by Hugging Face to summarize YouTube video transcripts. BART is an encoder-decoder model designed for natural language generation tasks, including summarization.

#### BART Summarization Steps

1. **Transcript Extraction**: Extract the transcript using the `YouTubeTranscriptApi` and join it into a single text string.

2. **BART Model Setup**: Load the pre-trained BART model (`facebook/bart-large-cnn`) and its tokenizer to prepare the input text.

3. **Dynamic Input Length**: Adjust the input length dynamically to retain important information and tokenize the input.

4. **Summary Generation**: Generate a summary using the BART model with parameters like `max_length`, `min_length`, `length_penalty`, and `num_beams` to control output quality.

5. **Pipeline Summarization**: Use the `pipeline` function for easy summarization with automatic parameter handling.

#### Advantages of BART
- **Contextual Understanding**: Uses transformer-based attention for context-aware summaries.
- **Flexibility**: Adjustable parameters for different types of content.
- **Efficiency**: Pre-trained for summarization, providing accurate results.

#### Summary Methods
- **Direct BART**: Allows fine control over summarization parameters.
- **Pipeline**: Simplifies the process with automated handling of steps.



In [24]:
import transformers # most basic gen ai library. get familiar with it
from transformers import BartTokenizer, BartForConditionalGeneration # importing the BART model

The downloading the transcript step are the same, I just repeated them so that you can revise them. Thats why there are no comments.

In [26]:
nltk.download('punkt')  

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
link = "https://www.youtube.com/watch?v=Y8Tko2YC5hA"
unique_id = link.split("=")[-1]  

try:
    sub = YouTubeTranscriptApi.get_transcript(unique_id)  
    if not sub:
        raise ValueError("Transcript is empty.")  
except youtube_transcript_api.exceptions.TranscriptsDisabled:
    print("Transcripts are disabled for this video.")
    exit()  
except youtube_transcript_api.exceptions.NoTranscriptFound:
    print("No transcript found for this video.")
    exit() 


subtitle = " ".join([x['text'] for x in sub]) 

In [30]:
# Load pre-trained BART model and tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')  # Load BART tokenizer
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')  # Load BART model for summarization

# you can also search for other models in hugging face. It contains thousands of such models, and thousands more for other tasks as well

In [31]:
# Tokenize the input subtitle with a limit on the maximum length
input_length = len(subtitle.split())
adjusted_max_length = min(1024, int(0.8 * input_length))  # Set max_length to be 80% of the input length, capped at 1024
input_tensor = tokenizer.encode(subtitle, return_tensors="pt", max_length=adjusted_max_length, truncation=True)  # Tokenize the input text with truncation

In [32]:
# Generate the summary using BART model with specified parameters
outputs_tensor = model.generate(input_tensor, max_length=160, min_length=120, length_penalty=1.5, num_beams=6, early_stopping=True)

In [33]:
# Decode the output tensor to get the summary
bart_summary = tokenizer.decode(outputs_tensor[0], skip_special_tokens=True)  # Decode the generated output to text
print("BART Summary:")
print(bart_summary)  # Print the BART based summary

BART Summary:
Python is the world's fastest growing and most popular programming language. It is used by mathematicians, data analysts, scientists, accountants, networking engineers, and even kids. With Python you can solve complex problems in less time with fewer lines of code. It's a high level language so you don't have to worry about complex tasks such as memory management, like you do in C++. It has a huge community so whenever you get  stuck, there is someone out there to help. You can also use Python to build  web, mobile and desktop applications as well as software  testing or even hacking.


_______________________________________________________________________________

### Alternative method using pipeline
This is much easier than doing everything by yourself because it abstracts the nitty gritty stuff.

Also keep one thing in mind, Professional coders never write the whole code. EVER. They always reuse their own code, or someone elses'. And coders try to abstract things as much as possible, by either using tools/libraries, the more nitty gritty stuff you handle harder is it to debug.

In [35]:
from transformers import pipeline

In [36]:
# Create summarization pipeline
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')  # Create a summarization pipeline using BART model




In [37]:
# Adjust max_length and min_length based on input length
adjusted_max_length = min(180, int(0.4 * input_length))  # Set max_length to be 40% of the input length, capped at 180
adjusted_min_length = min(30, int(0.1 * input_length))  # Set min_length to be 10% of the input length, capped at 30

In [38]:
# Generate summary using the pipeline with adjusted parameters
pipeline_summary = summarizer(subtitle, max_length=adjusted_max_length, min_length=adjusted_min_length, truncation=True)  # Generate summary with pipeline
print("Pipeline BART Summary:")
print(pipeline_summary[0]['summary_text'])  # Print the summary generated by the pipeline

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Pipeline BART Summary:
Python is the world's fastest growing and most popular programming language. People from different disciplines use Python for a variety of different tasks. With Python you can solve complex problems in less time with fewer lines of code.


_______________________________________________________________________________

_______________________________________________________________________________

_______________________________________________________________________________

# Langchain Youtube video summarizer

**Now lets use langchain to make this much more easier and abstract**

Langchain abstracts a lot of stuff, and it is one of the most beginner friendly framework in GenAI (to my knowledge).


Thi section t uses LangChain to summarize YouTube transcripts with two methods: **Stuff** and **Map Reduce**. LangChain provides an efficient and flexible way to create high-quality summaries by leveraging advanced language model capabilities.uctured.

#### Why LangChain?

Compared to traditional methods like TF-IDF and BART, LangChain offers several advantages that make it a superior choice for summarizing YouTube transcripts:

1. **Scalability**: LangChain handles larger texts effectively by breaking them into smaller chunks and processing them efficiently. This makes it well-suited for lengthy transcripts that would be challenging for models with fixed input limits.

2. **Modularity**: LangChain provides a modular framework that makes it easy to chain different LLM operations, such as splitting documents, summarizing sections, and combining summaries. This modularity allows for greater flexibility and customization in the summarization process, enabling users to adapt the approach based on the specific needs of the content.

3. **LLM Integration**: LangChain seamlessly integrates with advanced LLMs, offering better customization, parameter control, and the ability to dynamically adjust prompts to suit different scenarios. This results in more accurate and context-aware summaries, which are often difficult to achieve with traditional techniques alone.

Using LangChain helps manage long transcripts effectively and produces high-quality summaries by employing adaptable strategies like Stuff and Map Reduce. These methods allow for a more nuanced approach to summarization, ensuring that key information is retained and that the final summary is both informative and concise. Traditional techniques like TF-IDF and fixed-length models like BART struggle with scalability and flexibility, which LangChain overcomes with ease.



In [45]:
# !pip install --upgrade --quiet  pytube

In [46]:
from langchain_community.document_loaders import YoutubeLoader

In [47]:
# Load YouTube transcript using YoutubeLoader
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=Y8Tko2YC5hA", add_video_info=False
)  # Initialize the loader to fetch subtitles from the YouTube URL
subtitles = loader.load()  # Load the subtitles as document objects
subtitles
# See this way of exporting the transcripts are much easier.

[Document(metadata={'source': 'Y8Tko2YC5hA'}, page_content='In this video, I\'m going to answer the top 3 questions my students ask me about Python. What is Python? What can you do with it? And why is it so popular? In other words, what does it do that other programming languages don\'t? Python is the world\'s fastest growing and most popular programming language, not just amongst software engineers, but also amongst mathematicians, data analysts, scientists, accountants, networking engineers, and even kids! Because it\'s a very beginner friendly programming language. So people from different disciplines use Python for a variety of different tasks, such as data analysis and visualization, artificial intelligence and machine learning, automation in fact this is one of the big uses of Python amongst people who are not software developers. If you constantly have to do boring, repetitive tasks, such as copying files and folders around, renaming them, uploading them to a server, you can eas

Go to Groq website, signin and create an API key. Its free! (for personal use but yea)

Save this API key in a safe place so that ou dont loose it, and dont share your api keys with anyone.

then in the current directory create a file, rename it as '.env' or in jupyter as 'groq_api_token.env'

Open it and write GROQ_API_TOKEN = "--your key--"

In [49]:
import os
from langchain_groq import ChatGroq
from dotenv import load_dotenv # for loading the .env file

# Load environment variables and set up ChatGroq
load_dotenv('groq_api_token.env')  # Load the environment variables from the specified .env file
groq_key = os.getenv('GROQ_API_TOKEN')  # Retrieve the API token for Groq from environment variables

In [50]:
# Set up the LLM using ChatGroq with specified parameters
llm = ChatGroq(temperature=0, groq_api_key=groq_key, model_name="llama3-70b-8192")  # Set temperature to 0 for deterministic output
# the llama3 is an LLM and rather than exporting it through huggingface, groq and langchain makes it much easier.

#### Stuff Method

The **Stuff** method combines all the content into a single context and passes it to the LLM for summarization. This approach is straightforward and works well for shorter documents or when the entire input can fit within the model's length limit without truncation. By using the Stuff method, the LLM is able to consider all the information at once, leading to a cohesive summary that captures the essence of the entire content.



In [53]:
from langchain.chains.combine_documents import create_stuff_documents_chain #Combines multiple documents
from langchain.chains.llm import LLMChain # Chain to call an llm
from langchain_core.prompts import ChatPromptTemplate # Defines the structure of prompts that youll be passing to llm

In [54]:
# Define prompt for summarization
prompt = ChatPromptTemplate.from_messages(
    [("system", "Write a concise summary of the following:\n\n{context}")]
)  # Define a prompt that will instruct the LLM to summarize the given context



In [55]:
# Instantiate and invoke chain using the Stuff method
chain = create_stuff_documents_chain(llm, prompt)  # Create a chain that uses all the content to generate a summary
result = chain.invoke({"context": subtitles})  # Pass the subtitles as context to the chain and get the summary
print("Stuff Method Summary:")
print(result)  # Print the result of the Stuff summarization method

Stuff Method Summary:
Here is a concise summary of the video:

The video answers the top 3 questions about Python: what it is, what it can do, and why it's so popular. Python is a beginner-friendly, multi-purpose language used by people from various disciplines for tasks like data analysis, automation, and web development. It's popular due to its simplicity, ease of use, and high-level syntax, making it ideal for solving complex problems quickly. Python's advantages include its cross-platform compatibility, large community, and extensive libraries. As a result, it's the most in-demand language by employers, with Python developers earning an average of $116,000 per year.


_______________________________________________________________________________

#### Map Reduce Method

The **Map Reduce** method is designed to handle longer content that cannot be processed in a single pass. It splits the content into smaller, manageable chunks and summarizes each chunk independently in the **map** phase. In the **reduce** phase, these individual summaries are consolidated into a final, cohesive summary that captures the main themes and key points. This method is particularly useful for lengthy transcripts, ensuring that important information is not lost and that the final summary is comprehensive and well-structured.

The map reduce method is much used when you have a large document or here- a large video with a big transcript that can't be passed into the context window of an llm. llms now adays have started to come with much larger context window size, but still what if we have a thousand page document?



In [59]:
from langchain_core.output_parsers import StrOutputParser

# Define prompt for the map phase
map_prompt = ChatPromptTemplate.from_messages(
    [("system", "Write a concise summary of the following:\n\n{context}")]
)  # Define a prompt to summarize each chunk of the content

# Create map chain that summarizes each chunk individually
map_chain = map_prompt | llm | StrOutputParser()  # Create a chain that maps the prompt to the LLM and parses the output

In [60]:
# Define reduce template and prompt for consolidation
reduce_template = """
The following is a set of summaries:
{docs}
Take these and distill it into a final, consolidated summary
of the main themes.
"""  # Template to instruct the LLM to consolidate multiple summaries
reduce_prompt = ChatPromptTemplate([("human", reduce_template)])
reduce_chain = reduce_prompt | llm | StrOutputParser()  # Create a chain to reduce the mapped summaries into a final summary

In [62]:
from langchain_text_splitters import CharacterTextSplitter
# Split the subtitles into chunks for map reduce
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)  # Split the content into chunks of 1000 characters without overlap to make it manageable for the LLM
# we can also use the overlap, this helps to create a continuity between chunks. that means each chunk will have a couple of words, say - 50 from the prev chunk too.

split_subtitles = text_splitter.split_documents(subtitles)  # Split the subtitles document into smaller chunks
print(f"Generated {len(split_subtitles)} documents.")  # Print the number of generated chunks

Generated 1 documents.


Generated 1 document because the transcript size is too small to actually be divided. ie less than chunk size -1000.

Try to summarize a big video and see the result.

In [64]:
# Invoke reduce chain on the split subtitles
print("Map Reduce Method Summary:")
print(reduce_chain.invoke(split_subtitles))  # Print the consolidated summary from the Map Reduce method


Map Reduce Method Summary:
Here is a consolidated summary of the main themes:

**Python's Popularity and Versatility**

* Python is the world's fastest-growing and most popular programming language, used by people from various disciplines, including software engineers, mathematicians, data analysts, scientists, accountants, and kids.
* It's a beginner-friendly language that can be used for a variety of tasks, such as data analysis and visualization, artificial intelligence and machine learning, automation, web and mobile development, and software testing.

**Advantages of Python**

* Python allows users to solve complex problems in less time with fewer lines of code, making it a more efficient language.
* It has a simple, clean, and powerful syntax, making it easy to learn and use.
* Python is a high-level language, so users don't have to worry about complex tasks like memory management.
* It's cross-platform, allowing users to build and run applications on Windows, Mac, and Linux.
* P

Some of the sources (that I remeber) are - vidyaanalytics, langchain documents, a couple of yt videos and gpt (ofcourse)