### Dependencies


- langchain_google_genai
- langchain
- langchain_core
- time
- dotenv
- pprint
- datasets
- typing_extensions
- typing
- IPython
- ragas
- langgraph
- tiktoken
- re
- PyPDF2
- pylcs
- pandas
- textwrap
- markdown
- vertexai

In [1]:
# INSTALL DEPENDENCIES

!pip install langchain_google_genai langchain langchain_core python-dotenv datasets typing_extensions IPython ragas langgraph tiktoken PyPDF2 pylcs pandas langchain_groq langchain_mongodb markdown bs4 vertexai sentencepiece



In [2]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_groq import ChatGroq
from langchain.document_loaders import PyPDFLoader
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain.chains.summarize import load_summarize_chain
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_core.runnables.graph import MermaidDrawMethod

from langgraph.graph import END, StateGraph

from dotenv import load_dotenv
from pprint import pprint
import os
from datasets import Dataset
from typing_extensions import TypedDict
from IPython.display import display, Image
from typing import List, TypedDict

from ragas import evaluate
from ragas.metrics import (
    answer_correctness,
    faithfulness,
    answer_relevancy,
    context_recall,
    answer_similarity
)

import langgraph

### Helper functions for notebook


"""
from helper_functions import num_tokens_from_string, replace_t_with_space, replace_double_lines_with_one_line, split_into_chapters,\
analyse_metric_results, escape_quotes, text_wrap,extract_book_quotes_as_documents
"""


load_dotenv(override=True)


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


True

### Setting GEMINI and GROQ API keys

In [3]:
gemini_api_key = os.getenv("GOOGLE_API_KEY")
os.environ["GOOGLE_API_KEY"] = gemini_api_key
groq_api_key = os.getenv("GROQ_API_KEY")

## Data preprocessing

### Extract text from source

- pdf 

In [4]:
from my_helper_function import pdf_text_extract

pdf_path = "Computer_Network_Chapter_3.pdf"
pdf_text = pdf_text_extract(pdf_path)
print(pdf_text)

3/12/2025
1
Chapter 3: 
Datalinklayer
•Functionalities:
•Encapsulation, addressing
•Error detection and correction
•Flow control
•Media access control
1
Overview of Data link layer
21
2 3/12/2025
2TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG
School of Information and Communication Technology
Link layer: introduction
Link Layer: 6-3terminology:
hosts and routers: nodes
communication channels that 
connect adjacent nodes along 
communication path: links
•wired 
•wireless 
•LANs
layer-2 packet: frame, 
encapsulates datagrammobile network
enterprise
network
national or global ISP
datacenter 
network
link layer has responsibility of 
transferring datagram from one node 
to physically adjacent node over a li nk
TRƯỜNG CÔNG NGHỆ THÔNG TIN VÀ TRUYỀN THÔNG
School of Information and Communication Technology
Datalinklayer in Layer architecture
5LLC
(Logical Link Control)
MAC
(Media Access Control)Application
Transport
Network
Data-link
Physical
802.2 LLC
802.3
Ethernet802.4
Token Bus802.5
Token 

- markdown file

In [5]:
from my_helper_function import md_text_extract

md_path = "Computer_Network_Chapter_3.md"
md_text = md_text_extract(md_path)
print(md_text)

Chapter 3: Data Link Layer
(Based on PDF: Computer_Network_Chapter_3.pdf)

Overview & Introduction (Slides 1-5)

Functionalities: Encapsulation, addressing, Error detection and correction, Flow control, Media access control.
Introduction:
Terminology:
Nodes: Hosts and routers.
Links: Communication channels connecting adjacent nodes (wired, wireless, LANs).
Frame: Layer-2 packet, encapsulates datagram.


Responsibility: Transferring datagram from one node to a physically adjacent node over a link.
Datalink Layer in Layer Architecture:
Positioned between Network and Physical layers.
Often split into sublayers:
LLC (Logical Link Control): Media independent (e.g., IEEE 802.2).
MAC (Media Access Control): Media dependent (e.g., IEEE 802.3 Ethernet, 802.5 Token Ring, 802.11 Wi-Fi, 802.16 Wi-Max).



Functionalities & Context (Slides 6-9)

Core Functionalities: Framing, Addressing, Error control, Flow control, Media Access Control.
Link Layer Context:
Datagrams are transferred by different li

- txt file

In [6]:
from my_helper_function import txt_text_extract

txt_path = "Computer_Network_Chapter_3.txt"
txt_text = txt_text_extract(txt_path)
print(txt_text)



### Clean extracted text

In [7]:
from my_helper_function import clean_text

"""
Cleans the input text by removing unnecessary characters, extra spaces,
and standardizing formatting.  Handles common issues in OCR'd text.
"""

cleaned_pdf_text = clean_text(pdf_text)
cleaned_txt_text = clean_text(txt_text)
cleaned_md_text = clean_text(md_text)

### chunking

In [None]:
"""

# --- 4. Chunking with LangChain ---
# Initialize the text splitter
# Adjust chunk_size and chunk_overlap based on your RAG needs and the LLM's context window
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Max characters per chunk
    chunk_overlap=150, # Characters to overlap between chunks
    length_function=len,
    is_separator_regex=False,
    separators=["\n\n", "\n", ". ", ", ", " ", ""] # Order of separators to try
)

# Split the cleaned text into documents (chunks)
# In LangChain, text is often wrapped in a 'Document' object, 
# but for this example, we'll work directly with the text strings.
chunks = text_splitter.split_text(cleaned_text)

"""

In [None]:
'''

# --- 5. Condensing Chunks with Structured Output ---

# Define the desired structured output format using Pydantic
class ChunkCondensation(BaseModel):
    """Defines the desired structure for condensed information from a text chunk."""
    summary: str = Field(description="A concise summary (2-3 sentences) of the main topic discussed in the chunk.")
    keywords: List[str] = Field(description="A list of 3-5 main keywords or terms mentioned.")
    key_concepts: List[str] = Field(description="A list of key concepts, protocols, or definitions explained in the chunk.")
    # You could add more fields, e.g., potential_questions: List[str]

# Set up the PydanticOutputParser
# This parser will automatically generate formatting instructions for the LLM
# and parse the LLM's JSON output back into the Pydantic object.
output_parser = PydanticOutputParser(pydantic_object=ChunkCondensation)

# Define the Prompt Template
# This template includes the original text chunk and the formatting instructions
# generated by the output_parser.
prompt_template = PromptTemplate(
    template="Analyze the following text chunk about the Data Link Layer. Extract the requested information.\n{format_instructions}\nChunk Text:\n{chunk}\n",
    input_variables=["chunk"],
    partial_variables={"format_instructions": output_parser.get_format_instructions()}
)

# Initialize the LLM (e.g., Gemini via LangChain) - Requires API Key
# Make sure to replace "YOUR_API_KEY" with your actual key if running this
# llm = ChatGoogleGenerativeAI(model="gemini-pro", google_api_key="YOUR_API_KEY", convert_system_message_to_human=True) 

# --- Process Chunks ---
condensed_data_list = []
for i, chunk in enumerate(chunks):
    
    # --- This is where you would call the LLM with the structured prompt ---
    
    # 1. Format the prompt with the current chunk and parser instructions
    # formatted_prompt = prompt_template.format_prompt(chunk=chunk)

    # 2. Invoke the LLM (Requires API Key and uncommenting)
    # response = llm.invoke(formatted_prompt.to_string()) # Get LLM response as string

    # 3. Parse the LLM response using the Pydantic parser
    # try:
    #    parsed_output = output_parser.parse(response)
    #    condensed_data_list.append(parsed_output.dict()) # Store the structured data as a dictionary
    # except Exception as e:
    #    print(f"Error parsing LLM output for chunk {i+1}: {e}")
    #    # Handle error, maybe store the raw chunk or a default structure
    #    condensed_data_list.append({
    #        "summary": f"Error processing chunk {i+1}", 
    #        "keywords": [], 
    #        "key_concepts": [],
    #        "raw_chunk": chunk # Optionally keep raw chunk on error
    #    })
        
    # --- Placeholder for demonstration (Simulating LLM Output) ---
    # To make the script runnable without an API key, we'll create placeholder structured data.
    # In a real implementation, you would remove this placeholder block 
    # and use the actual LLM call and parsing above.
    placeholder_data = {
        "summary": f"This is a placeholder summary for chunk {i+1}. It would normally contain a concise overview of the chunk's content.",
        "keywords": [f"keyword{j}" for j in range(1, 4)], # Placeholder keywords
        "key_concepts": [f"concept{j}" for j in range(1, 3)] # Placeholder concepts
    }
    condensed_data_list.append(placeholder_data)
    # --- End of Placeholder ---

    # Optional: Print progress
    # print(f"Processed chunk {i+1}/{len(chunks)}")


# --- 6. Output Example ---
# Print the first few structured condensation results
print("--- Example Cleaned and Condensed Structured Output (First 3 Chunks) ---")
for i in range(min(3, len(condensed_data_list))): 
    print(f"--- Condensed Data for Chunk {i+1} ---")
    # Print the structured data (dictionary format)
    # In a real scenario, this would contain the data extracted by the LLM
    print(json.dumps(condensed_data_list[i], indent=2)) 
    print("-" * 20)

# You can now use the 'condensed_data_list' which contains structured dictionaries
# for each chunk, suitable for indexing in a vector database or other RAG components.
# For example, you might index the 'summary' or 'key_concepts' along with the original chunk.

# Example: Saving structured data to a JSON file
# with open("condensed_datalink_data.json", "w", encoding="utf-8") as f:
#     json.dump(condensed_data_list, f, indent=2, ensure_ascii=False)

'''

### Condense text

In [14]:
def condense_information(text: str) -> str:
    information_condenser_model = ChatGoogleGenerativeAI(
        model = "gemini-2.5-pro-exp-03-25",
        temperature = 0,
        max_tokens=8192,
        timeout=None
    )

    condense_information_prompt_template = """
        You are an expert text condenser for Retrieval Augmented Generation (RAG) systems. Your task is to maximize the information density of the given text while minimizing information loss.  This is for the preprocessing stage of a RAG pipeline, where the condensed text will be stored and later retrieved to answer user queries.

        Here are the guidelines:

        1.  **Core Information Preservation:** Retain all key entities, facts, relationships, and concepts.  Do NOT remove information that is crucial to understanding the original meaning.

        2.  **Contextual Integrity:** Ensure the condensed text remains coherent and understandable.  Maintain the logical flow of ideas, even if expressed more succinctly.

        3.  **Density Increase:**
            * Remove redundant phrases and words.
            * Replace verbose expressions with more concise equivalents.
            * Combine sentences where appropriate, without sacrificing clarity.
            * Use abbreviations and acronyms judiciously, only if they are widely understood or defined within the text.
            * Omit filler words and phrases (e.g., "it is important to note that", "in conclusion").
            * Avoid unnecessary details or elaborations that do not significantly contribute to the core meaning.
            * Use active voice where possible.

        4.  **Minimize Loss:** Do not sacrifice accuracy or completeness for the sake of brevity.  If a piece of information is important, keep it, even if it makes the text slightly longer.  Favor including slightly more information over omitting something crucial.

        5.  **Output Format:** Provide the condensed text as a single, coherent paragraph. Do not use bullet points or numbered lists unless the original text heavily relies on them and they are essential for understanding.

        6.  **Examples:**

            * **Input:** "The Battle of Hastings was fought on 14 October 1066 between the Norman-French army of William the Conqueror and an English army under the Anglo-Saxon King Harold Godwinson, beginning the Norman conquest of England."
            * **Output:** "On 14 October 1066, the Battle of Hastings marked the beginning of the Norman conquest of England, opposing the Norman-French army of William the Conqueror against the English army of Anglo-Saxon King Harold Godwinson."

            * **Input:** "The quick brown fox jumps over the lazy dog. This is a common English pangram. Pangrams are useful because they display all of the letters in the alphabet."
            * **Output:** "The quick brown fox jumps over the lazy dog, a common English pangram that displays all letters of the alphabet."

        7. **Specific Instructions for this task**:
            * Keep the language of the condensed text the same as the original text.
            * Do not add any information that is not present in the original text.
            * Be precise and factual.
            * The output should be one paragraph.

        Here is the text to condense inside three backticks:

        ```
        {text}
        ```
    """

    condense_information_prompt = PromptTemplate(template=condense_information_prompt_template, input_variables=["text"])

    string_output_parser = StrOutputParser()

    condense_chain = (
        {"text": lambda x: x}  # Identity function to pass the text
        | condense_information_prompt
        | information_condenser_model
        | string_output_parser # Use string output parser
    )
    """
    Condenses the input text using the Gemini LLM and LangChain.

    Args:
        text: The raw text to condense.

    Returns:
        A condensed and structured version of the text.
    """

    return condense_chain.invoke(text)

In [15]:
condensed_pdf_text = condense_information(cleaned_pdf_text)
print(condensed_pdf_text)

The Datalink layer, positioned between the Network and Physical layers in the network architecture, is responsible for transferring network-layer datagrams between physically adjacent nodes (hosts and routers) across communication links (wired, wireless, LANs). Its primary functions include encapsulating datagrams into frames (Layer 2 packets), link-layer addressing using unique 48-bit MAC addresses assigned by IEEE, flow control to prevent receiver overload, error detection and correction to handle transmission impairments, and Media Access Control (MAC) to coordinate access on shared media. The Datalink layer is typically implemented within a node's Network Interface Card (NIC) using a combination of hardware, software, and firmware, often split into Logical Link Control (LLC, e.g., IEEE 802.2) and MAC (e.g., IEEE 802.3 Ethernet, 802.11 Wi-Fi) sublayers. Error detection techniques include Parity checks, Checksum (using ones' complement sum), and Cyclic Redundancy Check (CRC), a power

In [17]:
from my_helper_function import count_tokens_for_gemini

num_token_raw = count_tokens_for_gemini(pdf_text)
num_token_cleaned = count_tokens_for_gemini(cleaned_pdf_text)
num_token_processed = count_tokens_for_gemini(condensed_pdf_text)
print(f"""
    {num_token_raw}
    ------
    {num_token_cleaned}
    ------
    {num_token_processed}
""")


    CountTokensResult(total_tokens=16742)
    ------
    CountTokensResult(total_tokens=15291)
    ------
    CountTokensResult(total_tokens=724)

