### Dependencies


- langchain_google_genai
- langchain
- langchain_core
- time
- dotenv
- pprint
- datasets
- typing_extensions
- typing
- IPython
- ragas
- langgraph
- tiktoken
- re
- PyPDF2
- pylcs
- pandas
- textwrap
- markdown
- vertexai
- chunking_evaluation (pip install git+https://github.com/brandonstarxel/chunking_evaluation.git)
- langchain_openai
- langchain_experimental
- pymongo

In [1]:
# INSTALL DEPENDENCIES

### pip install -r requirements.txt


In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_groq import ChatGroq
from langchain.document_loaders import PyPDFLoader
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.docstore.document import Document
from langchain.chains.summarize import load_summarize_chain
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_core.runnables.graph import MermaidDrawMethod

from langgraph.graph import END, StateGraph

from dotenv import load_dotenv
from pprint import pprint
import os
from datasets import Dataset
from typing_extensions import TypedDict
from IPython.display import display, Image
from typing import List, TypedDict

from ragas import evaluate
from ragas.metrics import (
    answer_correctness,
    faithfulness,
    answer_relevancy,
    context_recall,
    answer_similarity
)

import langgraph
from pymongo import MongoClient


### Helper functions for notebook


"""
from helper_functions import num_tokens_from_string, replace_t_with_space, replace_double_lines_with_one_line, split_into_chapters,\
analyse_metric_results, escape_quotes, text_wrap,extract_book_quotes_as_documents
"""


load_dotenv(override=True)

### Setting GEMINI and GROQ API keys

In [3]:
google_api_key = os.getenv("GOOGLE_API_KEY")
os.environ["GOOGLE_API_KEY"] = google_api_key
groq_api_key = os.getenv("GROQ_API_KEY")

## Data preprocessing

### Extract text from source

- pdf 

In [None]:
from my_helper_function import pdf_text_extract

pdf_path = "Computer_Network_Chapter_3.pdf"
pdf_text = pdf_text_extract(pdf_path)
print(pdf_text)

- markdown file

In [None]:
from my_helper_function import md_text_extract

md_path = "Computer_Network_Chapter_3.md"
md_text = md_text_extract(md_path)
print(md_text)

- txt file

In [None]:
from my_helper_function import txt_text_extract

txt_path = "Computer_Network_Chapter_3.txt"
txt_text = txt_text_extract(txt_path)
print(txt_text)

### Clean extracted text

In [7]:
from my_helper_function import clean_text_basic

"""
Cleans the input text by removing unnecessary characters, extra spaces,
and standardizing formatting.  Handles common issues in OCR'd text.
"""

cleaned_pdf_text = clean_text_basic(pdf_text)
cleaned_txt_text = clean_text_basic(txt_text)
cleaned_md_text = clean_text_basic(md_text)

In [None]:
from my_helper_function import llm_clean_data

# condensed_pdf_text = clean_pdf_data(cleaned_pdf_text)

condensed_pdf_text = """
```plaintext
Chapter 3: Datalinklayer

Functionalities:
* Encapsulation, addressing
* Error detection and correction
* Flow control
* Media access control

Overview of Data link layer

Link layer: introduction

Link Layer: terminology:
* hosts and routers: nodes
* communication channels that connect adjacent nodes along communication path: links
    * wired
    * wireless
    * LANs
* layer 2 packet: frame, encapsulates datagram

mobile network
enterprise network
national or global ISP
datacenter network

link layer has responsibility of transferring datagram from one node to physically adjacent node over a link

Datalinklayer in Layer architecture

Application
Transport
Network
Datalink
Physical

LLC (Logical Link Control)
MAC (Media Access Control)

IEEE 802.x series Media independent sublayer
Media dependent sub-layer
* 802.2 LLC
* 802.3 Ethernet
* 802.4 Token Bus
* 802.5 Token Ring
* 802.11 Wi Fi
* 802.16 Wi Max..

Functionalities

Datalink layer:
* Framing
* Addressing
* Flow control
* Error control
* Media Access Control

Link layer: context

datagram transferred by different link protocols over different links:
* e.g., Wi Fi on first link, Ethernet on next link
each link protocol provides different services
* e.g., may or may not provide reliable data transfer over link

transportation analogy:
* trip from Princeton to Lausanne
* limo: Princeton to JFK
* plane: JFK to Geneva
* train: Geneva to Lausanne
* tourist = datagram
* transport segment = communication link
* transportation mode = link layer protocol
* travel agent = routing algorithm

Link layer: services

framing, link access:
* encapsulate datagram into frame, adding header, trailer
* channel access if shared medium
* MAC addresses in frame headers identify source, destination (different from IP address!)

Media access control:
* If the nodes in the network share common media, a Media access control protocol is required

Link layer: services (more)

flow control:
* pacing between adjacent sending and receiving nodes

error detection:
* errors caused by signal attenuation, noise.
* receiver detects errors, signals retransmission, or drops frame

error correction:
* receiver identifies and corrects bit error(s) without retransmission

halfduplex and fullduplex:
* with half duplex, nodes at both ends of link can transmit, but not at same time

Where is the link layer implemented?

in each and every host
link layer implemented in network interface card (NIC) or on a chip
* Ethernet, Wi Fi card or chip
* implements link, physical layer
* attaches into hosts system buses
* combination of hardware, software, firmware

controller
physical cpu memory
host bus (e.g., PCI)
network interface
application
transport
network
link
link
physical

Interfaces communicating

controller
physical cpu memory
application
transport
network
link
link
physical

sending side:
* encapsulates datagram in frame
* adds error checking bits, reliable data transfer, flow control, etc.

receiving side:
* looks for errors, reliable data transfer, flow control, etc.
* extracts datagram, passes to upper layer at receiving side

link h link h link h link h link h link h datagram datagram datagram

Identifier: MAC address

MAC address: 48 bit, organized by IEEE
* Each port is assigned one MAC
* Cannot be changed
* Physical address
* No hierarchical system, flexible
* MAC Address is unchanged when changing networks
* Broadcast address in LAN: FFFFFFFFFFFF

Error control
* Error detection
* Error correction

Principle of error detection

EDC: error detection and correction bits (e.g., redundancy)
D: data protected by error checking, may include header fields
Error detection not 100% reliable!
* protocol may miss some errors, but rarely
* larger EDC field yields better detection and correction

datagram -> D EDC -> d data bits -> bit error prone link -> D EDC all bits in D -> OK? N detected error otherwise -> datagram

Parity code

Single code
* Able to detect single bit error
* A check bit is added to the original data to ensure that the total number of bit 1 is even (even parity code) or odd (odd parity code)

Two dimension code
* Detect and correct single bit error
* Application: mainly on hardware, ex: while sending data on PCI and SCSI bus
Example layout:
101011
111100
011101
001010
101011
101100
011101
001010

Parity code

Sent data with Odd code: 01010101 Code: 1
Case 1: Received data 01 110101 Received code: 1
Total number of 1 : 6 even number Code does not match with data -> Error
Case 2: Received data 01 110100 Received code: 1
Total number of bit 1 5 code matches with data -> No error

Data of m bit long space of data is 2^m expected to have different code for different data codes must be >=m bit long.

Checksum

sender:
* Divide data to nbit segments
* Calculate the sums of segments. If having overflow bits, add them to the results
* checksum: addition (ones complement sum) of segment content

receiver:
* Divide data to nbit segments
* Calculate the sums of segments. If having overflow bits, add them to the results
* Add the received checksum with the results
* Check the final outcome
    * Contains 0 - error detected
    * Only 1 - no error detected. But maybe errors nonetheless?

Goal: detect errors ( i.e., flipped bits) in transmitted segment

Checksum: Example

Data: 0011 0110 1000
Calculate checksum 4 bit:
  0011
+ 0110
+ 1000
-------
 10001 (Overflow bit 1)
+    1
-------
  0010
Alter bit -> checksum code: 1101
Bits to send: 0011 0110 1000 1101

Checksum: Processing on receiver

Bits received: 0011 0110 1000 1101
Verification:
  0011
+ 0110
+ 1000
+ 1101
-------
 11110 (Overflow bit 1)
+    1
-------
 1111 -> no bit error

Cyclic Redundancy Check (CRC)

more powerful error detection coding
D: data bits (given, think of these as a binary number)
G: bit pattern (generator), of r+1 bits (given)
goal: choose r CRC bits, R, such that <D,R> exactly divisible by G (mod 2)
receiver knows G, divides <D,R> by G. If nonzero remainder: error detected!
can detect all burst errors less than r+1 bits
widely used in practice (Ethernet, 802.11 Wi Fi)

d data bits | r CRC bits
D R
<D,R> = D 2^r XOR R *bit pattern formula for bit pattern

CRC: How to find R

<D, R> = D.2^r XOR R
Since <D, R> divides G then
D.2^r XOR R = n.G
D.2^r = n.G XOR R (associativity)
This means, R is the remainder of the division D.2^r by G (division modulo 2)
R = D.2^r mod G

Ex: D= 10101001
r= 3 bits
G=1001
Calculation:
10101001 000 / 1001 = 1011110 (Quotient)
... (steps of modulo 2 division shown visually in original) ...
Remainder R = 110
The string to send is 10101001 110 (D followed by R)

CRC under polynomial form

1011 -> x^3 + x + 1
Example of some CRC generators using in the practice:
* CRC 8 = x^8 + x^2 + x + 1
* CRC 12 = x^12 + x^11 + x^3 + x^2 + x
* CRC 16 CCITT = x^16 + x^12 + x^5 + 1
* CRC 32 = x^32 + x^26 + x^23 + x^22 + x^16 + x^12 + x^11 + x^10 + x^8 + x^7 + x^5 + x^4 + x^2 + x + 1

The longer G is, the more possible that CRC detects errors.
CRC is widely used in the practice: Wifi, ATM, Ethernet
Operation XOR is implemented in hardware
Capable to detect less than r+1 bits errors

CRC Example

Frame : 1101011011
Generator : G(x) = x^4 + x + 1 -> P = 10011
Dividend : Fk = 1101011011 0000 (Frame appended with r=4 zeros)
R = Fk mod P = 1110 (Result of modulo 2 division)
Send : 1101011011 1110 (Frame + CRC)

CRC Example (Division steps shown visually in original)
1101011011 0000 / 10011 -> Remainder: 1110 (CRC)

CRC Check (Received: 1101011011 1110)
1101011011 1110 / 10011 -> Remainder: 00000 -> No errors

CRC Check (Received with error: 1101001011 1110)
1101001011 1110 / 10011 -> Remainder: 00101 -> not 0 -> errors

Reaction when errors detected

Objective: to assure that data are received correctly even though the channel is not reliable.
Constraint:
* Data frame must be correctly received
* Negligible transmission delay.

Possible errors:
* Whole frame loss
* Error frame
* Loss of error warning message

Popular techniques:
* Error detection (as we seen)
* Acknowledgement/confirmation
* Retransmission after a clear confirmation that frame is not arrived
* Retransmission after timeout
* ARQ technique: automatic repeat request). There are 3 versions:
    * Stop and Wait ARQ
    * Go Back N ARQ
    * Selective Reject ARQ
* Similar to techniques used in flow control.

Stop and wait ARQ

Normal case (Visual representation):
Sender -> send pkt 0 -> Receiver (pkt 0 is OK)
Receiver -> rcv ACK -> Sender
Sender -> send pkt 1 -> Receiver (pkt 1 is corrupted)
Receiver -> rcv NAK -> Sender
Sender -> resend pkt 1 -> Receiver

Error ACK/NAK (Visual representation):
ACK error, resend the previous packet
Duplicated packets problem.
To eliminate repeated packet: Use Seq.#
All packets are assigned Seq# before sending out. Repeated packet has identical Seq#

Sender -> send pkt 0 -> Receiver (pkt 0 is OK)
Receiver -> rcv ACK -> Sender
Sender -> send pkt 1 -> Receiver (pkt 1 is OK)
Receiver -> rcv sth corrupted ! (ACK lost/corrupted) -> Sender (Timeout occurs)
Sender -> resend pkt 1 -> Receiver (rcv pkt 1 duplicate, discard it)

Stop and wait ARQ (Not using NAK) (Visual representation):
ACK packet carries #Seq of the packet to be acknowledged. This number is called acknowledgment number.
An ACK with acknowledgment number n implicitly confirms that all packet with #seq number <= n have been well received.

Sender -> send pkt 0 -> Receiver (pkt 0 is OK)
Receiver -> rcv ACK 0 -> Sender
Sender -> send pkt 1 -> Receiver (pkt 1 is OK)
Receiver -> rcv ACK 1 -> Sender
Sender -> send pkt 2 -> Receiver (pkt 2 is corrupted)
Receiver -> (Does nothing, waits) -> Sender (Timeout for ACK 2)
Sender -> resend pkt 2 -> Receiver

Stop and wait ARQ: When ACK is lost

Data packet and ACK packet may be lost
No ACK is received at sender side
How a sender decides to resends data or not?
Solution:
* After sending out a packet, sender starts a timer specifying maximum waiting time (timeout) for an ACK of the packet.
* When timeout expired sender resends the packet
How long a Timeout should be?
* At least 1 RTT (Round Trip Time)
If a packet arrives at the destination but its ACK is lost, the packet is still resent because associated timeout expired.
The duplicated packets are eliminated at the receiver side according to repeated #seq.

ARQ with timeout (Visual representations showing timeout scenarios for lost packet and lost ACK)

Flow control

What is flow control
Goal: Make sure that the sender does not overload the receiver
Why overloading?
* The receiver stores data frame in buffer.
* Receiver performs some processing before deliver data to the upper level.
* Buffer could be full, leaving no space for receiving more frame some data frame must be dropped.
Problem of errors in transmission is excluded
* All frames are transmitted to correct receiver without error
* Propagation time is small and could be ignored
Solution
* Stop and wait mechanism
* Sliding window mechanism

Stop and wait

Principles
* Transmitter sends a single frame
* Receiver receives the frame, process and then informs the transmitter that it is ready to receives next frames by a clear acknowledgement (ACK).
* Transmitter waits until reception of the ACK before sending next frames.

Stop and wait (Visual representation showing frame transmission and ACK wait time)
transmitter -> frame -> receiver
receiver -> Ack -> transmitter (wait time)

Stop and wait

Advantage
* Simple, suitable for transmission of big size frames
Weakness
* When frames are small, the transmission channel are not used efficiently.
* Cannot use often for big size frame due to
    * Limitation in buffer size
    * Big size frame prones to bigger error probability
* In shared medium, it is not convenient to leave one station using medium for long time

Sliding window: principle

Transmitter sends more than one frame without waiting in order to reduce waiting time
Transmitted frame without ACK will still be stored in buffer.
Number of frames to be transmitted without ACK depends on the size of buffer at transmitter
When transmitter receives ACK, it realises the succesfully transmitted frame from buffers
Transmitter continues sending a number of frame equivalent to the number of succesfully trasmitted frames.

Sliding window: principle

Assume that A and B are two stations connected by a full duplex media
B has a buffer size of n frame.
B can receive n frame without sending ACK
Acknowledgement
* In order to keep track of ACKed frames. It is neccessary to number frames.
* B acknowledge a frame by telling A which frame B is waiting for (by number of frame), implicitely saying that B receives well all other frame before that.
* One ACK frame serves for acknowledes several frames.

Sliding windows: principle (Visual representation showing sender and receiver windows)
Window list the frames to transmit
Window list the frames in waiting to receive

Sliding windows (Visual representation showing window sliding as frames are sent and ACKed)

Sliding windows

Frame are numbered. The maximum number must not be smaller than the size of the window.
Frame are ACKed by another message with number
Accumulated ACK: If frame 1,2,3,4 are well received, just send ACK 4
ACK with number k means all frame k-1, k-2 already well received.

Sliding windows

Transmitter needs to manage some information:
* List of frames transmitted sucessfully
* List of frames transmitted without ACK
* List of frames to be sent immediately
* List of frames NOT to be sent immediately
Receiver keep tracks of
* List of frames well received
* List of frames expected to receive

Piggybacking

A and B transmit data in both sides
When B needs to send an ACK while still needs to send data, B attaches the ACK in the Data frame: Piggybacking
Otherwise, B can send an ACK frame separately
After ACK, if B sends some other data, it still put the ACK information in data frame.
Sliding window is much more efficient than Stop-and Wait
More complicated in management.

Exercices

Given a link with rate R=100 Mbps
We need to send a file over data link layer with file size L=100 KB
Assume that the size of a frame is: 1 KB, header size is ignored
Round trip time (RTT) between 2 ends of the link is 3 ms
An ACK message is sent back from receiver whenever a frame is arrived. Size of ACK message is negligible
* What is the transmission time required if using Stop and wait mechanism?
* Transmission time with sliding window if the window size is =7?
* Which size of window allow to obtain the fastest transmission?

Transmission time with Stop and wait (Visual representation showing T_transmit + RTT per frame)
T_transmit + RTT

Media access control

Connection types

Point to point
* ADSL
* Telephone modem
* Leased Line.

Broadcast
* LAN using bus topology
* Wireless LAN
* HFC:
Broadcast networks need media access control protocol in order to avoid collision when nodes try to send data.

Multiple access links, protocols

two types of links:
* point to point
    * point to point link between Ethernet switch, host
    * PPP for dialup access
* broadcast (shared wire or medium)
    * old fashioned Ethernet
    * upstream HFC in cable based access network
    * 802.11 wireless LAN, 4G/5G. satellite

shared wire (e.g., cabled Ethernet)
shared radio: Wi Fi
shared radio: satellite
humans at a cocktail party (shared air, acoustical)
shared radio: 4G/5G

Multiple access protocols

single shared broadcast channel
two or more simultaneous transmissions by nodes: interference
collision if node receives two or more signals at the same time
multiple access protocol: distributed algorithm that determines how nodes share channel, i.e., determine when node can transmit
communication about channel sharing must use channel itself! no out of band channel for coordination

An ideal multiple access protocol

given: multiple access channel (MAC) of rate R bps
desiderata:
1. when one node wants to transmit, it can send at rate R.
2. when M nodes want to transmit, each can send at average rate R/M
3. fully decentralized:
    * no special node to coordinate transmissions
    * no synchronization of clocks, slots
4. simple

MAC protocols: taxonomy

three broad classes:
* channel partitioning
    * divide channel into smaller pieces (time slots, frequency, code)
    * allocate piece to node for exclusive use
    * e.g. time - TDMA, frequency FDMA, Code CDMA
* random access
    * channel not divided, allow collisions
    * recover from collisions
    * e.g. Pure Aloha, Slotted Aloha, CSMA/CD, CSMA/CA
* taking turns (sequence access)
    * nodes take turns, but nodes with more to send can take longer turns
    * Token Ring, Token Bus

Channel division

FDMA: frequency division multiple access
TDMA: time division multiple access
CDMA: code division multiple access

TDMA vs FDMA (Visual representation showing frequency/time allocation for 4 stations)
FDMA: frequency divided among stations, time is continuous for each.
TDMA: time divided into slots, each station uses full frequency during its slot.

CDMA

Several senders can share the same frequency on a single physical channel.
Signals come from different senders are encoded (multiplied) with different random code. Those code must be orthogonal.
Encoded signals are mixed and then transmit on a common frequency.
The signals are recovered at the receiver by using finding the correlation with the same codes as at sender side.
CDMA shows a lot of advantages that other technology cannot achieve. For example, the same frequency can be used in adjacent mobile cell without interference as if TDMA or FDMA are used.

CDMA (example) (Visual representation showing encoding/decoding with orthogonal codes)

Random access: Pure Aloha

Aloha is used in mobile network of 1G, 2.5G, 3G using GSM technology.
Pure Aloha:
* When one sender has data to send, just sends it
* If while sending, the senders receive data from other stations there is collision.
* All stations need to resend their data.
* There are possibility to have collision when retransmit.
Problem: Sender does not check to see if the chanel is free before sending data
(Visual representation showing overlapping packets causing collision) Grey package are having overlap in time causing collision

Random access: Slotted Aloha

Times axe is divided into equal slots.
Each station sends data only at the beginning of a time slot.
Collision possibility is reduced
(Visual representation showing packets aligned to slots, collision still possible within a slot) Still have collision in grey package

Random access: CSMA

CSMA: Carrier Sense Multiple Access
CSMA idea is similar to what happens in a meeting.
CSMA:
* The sender Listen before talk
* If the channel is busy, wait
* If the chanel is free, transmit

CSMA

CSMA: Sender listens before transmission:
* If the channel is free, send all the data
* If the channel is busy, wait.
Why there are still collision?
* Due to propagation delay
(Visual representation showing collision due to propagation delay)

Collision in CSMA

Assume that there are 4 nodes in the channel
The propagation of the signal from one node to the other requires a certain delay.
Ex: Transmissions from B and D cause collision
(Visual representation showing spatial layout and signal propagation leading to collision)

CSMA/CA (Collision Avoidance)

CSMA/CA is used WIFI standard IEEE 802.11
If two stations discover that the channel is busy, and both wait then it is possible that they will try to resend data in the same time. -> collision
Solution CSMA/CA.
Each station wait for a random period reduce the collision possibility

CSMA/CD

Used in Ethernet
CSMA with Collision Detection:
* Listen while talk.
* A sender listen to the channel,
* If the channel is free then transmit data
* While a station transmit data, it listens to the channel. If it detects a collision then transmits a short signal warning the collision then stop
* Do not continue the transmission even in collision as CSMA
* If the channel is busy, wait then transmit with probability p
* Retransmit after a random waiting time.

Comparison between channel division and random access

Channel division
* Efficient, treat stations equally.
* Waste of resources if one station has much smaller data to send than the others
Random access
* When total load is small: Efficient since each station can use the whole chanel
* When total load is large: Collision possibility increases.
Token control: compromise between the two above methods.

Taking turns MAC protocols

polling:
* master node invites other nodes to transmit in turn
* typically used with dumb devices
* concerns:
    * polling overhead
    * latency
    * single point of failure (master)
(Visual representation: master polls slaves, slaves send data)

Token Ring

A token is passed from one node to the other in a ring topo
Only the token holder can transmit data
After finishing sending data, the token need to be passed to next nodes.
Some problem
* Time consuming in passing token
* Loss of token due to some reasons
(Visual representation: Token (T) passed around ring, node with T sends data)

Summary on Media access control mechanisms

* Channel division
* Random access
* Token
What do you thinks about their advantages and weaknesses ?

Point to Point forwarding mechanism
Hub, Switch, Bridge

Devices of LAN

Repeater, Hub, bridge and switch
All are LAN devices with many ports
Repeater:
* Repeats the bits received in one port to the other port
* One network with repeaters = one collision domain
* Repeater is a physical layer system.
Hub:
* Receive the signal from one port (amplify ) and forward to the remaining ports
* Do not offer services of datalink layer
* Layer 1 intermediate system

Hub

Hub = Multiple port repeater
Single collision domain
Receive the signal from one port (amplify ) and forward to the remaining ports
(Visual representation of a Hub connecting multiple devices)

Devices of LAN (cont.)

Bridge
* More intelligent than hub
* Can store and forward data (Ethernet frame) according to MAC address.
* Bridge breaks the network into two collision domains.
* Layer 2 intermediate system
Switch
* More ports than bridge
* Can store and forward data according to MAC address
* Receive full frame, check error, forward

Bridge

Two ports systems
* Forward frames from one port to the other based on MAC address
* Create two collision domains
(Visual representation of a bridge connecting two hubs/segments)

Switch: multiple simultaneous transmissions

switch with six interfaces ( 1,2,3,4,5,6 )
(Visual representation: Switch with hosts A, B, C connected)
hosts have dedicated, direct connection to switch
"""


if(condensed_pdf_text[0:14] == "\n```plaintext\n"):
    condensed_pdf_text = condensed_pdf_text[14:]


# print(condensed_pdf_text)
from my_helper_function import count_tokens_for_gemini

num_token_raw = count_tokens_for_gemini(pdf_text)
num_token_cleaned = count_tokens_for_gemini(cleaned_pdf_text)
num_token_condensed = count_tokens_for_gemini(condensed_pdf_text)
print(f"""
    {num_token_raw}
    ------
    {num_token_cleaned}
    ------
    {num_token_condensed}
""")



### idk where to put this (condense information)

In [9]:
def condense_information(text: str) -> str:
    information_condenser_model = ChatGoogleGenerativeAI(
        model = "gemini-2.5-pro-exp-03-25",
        temperature = 0,
        max_tokens=8192,
        timeout=None
    )

    condense_information_prompt_template = """
        You are an expert text condenser for Retrieval Augmented Generation (RAG) systems. Your task is to maximize the information density of the given text while minimizing information loss.  This is for the preprocessing stage of a RAG pipeline, where the condensed text will be stored and later retrieved to answer user queries.

        Here are the guidelines:

        1.  **Core Information Preservation:** Retain all key entities, facts, relationships, and concepts.  Do NOT remove information that is crucial to understanding the original meaning.

        2.  **Contextual Integrity:** Ensure the condensed text remains coherent and understandable.  Maintain the logical flow of ideas, even if expressed more succinctly.

        3.  **Density Increase:**
            * Remove redundant phrases and words.
            * Replace verbose expressions with more concise equivalents.
            * Combine sentences where appropriate, without sacrificing clarity.
            * Use abbreviations and acronyms judiciously, only if they are widely understood or defined within the text.
            * Omit filler words and phrases (e.g., "it is important to note that", "in conclusion").
            * Avoid unnecessary details or elaborations that do not significantly contribute to the core meaning.
            * Use active voice where possible.

        4.  **Minimize Loss:** Do not sacrifice accuracy or completeness for the sake of brevity.  If a piece of information is important, keep it, even if it makes the text slightly longer.  Favor including slightly more information over omitting something crucial.

        5.  **Output Format:** Provide the condensed text as a single, coherent paragraph. Do not use bullet points or numbered lists unless the original text heavily relies on them and they are essential for understanding.

        6.  **Examples:**

            * **Input:** "The Battle of Hastings was fought on 14 October 1066 between the Norman-French army of William the Conqueror and an English army under the Anglo-Saxon King Harold Godwinson, beginning the Norman conquest of England."
            * **Output:** "On 14 October 1066, the Battle of Hastings marked the beginning of the Norman conquest of England, opposing the Norman-French army of William the Conqueror against the English army of Anglo-Saxon King Harold Godwinson."

            * **Input:** "The quick brown fox jumps over the lazy dog. This is a common English pangram. Pangrams are useful because they display all of the letters in the alphabet."
            * **Output:** "The quick brown fox jumps over the lazy dog, a common English pangram that displays all letters of the alphabet."

        7. **Specific Instructions for this task**:
            * Keep the language of the condensed text the same as the original text.
            * Do not add any information that is not present in the original text.
            * Be precise and factual.
            * The output should be one paragraph.

        Here is the text to condense inside three backticks:

        ```
        {text}
        ```
    """

    condense_information_prompt = PromptTemplate(template=condense_information_prompt_template, input_variables=["text"])

    string_output_parser = StrOutputParser()

    condense_chain = (
        {"text": lambda x: x}  # Identity function to pass the text
        | condense_information_prompt
        | information_condenser_model
        | string_output_parser # Use string output parser
    )
    """
    Condenses the input text using the Gemini LLM and LangChain.

    Args:
        text: The raw text to condense.

    Returns:
        A condensed and structured version of the text.
    """

    return condense_chain.invoke(text)

### Chunking

In [None]:
### Using cluster semantic chunker
from my_helper_function import cluster_chunking



'''
chunks = cluster_chunking(condensed_pdf_text)

print(f"Number of chunks: {len(chunks)}")
for i in range(len(chunks)):
    print(f"""
============================== {i}th chunks =============================

{chunks[i]}
""")

'''

## Data Encoding

In [11]:
MONGODB_ATLAS_CLUSTER_URI = os.getenv("MONGODB_ATLAS_CLUSTER_URI")

In [12]:
def encode_cleaned_text(str, mongo_db_uri):
    client = MongoClient(mongo_db_uri)  

    DB_NAME = "langchain_test_db"
    COLLECTION_NAME = "langchain_test_vectorstores"
    ATLAS_VECTOR_SEARCH_INDEX_NAME = "langchain-test-index-vectorstores"

    MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

    # Check if collection already exists and has documents
    # if MONGODB_COLLECTION.count_documents({}) > 0:
    #     # Collection exists and has documents, return existing vector store
    #     print("Using existing vector store from MongoDB")
    #     embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    #     vector_store = MongoDBAtlasVectorSearch(
    #         collection=MONGODB_COLLECTION,
    #         embedding=embeddings,
    #         index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    #         relevance_score_fn="cosine",
    #     )
    #     return vector_store
    
    MONGODB_COLLECTION.delete_many({})

    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    vector_store = MongoDBAtlasVectorSearch(
        collection=MONGODB_COLLECTION,
        embedding=embeddings,
        index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
        relevance_score_fn="cosine",
    )
    vector_store.create_vector_search_index(dimensions=768)
    

    chunks = cluster_chunking(str)
    documents = []
    for chunk in chunks:
        # Create a Document object for each chunk with page_content and metadata
        doc = Document(
            page_content=chunk,
            metadata={"source": "pdf"}  # You can add more metadata as needed
        )
        documents.append(doc)


    # Error 2: from_documents is a class method that returns a new instance
    # It doesn't modify the existing vector_store instance
    # The correct approach is to use add_documents on the existing instance
    # or use from_documents to create a new instance
    
    # Option 1: Add documents to existing vector_store


    vector_store.add_documents(documents)
    
    # Option 2: Create a new instance using from_documents (alternative approach)
    # vector_store = MongoDBAtlasVectorSearch.from_documents(
    #     documents=documents,
    #     embedding=embeddings,
    #     collection=MONGODB_COLLECTION,
    #     index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
    # )
    
    return vector_store


In [13]:
chunks_vector_store = encode_cleaned_text(str = condensed_pdf_text, mongo_db_uri= MONGODB_ATLAS_CLUSTER_URI)

In [None]:
results = chunks_vector_store.similarity_search_with_score("error control", k=5)
for res, score in results:
    print(f"""
==================================================================================
* [SIM={score:3f}] {res.page_content} [{res.metadata}]""")
