In [1]:
1-1

0

In [2]:
import langchain
from langchain_core.documents import Document

In [3]:
import os 
from typing import List,Dict,Any
import pandas as pd
from langchain_text_splitters  import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter
)

  from .autonotebook import tqdm as notebook_tqdm


Understanding Document Structure

In [4]:
doc = Document(
    page_content="Just a small example",
    metadata = {
        "source":"wikipedia",
        "author":"morty"
    }
)


In [5]:
print("Content",doc.page_content)

Content Just a small example


Doing things with Text files

In [1]:
import os
os.makedirs("data/textfiles",exist_ok=True)

In [2]:
#This is sample text in key-value pairs
sample_text = {
    "data/textfiles/rl_intro.txt":"""Reinforcement learning is a way of teaching machines to make decisions by letting them learn from experience rather than instructions.
      Instead of being told exactly what to do, a learning agent interacts with an environment, takes actions, and receives feedback in the form of rewards or penalties. Over time, the agent adjusts its behavior to maximize the total reward it receives.

At the heart of reinforcement learning is a simple idea: actions have consequences. Some choices lead to good outcomes, others to bad ones, and many only reveal their value after a sequence of steps. Because of this, reinforcement learning focuses not just on immediate rewards but on long-term success. 
An agent must balance trying new actions (exploration) with using what it already knows works well (exploitation).

This learning style closely mirrors how humans and animals learn skills, from riding a bicycle to playing a game.
 By repeatedly experimenting and improving based on feedback, reinforcement learning systems can eventually discover effective strategies in complex and uncertain environments, even when no clear “right answer” is provided in advance.
""",
    "data/textfiles/robotics_intro.txt":"""Robotics is the field that focuses on designing, building, and controlling machines that can sense, think, and act in the physical world
      A robot is more than just a machine that moves; it combines mechanical parts, electronics, and intelligence to perform tasks autonomously or with human guidance.
        The goal of robotics is to create systems that can assist, replace, or enhance human effort.

Robots rely on sensors to understand their surroundings and actuators, such as motors, to interact with the environment. 
Using control systems and algorithms, they process sensor information and decide how to move or respond. 
Because the real world is often uncertain and changing, robots must be able to adapt to different situations rather than follow fixed instructions.

Today, robotics is used in many areas, including manufacturing, healthcare, exploration, and everyday services.
 As robots become more advanced, they are increasingly capable of working safely alongside humans, performing complex tasks, and operating in environments that are difficult or dangerous for people.
"""
}

for filepath,content in sample_text.items():
    with open(filepath,"w",encoding="utf-8") as f:
        f.write(content)

print("Done")

Done


Text file loader

In [None]:
from langchain_community.document_loaders import TextLoader
#loading a single file
loader = TextLoader("data/textfiles/rl_intro.txt",encoding="utf-8")
documents = loader.load()
print(type(documents))
print(documents)
#prints out meta data and page_content

  from .autonotebook import tqdm as notebook_tqdm


<class 'list'>
[Document(metadata={'source': 'data/textfiles/rl_intro.txt'}, page_content='Reinforcement learning is a way of teaching machines to make decisions by letting them learn from experience rather than instructions.\n      Instead of being told exactly what to do, a learning agent interacts with an environment, takes actions, and receives feedback in the form of rewards or penalties. Over time, the agent adjusts its behavior to maximize the total reward it receives.\n\nAt the heart of reinforcement learning is a simple idea: actions have consequences. Some choices lead to good outcomes, others to bad ones, and many only reveal their value after a sequence of steps. Because of this, reinforcement learning focuses not just on immediate rewards but on long-term success. \nAn agent must balance trying new actions (exploration) with using what it already knows works well (exploitation).\n\nThis learning style closely mirrors how humans and animals learn skills, from riding a bicyc

Directory loader - infinte Text files

In [None]:
from langchain_community.document_loaders import DirectoryLoader

dloader = DirectoryLoader(
    "data/textfiles",
    glob="**/*.txt",    #pattern to match text files
    loader_cls=TextLoader,
    loader_kwargs={'encoding':'utf-8'},
    show_progress=True
)

docs = dloader.load()
print(f"Loaded {len(documents)} documents")



100%|██████████| 2/2 [00:00<00:00, 598.12it/s]

Loaded 1 documents





## Text splitting Strategies

In [8]:

from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    TokenTextSplitter
)
print(docs)

[Document(metadata={'source': 'data/textfiles/robotics_intro.txt'}, page_content='Robotics is the field that focuses on designing, building, and controlling machines that can sense, think, and act in the physical world\n      A robot is more than just a machine that moves; it combines mechanical parts, electronics, and intelligence to perform tasks autonomously or with human guidance.\n        The goal of robotics is to create systems that can assist, replace, or enhance human effort.\n\nRobots rely on sensors to understand their surroundings and actuators, such as motors, to interact with the environment. \nUsing control systems and algorithms, they process sensor information and decide how to move or respond. \nBecause the real world is often uncertain and changing, robots must be able to adapt to different situations rather than follow fixed instructions.\n\nToday, robotics is used in many areas, including manufacturing, healthcare, exploration, and everyday services.\n As robots be

In [11]:
# Method 1- Character Text Splitter
text=documents[0].page_content
char_splitter = CharacterTextSplitter(
    separator="\n",  # Split on newlines
    chunk_size=200,  # Max chunk size in characters
    chunk_overlap=20,  # Overlap between chunks
    length_function=len  # How to measure chunk size
)

char_chunks=char_splitter.split_text(text)
print(f"Created {len(char_chunks)} chunks")
print(f"First chunk: {char_chunks[0][:100]}...")

Created a chunk of size 253, which is longer than the specified 200
Created a chunk of size 304, which is longer than the specified 200


Created 6 chunks
First chunk: Reinforcement learning is a way of teaching machines to make decisions by letting them learn from ex...


In [12]:
print(char_chunks[0])
print("------------------")
print(char_chunks[1])

Reinforcement learning is a way of teaching machines to make decisions by letting them learn from experience rather than instructions.
------------------
Instead of being told exactly what to do, a learning agent interacts with an environment, takes actions, and receives feedback in the form of rewards or penalties. Over time, the agent adjusts its behavior to maximize the total reward it receives.


In [13]:
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n","\n"," ","  "],  # Try these separators in order
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)

recursive_chunks = recursive_splitter.split_text(text)
print(f"Created {len(recursive_chunks)} chunks")
print(f"First chunk: {recursive_chunks[0][:100]}...")

Created 9 chunks
First chunk: Reinforcement learning is a way of teaching machines to make decisions by letting them learn from ex...


In [14]:
print(recursive_chunks[0])
print("-----------------")
print(recursive_chunks[1])
print("------------------")
print(recursive_chunks[2])


Reinforcement learning is a way of teaching machines to make decisions by letting them learn from experience rather than instructions.
-----------------
Instead of being told exactly what to do, a learning agent interacts with an environment, takes actions, and receives feedback in the form of rewards or penalties. Over time, the agent adjusts
------------------
the agent adjusts its behavior to maximize the total reward it receives.


This is not working because I am on a proxy

In [18]:
#Token based splitting
token_splitter = TokenTextSplitter(
    chunk_size=50,  # Size in tokens (not characters)
    chunk_overlap=10
)

token_chunks = token_splitter.split_text(text)
print(f"Created {len(token_chunks)} chunks")
print(f"First chunk: {token_chunks[0][:100]}...")

Created 6 chunks
First chunk: Reinforcement learning is a way of teaching machines to make decisions by letting them learn from ex...


In [19]:
for token_num,chunk in enumerate(token_chunks):
    print(f"{token_num}. {chunk}\n")

0. Reinforcement learning is a way of teaching machines to make decisions by letting them learn from experience rather than instructions.
      Instead of being told exactly what to do, a learning agent interacts with an environment, takes actions, and

1.  agent interacts with an environment, takes actions, and receives feedback in the form of rewards or penalties. Over time, the agent adjusts its behavior to maximize the total reward it receives.

At the heart of reinforcement learning is a simple idea: actions

2.  heart of reinforcement learning is a simple idea: actions have consequences. Some choices lead to good outcomes, others to bad ones, and many only reveal their value after a sequence of steps. Because of this, reinforcement learning focuses not just on immediate rewards but

3. , reinforcement learning focuses not just on immediate rewards but on long-term success. 
An agent must balance trying new actions (exploration) with using what it already knows works well (exploit