In [1]:
# pip install langchain==0.2.17
# pip install --upgrade pip
# pip install --no-cache-dir langchain==0.3.1

## Document Loaders In LangChain

### 1. TextLoader

In [2]:
from langchain.document_loaders import TextLoader

loader = TextLoader("nvda_news_1.txt")
loader.load()[0]



In [3]:
loader.file_path

'nvda_news_1.txt'

### 2. CSVLoader

In [4]:
from langchain.document_loaders.csv_loader import CSVLoader

In [5]:
loader = CSVLoader(file_path="movies.csv")
data = loader.load()
data[0]

Document(metadata={'source': 'movies.csv', 'row': 0}, page_content='movie_id: 101\ntitle: K.G.F: Chapter 2\nindustry: Bollywood\nrelease_year: 2022\nimdb_rating: 8.4\nstudio: Hombale Films\nlanguage_id: 3\nbudget: 1\nrevenue: 12.5\nunit: Billions\ncurrency: INR')

### UnstructuredURLLoader
UnstructuredURLLoader of Langchain internally uses unstructured python library to load the content from url's

https://unstructured-io.github.io/unstructured/introduction.html

https://pypi.org/project/unstructured/#description

In [6]:
from langchain_community.document_loaders import WebBaseLoader

# Define the URLs to load
urls = [
    "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
    "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
]

# Load documents from URLs
loader = WebBaseLoader(urls)
data = loader.load()

# Check the number of documents loaded
print(len(data))  # Should print 2 if both URLs are accessible

USER_AGENT environment variable not set, consider setting it to identify your requests.


2


In [7]:
for doc in data:
    print(doc.page_content[:500])

  HDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer             

  

      

   

  EnglishHindiGujaratiSpecialsSearch Quotes, News, Mutual Fund NAVsTrending StocksKalyan Jeweller INE303R01014, KALYANKJIL, 543278ITC Hotels INE379A01028, ITCHOTELS, 544325Ola Electric INE0LXG01040, OLAELEC, 544225Suzlon Energy INE040H01021, SUZLON, 532667Reliance INE002A01018, RELIANCE, 500325QuotesMutual FundsCommoditiesFutures & OptionsCurrencyNewsCryptocurrencyForumNoticesVideosGlossaryAll  Hello,
  Market corrects post RBI MPC outcome| Bet on these top 10 rate-sensitive stocks             

  

      

   

  EnglishHindiGujaratiSpecialsSearch Quotes, News, Mutual Fund NAVsTrending StocksKalyan Jeweller INE303R01014, KALYANKJIL, 543278ITC Hotels INE379A01028, ITCHOTELS, 544325Ola Electric INE0LXG01040, OLAELEC, 544225Suzlon Energy INE040H01021, SUZLON, 532667Reliance INE002A01018, RELIANCE, 500325QuotesMutual FundsCommoditiesFutures & OptionsCurrencyNewsCryptocurrencyForumNoticesVideosG

In [8]:
data[0].metadata

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html',
 'title': 'HDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer',
 'description': 'Chakrabarti has been appointed for a period of five years from December 14, 2023 to December 13, 2028.',
 'language': 'en'}

### Text Splitters

#### Why do we need text splitters in first place?

    LLM's have token limits. Hence we need to split the text which can be large into small chunks so that each chunk size is under the token limit. There are various text splitter classes in langchain that allows us to do this.

In [9]:
text = '''Friends is an American television sitcom created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons.[1] With an ensemble cast starring Jennifer Aniston, Courteney Cox, Lisa Kudrow, Matt LeBlanc, Matthew Perry and David Schwimmer, the show revolves around six friends in their 20s and early 30s who live in Manhattan, New York City. The original executive producers were Kevin S. Bright, Kauffman, and David Crane.

Kauffman and Crane began developing Friends under the working title Insomnia Cafe between November and December 1993. They presented the idea to Bright, and together they pitched a seven-page treatment of the show to NBC. After several script rewrites and changes, including title changes to Six of One[2] and Friends Like Us, the series was finally named Friends.[3] Filming took place at Warner Bros. Studios in Burbank, California. The series was produced by Bright/Kauffman/Crane Productions and Warner Bros. Television.

The show ranked within the top ten of the final television season ratings; it ultimately reached the number-one spot in its eighth season. The series finale aired on May 6, 2004, and was watched by around 52.5 million American viewers, making it the fifth-most-watched series finale in television history[4][5][6] and the most-watched television episode of the 2000s.[7][8] Friends received acclaim throughout its run, becoming one of the most popular television shows of all time.[9] It is also one of the most successful and highest-grossing television shows of all time, having grossed an estimated $1.4 billion since its debut.[10] The series was nominated for 62 Primetime Emmy Awards, winning the Outstanding Comedy Series award in 2002 for its eighth season.[11] The show ranked no. 21 on TV Guide's 50 Greatest TV Shows of All Time,[12] no. 29 on Variety magazine's The 100 Greatest TV Shows of All Time,[13] and no. 5 on Empire magazine's The 50 Greatest TV Shows of All Time.[14] In 1997, the episode "The One with the Prom Video" was ranked no. 100 on TV Guide's 100 Greatest Episodes of All-Time.[15] In 2013, Friends ranked no. 24 on the Writers Guild of America's 101 Best Written TV Series of All Time,[16] and no. 28 on TV Guide's 60 Best TV Series of All Time.[17] The sitcom's cast members returned for Friends: The Reunion, a reunion special which was released on HBO Max on May 27, 2021.'''

#### 1. Manual approach of splitting the text into chunks

In [10]:
# if LLM token limit is 200, in that case we can do simple thing such as this

text[0:200]

'Friends is an American television sitcom created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons.[1] With an ensemble cast starring J'

In [11]:
# Well but we want complete words and want to do this for entire text, may be we can use Python's split funciton

words = text.split(" ")
len(words)

392

In [12]:
chunks = []

s = ""
for word in words:
    s += word + " "
    if len(s)>200:
        chunks.append(s)
        s = ""
        
chunks.append(s)
print(f"First chunk by mannually python scripts ::::::\n\n{chunks[0]}")

First chunk by mannually python scripts ::::::

Friends is an American television sitcom created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons.[1] With an ensemble cast starring Jennifer 


### 2. Using Text Splitter Classes from Langchain

Bcz, Splitting data into chunks can be done in native python but it is a tidious process. Also if necessary, you may need to experiment with various delimiters in an iterative manner to ensure that each chunk does not exceed the token length limit of the respective LLM.

<b>Langchain provides a better way through text splitter classes.</b>

In [13]:
from langchain_text_splitters import TokenTextSplitter
import tiktoken  # TikToken library used by OpenAI models for tokenization

# Initialize tokenizer using OpenAI's "cl100k_base" tokenizer
encoding = tiktoken.get_encoding("cl100k_base")

# Tokenize the text into tokens
tokens = encoding.encode(text)

# Create chunks ensuring each chunk has a max of 100 tokens
max_chunk_size = 100
chunks = []

for i in range(0, len(tokens), max_chunk_size):
    chunk = tokens[i:i+max_chunk_size]
    chunks.append(encoding.decode(chunk))  # Decode the tokens back into text

# Print total chunks and their token counts
print(f"Total Chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    # Re-tokenize the chunk to check token count
    chunk_tokens = encoding.encode(chunk)
    print(f"Chunk {i+1} Token Count: {len(chunk_tokens)}")

Total Chunks: 6
Chunk 1 Token Count: 100
Chunk 2 Token Count: 100
Chunk 3 Token Count: 100
Chunk 4 Token Count: 100
Chunk 5 Token Count: 100
Chunk 6 Token Count: 79
