Lets do document loading

In [1]:
from langchain.document_loaders import TextLoader

loader = TextLoader('AI News.txt')
data = loader.load()
print(data[0])

page_content='In Silicon Valley, what matters most is staying ahead of the competition. If you are Apple or Meta and find yourself short on talent or unable to build the technology in-house, the typical approach is to spend big. That usually means acquiring a successful company, bringing in the team behind the product, and either integrating it into your core offerings or giving the founders enough autonomy to continue innovating within your brand. Itâ€™s an approach many Silicon Valley companies have embracedâ€”some have made blockbuster acquisitions, while others have gotten burned.

Last week, both Apple and Meta made headlines with news of their interest in acquiring Perplexity AI, a leading AI startup founded by Indian-origin computer scientist Arvind Srinivas. While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexityâ€”at the same time. The timing makes it even more interesti

In [2]:
from langchain.document_loaders.csv_loader import CSVLoader
# This is a csv document loader
loader = CSVLoader("Telco Customer Churn.csv")
data = loader.load()
len(data)

7032

In [3]:
data[0].page_content

'_c0: 0\ncustomerID: 7590-VHVEG\ngender: Female\nSeniorCitizen: 0\nDependents: No\ntenure: 1\nPhoneService: No\nMultipleLines: No\nInternetService: DSL\nOnlineSecurity: No\nStreamingTV: No\nContract: Month-to-month\nMonthlyCharges: 29.85\nTotalCharges: 29.85\nChurn: No'

In [4]:
data[0].metadata

{'source': 'Telco Customer Churn.csv', 'row': 0}

In [7]:
from langchain.document_loaders import UnstructuredURLLoader
# Go to specified website with URL and look to its DOM object the HTML structure and pulls all the information
loader = UnstructuredURLLoader(urls=[
    "https://www.reuters.com/world/india/gold-subdued-dollar-gains-markets-await-iran-response-2025-06-23/",
    "https://www.jll.com/en-in/insights/land-transactions-in-india"
])

data = loader.load()
print(len(data))
data[0].metadata

2


{'source': 'https://www.reuters.com/world/india/gold-subdued-dollar-gains-markets-await-iran-response-2025-06-23/'}

Text Splitting

In [10]:
# random text 
text = """In Silicon Valley, what matters most is staying ahead of the competition. 
If you are Apple or Meta and find yourself short on talent or unable to build the technology in-house, the typical approach is to spend big. 
That usually means acquiring a successful company, bringing in the team behind the product, and either integrating it into your core offerings or giving the founders enough autonomy to continue innovating within your brand.
It’s an approach many Silicon Valley companies have embraced—some have made blockbuster acquisitions, while others have gotten burned.

Last week, both Apple and Meta made headlines with news of their interest in acquiring Perplexity AI, a leading AI startup founded by Indian-origin computer scientist Arvind Srinivas. 
While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexity—at the same time. 
The timing makes it even more interesting, especially in the case of Apple, which typically avoids bringing in outside talent and prefers to build competing technologies in-house.

Bloomberg first reported that Meta approached Perplexity about a potential takeover before the company recently invested $14.3 billion in Scale AI. 
Unsurprisingly, the San Francisco–based AI startup was also on Apple’s radar, according to another Bloomberg report. 
It’s not clear whether Perplexity is up for sale, whether Meta or Apple has held formal talks to acquire the company, or how close either might be to sealing a deal. 
We may not know until one of them makes an official announcement."""

In [11]:
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=200,
    chunk_overlap=0
)

chunks = splitter.split_text(text)
len(chunks)

Created a chunk of size 223, which is longer than the specified 200


11

In [12]:
chunks

['In Silicon Valley, what matters most is staying ahead of the competition.',
 'If you are Apple or Meta and find yourself short on talent or unable to build the technology in-house, the typical approach is to spend big.',
 'That usually means acquiring a successful company, bringing in the team behind the product, and either integrating it into your core offerings or giving the founders enough autonomy to continue innovating within your brand.',
 'It’s an approach many Silicon Valley companies have embraced—some have made blockbuster acquisitions, while others have gotten burned.',
 'Last week, both Apple and Meta made headlines with news of their interest in acquiring Perplexity AI, a leading AI startup founded by Indian-origin computer scientist Arvind Srinivas.',
 'While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexity—at the same time.',
 'The timing makes it even more inte

In [13]:
for chunk in chunks:
    print(len(chunk))

73
140
223
134
183
181
179
147
116
165
65


In [14]:
# Lets use recursive character text splitter so to split text with multiple separators
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators=['\n\n', '\n', ' '],
    chunk_size=200,
    chunk_overlap=0
)

chunks = r_splitter.split_text(text)
len(chunks)

12

In [15]:
for chunk in chunks:
    print(len(chunk))

73
140
193
29
134
183
181
179
147
116
165
65


In [None]:
# here after splitting with '\n' still there was a line with limit above 200 so then
# again did a split using the space ' ' separator so it splitted
# (but had an issue when using space separator, all the words gets splitted as there is space between them so to overcome this it automatically merges small chunks together without exceeding the limit)
chunks

['In Silicon Valley, what matters most is staying ahead of the competition.',
 'If you are Apple or Meta and find yourself short on talent or unable to build the technology in-house, the typical approach is to spend big.',
 'That usually means acquiring a successful company, bringing in the team behind the product, and either integrating it into your core offerings or giving the founders enough autonomy to continue',
 'innovating within your brand.',
 'It’s an approach many Silicon Valley companies have embraced—some have made blockbuster acquisitions, while others have gotten burned.',
 'Last week, both Apple and Meta made headlines with news of their interest in acquiring Perplexity AI, a leading AI startup founded by Indian-origin computer scientist Arvind Srinivas.',
 'While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexity—at the same time.',
 'The timing makes it even more 

In [None]:
# Separation using space and "\n\n" looks like
"""While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexity—at the same time. 
The timing makes it even more interesting, especially in the case of Apple, which typically avoids bringing in outside talent and prefers to build competing technologies in-house.

Bloomberg first reported that Meta approached Perplexity about a potential takeover before the company recently invested $14.3 billion in Scale AI. """.split(" ")

['While',
 'major',
 'Silicon',
 'Valley',
 'players',
 'often',
 'work',
 'behind',
 'the',
 'scenes',
 'to',
 'quietly',
 'pursue',
 'companies',
 'they',
 'are',
 'interested',
 'in,',
 'this',
 'time',
 'two',
 'tech',
 'giants',
 'are',
 'eyeing',
 'Perplexity—at',
 'the',
 'same',
 'time.',
 '\nThe',
 'timing',
 'makes',
 'it',
 'even',
 'more',
 'interesting,',
 'especially',
 'in',
 'the',
 'case',
 'of',
 'Apple,',
 'which',
 'typically',
 'avoids',
 'bringing',
 'in',
 'outside',
 'talent',
 'and',
 'prefers',
 'to',
 'build',
 'competing',
 'technologies',
 'in-house.\n\nBloomberg',
 'first',
 'reported',
 'that',
 'Meta',
 'approached',
 'Perplexity',
 'about',
 'a',
 'potential',
 'takeover',
 'before',
 'the',
 'company',
 'recently',
 'invested',
 '$14.3',
 'billion',
 'in',
 'Scale',
 'AI.',
 '']

In [20]:
"""While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexity—at the same time. 
The timing makes it even more interesting, especially in the case of Apple, which typically avoids bringing in outside talent and prefers to build competing technologies in-house.

Bloomberg first reported that Meta approached Perplexity about a potential takeover before the company recently invested $14.3 billion in Scale AI. """.split("\n\n")

['While major Silicon Valley players often work behind the scenes to quietly pursue companies they are interested in, this time two tech giants are eyeing Perplexity—at the same time. \nThe timing makes it even more interesting, especially in the case of Apple, which typically avoids bringing in outside talent and prefers to build competing technologies in-house.',
 'Bloomberg first reported that Meta approached Perplexity about a potential takeover before the company recently invested $14.3 billion in Scale AI. ']