# Document Loaders and Splitters in Langchain

### UnstructuredURLLoader
UnstructuredURLLoader of Langchain internally uses unstructured python library to load the content from url's

In [1]:
from langchain.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(
    urls = [
        "https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html",
        "https://www.moneycontrol.com/news/business/markets/market-corrects-post-rbi-ups-inflation-forecast-icrr-bet-on-these-top-10-rate-sensitive-stocks-ideas-11142611.html"
    ]
)

In [2]:
data = loader.load()
len(data)

2

In [3]:
text = data[0].page_content + " " + data[1].page_content
text

'English\n\nHindi\n\nGujarati\n\nSpecials\n\nHello, Login\n\nHello, Login\n\nLog-inor Sign-Up\n\nMy Account\n\nMy Profile\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nMy Profile\n\nMy PRO\n\nMy Portfolio\n\nMy Watchlist\n\nMy Alerts\n\nMy Messages\n\nPrice Alerts\n\nLogout\n\nLoans up to ₹15 LAKHS\n\nFixed Deposits\n\nCredit CardsLifetime Free\n\nCredit Score\n\nChat with Us\n\nDownload App\n\nFollow us on:\n\nGo Ad-Free\n\nMy Alerts\n\n>->MC_ENG_DESKTOP/MC_ENG_NEWS/MC_ENG_BUSINESS_AS/MC_ENG_ROS_NWS_BUS_AS_ATF_728\n\nGo PRO @₹99 PRO\n\nAdvertisement\n\nRemove Ad\n\nBusiness\n\nMarkets\n\nStocks\n\nEconomy\n\nCompanies\n\nTrends\n\nIPO\n\nOpinion\n\nEV Special\n\nHomeNewsBusinessBanksHDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer\n\nTrending Topics\n\nBusiness News LiveBudget 2025 News LiveSat Kartar Shopping Share PriceWorld BankIndian Rupee\n\nHDFC Bank re-appoints Sanmoy Chakrabarti as Chief Risk Officer\n\nChakrabarti has been appoin

In [4]:
data[0].metadata

{'source': 'https://www.moneycontrol.com/news/business/banks/hdfc-bank-re-appoints-sanmoy-chakrabarti-as-chief-risk-officer-11259771.html'}

In [5]:
type(data)

list

### Text Splitters
Why do we need text splitters in first place?

LLM's have token limits. Hence we need to split the text which can be large into small chunks so that each chunk size is under the token limit. There are various text splitter classes in langchain that allows us to do this.

### RecursiveCharacterTextSplitter
Recursive text splitter uses a list of separators, i.e. separators = ["\n\n", "\n", "."]

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_splitter = RecursiveCharacterTextSplitter(
    separators = ['\n\n', '\n', ' '],
    chunk_size = 200,
    chunk_overlap = 20,
    length_function = len
)

In [8]:
chunks = r_splitter.split_text(text)
len(chunks)

155

In [9]:
for chunk in chunks:
    print(len(chunk))

190
186
197
181
173
192
79
189
164
146
198
25
199
195
191
43
57
194
186
98
193
193
196
127
108
195
196
69
159
186
197
121
192
152
130
79
196
163
194
130
197
26
46
198
47
164
191
97
148
70
195
78
196
102
197
188
46
196
189
36
197
195
75
60
196
184
198
65
197
140
130
97
196
196
110
199
47
115
103
192
115
195
154
199
151
156
191
192
147
197
76
195
96
197
32
90
196
58
190
79
123
99
195
56
160
199
42
88
193
198
84
196
66
191
90
197
70
154
198
137
198
149
199
129
98
198
88
191
189
82
95
192
141
191
160
190
62
195
77
60
194
131
190
184
115
193
193
196
127
108
195
196
69
159
67


In [10]:
import pandas as pd

df = pd.DataFrame({'text': chunks})
df.to_csv('chunks.csv', index=False)