# Data Loading in Langchain

## Loading a Text File

In [1]:
from langchain_community.document_loaders import TextLoader

In [2]:
loader = TextLoader("speech.txt", encoding="utf-8")
loader

<langchain_community.document_loaders.text.TextLoader at 0x108977cb0>

In [5]:
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\n…\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness be

## Loading a PDF File

In [6]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("syllabus.pdf")
docs = loader.load()
docs

[Document(metadata={'producer': 'Canva', 'creator': 'Canva', 'creationdate': '2025-01-30T20:27:03+00:00', 'title': 'Ultimate Data Science & GenAI Bootcamp', 'moddate': '2025-01-30T20:26:59+00:00', 'keywords': 'DAGdmhcqnYw,BAEmsmap8Lg,0', 'author': 'monal singh', 'containsaigeneratedcontent': 'Yes', 'source': 'syllabus.pdf', 'total_pages': 34, 'page': 0, 'page_label': '1'}, page_content='MACHINE\nLEARNING\nDEEP\nLEARNING\nPYTHON +\nSTATS\nCOMPUTER VISIONNATURAL LANGUAGE PROCESSING\nGENERATIVE AI\nRETRIEVAL AUGUMENT GENERATION\nVECTOR DB'),
 Document(metadata={'producer': 'Canva', 'creator': 'Canva', 'creationdate': '2025-01-30T20:27:03+00:00', 'title': 'Ultimate Data Science & GenAI Bootcamp', 'moddate': '2025-01-30T20:26:59+00:00', 'keywords': 'DAGdmhcqnYw,BAEmsmap8Lg,0', 'author': 'monal singh', 'containsaigeneratedcontent': 'Yes', 'source': 'syllabus.pdf', 'total_pages': 34, 'page': 1, 'page_label': '2'}, page_content='This course is designed for aspiring data scientists, machine lea

## Loading from Web Page

In [7]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader("https://www.wikipedia.org/")
docs = loader.load()
docs

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://www.wikipedia.org/', 'title': 'Wikipedia', 'description': 'Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.', 'language': 'en'}, page_content='\n\n\n\nWikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWikipedia\n\nThe Free Encyclopedia\n\n\n\n\n\n\nEnglish\n6,974,000+ articles\n\n\n\n\n\n日本語\n1,457,000+ 記事\n\n\n\n\n\nРусский\n2\xa0036\xa0000+ статей\n\n\n\n\n\nDeutsch\n3.001.000+ Artikel\n\n\n\n\n\nEspañol\n2.021.000+ artículos\n\n\n\n\n\nFrançais\n2\u202f674\u202f000+ articles\n\n\n\n\n\n中文\n1,470,000+ 条目 / 條目\n\n\n\n\n\nItaliano\n1.910.000+ voci\n\n\n\n\n\nPortuguês\n1.146.000+ artigos\n\n\n\n\n\nPolski\n1\xa0652\xa0000+ haseł\n\n\n\n\n\n\n\n\nSearch Wikipedia\n\n\n\n\nAfrikaans\nالعربية\nAsturianu\nAzərbaycanca\nБългарски\n閩南語 / Bân-lâm-gú\nবাংলা\nБеларуская\nCatalà\nČeština\nCymraeg\nDansk\nDeutsch\nEesti\nΕλληνικά\nEnglish\nEspañol\nEsperanto\nEuskara\nفارس

In [None]:
# Example of using WebBaseLoader with specific HTML elements 

from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(
    web_paths=(
        "https://lilianweng.github.io/posts/2023-06-23-agent/",
    ),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=(
                "post-title",
                "post-content",
                "post-header"
            )
        )
    )
)

doc=loader.load()
doc

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake