# Document loaders

DocumentLoaders load data into the standard LangChain Document format.

Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method. An example use case is as follows:

In [1]:
import os
from dotenv import load_dotenv
from pprint import pprint
load_dotenv()
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")

## Langsmith Tracking and tracing

os.environ["LANGCHAIN_API_KEY"]=os.getenv("LANGCHAIN_API_KEY")
os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_PROJECT"]=os.getenv("LANGCHAIN_PROJECT")
from langchain_openai import ChatOpenAI


# Text loader

In [2]:
from langchain_community.document_loaders import TextLoader


In [6]:
loader=TextLoader("data/loreal_shareholder_2022.txt")
loader

<langchain_community.document_loaders.text.TextLoader at 0x126a5f9b0>

In [7]:
text_documents=loader.load()


In [9]:
text_documents

[Document(metadata={'source': 'data/loreal_shareholder_2022.txt'}, page_content='“Dear Shareholders,\nL’Oréal continues on the path to success with an ever-stronger ambition, while acting with the sense of responsibility of a global leader. Dual financial and social excellence will always be at the heart of our business model.\nWe have set ourselves the ultimate goal of creating value that benefits everyone.\nWe create value for you, our shareholders.\xa0The resilience and outperformance of your Company are the perfect demonstration of its robust, virtuous and value creating business model. The quality of our results puts us in a position to offer a dividend of €6 per share, representing a significant increase of +25%. And the preferential dividend with a 10% loyalty bonus(1), at €6.60, is recognition of your long-term loyalty.\nI also know that you attach just as much importance to the quality of our relationship with you, our shareholders. I am delighted to welcome the more than 30,0

In [10]:
from langchain_community.document_loaders import PyPDFLoader

loader=PyPDFLoader('data/2024_berkshire_hathway_shareholder_letter.pdf')

In [11]:
pdfdocs=loader.load()
pdfdocs

[Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PyPDF', 'creationdate': '2025-02-22T07:14:18-06:00', 'moddate': '2025-02-22T07:14:40-06:00', 'title': 'printmgr file', 'source': 'data/2024_berkshire_hathway_shareholder_letter.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='BERKSHIRE HATHAWAY INC. \nTo the Shareholders of Berkshire Hathaway Inc.: \nThis letter comes to you as part of Berkshire’s annual report. As a public company, we \nare required to periodically tell you many specific facts and figures. \n“Report,” however, implies a greater responsibility. In addition to the mandated data, we \nbelieve we owe you additional commentary about what you own and how we think. Our goal is \nto communicate with you in a manner that we would wish you to use if our positions were \nreversed – that is, if you were Berkshire’s CEO while I and my family were passive investors, \ntrusting you with our savings. \nThis approach leads us to an an

## Arxiv

In [12]:

from langchain_community.document_loaders import ArxivLoader

In [13]:
research_paper=ArxivLoader(query="1706.03762",load_max_docs=1).load()

In [14]:
research_paper

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

# CSV Loader

In [16]:
from langchain_community.document_loaders.csv_loader import CSVLoader

loader=CSVLoader('data/Financials.csv')

In [17]:
data=loader.load()
data

[Document(metadata={'source': 'data/Financials.csv', 'row': 0}, page_content='Segment: Government\nCountry: Canada\nProduct: Carretera\nDiscount Band: None\nUnits Sold: $1,618.50\nManufacturing Price: $3.00\nSale Price: $20.00\nGross Sales: $32,370.00\nDiscounts: $-\nSales: $32,370.00\nCOGS: $16,185.00\nProfit: $16,185.00\nDate: 01/01/2014\nMonth Number: 1\nMonth Name: January\nYear: 2014'),
 Document(metadata={'source': 'data/Financials.csv', 'row': 1}, page_content='Segment: Government\nCountry: Germany\nProduct: Carretera\nDiscount Band: None\nUnits Sold: $1,321.00\nManufacturing Price: $3.00\nSale Price: $20.00\nGross Sales: $26,420.00\nDiscounts: $-\nSales: $26,420.00\nCOGS: $13,210.00\nProfit: $13,210.00\nDate: 01/01/2014\nMonth Number: 1\nMonth Name: January\nYear: 2014'),
 Document(metadata={'source': 'data/Financials.csv', 'row': 2}, page_content='Segment: Midmarket\nCountry: France\nProduct: Carretera\nDiscount Band: None\nUnits Sold: $2,178.00\nManufacturing Price: $3.00\nSa

# Wikipedia loader

In [18]:
from langchain_community.document_loaders import WikipediaLoader

docs=WikipediaLoader(query="Generative AI",load_max_docs=4).load()
len(docs)



4

In [22]:
docs

[Document(metadata={'title': 'Generative artificial intelligence', 'summary': 'Generative artificial intelligence (Generative AI, GenAI, or GAI) is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models learn the underlying patterns and structures of their training data and use them to produce new data based on the input, which often comes in the form of natural language prompts.  \nImprovements in transformer-based deep neural networks, particularly large language models (LLMs), enabled an AI boom of generative AI systems in the 2020s. These include chatbots such as ChatGPT, Copilot, Gemini, and LLaMA; text-to-image artificial intelligence image generation systems such as Stable Diffusion, Midjourney, and DALL-E; and text-to-video AI generators such as Sora. Companies such as OpenAI, Anthropic, Microsoft, Google, and Baidu as well as numerous smaller firms have developed generative AI models.\nGenerative AI