# **Document Loaders** in LangChain

### Loading Text Files

In [2]:
from langchain_community.document_loaders import TextLoader

# Initialize the TextLoader with the path to the text file
loader = TextLoader("./data/example_txt_file.txt")

# Load the text data
txt_data = loader.load()

print(txt_data)

[Document(metadata={'source': './data/example_txt_file.txt'}, page_content="Sri Lanka's national cricket team achieved a historic milestone by winning the ICC Cricket World Cup in 1996. \nThe team is known for producing cricket legends like Muttiah Muralitharan, the highest wicket-taker in Test cricket history. \nKumar Sangakkara and Mahela Jayawardene are celebrated for their prolific batting partnerships. \nSanath Jayasuriya revolutionized one-day cricket with his explosive batting style. \nThe Galle International Stadium, with its stunning backdrop of the Galle Fort, is one of the world's most picturesque cricket venues. \nSri Lanka won the ICC T20 World Cup in 2014, demonstrating their prowess in the shortest format of the game. \nLasith Malinga, famous for his unique bowling action and lethal yorkers, has been a key player in their T20 success. \nCricket is deeply embedded in Sri Lankan culture, uniting people from all walks of life during major tournaments.")]


In [None]:
# txt_data[0].page_content
txt_data.page_content

"Sri Lanka's national cricket team achieved a historic milestone by winning the ICC Cricket World Cup in 1996. \nThe team is known for producing cricket legends like Muttiah Muralitharan, the highest wicket-taker in Test cricket history. \nKumar Sangakkara and Mahela Jayawardene are celebrated for their prolific batting partnerships. \nSanath Jayasuriya revolutionized one-day cricket with his explosive batting style. \nThe Galle International Stadium, with its stunning backdrop of the Galle Fort, is one of the world's most picturesque cricket venues. \nSri Lanka won the ICC T20 World Cup in 2014, demonstrating their prowess in the shortest format of the game. \nLasith Malinga, famous for his unique bowling action and lethal yorkers, has been a key player in their T20 success. \nCricket is deeply embedded in Sri Lankan culture, uniting people from all walks of life during major tournaments."

In [None]:
# txt_data[0].metadata
txt_data.metadata

{'source': '/content/example_txt_file.txt'}

### Loading Text Files from a Directory

In [3]:
# !pip install unstructured -qU

In [5]:
from langchain_community.document_loaders import DirectoryLoader

# Initialize the DirectoryLoader with the path to the directory and a glob pattern for text files
loader = DirectoryLoader("./data/txt_folder", glob="**/*.txt")

# Load the text data from the directory
dataset = loader.load()

for data in dataset:
  print("------------------------")
  print(data.page_content)
  print(data.metadata)

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


------------------------
The AI revolution continues to transform industries and reshape the global economy.

Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-driven diagnostics improving patient outcomes and reducing costs.

Autonomous systems are becoming increasingly prevalent in logistics and transportation, enhancing efficiency and safety.
{'source': 'data\\txt_folder\\ai_news.txt'}
------------------------
The T20 World Cup 2024 is in full swing, bringing excitement and drama to cricket fans worldwide.

India's team, captained by Rohit Sharma, is preparing for a crucial match against Ireland, with standout player Jasprit Bumrah expected to play a pivotal role in their campaign.

The tournament has already seen controversy, particularly concerning the pitch conditions at Nassau County International Cricket Stadium in New York, which came under fire after a low-scoring game between Sri Lanka and South Africa.
{'source': 'data\\txt

# Loading PDF Files

In [6]:
# !pip install pypdf -qU

In [7]:
from langchain_community.document_loaders import PyPDFLoader

# Initialize the PyPDFLoader with the path to the PDF file
loader = PyPDFLoader("./data/example_pdf_file.pdf")

# Load the PDF data
pdf_data = loader.load()

print(pdf_data)

[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2024-06-17T22:17:35+05:30', 'author': 'Dinesh Piyasamara', 'moddate': '2024-06-17T22:17:35+05:30', 'source': './data/example_pdf_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="Sri Lanka's national cricket team achieved a historic milestone by winning the ICC Cricket World \nCup in 1996.  \nThe team is known for producing cricket legends like Muttiah Muralitharan, the highest wicket-taker \nin Test cricket history.  \nKumar Sangakkara and Mahela Jayawardene are celebrated for their prolific batting partnerships.  \nSanath Jayasuriya revolutionized one-day cricket with his explosive batting style.  \nThe Galle International Stadium, with its stunning backdrop of the Galle Fort, is one of the world's \nmost picturesque cricket venues.  \nSri Lanka won the ICC T20 World Cup in 2014, demonstrating their prowess in the shortest forma

In [8]:
pdf_data[0].page_content

"Sri Lanka's national cricket team achieved a historic milestone by winning the ICC Cricket World \nCup in 1996.  \nThe team is known for producing cricket legends like Muttiah Muralitharan, the highest wicket-taker \nin Test cricket history.  \nKumar Sangakkara and Mahela Jayawardene are celebrated for their prolific batting partnerships.  \nSanath Jayasuriya revolutionized one-day cricket with his explosive batting style.  \nThe Galle International Stadium, with its stunning backdrop of the Galle Fort, is one of the world's \nmost picturesque cricket venues.  \nSri Lanka won the ICC T20 World Cup in 2014, demonstrating their prowess in the shortest format of \nthe game.  \nLasith Malinga, famous for his unique bowling action and lethal yorkers, has been a key player in \ntheir T20 success.  \nCricket is deeply embedded in Sri Lankan culture, uniting people from all walks of life during major \ntournaments."

In [9]:
pdf_data[0].metadata

{'producer': 'Microsoft® Word for Microsoft 365',
 'creator': 'Microsoft® Word for Microsoft 365',
 'creationdate': '2024-06-17T22:17:35+05:30',
 'author': 'Dinesh Piyasamara',
 'moddate': '2024-06-17T22:17:35+05:30',
 'source': './data/example_pdf_file.pdf',
 'total_pages': 1,
 'page': 0,
 'page_label': '1'}

### Loading PDF Files from a Directory

In [10]:
from langchain_community.document_loaders import PyPDFDirectoryLoader

# Initialize the PyPDFDirectoryLoader with the path to the directory containing PDF files
loader = PyPDFDirectoryLoader("./data/pdf_folder")

# Load the PDF data from the directory
dataset = loader.load()

for data in dataset:
  print("------------------------")
  print(data.page_content)
  print(data.metadata)

------------------------
The AI revolution continues to transform industries and reshape the global economy. 
Significant advancements in artificial intelligence have led to breakthroughs in healthcare, with AI-
driven diagnostics improving patient outcomes and reducing costs. 
Autonomous systems are becoming increasingly prevalent in logistics and transportation, 
enhancing efficiency and safety.
{'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2024-06-17T22:35:41+05:30', 'author': 'Dinesh Piyasamara', 'moddate': '2024-06-17T22:35:41+05:30', 'source': 'data\\pdf_folder\\ai_news.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}
------------------------
The T20 World Cup 2024 is in full swing, bringing excitement and drama to cricket fans worldwide. 
India's team, captained by Rohit Sharma, is preparing for a crucial match against Ireland, with 
standout player Jasprit Bumrah expected to play a pivotal role in their c

### Loading CSV Files

In [11]:
from langchain_community.document_loaders import CSVLoader

# Initialize the CSVLoader with the path to the CSV file
loader = CSVLoader("./data/example_csv_file.csv")

# Load the CSV data
csv_data = loader.load()

print(csv_data)

[Document(metadata={'source': './data/example_csv_file.csv', 'row': 0}, page_content='ï»¿name: kamal\ncity: colombo\nresult: pass'), Document(metadata={'source': './data/example_csv_file.csv', 'row': 1}, page_content='ï»¿name: saman\ncity: kandy\nresult: pass'), Document(metadata={'source': './data/example_csv_file.csv', 'row': 2}, page_content='ï»¿name: pawan\ncity: jaffna\nresult: fail'), Document(metadata={'source': './data/example_csv_file.csv', 'row': 3}, page_content='ï»¿name: nimal\ncity: puttalam\nresult: fail'), Document(metadata={'source': './data/example_csv_file.csv', 'row': 4}, page_content='ï»¿name: sunil\ncity: anuradapura\nresult: pass')]


In [12]:
csv_data[0]

Document(metadata={'source': './data/example_csv_file.csv', 'row': 0}, page_content='ï»¿name: kamal\ncity: colombo\nresult: pass')

In [13]:
csv_data[0].page_content

'ï»¿name: kamal\ncity: colombo\nresult: pass'

### Displaying HTML Data

In [14]:
from langchain_community.document_loaders import BSHTMLLoader

# Initialize the BSHTMLLoader with the path to the HTML file
loader = BSHTMLLoader("./data/example_html_file.html")

# Load the HTML data
html_data = loader.load()

print(html_data)

[Document(metadata={'source': './data/example_html_file.html', 'title': 'Sri Lanka Cricket'}, page_content="\n\n\n\nSri Lanka Cricket\n\n\n\n\nSri Lanka Cricket\nSri Lanka's national cricket team achieved a historic milestone by winning the ICC Cricket World Cup in 1996.\nThe team is known for producing cricket legends like Muttiah Muralitharan, the highest wicket-taker in Test cricket history.\nKumar Sangakkara and Mahela Jayawardene are celebrated for their prolific batting partnerships.\nSanath Jayasuriya revolutionized one-day cricket with his explosive batting style.\nThe Galle International Stadium, with its stunning backdrop of the Galle Fort, is one of the world's most picturesque cricket venues.\nSri Lanka won the ICC T20 World Cup in 2014, demonstrating their prowess in the shortest format of the game.\nLasith Malinga, famous for his unique bowling action and lethal yorkers, has been a key player in their T20 success.\nCricket is deeply embedded in Sri Lankan culture, uniting

In [15]:
html_data[0].page_content

"\n\n\n\nSri Lanka Cricket\n\n\n\n\nSri Lanka Cricket\nSri Lanka's national cricket team achieved a historic milestone by winning the ICC Cricket World Cup in 1996.\nThe team is known for producing cricket legends like Muttiah Muralitharan, the highest wicket-taker in Test cricket history.\nKumar Sangakkara and Mahela Jayawardene are celebrated for their prolific batting partnerships.\nSanath Jayasuriya revolutionized one-day cricket with his explosive batting style.\nThe Galle International Stadium, with its stunning backdrop of the Galle Fort, is one of the world's most picturesque cricket venues.\nSri Lanka won the ICC T20 World Cup in 2014, demonstrating their prowess in the shortest format of the game.\nLasith Malinga, famous for his unique bowling action and lethal yorkers, has been a key player in their T20 success.\nCricket is deeply embedded in Sri Lankan culture, uniting people from all walks of life during major tournaments.\n\n\n\n"

In [16]:
html_data[0].metadata

{'source': './data/example_html_file.html', 'title': 'Sri Lanka Cricket'}