## Data Ingestion - Document Loaders

#### Offical docs links: https://python.langchain.com/v0.2/docs/integrations/document_loaders/

In [5]:
## CSV Loader

from langchain_community.document_loaders import CSVLoader

data = CSVLoader("planets.csv")
loader = data.load()
print(loader)

[Document(metadata={'source': 'planets.csv', 'row': 0}, page_content='method: Radial Velocity\nnumber: 1\norbital_period: 269.3\nmass: 7.1\ndistance: 77.4\nyear: 2006'), Document(metadata={'source': 'planets.csv', 'row': 1}, page_content='method: Radial Velocity\nnumber: 1\norbital_period: 874.774\nmass: 2.21\ndistance: 56.95\nyear: 2008'), Document(metadata={'source': 'planets.csv', 'row': 2}, page_content='method: Radial Velocity\nnumber: 1\norbital_period: 763.0\nmass: 2.6\ndistance: 19.84\nyear: 2011'), Document(metadata={'source': 'planets.csv', 'row': 3}, page_content='method: Radial Velocity\nnumber: 1\norbital_period: 326.03\nmass: 19.4\ndistance: 110.62\nyear: 2007'), Document(metadata={'source': 'planets.csv', 'row': 4}, page_content='method: Radial Velocity\nnumber: 1\norbital_period: 516.22\nmass: 10.5\ndistance: 119.47\nyear: 2009'), Document(metadata={'source': 'planets.csv', 'row': 5}, page_content='method: Radial Velocity\nnumber: 1\norbital_period: 185.84\nmass: 4.8\nd

In [6]:
## Text Loader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("rag.txt")
data = loader.load()
print(data)

[Document(metadata={'source': 'rag.txt'}, page_content="Load: First we need to load our data. This is done with Document Loaders.\nSplit: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.\nStore: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.\nRetrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.\nGenerate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data")]


In [7]:
## Pdf Loader

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("RAG.pdf")

data = loader.load()
print(data)

[Document(metadata={'source': 'RAG.pdf', 'page': 0}, page_content='Adaptive-RAG: Learning to Adapt Retrieval-Augmented\nLarge Language Models through Question Complexity\nSoyeong Jeong1Jinheon Baek2Sukmin Cho1Sung Ju Hwang1,2Jong C. Park1*\nSchool of Computing1Graduate School of AI2\nKorea Advanced Institute of Science and Technology1,2\n{starsuzi,jinheon.baek,nelllpic,sjhwang82,jongpark}@kaist.ac.kr\nAbstract\nRetrieval-Augmented Large Language Models\n(LLMs), which incorporate the non-parametric\nknowledge from external knowledge bases into\nLLMs, have emerged as a promising approach\nto enhancing response accuracy in several tasks,\nsuch as Question-Answering (QA). However,\neven though there are various approaches deal-\ning with queries of different complexities, they\neither handle simple queries with unnecessary\ncomputational overhead or fail to adequately\naddress complex multi-step queries; yet, not\nall user requests fall into only one of the sim-\nple or complex categories.

In [8]:
data

[Document(metadata={'source': 'RAG.pdf', 'page': 0}, page_content='Adaptive-RAG: Learning to Adapt Retrieval-Augmented\nLarge Language Models through Question Complexity\nSoyeong Jeong1Jinheon Baek2Sukmin Cho1Sung Ju Hwang1,2Jong C. Park1*\nSchool of Computing1Graduate School of AI2\nKorea Advanced Institute of Science and Technology1,2\n{starsuzi,jinheon.baek,nelllpic,sjhwang82,jongpark}@kaist.ac.kr\nAbstract\nRetrieval-Augmented Large Language Models\n(LLMs), which incorporate the non-parametric\nknowledge from external knowledge bases into\nLLMs, have emerged as a promising approach\nto enhancing response accuracy in several tasks,\nsuch as Question-Answering (QA). However,\neven though there are various approaches deal-\ning with queries of different complexities, they\neither handle simple queries with unnecessary\ncomputational overhead or fail to adequately\naddress complex multi-step queries; yet, not\nall user requests fall into only one of the sim-\nple or complex categories.

In [9]:
type(data)

list

In [10]:
type(data[0])

langchain_core.documents.base.Document

In [13]:
## Web Based Loader 

from langchain_community.document_loaders import WebBaseLoader
import bs4

loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                       bs_kwargs=dict(parse_only=bs4.SoupStrainer(
                           class_ = ("post-title","post-content","post-header")
                       )))
data = loader.load()
data

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

In [17]:
from langchain_community.document_loaders import ArxivLoader

# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="1706.03762",
    load_max_docs=2,
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)

data = loader.load()
data

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

In [19]:
from langchain_community.document_loaders import WikipediaLoader
docs = WikipediaLoader(query="India", load_max_docs=2).load()
len(docs)
print(docs)

[Document(metadata={'title': 'India', 'summary': "India, officially the Republic of India (ISO: Bhārat Gaṇarājya), is a country in South Asia.  It is the seventh-largest country by area; the most populous country with effect from June 2023; and from the time of its independence in 1947, the world's most populous democracy. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.\nModern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.\nTheir long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic dive

In [20]:
docs

[Document(metadata={'title': 'India', 'summary': "India, officially the Republic of India (ISO: Bhārat Gaṇarājya), is a country in South Asia.  It is the seventh-largest country by area; the most populous country with effect from June 2023; and from the time of its independence in 1947, the world's most populous democracy. Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east. In the Indian Ocean, India is in the vicinity of Sri Lanka and the Maldives; its Andaman and Nicobar Islands share a maritime border with Thailand, Myanmar, and Indonesia.\nModern humans arrived on the Indian subcontinent from Africa no later than 55,000 years ago.\nTheir long occupation, initially in varying forms of isolation as hunter-gatherers, has made the region highly diverse, second only to Africa in human genetic dive