# Document Loaders

Langchain comes with builtin loader tools to load in files to its own Document object.

More info on loaders here https://python.langchain.com/docs/modules/data_connection/document_loaders/

## CSV

In [2]:
from langchain.document_loaders import CSVLoader

In [6]:
loader = CSVLoader('penguins.csv')
# It's a lazy loading, meaning that we need to call the .load() method in order to actually load the file

In [7]:
data = loader.load()

The document is tipically loaded into a list format.

In [9]:
print(data)

[Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.1\nbill_depth_mm: 18.7\nflipper_length_mm: 181\nbody_mass_g: 3750\nsex: MALE', metadata={'source': 'penguins.csv', 'row': 0}), Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.5\nbill_depth_mm: 17.4\nflipper_length_mm: 186\nbody_mass_g: 3800\nsex: FEMALE', metadata={'source': 'penguins.csv', 'row': 1}), Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 40.3\nbill_depth_mm: 18\nflipper_length_mm: 195\nbody_mass_g: 3250\nsex: FEMALE', metadata={'source': 'penguins.csv', 'row': 2}), Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: \nbill_depth_mm: \nflipper_length_mm: \nbody_mass_g: \nsex: ', metadata={'source': 'penguins.csv', 'row': 3}), Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 36.7\nbill_depth_mm: 19.3\nflipper_length_mm: 193\nbody_mass_g: 3450\nsex: FEMALE', metadata={'source': 'penguins.csv', '

In [10]:
type(data)

list

In [11]:
data[0]

Document(page_content='species: Adelie\nisland: Torgersen\nbill_length_mm: 39.1\nbill_depth_mm: 18.7\nflipper_length_mm: 181\nbody_mass_g: 3750\nsex: MALE', metadata={'source': 'penguins.csv', 'row': 0})

In [13]:
print(data[0].page_content)

species: Adelie
island: Torgersen
bill_length_mm: 39.1
bill_depth_mm: 18.7
flipper_length_mm: 181
body_mass_g: 3750
sex: MALE


In [20]:
print(data[0].metadata)

{'source': 'penguins.csv', 'row': 0}


## HTML

In [1]:
from langchain.document_loaders import BSHTMLLoader

In [14]:
loader = BSHTMLLoader('some_website.html')

In [15]:
data = loader.load()

In [16]:
print(data[0].page_content)

Heading 1


## PDF

There is no guarantee on how python will actually read this sort of files, there may be errors and unformatted sentences due to the fact that it's not clear how PDF loading libraries interpret the PDF file structure. 

In [18]:
from langchain.document_loaders import PyPDFLoader

In [20]:
loader = PyPDFLoader('SomeReport.pdf')

In [21]:
pages = loader.load_and_split()

In [22]:
type(pages)

list

In [23]:
pages[0]

Document(page_content='This\nis\nthe\nfirst\nline\nPDF.\nThis\nis\nthe\nsecond\nline\nin\nthe\nPDF.\nThis\nis\nthe\nthird\nline\nin\nthe\nPDF.', metadata={'source': 'SomeReport.pdf', 'page': 0})

In [24]:
print(pages[0].page_content)

This
is
the
first
line
PDF.
This
is
the
second
line
in
the
PDF.
This
is
the
third
line
in
the
PDF.
