<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# Load Documents Using LangChain for Different Sources 


Estimated time needed: **20** minutes


Imagine you are working as a data scientist at a consulting firm, and you've been tasked with analyzing documents from multiple clients. Each client provides their data in different formats: some in PDFs, others in Word documents, CSV files, or even HTML webpages. Manually loading and parsing each document type is not only time-consuming but also prone to errors. Your goal is to streamline this process, making it efficient and error-free.

To achieve this, you'll use LangChain’s powerful document loaders. These loaders allow you to read and convert various file formats into a unified document structure that can be easily processed. For example, you'll load client policy documents from text files, financial reports from PDFs, marketing strategies from Word documents, and product reviews from JSON files. By the end of this lab, you will have a robust pipeline that can handle any new file formats clients might send, saving you valuable time and effort.


<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Hvf-jk8b5Fs-E_E4AJyEow/loader.png" width="50%" alt="indexing"/>


In this lab, you will explore how to use various loaders provided by LangChain to load and process data from different file formats. These loaders simplify the task of reading and converting files into a document format that can be processed downstream. By the end of this lab, you will be able to efficiently load text, PDF, Markdown, JSON, CSV, DOCX, and other file formats into a unified format, allowing for seamless data analysis and manipulation for LLM applications.

(Note: In this lab, we just introduced several commonly used file format loaders. LangChain provides more document loaders for various document formats [here](https://python.langchain.com/v0.2/docs/integrations/document_loaders/).)


## __Table of Contents__

<ol>
    <li><a href="#Objectives">Objectives</a></li>
    <li>
        <a href="#Setup">Setup</a>
        <ol>
            <li><a href="#Installing-required-libraries">Installing required libraries</a></li>
            <li><a href="#Importing-Required-Libraries">Importing required libraries</a></li>
        </ol>
    </li>
    <li><a href="#Load-from-TXT-files">Load from TXT files</a></li>
    <li><a href="#Load-from-PDF-files">Load from PDF files</a></li>
    <li><a href="#Load-from-Markdown-files">Load from Markdown files</a></li>
    <li><a href="#Load-from-JSON-files">Load from JSON files</a></li>
    <li><a href="#Load-from-CSV-files">Load from CSV files</a></li>
    <li><a href="#Load-from-URL/Website-files">Load from URL/Webpage files</a></li>
    <li><a href="#Load-from-WORD-files">Load from WORD files</a></li>
    <li><a href="#Load-from-Unstructured-Files">Load from Unstructured Files</a></li>
</ol>

<a href="#Exercises">Exercises</a>
<ol>
    <li><a href="#Exercise-1---Try-to-use-other-PDF-loaders">Exercise 1 - Try to use other PDF loaders</a></li>
    <li><a href="#Exercise-2---Load-from-Arxiv">Exercise 2 - Load from Arxiv</a></li>
</ol>


## Objectives

After completing this lab you will be able to:

 - Understand how to use `TextLoader` to load text files.
 - Learn how to load PDFs using `PyPDFLoader` and `PyMuPDFLoader`.
 - Use `UnstructuredMarkdownLoader` to load Markdown files.
 - Load JSON files with `JSONLoader` using jq schemas.
 - Process CSV files with `CSVLoader` and `UnstructuredCSVLoader`.
 - Load Webpage content using `WebBaseLoader`.
 - Load Word documents using `Docx2txtLoader`.
 - Utilize `UnstructuredFileLoader` for various file types.


----


## Setup


### Installing required libraries

The following required libraries are __not__ preinstalled in the Skills Network Labs environment. __You must run the following cell__ to install them:

**Note:** We are pinning the version here to specify the version. It's recommended that you do this as well. Even if the library is updated in the future, the installed library could still support this lab work.

This might take approximately 1 minute. 

As we use `%%capture` to capture the installation, you won't see the output process. But after the installation completes, you will see a number beside the cell.


In [24]:
%%capture
#After executing the cell,please RESTART the kernel and run all the cells.
!pip install --user "langchain-community==0.2.1"
!pip install --user "pypdf==4.2.0"
!pip install --user "PyMuPDF==1.24.5"
!pip install --user "unstructured==0.14.8"
!pip install --user "markdown==3.6"
!pip install --user  "jq==1.7.0"
!pip install --user "pandas==2.2.2"
!pip install --user "docx2txt==0.8"
!pip install --user "requests==2.32.3"
!pip install --user "beautifulsoup4==4.12.3"
!pip install --user "nltk==3.8.0"

After you install the libraries, restart your kernel. You can do that by clicking the **Restart the kernel** icon.

<p style="text-align:left">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/1_Bd_EvpEzLH9BbxRXXUGQ/screenshot-to-replace.png" width="50%"/>
    </a>
</p>



### Importing Required Libraries

_We recommend you import all required libraries in one place (here):_


In [2]:
# You can also use this section to suppress warnings generated by your code:

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

from pprint import pprint
import json
from pathlib import Path
import nltk
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.document_loaders import JSONLoader
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders.csv_loader import UnstructuredCSVLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import UnstructuredFileLoader

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

### Load from TXT files


The `TextLoader` is a tool designed to load textual data from various sources.

It is the simplest loader, reading a file as text and placing all the content into a single document.


We have prepared a .txt file for you to load. First, we need to download it from a remote source.


We have prepared a .txt file for you to load. First, we need to download it from a remote source.


In [3]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt"

--2025-05-22 16:38:33--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Ec5f3KYU1CpbKRp1whFLZw/new-Policies.txt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6363 (6.2K) [text/plain]
Saving to: ‘new-Policies.txt’


2025-05-22 16:38:33 (924 MB/s) - ‘new-Policies.txt’ saved [6363/6363]



Next, we will use the `TextLoader` class to load the file.


In [4]:
loader = TextLoader("new-Policies.txt")
loader

<langchain_community.document_loaders.text.TextLoader at 0x7f474e210f50>

Here, we use the `load` method to load the data as documents.


In [5]:
data = loader.load()

Let's present the entire data (document) here.

This is a `document` object that includes `page_content` and `metadata` attributes.


In [6]:
data

[Document(metadata={'source': 'new-Policies.txt'}, page_content="1. Code of Conduct\n\nOur Code of Conduct establishes the core values and ethical standards that all members of our organization must adhere to. We are committed to fostering a workplace characterized by integrity, respect, and accountability.\n\nIntegrity: We commit to the highest ethical standards by being honest and transparent in all our dealings, whether with colleagues, clients, or the community. We protect sensitive information and avoid conflicts of interest.\n\nRespect: We value diversity and every individual's contribution. Discrimination, harassment, or any form of disrespect is not tolerated. We promote an inclusive environment where differences are respected, and everyone is treated with dignity.\n\nAccountability: We are responsible for our actions and decisions, complying with all relevant laws and regulations. We aim for continuous improvement and report any breaches of this code, supporting investigations

We can also use the `pprint` function to print the first 1000 characters of the `page_content` here.


In [7]:
pprint(data[0].page_content[:1000])

('1. Code of Conduct\n'
 '\n'
 'Our Code of Conduct establishes the core values and ethical standards that '
 'all members of our organization must adhere to. We are committed to '
 'fostering a workplace characterized by integrity, respect, and '
 'accountability.\n'
 '\n'
 'Integrity: We commit to the highest ethical standards by being honest and '
 'transparent in all our dealings, whether with colleagues, clients, or the '
 'community. We protect sensitive information and avoid conflicts of '
 'interest.\n'
 '\n'
 "Respect: We value diversity and every individual's contribution. "
 'Discrimination, harassment, or any form of disrespect is not tolerated. We '
 'promote an inclusive environment where differences are respected, and '
 'everyone is treated with dignity.\n'
 '\n'
 'Accountability: We are responsible for our actions and decisions, complying '
 'with all relevant laws and regulations. We aim for continuous improvement '
 'and report any breaches of this code, supporting i

### Load from PDF files


Sometimes, we may have files in PDF format that we want to load for processing.

LangChain provides several classes for loading PDFs. Here, we introduce two classes: `PyPDFLoader` and `PyMuPDFLoader`.


#### PyPDFLoader


Load the PDF using `PyPDFLoader` into an array of documents, where each document contains the page content and metadata with the page number.


In [8]:
pdf_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/Q81D33CdRLK6LswuQrANQQ/instructlab.pdf"

loader = PyPDFLoader(pdf_url)

pages = loader.load_and_split()

Display the first page of the PDF.


In [9]:
print(pages[0])

page_content='LAB: L ARGE -SCALE ALIGNMENT FOR CHATBOTS
MIT-IBM Watson AI Lab and IBM Research
Shivchander Sudalairaj∗
Abhishek Bhandwaldar∗
Aldo Pareja∗
Kai Xu
David D. Cox
Akash Srivastava∗,†
*Equal Contribution, †Corresponding Author
ABSTRACT
This work introduces LAB (Large-scale Alignment for chatBots), a novel method-
ology designed to overcome the scalability challenges in the instruction-tuning
phase of large language model (LLM) training. Leveraging a taxonomy-guided
synthetic data generation process and a multi-phase tuning framework, LAB sig-
nificantly reduces reliance on expensive human annotations and proprietary mod-
els like GPT-4. We demonstrate that LAB-trained models can achieve compet-
itive performance across several benchmarks compared to models trained with
traditional human-annotated or GPT-4 generated synthetic data. Thus offering a
scalable, cost-effective solution for enhancing LLM capabilities and instruction-
following behaviors without the drawbacks of cata

Display the first three pages of the PDF.


In [10]:
for p,page in enumerate(pages[0:3]):
    print(f"page number {p+1}")
    print(page)

page number 1
page_content='LAB: L ARGE -SCALE ALIGNMENT FOR CHATBOTS
MIT-IBM Watson AI Lab and IBM Research
Shivchander Sudalairaj∗
Abhishek Bhandwaldar∗
Aldo Pareja∗
Kai Xu
David D. Cox
Akash Srivastava∗,†
*Equal Contribution, †Corresponding Author
ABSTRACT
This work introduces LAB (Large-scale Alignment for chatBots), a novel method-
ology designed to overcome the scalability challenges in the instruction-tuning
phase of large language model (LLM) training. Leveraging a taxonomy-guided
synthetic data generation process and a multi-phase tuning framework, LAB sig-
nificantly reduces reliance on expensive human annotations and proprietary mod-
els like GPT-4. We demonstrate that LAB-trained models can achieve compet-
itive performance across several benchmarks compared to models trained with
traditional human-annotated or GPT-4 generated synthetic data. Thus offering a
scalable, cost-effective solution for enhancing LLM capabilities and instruction-
following behaviors without the dra

#### PyMuPDFLoader


`PyMuPDFLoader` is the fastest of the PDF parsing options. It provides detailed metadata about the PDF and its pages, and returns one document per page.


In [11]:
loader = PyMuPDFLoader(pdf_url)
loader

<langchain_community.document_loaders.pdf.PyMuPDFLoader at 0x7f47618681a0>

In [12]:
data = loader.load()

In [13]:
print(data[0])

page_content='LAB: LARGE-SCALE ALIGNMENT FOR CHATBOTS
MIT-IBM Watson AI Lab and IBM Research
Shivchander Sudalairaj∗
Abhishek Bhandwaldar∗
Aldo Pareja∗
Kai Xu
David D. Cox
Akash Srivastava∗,†
*Equal Contribution, †Corresponding Author
ABSTRACT
This work introduces LAB (Large-scale Alignment for chatBots), a novel method-
ology designed to overcome the scalability challenges in the instruction-tuning
phase of large language model (LLM) training. Leveraging a taxonomy-guided
synthetic data generation process and a multi-phase tuning framework, LAB sig-
nificantly reduces reliance on expensive human annotations and proprietary mod-
els like GPT-4. We demonstrate that LAB-trained models can achieve compet-
itive performance across several benchmarks compared to models trained with
traditional human-annotated or GPT-4 generated synthetic data. Thus offering a
scalable, cost-effective solution for enhancing LLM capabilities and instruction-
following behaviors without the drawbacks of catast

The `metadata` attribute reveals that `PyMuPDFLoader` provides more detailed metadata information than `PyPDFLoader`.


### Load from Markdown files


Sometimes, our file source might be in Markdown format.

LangChain provides the `UnstructuredMarkdownLoader` to load content from Markdown files.


In [14]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eMSP5vJjj9yOfAacLZRWsg/markdown-sample.md'

--2025-05-22 16:38:35--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/eMSP5vJjj9yOfAacLZRWsg/markdown-sample.md
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
200 OKequest sent, awaiting response... 
Length: 3398 (3.3K) [text/markdown]
Saving to: ‘markdown-sample.md’


2025-05-22 16:38:35 (494 MB/s) - ‘markdown-sample.md’ saved [3398/3398]



In [15]:
markdown_path = "markdown-sample.md"
loader = UnstructuredMarkdownLoader(markdown_path)
loader

<langchain_community.document_loaders.markdown.UnstructuredMarkdownLoader at 0x7f477c3203e0>

In [16]:
data = loader.load()

[nltk_data] Downloading package punkt to /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jupyterlab/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [17]:
data

[Document(metadata={'source': 'markdown-sample.md'}, page_content='An h1 header\n\nParagraphs are separated by a blank line.\n\n2nd paragraph. Italic, bold, and monospace. Itemized lists\nlook like:\n\nthis one\n\nthat one\n\nthe other one\n\nNote that --- not considering the asterisk --- the actual text\ncontent starts at 4-columns in.\n\nBlock quotes are\nwritten like so.\n\nThey can span multiple paragraphs,\nif you like.\n\nUse 3 dashes for an em-dash. Use 2 dashes for ranges (ex., "it\'s all\nin chapters 12--14"). Three dots ... will be converted to an ellipsis.\nUnicode is supported. ☺\n\nAn h2 header\n\nHere\'s a numbered list:\n\nfirst item\n\nsecond item\n\nthird item\n\nNote again how the actual text starts at 4 columns in (4 characters\nfrom the left side). Here\'s a code sample:\n\nAs you probably guessed, indented 4 spaces. By the way, instead of\nindenting the block, you can use delimited blocks, if you like:\n\n~~~\ndefine foobar() {\n    print "Welcome to flavor country

### Load from JSON files



The JSONLoader uses a specified [jq schema](https://en.wikipedia.org/wiki/Jq_(programming_language)) to parse the JSON files. It uses the jq python package, which we've installed before.


In [18]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/hAmzVJeOUAMHzmhUHNdAUg/facebook-chat.json'

--2025-05-22 16:38:37--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/hAmzVJeOUAMHzmhUHNdAUg/facebook-chat.json
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
200 OKequest sent, awaiting response... 
Length: 2167 (2.1K) [application/json]
Saving to: ‘facebook-chat.json’


2025-05-22 16:38:37 (420 MB/s) - ‘facebook-chat.json’ saved [2167/2167]



First, let's use `pprint` to take a look at the JSON file and its structure. 


In [19]:
file_path='facebook-chat.json'
data = json.loads(Path(file_path).read_text())

In [20]:
pprint(data)

{'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
 'is_still_participant': True,
 'joinable_mode': {'link': '', 'mode': 1},
 'magic_words': [],
 'messages': [{'content': 'Bye!',
               'sender_name': 'User 2',
               'timestamp_ms': 1675597571851},
              {'content': 'Oh no worries! Bye',
               'sender_name': 'User 1',
               'timestamp_ms': 1675597435669},
              {'content': 'No Im sorry it was my mistake, the blue one is not '
                          'for sale',
               'sender_name': 'User 2',
               'timestamp_ms': 1675596277579},
              {'content': 'I thought you were selling the blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595140251},
              {'content': 'Im not interested in this bag. Im interested in the '
                          'blue one!',
               'sender_name': 'User 1',
               'timestamp_ms': 1675595109305},
   

We use `JSONLoader` to load data from the JSON file. However, JSON files can have various attribute-value pairs. If we want to load a specific attribute and its value, we need to set an appropriate `jq schema`.

So for example, if we want to load the `content` from the JSON file, we need to set `jq_schema='.messages[].content'`.


In [21]:
loader = JSONLoader(
    file_path=file_path,
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()We use JSONLoader to load data from the JSON file. However, JSON files can have various attribute-value pairs. If we want to load a specific attribute and its value, we need to set an appropriate jq schema.

So for example, if we want to load the content from the JSON file, we need to set jq_schema='.messages[].content'.









SyntaxError: invalid syntax (2331852911.py, line 6)

In [None]:
pprint(data)

### Load from CSV files


CSV files are a common format for storing tabular data. The `CSVLoader` provides a convenient way to read and process this data.


In [None]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IygVG_j0M87BM4Z0zFsBMA/mlb-teams-2012.csv'

In [None]:
loader = CSVLoader(file_path='mlb-teams-2012.csv')
data = loader.load()

In [None]:
data

When you load data from a CSV file, the loader typically creates a separate `Document` object for each row of data in the CSV.


#### UnstructuredCSVLoader


In contrast to `CSVLoader`, which treats each row as an individual document with headers defining the data, `UnstructuredCSVLoader` considers the entire CSV file as a single unstructured table element. This approach is beneficial when you want to analyze the data as a complete table rather than as separate entries.


In [None]:
loader = UnstructuredCSVLoader(
    file_path="mlb-teams-2012.csv", mode="elements"
)
data = loader.load()

In [None]:
data[0].page_content

In [None]:
print(data[0].metadata["text_as_html"])

### Load from URL/Website files


Usually we use `BeautifulSoup` package to load and parse a HTML or XML file. But it has some limitations.

The following code is using `BeautifulSoup` to parse a website. Let's see what limitation it has.


In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.ibm.com/topics/langchain'
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

From the print output, we can see that `BeautifulSoup` not only load the web content, but also a lot of HTML tags and external links, which are not necessary if we just want to load the text content of the web.


So LangChain's `WebBaseLoader` can effectively address this limitation.

`WebBaseLoader` is designed to extract all text from HTML webpages and convert it into a document format suitable for further processing.


#### Load from single web page


In [None]:
loader = WebBaseLoader("https://www.ibm.com/topics/langchain")

In [None]:
data = loader.load()

In [None]:
data

#### Load from multiple web pages


You can load multiple webpages simultaneously by passing a list of URLs to the loader. This will return a list of documents corresponding to the order of the URLs provided.


In [None]:
loader = WebBaseLoader(["https://www.ibm.com/topics/langchain", "https://www.redhat.com/en/topics/ai/what-is-instructlab"])
data = loader.load()
data

### Load from WORD files


`Docx2txtLoader` is utilized to convert Word documents into a document format suitable for further processing.


In [None]:
!wget "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/94hiHUNLZdb0bLMkrCh79g/file-sample.docx"

In [None]:
loader = Docx2txtLoader("file-sample.docx")

In [None]:
data = loader.load()

In [None]:
data

### Load from Unstructured Files


Sometimes, we need to load content from various text sources and formats without writing a separate loader for each one. Additionally, when a new file format emerges, we want to save time by not having to write a new loader for it. `UnstructuredFileLoader` addresses this need by supporting the loading of multiple file types. Currently, `UnstructuredFileLoader` can handle text files, PowerPoints, HTML, PDFs, images, and more.


For example, we can load `.txt` file.


In [None]:
loader = UnstructuredFileLoader("new-Policies.txt")
data = loader.load()
data

We also can load `.md` file.


In [None]:
loader = UnstructuredFileLoader("markdown-sample.md")
data = loader.load()
data

#### Multiple files with different formats


We can even load a list of files with different formats.


In [None]:
files = ["markdown-sample.md", "new-Policies.txt"]

In [None]:
loader = UnstructuredFileLoader(files)

In [None]:
data = loader.load()

In [None]:
data

# Exercises


### Exercise 1 - Try to use other PDF loaders

There are many other PDF loaders in LangChain, for example, `PyPDFium2Loader`. Can you use this PDF loader to load the PDF and see the difference?


In [9]:
from langchain_community.document_loaders import PyPDFium2Loader

pdf_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/WgM1DaUn2SYPcCg_It57tA/A-Comprehensive-Review-of-Low-Rank-Adaptation-in-Large-Language-Models-for-Efficient-Parameter-Tuning-1.pdf"

loader = PyPDFium2Loader(pdf_url)
data = loader.load()

# Get the sliced content
sliced_content = data[0].page_content[:1000]

# Print the actual length of the sliced content
print(f"The actual length of the printed content is: {len(sliced_content)} characters.")

# Now print the content itself
print(sliced_content)

The actual length of the printed content is: 1000 characters.
A Comprehensive Review of Low-Rank
Adaptation in Large Language Models for
Efficient Parameter Tuning
September 10, 2024
Abstract
Natural Language Processing (NLP) often involves pre-training large
models on extensive datasets and then adapting them for specific tasks
through fine-tuning. However, as these models grow larger, like GPT-3
with 175 billion parameters, fully fine-tuning them becomes computationally expensive. We propose a novel method called LoRA (Low-Rank
Adaptation) that significantly reduces the overhead by freezing the original model weights and only training small rank decomposition matrices.
This leads to up to 10,000 times fewer trainable parameters and reduces
GPU memory usage by three times. LoRA not only maintains but sometimes surpasses fine-tuning performance on models like RoBERTa, DeBERTa, GPT-2, and GPT-3. Unlike other methods, LoRA introduces
no extra latency during inference, making it more 

<details>
    <summary>Click here for Solution</summary>


```python

!pip install pypdfium2

from langchain_community.document_loaders import PyPDFium2Loader

loader = PyPDFium2Loader(pdf_url)

data = loader.load()

```

</details>


### Exercise 2 - Load from Arxiv


Sometimes we have paper that we want to load from Arxiv, can you load this [paper](https://arxiv.org/abs/1605.08386) using `ArxivLoader`.


In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Note: You might need to install 'langchain' if not already installed.
# !pip install --user "langchain==0.2.11"

latex_text = """
    \documentclass{article}
    \begin{document}
    \maketitle
    \section{Introduction}
    Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in various natural language processing tasks, including language translation, text generation, and sentiment analysis.
    \subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.
\subsection{Applications of LLMs}
LLMs have many applications in the industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.
\end{document}
"""

# Define the text splitter for LaTeX
# We define common LaTeX structural elements as separators
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Example chunk size
    chunk_overlap=50, # Example overlap
    separators=[
        "\n\n",       # Double newline (paragraph break)
        "\n",         # Single newline
        " ",          # Space
        "",           # Character by character fallback
        "\\section{", # LaTeX sections
        "\\subsection{", # LaTeX subsections
        "\\subsubsection{", # LaTeX subsubsections
        "\\chapter{", # LaTeX chapters (if using a document class that supports them)
        "\\documentclass{", # Document class
        "\\begin{document}", # Document environment start
        "\\end{document}", # Document environment end
    ],
    length_function=len,
    is_separator_regex=False # Set to True if your separators are regex
)

# Split the LaTeX document
texts = text_splitter.split_text(latex_text)

# Print the resulting chunks
print("--- Split Texts ---")
for i, chunk in enumerate(texts):
    print(f"Chunk {i+1} (Length: {len(chunk)}):\n{chunk}\n{'-'*30}\n")

--- Split Texts ---
Chunk 1 (Length: 442):
\documentclass{article}
   egin{document}
    \maketitle
    \section{Introduction}
    Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in various natural language processing tasks, including language translation, text generation, and sentiment analysis.
    \subsection{History of LLMs}
------------------------------

Chunk 2 (Length: 410):
\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.
\subsection{Applications of LLMs}
------------------------------

Chunk 3 (Length: 2

<details>
    <summary>Click here for Solution</summary>
    
```python

!pip install arxiv

from langchain_community.document_loaders import ArxivLoader

docs = ArxivLoader(query="1605.08386", load_max_docs=2).load()

print(docs[0].page_content[:1000])

```

</details>


## Authors


[Kang Wang](https://www.linkedin.com/in/kangwang95/)

Kang Wang is a Data Scientist in IBM. He is also a PhD Candidate in the University of Waterloo.


### Other Contributors


[Joseph Santarcangelo](https://www.linkedin.com/in/joseph-s-50398b136/)

Joseph has a Ph.D. in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

[Hailey Quach](https://author.skills.network/instructors/hailey_quach)

Hailey is a Data Scientist at IBM. She is also an undergraduate student at Concordia University, Montreal.


© Copyright IBM Corporation. All rights reserved.
