# 常用 pdf loader

https://docs.langchain.com/oss/python/integrations/document_loaders#pdfs

- PyPDF
- PyMuPDF
- PDFPlumber

## PyPDF

### 提取文本



In [2]:
! pip install -qU langchain_community pypdf

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(
    file_path="baichuan2.pdf",
    mode="page",
)

docs = loader.load()

print(docs[2].page_content)

C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5 Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA 2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM 2-6B (base)∗ 51.70 47.86 - - - 33.68 32.37 -
Baichuan 1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.20
7B
Baichuan 2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA 2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan 1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.59
13B
Baichuan 2-13B-Base 58.10 59.17 61.97 54.33 48

In [14]:

docs[0].page_content[1583:2000]

'Introduction\nThe field of large language models has witnessed\npromising and remarkable progress in recent years.\nThe size of language models has grown from\nmillions of parameters, such as ELMo (Peters\net al., 2018), GPT-1 (Radford et al., 2018), to\nbillions or even trillions of parameters such as GPT-\n3 (Brown et al., 2020), PaLM (Chowdhery et al.,\n2022; Anil et al., 2023) and Switch Transformers\n(Fedus et al., 20'

### 提取图片文本

需要注意的是：提取文本时需要修改源码：`langchain_community\document_loaders\parsers\pdf.py\PyPDFParser\extract_images_from_page`方法中存在一个错误，该错误导致在保存任何图像数据之前检查图像像素image_bytes是否为空。结果，所有图像图像都被跳过，没有将图像传送给 OCR 解析器。

当前环境
```text
langchain-community      0.4.1
rapidocr-onnxruntime     1.4.4
numpy                    2.2.6
```


修复方式：将`Image.fromarray(np_image).save(image_bytes, format="PNG")`移动到`if image_bytes.getbuffer().nbytes == 0:`前

原代码：
```python
                if np_image is not None:
                    image_bytes = io.BytesIO()

                    if image_bytes.getbuffer().nbytes == 0:
                        continue

                    
                    blob = Blob.from_data(image_bytes.getvalue(), mime_type="image/png")
                    image_text = next(self.images_parser.lazy_parse(blob)).page_content
                    images.append(
                        _format_inner_image(blob, image_text, self.images_inner_format)
                    )
```

修复后代码：

```python
                if np_image is not None:
                    image_bytes = io.BytesIO()
                    Image.fromarray(np_image).save(image_bytes, format="PNG")
                    if image_bytes.getbuffer().nbytes == 0:
                        continue

                    
                    blob = Blob.from_data(image_bytes.getvalue(), mime_type="image/png")
                    image_text = next(self.images_parser.lazy_parse(blob)).page_content
                    images.append(
                        _format_inner_image(blob, image_text, self.images_inner_format)
                    )

```

In [None]:
! pip install -qU rapidocr-onnxruntime

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PyPDFLoader(
    file_path="baichuan2.pdf",
    mode="page",
    images_inner_format="markdwon-img",
    images_parser=RapidOCRBlobParser()
)

docs = loader.load()

print(docs[2].page_content)

C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5 Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA 2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM 2-6B (base)∗ 51.70 47.86 - - - 33.68 32.37 -
Baichuan 1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.20
7B
Baichuan 2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA 2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan 1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.59
13B
Baichuan 2-13B-Base 58.10 59.17 61.97 54.33 48

## PyMuPDF

### 提取文本

In [10]:
! pip install -qU langchain_community pymupdf

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    file_path="baichuan2.pdf",
    mode="page",
)

docs = loader.load()

print(docs[2].page_content)

page_content='C-Eval MMLU CMMLU Gaokao AGIEval BBH
GSM8K HumanEval
GPT-4
68.40
83.93
70.33
66.15
63.27
75.12
89.99
69.51
GPT-3.5 Turbo
51.10
68.54
54.06
47.07
46.13
61.59
57.77
52.44
LLaMA-7B
27.10
35.10
26.75
27.81
28.17
32.38
9.78
11.59
LLaMA 2-7B
28.90
45.73
31.38
25.97
26.53
39.16
16.22
12.80
MPT-7B
27.15
27.93
26.00
26.54
24.83
35.20
8.64
14.02
Falcon-7B
24.23
26.03
25.66
24.24
24.10
28.77
5.46
-
ChatGLM 2-6B (base)∗
51.70
47.86
-
-
-
33.68
32.37
-
Baichuan 1-7B
42.80
42.30
44.02
36.34
34.44
32.48
9.17
9.20
7B
Baichuan 2-7B-Base
54.00
54.16
57.07
47.47
42.73
41.56
24.49
18.29
LLaMA-13B
28.50
46.30
31.15
28.23
28.22
37.89
20.55
15.24
LLaMA 2-13B
35.80
55.09
37.99
30.83
32.29
46.98
28.89
15.24
Vicuna-13B
32.80
52.00
36.28
30.11
31.55
43.04
28.13
16.46
Chinese-Alpaca-Plus-13B
38.80
43.90
33.43
34.78
35.46
28.94
11.98
16.46
XVERSE-13B
53.70
55.21
58.44
44.69
42.54
38.06
18.20
15.85
Baichuan 1-13B-Base
52.40
51.60
55.30
49.69
43.20
43.01
26.76
11.59
13B
Baichuan 2-13B-Base
58.10
59.17


In [15]:
docs[0].page_content[1583:2000]

'Introduction\nThe field of large language models has witnessed\npromising and remarkable progress in recent years.\nThe size of language models has grown from\nmillions of parameters, such as ELMo (Peters\net al., 2018), GPT-1 (Radford et al., 2018), to\nbillions or even trillions of parameters such as GPT-\n3 (Brown et al., 2020), PaLM (Chowdhery et al.,\n2022; Anil et al., 2023) and Switch Transformers\n(Fedus et al., 20'

可以发现表格处理是不如PyPDF的

### 提取图片文本

同样PyMuPDF也存在问题，也需要修改源码：

修改方式与PyPDF类似：

修改`langchain_community\document_loaders\parsers\pdf.py\PyMuPDFParser\_extract_images_from_page`方法：

将`numpy.save(image_bytes, image)`放到`if image_bytes.getbuffer().nbytes == 0:`前


修改前：

```python
            if self.images_parser:
                xref = img[0]
                pix = pymupdf.Pixmap(doc, xref)
                image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
                    pix.height, pix.width, -1
                )
                image_bytes = io.BytesIO()
                
                if image_bytes.getbuffer().nbytes == 0:
                    continue

                numpy.save(image_bytes, image)
                blob = Blob.from_data(
                    image_bytes.getvalue(), mime_type="application/x-npy"
                )
                image_text = next(self.images_parser.lazy_parse(blob)).page_content

                images.append(
                    _format_inner_image(blob, image_text, self.images_inner_format)
                )
```

修改后：

```python
            if self.images_parser:
                xref = img[0]
                pix = pymupdf.Pixmap(doc, xref)
                image = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
                    pix.height, pix.width, -1
                )
                image_bytes = io.BytesIO()
                numpy.save(image_bytes, image)
                if image_bytes.getbuffer().nbytes == 0:
                    continue

                
                blob = Blob.from_data(
                    image_bytes.getvalue(), mime_type="application/x-npy"
                )
                image_text = next(self.images_parser.lazy_parse(blob)).page_content

                images.append(
                    _format_inner_image(blob, image_text, self.images_inner_format)
                )
```


In [14]:
! pip install -qU rapidocr-onnxruntime

In [1]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PyMuPDFLoader(
    file_path="baichuan2.pdf",
    mode="page",
    images_inner_format="markdown-img",
    images_parser=RapidOCRBlobParser(),
)
docs = loader.load()

print(docs[2].page_content)

C-Eval MMLU CMMLU Gaokao AGIEval BBH
GSM8K HumanEval
GPT-4
68.40
83.93
70.33
66.15
63.27
75.12
89.99
69.51
GPT-3.5 Turbo
51.10
68.54
54.06
47.07
46.13
61.59
57.77
52.44
LLaMA-7B
27.10
35.10
26.75
27.81
28.17
32.38
9.78
11.59
LLaMA 2-7B
28.90
45.73
31.38
25.97
26.53
39.16
16.22
12.80
MPT-7B
27.15
27.93
26.00
26.54
24.83
35.20
8.64
14.02
Falcon-7B
24.23
26.03
25.66
24.24
24.10
28.77
5.46
-
ChatGLM 2-6B (base)∗
51.70
47.86
-
-
-
33.68
32.37
-
Baichuan 1-7B
42.80
42.30
44.02
36.34
34.44
32.48
9.17
9.20
7B
Baichuan 2-7B-Base
54.00
54.16
57.07
47.47
42.73
41.56
24.49
18.29
LLaMA-13B
28.50
46.30
31.15
28.23
28.22
37.89
20.55
15.24
LLaMA 2-13B
35.80
55.09
37.99
30.83
32.29
46.98
28.89
15.24
Vicuna-13B
32.80
52.00
36.28
30.11
31.55
43.04
28.13
16.46
Chinese-Alpaca-Plus-13B
38.80
43.90
33.43
34.78
35.46
28.94
11.98
16.46
XVERSE-13B
53.70
55.21
58.44
44.69
42.54
38.06
18.20
15.85
Baichuan 1-13B-Base
52.40
51.60
55.30
49.69
43.20
43.01
26.76
11.59
13B
Baichuan 2-13B-Base
58.10
59.17
61.97
54.33
48

## PDFPlumber

### 提取文本

In [4]:
! pip install -qU langchain-community pdfplumber

In [6]:
from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("baichuan2.pdf")

docs = loader.load()

print(docs[2].page_content)


C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
7B Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM2-6B(base)∗ 51.70 47.86 - - - 33.68 32.37 -
Baichuan1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.20
Baichuan2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
13B Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.59
Baichuan2-13B-Base 58.10 59.17 61.97 54.33 48.17 48.78

PDFPlumber的表格处理非常出众

In [17]:
docs[0].page_content[1583:2000]

'Introduction\nThe field of large language models has witnessed\npromising and remarkable progress in recent years.\nThe size of language models has grown from\nmillions of parameters, such as ELMo (Peters\net al., 2018), GPT-1 (Radford et al., 2018), to\nbillions or even trillions of parameters such as GPT-\n3 (Brown et al., 2020), PaLM (Chowdhery et al.,\n2022; Anil et al., 2023) and Switch Transformers\n(Fedus et al., 20'

### 提取图片文本

In [None]:
! pip install -qU rapidocr-onnxruntime

In [2]:
from langchain_community.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("baichuan2.pdf", extract_images=True)

docs = loader.load()

print(docs[2].page_content)


C-Eval MMLU CMMLU Gaokao AGIEval BBH GSM8K HumanEval
GPT-4 68.40 83.93 70.33 66.15 63.27 75.12 89.99 69.51
GPT-3.5Turbo 51.10 68.54 54.06 47.07 46.13 61.59 57.77 52.44
LLaMA-7B 27.10 35.10 26.75 27.81 28.17 32.38 9.78 11.59
LLaMA2-7B 28.90 45.73 31.38 25.97 26.53 39.16 16.22 12.80
MPT-7B 27.15 27.93 26.00 26.54 24.83 35.20 8.64 14.02
7B Falcon-7B 24.23 26.03 25.66 24.24 24.10 28.77 5.46 -
ChatGLM2-6B(base)∗ 51.70 47.86 - - - 33.68 32.37 -
Baichuan1-7B 42.80 42.30 44.02 36.34 34.44 32.48 9.17 9.20
Baichuan2-7B-Base 54.00 54.16 57.07 47.47 42.73 41.56 24.49 18.29
LLaMA-13B 28.50 46.30 31.15 28.23 28.22 37.89 20.55 15.24
LLaMA2-13B 35.80 55.09 37.99 30.83 32.29 46.98 28.89 15.24
Vicuna-13B 32.80 52.00 36.28 30.11 31.55 43.04 28.13 16.46
13B Chinese-Alpaca-Plus-13B 38.80 43.90 33.43 34.78 35.46 28.94 11.98 16.46
XVERSE-13B 53.70 55.21 58.44 44.69 42.54 38.06 18.20 15.85
Baichuan1-13B-Base 52.40 51.60 55.30 49.69 43.20 43.01 26.76 11.59
Baichuan2-13B-Base 58.10 59.17 61.97 54.33 48.17 48.78