This notebook investigates output from `langchain` pdf loaders and splitters. Loaders are responsible for processing raw pdf files, while splitters define heuristics for chunking documents for more efficient processing by LLMs.

In [40]:
import os
import time
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_dir = "/Users/kkumbier/github/persisters/papers/"
paper = "Hata/Cabanos2021.pdf"
pdf_file = os.path.join(pdf_dir, paper)


# PDF loaders
Below we examine the output of various pdf loaders with default splitters.

In [18]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader(pdf_file)

a = time.time()
pages = loader.load_and_split()
b = time.time()

print(b - a)
print(len(pages))

for p in pages:
  print(p)

0.9931268692016602
19
page_content='A UGUST  2021\u2003 CANCER DISCOVERY \u2002|\u2002OF1 \nResea Rch B Rief\nClinical Acquired Resistance to KRASG12C \nInhibition through a Novel KRAS Switch-II \nPocket Mutation and Polyclonal Alterations Converging on RAS–MAPK Reactivation  \n \nNoritaka Tanaka1, Jessica J. Lin1, Chendi Li1, Meagan B. Ryan1, Junbing Zhang1, Lesli A. Kiedrowski2,  \nAlexa G. Michel1, Mohammed U. Syed1, Katerina A. Fella1, Mustafa Sakhi1, Islam Baiev1, Dejan Juric1,  \nJustin F . Gainor1, Samuel J. Klempner1, Jochen K. Lennerz3, Giulia Siravegna1, Liron Bar-Peled1,  \nAaron N. Hata1, Rebecca S. Heist1, and Ryan B. Corcoran1\naBstRact Mutant-selective KRASG12C inhibitors, such as MRTX849 (adagrasib) and AMG 510 \n(sotorasib), have demonstrated efficacy in KRASG12C-mutant cancers, including \nnon–small cell lung cancer (NSCLC). However, mechanisms underlying clinical acquired resistance to \nKRASG12C inhibitors remain undetermined. To begin to define the mechanistic spec

In [19]:
from langchain_community.document_loaders import PDFPlumberLoader
loader = PDFPlumberLoader(pdf_file)

a = time.time()
pages = loader.load_and_split()
b = time.time()

print(b - a)
print(len(pages))

for p in pages:
  print(p)

3.132020950317383
19
page_content='Published OnlineFirst April 6, 2021; DOI: 10.1158/2159-8290.CD-21-0365\nReseaRch BRief\nClinical Acquired Resistance to KRASG12C\nInhibition through a Novel KRAS Switch-II\nPocket Mutation and Polyclonal Alterations\nConverging on RAS–MAPK Reactivation\nNoritaka Tanaka1, Jessica J. Lin1, Chendi Li1, Meagan B. Ryan1, Junbing Zhang1, Lesli A. Kiedrowski2,\nAlexa G. Michel1, Mohammed U. Syed1, Katerina A. Fella1, Mustafa Sakhi1, Islam Baiev1, Dejan Juric1,\nJustin F. Gainor1, Samuel J. Klempner1, Jochen K. Lennerz3, Giulia Siravegna1, Liron Bar-Peled1,\nAaron N. Hata1, Rebecca S. Heist1, and Ryan B. Corcoran1\naBstRact Mutant-selective KRASG12C inhibitors, such as MRTX849 (adagrasib) and AMG 510\n(sotorasib), have demonstrated efficacy in KRASG12C-mutant cancers, including\nnon–small cell lung cancer (NSCLC). However, mechanisms underlying clinical acquired resistance to\nKRASG12C inhibitors remain undetermined. To begin to define the mechanistic spectru

In [35]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(pdf_file)

a = time.time()
pages = loader.load_and_split()
b = time.time()

print(b - a)
print(len(pages))

for p in pages:
  print(p)



0.22659611701965332
19
page_content='AUGUST  2021\u2003CANCER DISCOVERY\u2002|\u2002OF1 \nResearch Brief\nClinical Acquired Resistance to KRASG12C \nInhibition through a Novel KRAS Switch-II \nPocket Mutation and Polyclonal Alterations \nConverging on RAS–MAPK Reactivation  \nNoritaka Tanaka1, Jessica J. Lin1, Chendi Li1, Meagan B. Ryan1, Junbing Zhang1, Lesli A. Kiedrowski2,  \nAlexa G. Michel1, Mohammed U. Syed1, Katerina A. Fella1, Mustafa Sakhi1, Islam Baiev1, Dejan Juric1,  \nJustin F. Gainor1, Samuel J. Klempner1, Jochen K. Lennerz3, Giulia Siravegna1, Liron Bar-Peled1,  \nAaron N. Hata1, Rebecca S. Heist1, and Ryan B. Corcoran1\nabstract\nMutant-selective KRASG12C inhibitors, such as MRTX849 (adagrasib) and AMG 510 \n(sotorasib), have demonstrated efficacy in KRASG12C-mutant cancers, including \nnon–small cell lung cancer (NSCLC). However, mechanisms underlying clinical acquired resistance to \nKRASG12C inhibitors remain undetermined. To begin to define the mechanistic spectrum 

In [29]:
for p in pages:
  pg = p.page_content
  print(f"CHARS: {len(pg)}, WORDS: {len(pg.split(" "))}")
  print("-" * 80)

CHARS: 3996, WORDS: 506
--------------------------------------------------------------------------------
CHARS: 240, WORDS: 28
--------------------------------------------------------------------------------
CHARS: 3942, WORDS: 521
--------------------------------------------------------------------------------
CHARS: 3079, WORDS: 432
--------------------------------------------------------------------------------
CHARS: 2924, WORDS: 364
--------------------------------------------------------------------------------
CHARS: 3964, WORDS: 532
--------------------------------------------------------------------------------
CHARS: 1135, WORDS: 145
--------------------------------------------------------------------------------
CHARS: 2425, WORDS: 280
--------------------------------------------------------------------------------
CHARS: 2725, WORDS: 353
--------------------------------------------------------------------------------
CHARS: 3987, WORDS: 554
---------------------------------

As an alterniative to the langchain pdf loaders / splitters, we consider parsers designed for scientific papers: https://github.com/titipata/scipdf_parser

In [42]:
import scipdf
article_dict = scipdf.parse_pdf_to_dict(pdf_file)
print(article_dict)

None


# Splitters
Default splitters are based on page. We'll try to split by section / subsection.

In [37]:
pdf = loader.load()

splitter = RecursiveCharacterTextSplitter(
  chunk_size = 1000,
  chunk_overlap = 50
)

TypeError: expected string or bytes-like object, got 'Document'

In [39]:
splitter.split_documents(pdf)

[Document(page_content='AUGUST  2021\u2003CANCER DISCOVERY\u2002|\u2002OF1 \nResearch Brief\nClinical Acquired Resistance to KRASG12C \nInhibition through a Novel KRAS Switch-II \nPocket Mutation and Polyclonal Alterations \nConverging on RAS–MAPK Reactivation  \nNoritaka Tanaka1, Jessica J. Lin1, Chendi Li1, Meagan B. Ryan1, Junbing Zhang1, Lesli A. Kiedrowski2,  \nAlexa G. Michel1, Mohammed U. Syed1, Katerina A. Fella1, Mustafa Sakhi1, Islam Baiev1, Dejan Juric1,  \nJustin F. Gainor1, Samuel J. Klempner1, Jochen K. Lennerz3, Giulia Siravegna1, Liron Bar-Peled1,  \nAaron N. Hata1, Rebecca S. Heist1, and Ryan B. Corcoran1\nabstract\nMutant-selective KRASG12C inhibitors, such as MRTX849 (adagrasib) and AMG 510 \n(sotorasib), have demonstrated efficacy in KRASG12C-mutant cancers, including \nnon–small cell lung cancer (NSCLC). However, mechanisms underlying clinical acquired resistance to \nKRASG12C inhibitors remain undetermined. To begin to define the mechanistic spectrum of acquired',