Skip to content

Useful tools to process text from pdf's for use with AI, LLMs including splitters and heading finders

License

Notifications You must be signed in to change notification settings

millsit/pdf-text-tools

Repository files navigation

pdf-text-tools

A bunch of tools to help with processing text from a pdf, for use with LLMs. For example, finding headers, splitting text at headers, etc. Particularly useful for processing pages of text from a pdf, where the text is not structured in a way that is easy to process. and

Install

npm install pdf-text-tools

Usage

/**
 * Find header titles in a pdf using regex ish 
 */
import { findHeaderTitles } from 'pdf-text-tools';

findHeaderTitles('..some text string from pdf..');
//=> ['header1', 'header2'] 

/**
 * Split text at header titles
 *  - Usefull to grab the last bit of a page
 */ 
import { splitAtHeader } from 'pdf-text-tools';

splitAtHeader('..some text string from pdf..', "last");
//=> ['text before the header', 'text after the heading, including the header'] 

More tools coming soon!

About

Useful tools to process text from pdf's for use with AI, LLMs including splitters and heading finders

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published