diff --git a/README.md b/README.md index ae78a49..ba7d3bb 100644 --- a/README.md +++ b/README.md @@ -9,30 +9,50 @@ JChunk project is simple library that enables different types of text splitting ## Docs -### Chunkers - - [Fixed Chunker](jchunk-fixed/README.md) - - [Recursive Character Chunker](jchunk-recursive-character/README.md) - - [Semantic Chunker](jchunk-semantic/README.md) - -### More - - [Jchunk Documentation](docs/modules/ROOT/pages/index.adoc) +[Jchunk Website](https://jchunk-io.github.io/jchunk/) ## Installing -### Maven +### Fixed Chunker + +```xml + + io.jchunk + jchunk-fixed + ${jchunk.version} + +``` + +```groovy +implementation("io.jchunk:jchunk-fixed:${JCHUNK_VERSION}") +``` + +### Recursive Chunker ```xml io.jchunk - jchunk-... + jchunk-recursive-character ${jchunk.version} ``` -### Gradle +```groovy +implementation("io.jchunk:jchunk-recursive-character:${JCHUNK_VERSION}") +``` + +### Semantic Chunker + +```xml + + io.jchunk + jchunk-semantic + ${jchunk.version} + +``` ```groovy -implementation group: 'io.jchunk', name: 'jchunk-...', version: "${JCHUNK_VERSION}" // replace dots with desired module name +implementation("io.jchunk:jchunk-semantic:${JCHUNK_VERSION}") ``` ## Building diff --git a/jchunk-fixed/README.md b/jchunk-fixed/README.md deleted file mode 100644 index 66d8c5c..0000000 --- a/jchunk-fixed/README.md +++ /dev/null @@ -1,50 +0,0 @@ -# FixedChunker - -Splits text into **fixed-size chunks** using a single delimiter. - -## Installing - -Considering there is a property defined for jchunk: -```xml - - X.X.X - -``` - -Then: -```xml - - io.jchunk - jchunk-fixed - ${jchunk.version} - -``` - -```groovy -implementation group: 'io.jchunk', name: 'jchunk-fixed', version: "${JCHUNK_VERSION}" -``` - -## Usage - -```java -import io.jchunk.fixed.Config; -import io.jchunk.fixed.FixedChunker; -import io.jchunk.core.chunk.Chunk; -import io.jcunk.commons.Delimiter; - -var config = Config.builder() - .chunkSize(1000) // max characters per chunk - .chunkOverlap(100) // overlapping characters between chunks - .delimiter("\\.") // split on dots (this is regex based) - .keepDelimiter(Delimiter.NONE) // NONE / START / END - .trimWhitespace(true) - .build(); - -var chunker = new FixedChunker(config); // or new FixedChunker() using default config -List chunks = chunker.split("Your long text here..."); -``` -## Notes - -- Chunk size is a target, not guaranteed if input cannot be split further. -- Overlap keeps context between chunks. -- Delimiter is regex based. \ No newline at end of file diff --git a/jchunk-recursive-character/README.md b/jchunk-recursive-character/README.md deleted file mode 100644 index 56f469c..0000000 --- a/jchunk-recursive-character/README.md +++ /dev/null @@ -1,52 +0,0 @@ -# RecursiveCharacterChunker - -Splits text **recursively by multiple delimiters**, starting with bigger ones (paragraphs) and falling back to smaller ones (sentences, words, characters). - -## Installing - -Considering there is a property defined for jchunk: -```xml - - X.X.X - -``` - -Then: -```xml - - io.jchunk - jchunk-recursive-character - ${jchunk.version} - -``` - -```groovy -implementation group: 'io.jchunk', name: 'jchunk-recursive-character', version: "${JCHUNK_VERSION}" -``` - - -## Usage - -```java -import io.jchunk.recursive.Config; -import io.jchunk.recursive.RecursiveCharacterChunker; -import io.jchunk.core.chunk.Chunk; -import io.jchunk.commons..Delimiter; - -var config = Config.builder() - .chunkSize(500) - .chunkOverlap(50) - .delimiters(List.of("\n\n", "\n", " ", "")) // fallback to character-level, regex-string based - .keepDelimiter(Delimiter.START) // NONE / START / END - .trimWhitespace(true) - .build(); - -var chunker = new RecursiveCharacterChunker(config); -List chunks = chunker.split("Your long text here..."); -``` - -## Notes - -- Delimiters are applied in order (first match wins). -- Last delimiter "" means character-level splitting. -- Keeps context overlap between chunks. diff --git a/jchunk-semantic/README.md b/jchunk-semantic/README.md deleted file mode 100644 index 50dc5bd..0000000 --- a/jchunk-semantic/README.md +++ /dev/null @@ -1,53 +0,0 @@ -# SemanticChunker - -Splits text into chunks based on **semantic similarity** using embeddings. -Instead of relying only on character counts or delimiters, it groups sentences into coherent chunks that better preserve meaning, useful for **RAG pipelines**, **semantic search**, and **embedding-based retrieval**. - -## Installing - -Considering there is a property defined for jchunk: -```xml - - X.X.X - -``` - -Then: -```xml - - io.jchunk - jchunk-semantic - ${jchunk.version} - -``` - -```groovy -implementation group: 'io.jchunk', name: 'jchunk-semantic', version: "${JCHUNK_VERSION}" -``` - -## Usage - -```java -import io.jchunk.semantic.SemanticChunker; -import io.jchunk.semantic.embedder.JChunkEmbedder; -import io.jchunk.core.chunk.Chunk; - -var config = Config.builder() - .sentenceSplittingStrategy(SentenceSplittingStrategy.DEFAULT) // regex being used to split into sentences, can be user defined - .percentile(95) // similarity threshold (1–99) - .bufferSize(1) // number of neighboring sentences to include - .build(); - -var embedder = new JChunkEmbedder(); // default provided embedder -var chunker = new SemanticChunker(embedder); - -List chunks = chunker.split("Your long text here..."); -``` - -## Notes - -- The number and size of chunks depend on the embedding model. -- Preserves semantic coherence rather than enforcing strict size. - - The model selected must support the language being use if not chunks might not have coherence - - If the text is in English make sure the model supports English. -- Requires an Embedder implementation (default: `JChunkEmbedder`). \ No newline at end of file