Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 18 additions & 19 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,10 @@
name: Deploy Antora to GitHub Pages
name: Deploy Docusaurus to GitHub Pages

on:
push:
branches: [ main ]
paths:
- 'docs/**'
- 'antora-playbook.yml'
- '.github/workflows/docs.yml'
workflow_dispatch:

Expand All @@ -20,38 +19,38 @@ concurrency:

jobs:
build:
name: Build Jchunk docs
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Setup Node.js
uses: actions/setup-node@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
cache-dependency-path: docs/package-lock.json

- name: Setup Pages
uses: actions/configure-pages@v5

- name: Build Antora site
run: |
npx -y -p @antora/cli@3.1 -p @antora/site-generator@3.1 antora -r @antora/site-generator antora-playbook.yml

- name: Disable Jekyll on Pages
run: |
echo > build/site/.nojekyll
- name: Install dependencies
working-directory: docs
run: npm ci
- name: Build website
working-directory: docs
run: npm run build

- name: Upload Pages artifact
- name: Upload Build Artifact
uses: actions/upload-pages-artifact@v3
with:
path: build/site
path: docs/build

deploy:
name: Deploy to GitHub pages
needs: build
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
Expand Down
21 changes: 20 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,23 @@ build/
### VS Code ###
.vscode/

.DS_Store
.DS_Store

### Docosaurus

# Dependencies
node_modules

# Generated files
.docusaurus
.cache-loader

# Misc
.env.local
.env.development.local
.env.test.local
.env.production.local

npm-debug.log*
yarn-debug.log*
yarn-error.log*
19 changes: 1 addition & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# JChunk

[![GitHub Actions Status](https://img.shields.io/github/actions/workflow/status/jchunk-io/jchunk/build.yml?branch=main&logo=GitHub&style=for-the-badge)](.)
[![Apache 2.0 License](https://img.shields.io/github/license/arconia-io/arconia?style=for-the-badge&logo=apache&color=brightgreen)](.)
[![Apache 2.0 License](https://img.shields.io/github/license/jchunk-io/jchunk?style=for-the-badge&logo=apache&color=brightgreen)](.)

## A Java Library for Text Chunking

Expand Down Expand Up @@ -55,23 +55,6 @@ To check javadocs using the javadoc:javadoc
./mvnw javadoc:javadoc -Pjavadoc
```

## Building the docs locally

You can build and preview the Antora documentation locally without installing anything globally.

Prerequisites:
- Node.js 18+ (20 recommended).
- Download from https://nodejs.org/

Build the site:

```sh
npx -y -p @antora/cli@3.1 -p @antora/site-generator@3.1 antora -r @antora/site-generator antora-playbook.yml
```

Open the generated site:
- `build/site/index.html`

## Contributing

Please read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
25 changes: 0 additions & 25 deletions antora-playbook.yml

This file was deleted.

41 changes: 41 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Website

This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.

## Installation

```bash
yarn
```

## Local Development

```bash
yarn start
```

This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.

## Build

```bash
yarn build
```

This command generates static content into the `build` directory and can be served using any static contents hosting service.

## Deployment

Using SSH:

```bash
USE_SSH=true yarn deploy
```

Not using SSH:

```bash
GIT_USER=<Your GitHub username> yarn deploy
```

If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
8 changes: 0 additions & 8 deletions docs/antora.yml

This file was deleted.

7 changes: 7 additions & 0 deletions docs/docs/chunkers/_category_.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"label": "Chunkers",
"position": 3,
"link": {
"type": "generated-index"
}
}
98 changes: 98 additions & 0 deletions docs/docs/chunkers/fixed-chunker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Fixed Character Chunker

## Overview

The Fixed Character Chunker is a basic text processing technique where text is divided into fixed-size chunks of characters. While simple, it serves as an excellent starting point to understand text splitting fundamentals.

## Installation

```xml
<dependency>
<groupId>io.jchunk</groupId>
<artifactId>jchunk-fixed</artifactId>
<version>${jchunk.version}</version>
</dependency>
```

```groovy
implementation group: 'io.jchunk', name: 'jchunk-fixed', version: "${JCHUNK_VERSION}"
```

## Configuration

```java
// using default config
FixedChunker chunker = new FixedChunker();

// with custom config
Config config = Config.builder()
.chunkSize(10)
.chunkOverlap(0)
.delimiter(";")
.trimWhitespace(true)
.keepDelimiter(Delimiter.START)
.build();

FixedChunker chunker = new FixedChunker(config);
```

### Configuration Options

- `chunkSize`: Maximum number of characters per chunk. Defines the target size of each piece. If a single segment is longer than this, it may exceed the limit.
- Default: `1000`.
- `chunkOverlap`: Number of characters to overlap between consecutive chunks (preserves context).
- Default: `100`.
- `delimiter`: Regex string used to split text before forming chunks. Common values: `" "` for spaces, `"\n"` for newlines, `""` for character-level.
- Default: `space (" ")`.
- `trimWhitespace`: Whether to trim leading/trailing whitespace from each chunk.
- Default: `true`.
- `keepDelimiter`: How to keep delimiters in chunks: `NONE`, `START`, or `END`.
- Default: `NONE`.

## Examples

### Basic Chunking

Chunk size of 10 and no overlap (0):

```java
Config config = Config.builder()
.chunkSize(10)
.chunkOverlap(0)
.build();
FixedChunker chunker = new FixedChunker(config);
String text = "This is an example of character splitting.";

List<Chunk> chunks = chunker.split(text);

// Result: ["This is an", "example of", "character", "splitting."]
```

### With Overlap

Adding 4 characters of overlap and a custom blank delimiter:

```java
Config config = Config.builder()
.chunkSize(35)
.chunkOverlap(4)
.delimiter("")
.build();
FixedChunker chunker = new FixedChunker(config);
String text = "This is the text I would like to chunk up. It is the example text for this exercise";
List<Chunk> chunks = chunker.split(text);

// Result: ["This is the text I would like to ch", "o chunk up. It is the example text", "ext for this exercise"]
```

## Pros and Cons

### Pros
- Easy to implement and understand
- Predictable chunk sizes
- Fast processing

### Cons
- Doesn't consider text structure or context
- May split words inappropriately
- Overlap creates duplicate data
Loading