Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking code makes odd assumption #23

Open
Aemon-Algiz opened this issue Apr 20, 2023 · 4 comments
Open

Chunking code makes odd assumption #23

Aemon-Algiz opened this issue Apr 20, 2023 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@Aemon-Algiz
Copy link

Aemon-Algiz commented Apr 20, 2023

func CreateChunks(fileContent string, window int, stride int, title string) []Chunk {
	sentences := strings.Split(fileContent, ".") // assuming sentences end with a period
	newData := make([]Chunk, 0)

	for i := 0; i < len(sentences)-window; i += stride {
		iEnd := i + window
		text := strings.Join(sentences[i:iEnd], ". ")
		start := 0
		end := 0

		if i > 0 {
			start = len(strings.Join(sentences[:i], ". ")) + 2 // +2 for the period and space
		}

		end = len(strings.Join(sentences[:iEnd], ". "))

		newData = append(newData, Chunk{
			Start: start,
			End:   end,
			Title: title,
			Text:  text,
		})
	}

	return newData
}

Based on the source, this seems to assume that the minimum document will have 20 sentences. Anything less than 20 sentences does not appear to create any embeddings. This probably isn't the desired result. It would probably be better to chunk based on token count rather than sentence count.

@pashpashpash
Copy link
Owner

@Aemon-Algiz yeah, the chunking for different types of documents can be improved significantly... Some thoughts:

  1. Books and video/audio transcripts – 20 sentence chunks are largely fine. Only consideration as you pointed out is sentences can vary wildly in size, so estimated tik-token count would be nice to factor in, for some cases.
  2. Legal documents and code/manufacturing documentation – Structure (sections, subsections) is particularly important for these, so all things being equal, it would be better to ingest an entire section instead of 20 sentences.
  3. Code – obviously, code is not oriented around sentences. So a completely different chunking algorithm would be needed for documents containing code.

If anyone has any other thoughts, I'd be happy to hear them.

@pashpashpash pashpashpash added the help wanted Extra attention is needed label Apr 20, 2023
@lonelycode
Copy link

lonelycode commented Apr 22, 2023

I modified this locally to use a NLP library to detect sentences, I think it's a bit better at dealing with different media types, no PR because my local version is a kludgy mess atm, but here's my replacement CompleteChunks function (note it has an extra parameter to define sentences to pull, so make sure to update the caller too:

import "github.com/jdkato/prose/v2"

func CreateChunks(fileContent string, window int, stride int, title string, chunkSize int) []Chunk {
	doc, err := prose.NewDocument(fileContent)
	if err != nil {
		log.Fatal(err)
	}

	sentences := doc.Sentences() //strings.Split(fileContent, ".") // assuming sentences end with a period
	newData := make([]Chunk, 0)

	c := 0
	text := ""
	start := 0
	end := 0
	for si, _ := range sentences {
		text += " " + sentences[si].Text
		end = start + len(text)

		if c == chunkSize || (c < chunkSize && si == len(sentences)) {
			if checkTokenLimit(text) {
				// only write chunks that are ok
				newData = append(newData, Chunk{
					Start: start,
					End:   end,
					Title: title,
					Text:  text,
				})
			} else {
				fmt.Println("chunk size too large!")
			}

			text = ""
			c = 0
		}

		c++
		start = end + 1
	}

	return newData

}

And a test:

func TestCreateChunks(t *testing.T) {
	// 14 sentences
	doc := `Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur pretium scelerisque lorem eget eleifend. Suspendisse condimentum libero at nisl commodo, ac pretium sapien convallis. Sed id lectus non justo varius semper sit amet in sapien. Proin arcu arcu, consequat fermentum tortor lacinia, tincidunt consectetur turpis. Donec iaculis tincidunt iaculis. Cras pulvinar mauris tempor lectus lacinia efficitur. Sed in nibh tellus. Curabitur molestie aliquet leo, non efficitur felis. Integer condimentum libero nec sapien ultrices accumsan. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam quis sagittis dui. Phasellus venenatis nulla quis ligula rutrum bibendum.`
	chunks := CreateChunks(doc, 1, 1, "foo", 1)

	if len(chunks) != 12 {
		tx := ""
		for _, s := range chunks {
			tx += s.Text
		}

		fmt.Println(tx)
		t.Fatalf("expected 12 chunks got %v\n", len(chunks))

	}
}

(edit: forgot the import)

@pashpashpash
Copy link
Owner

pashpashpash commented Apr 24, 2023

@Aemon-Algiz @lonelycode good idea using an NLP library. Does this library support most languages? I ended up going with github.com/neurosnap/sentences – check out my fix here 2bff175

@pashpashpash
Copy link
Owner

// MaxTokensPerChunk is the maximum number of tokens allowed in a single chunk for OpenAI embeddings
const MaxTokensPerChunk = 500
const EmbeddingModel = "text-embedding-ada-002"

func CreateChunks(fileContent string, title string) ([]Chunk, error) {
	tokenizer, _ := english.NewSentenceTokenizer(nil)
	sentences := tokenizer.Tokenize(fileContent)

	log.Println("[CreateChunks] getting tiktoken for", EmbeddingModel, "...")
	// Get tiktoken encoding for the model
	tiktoken, err := tke.EncodingForModel(EmbeddingModel)
	if err != nil {
		return []Chunk{}, fmt.Errorf("getEncoding: %v", err)
	}

	newData := make([]Chunk, 0)
	position := 0
	i := 0

	for i < len(sentences) {
		chunkTokens := 0
		chunkSentences := []*s.Sentence{}

		// Add sentences to the chunk until the token limit is reached
		for i < len(sentences) {
			tiktokens := tiktoken.Encode(sentences[i].Text, nil, nil)
			tokenCount := len(tiktokens)
			fmt.Printf(
				"[CreateChunks] #%d Token count: %d | Total number of sentences: %d | Sentence: %s\n",
				i, tokenCount, len(sentences), sentences[i].Text)

			if chunkTokens+tokenCount <= MaxTokensPerChunk {
				chunkSentences = append(chunkSentences, sentences[i])
				chunkTokens += tokenCount
				i++
			} else {
				log.Println("[CreateChunks] Adding this sentence would exceed max token limit. Breaking....")
				break
			}
		}

		if len(chunkSentences) > 0 {
			text := strings.Join(sentencesToStrings(chunkSentences), "")

			start := position
			end := position + len(text)

			fmt.Printf("[CreateChunks] Created chunk and adding it to the array...\nText: %s\n",
				text)

			newData = append(newData, Chunk{
				Start: start,
				End:   end,
				Title: title,
				Text:  text,
			})
			fmt.Printf("[CreateChunks] New chunk array length: %d\n",
				len(newData))
			position = end

			// Set the stride for overlapping chunks
			stride := len(chunkSentences) / 2
			if stride < 1 {
				stride = 1
			}

			oldI := i
			i -= stride

			// Check if the next sentence would still fit within the token limit
			nextTokens := tiktoken.Encode(sentences[i].Text, nil, nil)
			nextTokenCount := len(nextTokens)

			if chunkTokens+nextTokenCount <= MaxTokensPerChunk {
				// Increment i without applying the stride
				i = oldI + 1
			} else if i == oldI {
				// Ensure i is always incremented to avoid an infinite loop
				i++
			}

		}
	}

	return newData, nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants