feat: add task chunk text #139

chuang8511 · 2024-05-24T20:43:36Z

Because

we want to enrich the functions to chunk text for knowledge base

This commit

add the task to chunk

linear · 2024-05-24T20:43:39Z

INS-4499 [Component][Text Operator] add TASK_CHUNK_TEXT for more advanced chunking methods

codecov · 2024-05-28T19:03:26Z

Codecov Report

Attention: Patch coverage is 75.46083% with 213 lines in your changes are missing coverage. Please review.

Project coverage is 61.99%. Comparing base (4ce9851) to head (f92868c).
Report is 48 commits behind head on main.

❗ Current head f92868c differs from pull request most recent head 70a1dd2

Please upload reports for the commit 70a1dd2 to get more accurate results.

Files	Patch %	Lines
...rnal/langchaingo/textsplitter/markdown_splitter.go	77.17%	103 Missing and 15 partials ⚠️
external/langchaingo/textsplitter/options.go	49.18%	30 Missing and 1 partial ⚠️
...ternal/langchaingo/textsplitter/split_documents.go	70.37%	20 Missing and 4 partials ⚠️
operator/text/v0/main.go	0.00%	15 Missing ⚠️
operator/text/v0/chunk_text.go	83.75%	9 Missing and 4 partials ⚠️
...xternal/langchaingo/textsplitter/token_splitter.go	81.25%	6 Missing and 3 partials ⚠️
...al/langchaingo/textsplitter/recursive_character.go	95.45%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   57.32%   61.99%   +4.66%     
==========================================
  Files          30       42      +12     
  Lines        3145     4305    +1160     
==========================================
+ Hits         1803     2669     +866     
- Misses       1104     1342     +238     
- Partials      238      294      +56

Flag	Coverage Δ
unittests	`61.99% <75.46%> (+4.66%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jvallesm

❓ Is the code in external/langchaingo fully copied from the langchaingo project? Why not importing the package? If not, what are the main differences? The project is actively maintained so by duplicating its code we're losing any update they make (e.g. a vulnerability fix). I was reviewing this code, then realised it comes from somewhere else.

jvallesm · 2024-05-29T06:02:48Z

external/langchaingo/textsplitter/token_splitter.go

+import (
+	"fmt"
+
+	"github.com/pkoukk/tiktoken-go"


👀 In pipeline-backend-cloud we use https://github.com/hupe1980/go-tiktoken, please consider using a single package across our codebase.

Since this is going to be used by an operator, I'd avoid making external calls so, if we need to keep this package, I'd use this loader.

@jvallesm

Why not importing the package? If not, what are the main differences?

There is no difference. I chose this way because of this discussion.

discussion
In short, the main reason is that importing all package needs to upgrade Go version.
And, I am not quite sure the risk about updating Go version.
So, I want to keep updating Go version in another branch.

The project is actively maintained so by duplicating its code we're losing any update they make (e.g. a vulnerability fix).

I agree! I am thinking about we transfer to import the package directly after we upgrade Go version.
However, I am not quite sure about this topic.
Do you have experience on this part?

It has its own data model and interfaces for connecting LLMs, database I/O, etc., which is a bit too heavy for a single Instill Component. We probably want to implement the code ourselves instead of importing it as a whole Go package.

I checked Task_Split_By_Token, and it used "github.com/pkoukk/tiktoken-go"

code

Is it different between pipeline-backend-cloud & pipeline-backend now?
Should we modify pipeline-backend-cloud?

I am not quite sure about whether it is a good source.
It seems the star / fork count is low.
Is it a standard to judge?

About the Go version, I'd import an earlier version of the package that is compatible with our Go version. If this is the reason to copy instead of import the code, a much more sensible approach is updating Go, then importing the latest version of the dependency, even if in the beginning we might potentially lose some functionality like a given token encoding. The version difference is small so this shouldn't be a big migration task.

About the size of langchaingo, the rule of thumb I'd follow would be that it's a good argument as long as what you're trying to copy fits in your package (a function, a struct type). Creating 2 external packages and importing them adds confusion to the structure and maybe it's a smell that indicates that it's worth importing the source (and more honest). At this point I don't think binary size is a massive concern and I think there are plenty of ways to improve it anyways. In other SDKs like S3 or Stripe we're using a tiny subset of their features and we're not considering copying their source code.

Is it different between pipeline-backend-cloud & pipeline-backend now? Should we modify pipeline-backend-cloud?

pipeline-backend doesn't use it because this is needed for the Credit feature, implemented only on Cloud.

I am not quite sure about whether it is a good source.

I picked the other package (without being aware of tiktoken-go being used in component) because the encoding is embedded by default and because gpt-4o was quickly added after it was announced, so it fit our needs. The star count is definitely an indicator of the quality and health of the project but we shouldn't be guided solely by it and in this case the choice made sense. I'm not against modifying pipeline-backend-cloud instead of this package.

@jvallesm

even if in the beginning we might potentially lose some functionality like a given token encoding. The version difference is small so this shouldn't be a big migration task.

Just share
In my previous experience / industry / company, losing functionality is not acceptable in terms of client sides. So, we barely upgrade the language version to keep the stability.

About Go version, I checked the release note.
It seems like what you said, it is not a big migration.
However, I still want to separate branches between upgrading Go version & this feature branch.
I think it will be clearer than combining them together.

How about we mark it as TODO and import the package after we upgrade the Go version?
What do you think about it?

I'm not against modifying pipeline-backend-cloud instead of this package.

I am confused about this. So, I want to confirm it again.
If we are going to import all langchaingo package, should we modify pipeline-backend-cloud to avoid two token package rather than pipeline-backend?

How about we mark it as TODO and import the package after we upgrade the Go version?
If we're going to tackle one task after the other, I'm not against moving forward like this.

If we are going to import all langchaingo package, should we modify pipeline-backend-cloud to avoid two token package rather than pipeline-backend?

Yep my suggestion is that we can keep using https://github.com/pkoukk/tiktoken-go here and update pipeline-backend-cloud to use it, too. I'll create a ticket and modify it in the next sprint, it's a low-hanging fruit.

@chuang8511
I'll upgrade the Golang version today in another PR.

If my understanding is correct, after we upgrade the Golang version, we don't need to modify any Golang code from LangChainGo, right? If so, let's just import it.

It has its own data model and interfaces for connecting LLMs, database I/O, etc., which is a bit too heavy for a single Instill Component. We probably want to implement the code ourselves instead of importing it as a whole Go package.

This only happens when you need to make modifications. If we don't update their code, we can just import it. The size of the package is not a problem, I remember that when using go build, the compiler will only compile the code you used.

I just checked https://github.com/pkoukk/tiktoken-go, and they have added GPT-4 support. I guess we can safely switch to it.

jvallesm · 2024-05-29T06:18:33Z

external/langchaingo/schema/chat_message_history.go

There are some files where everything is commented out. We should not include them.

jvallesm

About test case
There are several functions that we should mock to avoid uncertainty of unit test. However, it will damage the code structure. So, I decide not to mock split.SplitText & tkm.Encode

Why would these methods be undeterministic? Given the same model / encoding and input, text should be split or tokenised consistently.

I tried to refactor the part of switch e.Task. However, it damaged the readability when I kept trying to abstract the code.

Could you expand on what you had in mind for this refactorisation and what you tried to achieve? This switch pattern was added not so long ago to avoid code duplication so I'd like to know if it's confusing or missing anything.

I want to keep the Go version in this feature branch. So, I commented out the parts that are required to upgrade Go version when I cherry-pick the code.

Not in the scope of this PR but I'm for updating Go to the latest version. The current version is already preventing us from upgrading certain packages. cc @donch1989

jvallesm · 2024-05-29T06:33:47Z

operator/text/v0/chunk_text.go

+func chunkText(input ChunkTextInput) (ChunkTextOutput, error) {
+	var split textsplitter.TextSplitter
+	setting := input.Strategy.Setting
+	setting.SetDefault()


What's the point in having settings as user input (e.g. allowed special) if they're going to ve overriden? Shouldn't the default field in the config schema determine the default behaviour, rather than the Go code?

Shouldn't the default field in the config schema determine the default behaviour, rather than the Go code?

I also think so.
However, there is an unknown bug that I could not fix.
I also feel sad about it. 😢

Ref

But, in a positive way, it decreases the risk of the bugs come from other places.

Sorry, I reviewed without having all the context in mind. If this is a workaround to a bug that we can't address in this sprint, I'd add a TODO comment before this line explaining why it's needed and how / when it should be fixed.

jvallesm · 2024-05-29T06:35:49Z

operator/text/v0/chunk_text.go

+	if setting.ChunkMethod == "Token" {
+		tkm, err := tiktoken.EncodingForModel(setting.ModelName)
+		if err != nil {
+			return ChunkTextOutput{}, err
+		}
+		token := tkm.Encode(input.Text, setting.AllowedSpecial, setting.DisallowedSpecial)
+		output.TokenCount = len(token)
+	}


This can go in the switch statement above

As mentioned in a previous comment, I'd avoid external calls in this document. You can set an offline BPE loader by using https://github.com/pkoukk/tiktoken-go-loader

Replied here

As mentioned, I'm fine keeping this package and updating Pipeline to use it, too. But if we do so, I'd add the offline loader to avoid external calls.

chuang8511 · 2024-05-29T08:43:52Z

@jvallesm

About test case
There are several functions that we should mock to avoid uncertainty of unit test. However, it will damage the code structure. So, I decide not to mock split.SplitText & tkm.Encode
Why would these methods be undeterministic? Given the same model / encoding and input, text should be split or tokenised consistently.

split functions belong to external services. If there is an update in external service, the result could be failure.
In my opinion, unit test should mock the external service to focus on what we are testing.
And, the logic of the external service should contain their own test.

I tried to refactor the part of switch e.Task. However, it damaged the readability when I kept trying to abstract the code.
Could you expand on what you had in mind for this refactorisation and what you tried to achieve? This switch pattern was added not so long ago to avoid code duplication so I'd like to know if it's confusing or missing anything.

OK. I will checkout to another branch and push it for you to check it later!

I want to keep the Go version in this feature branch. So, I commented out the parts that are required to upgrade Go version when I cherry-pick the code.
Not in the scope of this PR but I'm for updating Go to the latest version. The current version is already preventing us from upgrading certain packages. cc @donch1989

OK. Thanks for letting me know!
After we check this topic, we can decide if we import the package or keep the current way.

jvallesm · 2024-05-29T09:14:58Z

split functions belong to external services. If there is an update in external service, the result could be failure.
In my opinion, unit test should mock the external service to focus on what we are testing.
And, the logic of the external service should contain their own test.

@chuang8511 by service you mean we're making external calls to split the text? I.e. not only to packages we import but to online APIs. Is there any call besides the encoding fetch from tiktoken (as mentioned, we can have it offline)? I think this is a task where the results shouldn't change if the dependencies are updated (except on e.g. a bugfix). My fear is that, by mocking the dependencies, we might be hiding the real behaviour of the component and have unexpected results even if the tests pass. Without mocking, a breaking test will signal a breaking change in a dependency. After inspecting the change, we can decide to incorporate the new behaviour if we find it's a legit modification.

I'd start by not mocking the depencencies (it's also less work on our side) and only consider mocking if we find the tests are flickering or requiring frequent updates.

chuang8511 · 2024-05-29T10:54:17Z

@jvallesm

My fear is that, by mocking the dependencies, we might be hiding the real behaviour of the component and have unexpected results even if the tests pass.

Before read
All contents below are my previous experience, but I think it is highly related to the character of the industry.

We divide unit test & integration test.
By integration test, we can monitor the dependencies.

We separate unit test & integration test because the integration test takes way longer time for CI.
The reason why taking longer time matters is related to two factors

The confidence of how much the test protects our function.
How urgent we need to release the functions.

I feel the confidence level below.

unit test -> 99% confidence.
- It means no matter how urgent it is, we cannot release without passing unit test.
integration test -> 80% confidence.
- It means we can come back to fix it later.

However, I missed considering that this external service won't take such long time.
And, the situation is totally different from my previous experience.
So, I think it will be good not to mock the external service here!

jvallesm · 2024-05-29T14:53:48Z

However, I missed considering that this external service won't take such long time.
And, the situation is totally different from my previous experience.
So, I think it will be good not to mock the external service here!

I understand your point about confidence & time. Although here we wouldn't be strictly unit-testing the package, we're tackling directly the input/output interface of our package. You get much more confidence by including the dependencies in your tests and, as you mentioned, the time difference is not significant. The added value in mocks here would be (in my opinion) testing edge cases such as error codes, etc. Here we're importing purely cpu-bound functions so it is not so different from not mocking a function in an official package like strings.

In DB tests you can see a more radical case of not following strict unit testing. We could use sqlmock to avoid needing a database running but in my experience production bugs tend to get leaked into mock expectations / return values. By not mocking the infra layer, we make sure that the package API implements the desired behaviour. E.g. testing a database lock with mocks is quite futile but testing them from the end-to-end tests is overkill given their execution cost. This middle ground is both quick and robust in terms of confidence.

chuang8511 · 2024-05-29T17:33:20Z

It is quite weird that I added main_test.go but the coverage did not increase....
Not sure how to check it yet.
I have searched in the report but did not find it.

donch1989 · 2024-05-30T07:29:38Z

Hi @chuang8511
I've upgraded the Go version.

donch1989

@chuang8511
I'll merge this PR first. Please note the improvement items, and we can optimize them afterward.

🤖 I have created a release *beep* *boop* --- ## [0.19.0-beta](v0.18.0-beta...v0.19.0-beta) (2024-06-05) ### Features * add pdf component ([#138](#138)) ([517afcf](517afcf)) * add task chunk text ([#139](#139)) ([7b36553](7b36553)) * **instill:** adopt latest Model endpoints ([#146](#146)) ([7f2537b](7f2537b)) * optimise ux for slack component ([#143](#143)) ([ed60235](ed60235)) * refactor package structure ([#140](#140)) ([4853d4c](4853d4c)) * support markdown to text function in text operator ([#149](#149)) ([dcbae37](dcbae37)) * unify component interface ([#144](#144)) ([ad35e10](ad35e10)) ### Bug Fixes * bug of failure of document component ([#152](#152)) ([aed51f8](aed51f8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

Because - we want to enrich the functions to chunk text for knowledge base This commit - add the task to chunk

🤖 I have created a release *beep* *boop* --- ## [0.19.0-beta](instill-ai/component@v0.18.0-beta...v0.19.0-beta) (2024-06-05) ### Features * add pdf component ([instill-ai#138](instill-ai#138)) ([517afcf](instill-ai@517afcf)) * add task chunk text ([instill-ai#139](instill-ai#139)) ([7b36553](instill-ai@7b36553)) * **instill:** adopt latest Model endpoints ([instill-ai#146](instill-ai#146)) ([7f2537b](instill-ai@7f2537b)) * optimise ux for slack component ([instill-ai#143](instill-ai#143)) ([ed60235](instill-ai@ed60235)) * refactor package structure ([instill-ai#140](instill-ai#140)) ([4853d4c](instill-ai@4853d4c)) * support markdown to text function in text operator ([instill-ai#149](instill-ai#149)) ([dcbae37](instill-ai@dcbae37)) * unify component interface ([instill-ai#144](instill-ai#144)) ([ad35e10](instill-ai@ad35e10)) ### Bug Fixes * bug of failure of document component ([instill-ai#152](instill-ai#152)) ([aed51f8](instill-ai@aed51f8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

droplet-bot added the instill component label May 24, 2024

chuang8511 marked this pull request as ready for review May 28, 2024 19:57

chuang8511 requested review from donch1989, pinglin, xiaofei-du and jvallesm as code owners May 28, 2024 19:57

jvallesm requested changes May 29, 2024

View reviewed changes

jvallesm reviewed May 29, 2024

View reviewed changes

chuang8511 requested a review from jvallesm May 29, 2024 17:32

chuang8511 added 13 commits May 30, 2024 10:04

chore: Add a new task chunk to text component

41f747d

fix: add chunk file back to resolve conflict

a6564e6

fix: Revert test code

a4bbd0c

fix: Modify the chunk methoud for token split

16135e8

fix: Add recursive & markdown back to setting

3407fbe

fix: change the source package to resolve conflict

a68514f

fix: Fix lint error

f6de82b

fix: Fix golint error in GitHub Action

d7f1a23

fix: fix golintci error in external package

aaf29c1

feat: add test code to increase the coverage

3164076

fix: import the langchaingo package to resolve conflict

1f753a2

fix: modify test code

cf772ed

chore: update the package

9b6ec01

chuang8511 force-pushed the chunhao/ins-4499 branch from f51e04d to 9b6ec01 Compare May 30, 2024 09:19

chuang8511 added 2 commits May 30, 2024 10:32

chore: add documents for text and document operator

7e954fa

chore: add test cases to increase the coverage

70a1dd2

donch1989 approved these changes May 31, 2024

View reviewed changes

donch1989 merged commit 7b36553 into main May 31, 2024
8 checks passed

donch1989 deleted the chunhao/ins-4499 branch May 31, 2024 03:00

droplet-bot mentioned this pull request May 30, 2024

chore(main): release 0.19.0-beta #141

Merged

chuang8511 mentioned this pull request Jun 14, 2024

feat: add Anthropic component #161

Closed

namwoam pushed a commit to namwoam/component that referenced this pull request Jun 24, 2024

feat: add task chunk text (instill-ai#139)

7c58de8

Because - we want to enrich the functions to chunk text for knowledge base This commit - add the task to chunk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add task chunk text #139

feat: add task chunk text #139

chuang8511 commented May 24, 2024 •

edited

Loading

linear bot commented May 24, 2024

codecov bot commented May 28, 2024 •

edited

Loading

jvallesm left a comment

jvallesm May 29, 2024

chuang8511 May 29, 2024

chuang8511 May 29, 2024

jvallesm May 29, 2024

jvallesm May 29, 2024

chuang8511 May 29, 2024

jvallesm May 29, 2024

donch1989 May 30, 2024 •

edited

Loading

donch1989 May 31, 2024

jvallesm May 29, 2024

jvallesm left a comment

jvallesm May 29, 2024

chuang8511 May 29, 2024

jvallesm May 29, 2024

jvallesm May 29, 2024

jvallesm May 29, 2024

chuang8511 May 29, 2024

jvallesm May 29, 2024

chuang8511 commented May 29, 2024 •

edited

Loading

jvallesm commented May 29, 2024

chuang8511 commented May 29, 2024

jvallesm commented May 29, 2024

chuang8511 commented May 29, 2024

donch1989 commented May 30, 2024

donch1989 left a comment

feat: add task chunk text #139

feat: add task chunk text #139

Conversation

chuang8511 commented May 24, 2024 • edited Loading

linear bot commented May 24, 2024

codecov bot commented May 28, 2024 • edited Loading

Codecov Report

jvallesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

donch1989 May 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jvallesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chuang8511 commented May 29, 2024 • edited Loading

jvallesm commented May 29, 2024

chuang8511 commented May 29, 2024

jvallesm commented May 29, 2024

chuang8511 commented May 29, 2024

donch1989 commented May 30, 2024

donch1989 left a comment

Choose a reason for hiding this comment

chuang8511 commented May 24, 2024 •

edited

Loading

codecov bot commented May 28, 2024 •

edited

Loading

donch1989 May 30, 2024 •

edited

Loading

chuang8511 commented May 29, 2024 •

edited

Loading