Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add task chunk text #139

Merged
merged 15 commits into from
May 31, 2024
Merged

feat: add task chunk text #139

merged 15 commits into from
May 31, 2024

Conversation

chuang8511
Copy link
Member

@chuang8511 chuang8511 commented May 24, 2024

Because

  • we want to enrich the functions to chunk text for knowledge base

This commit

  • add the task to chunk

Copy link

codecov bot commented May 28, 2024

Codecov Report

Attention: Patch coverage is 75.46083% with 213 lines in your changes are missing coverage. Please review.

Project coverage is 61.99%. Comparing base (4ce9851) to head (f92868c).
Report is 48 commits behind head on main.

Current head f92868c differs from pull request most recent head 70a1dd2

Please upload reports for the commit 70a1dd2 to get more accurate results.

Files Patch % Lines
...rnal/langchaingo/textsplitter/markdown_splitter.go 77.17% 103 Missing and 15 partials ⚠️
external/langchaingo/textsplitter/options.go 49.18% 30 Missing and 1 partial ⚠️
...ternal/langchaingo/textsplitter/split_documents.go 70.37% 20 Missing and 4 partials ⚠️
operator/text/v0/main.go 0.00% 15 Missing ⚠️
operator/text/v0/chunk_text.go 83.75% 9 Missing and 4 partials ⚠️
...xternal/langchaingo/textsplitter/token_splitter.go 81.25% 6 Missing and 3 partials ⚠️
...al/langchaingo/textsplitter/recursive_character.go 95.45% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #139      +/-   ##
==========================================
+ Coverage   57.32%   61.99%   +4.66%     
==========================================
  Files          30       42      +12     
  Lines        3145     4305    +1160     
==========================================
+ Hits         1803     2669     +866     
- Misses       1104     1342     +238     
- Partials      238      294      +56     
Flag Coverage Δ
unittests 61.99% <75.46%> (+4.66%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chuang8511 chuang8511 marked this pull request as ready for review May 28, 2024 19:57
Copy link
Collaborator

@jvallesm jvallesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Is the code in external/langchaingo fully copied from the langchaingo project? Why not importing the package? If not, what are the main differences? The project is actively maintained so by duplicating its code we're losing any update they make (e.g. a vulnerability fix). I was reviewing this code, then realised it comes from somewhere else.

import (
"fmt"

"github.com/pkoukk/tiktoken-go"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 In pipeline-backend-cloud we use https://github.com/hupe1980/go-tiktoken, please consider using a single package across our codebase.

Since this is going to be used by an operator, I'd avoid making external calls so, if we need to keep this package, I'd use this loader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvallesm

Why not importing the package? If not, what are the main differences?

There is no difference. I chose this way because of this discussion.

  • discussion
    In short, the main reason is that importing all package needs to upgrade Go version.
    And, I am not quite sure the risk about updating Go version.
    So, I want to keep updating Go version in another branch.

The project is actively maintained so by duplicating its code we're losing any update they make (e.g. a vulnerability fix).

I agree! I am thinking about we transfer to import the package directly after we upgrade Go version.
However, I am not quite sure about this topic.
Do you have experience on this part?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked Task_Split_By_Token, and it used "github.com/pkoukk/tiktoken-go"

Is it different between pipeline-backend-cloud & pipeline-backend now?
Should we modify pipeline-backend-cloud?

I am not quite sure about whether it is a good source.
It seems the star / fork count is low.
Is it a standard to judge?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • About the Go version, I'd import an earlier version of the package that is compatible with our Go version. If this is the reason to copy instead of import the code, a much more sensible approach is updating Go, then importing the latest version of the dependency, even if in the beginning we might potentially lose some functionality like a given token encoding. The version difference is small so this shouldn't be a big migration task.
  • About the size of langchaingo, the rule of thumb I'd follow would be that it's a good argument as long as what you're trying to copy fits in your package (a function, a struct type). Creating 2 external packages and importing them adds confusion to the structure and maybe it's a smell that indicates that it's worth importing the source (and more honest). At this point I don't think binary size is a massive concern and I think there are plenty of ways to improve it anyways. In other SDKs like S3 or Stripe we're using a tiny subset of their features and we're not considering copying their source code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it different between pipeline-backend-cloud & pipeline-backend now? Should we modify pipeline-backend-cloud?

pipeline-backend doesn't use it because this is needed for the Credit feature, implemented only on Cloud.

I am not quite sure about whether it is a good source.

I picked the other package (without being aware of tiktoken-go being used in component) because the encoding is embedded by default and because gpt-4o was quickly added after it was announced, so it fit our needs. The star count is definitely an indicator of the quality and health of the project but we shouldn't be guided solely by it and in this case the choice made sense. I'm not against modifying pipeline-backend-cloud instead of this package.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvallesm

even if in the beginning we might potentially lose some functionality like a given token encoding. The version difference is small so this shouldn't be a big migration task.

Just share
In my previous experience / industry / company, losing functionality is not acceptable in terms of client sides. So, we barely upgrade the language version to keep the stability.

About Go version, I checked the release note.
It seems like what you said, it is not a big migration.
However, I still want to separate branches between upgrading Go version & this feature branch.
I think it will be clearer than combining them together.

How about we mark it as TODO and import the package after we upgrade the Go version?
What do you think about it?

I'm not against modifying pipeline-backend-cloud instead of this package.

I am confused about this. So, I want to confirm it again.
If we are going to import all langchaingo package, should we modify pipeline-backend-cloud to avoid two token package rather than pipeline-backend?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we mark it as TODO and import the package after we upgrade the Go version?
If we're going to tackle one task after the other, I'm not against moving forward like this.

If we are going to import all langchaingo package, should we modify pipeline-backend-cloud to avoid two token package rather than pipeline-backend?

Yep my suggestion is that we can keep using https://github.com/pkoukk/tiktoken-go here and update pipeline-backend-cloud to use it, too. I'll create a ticket and modify it in the next sprint, it's a low-hanging fruit.

Copy link
Member

@donch1989 donch1989 May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuang8511
I'll upgrade the Golang version today in another PR.

If my understanding is correct, after we upgrade the Golang version, we don't need to modify any Golang code from LangChainGo, right? If so, let's just import it.

It has its own data model and interfaces for connecting LLMs, database I/O, etc., which is a bit too heavy for a single Instill Component. We probably want to implement the code ourselves instead of importing it as a whole Go package.

This only happens when you need to make modifications. If we don't update their code, we can just import it. The size of the package is not a problem, I remember that when using go build, the compiler will only compile the code you used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked https://github.com/pkoukk/tiktoken-go, and they have added GPT-4 support. I guess we can safely switch to it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some files where everything is commented out. We should not include them.

Copy link
Collaborator

@jvallesm jvallesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. About test case
    There are several functions that we should mock to avoid uncertainty of unit test. However, it will damage the code structure. So, I decide not to mock split.SplitText & tkm.Encode

Why would these methods be undeterministic? Given the same model / encoding and input, text should be split or tokenised consistently.

  1. I tried to refactor the part of switch e.Task. However, it damaged the readability when I kept trying to abstract the code.

Could you expand on what you had in mind for this refactorisation and what you tried to achieve? This switch pattern was added not so long ago to avoid code duplication so I'd like to know if it's confusing or missing anything.

  1. I want to keep the Go version in this feature branch. So, I commented out the parts that are required to upgrade Go version when I cherry-pick the code.

Not in the scope of this PR but I'm for updating Go to the latest version. The current version is already preventing us from upgrading certain packages. cc @donch1989

func chunkText(input ChunkTextInput) (ChunkTextOutput, error) {
var split textsplitter.TextSplitter
setting := input.Strategy.Setting
setting.SetDefault()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point in having settings as user input (e.g. allowed special) if they're going to ve overriden? Shouldn't the default field in the config schema determine the default behaviour, rather than the Go code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the default field in the config schema determine the default behaviour, rather than the Go code?

I also think so.
However, there is an unknown bug that I could not fix.
I also feel sad about it. 😢

But, in a positive way, it decreases the risk of the bugs come from other places.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I reviewed without having all the context in mind. If this is a workaround to a bug that we can't address in this sprint, I'd add a TODO comment before this line explaining why it's needed and how / when it should be fixed.

Comment on lines 117 to 122
if setting.ChunkMethod == "Token" {
tkm, err := tiktoken.EncodingForModel(setting.ModelName)
if err != nil {
return ChunkTextOutput{}, err
}
token := tkm.Encode(input.Text, setting.AllowedSpecial, setting.DisallowedSpecial)
output.TokenCount = len(token)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go in the switch statement above

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in a previous comment, I'd avoid external calls in this document. You can set an offline BPE loader by using https://github.com/pkoukk/tiktoken-go-loader

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, I'm fine keeping this package and updating Pipeline to use it, too. But if we do so, I'd add the offline loader to avoid external calls.

@chuang8511
Copy link
Member Author

chuang8511 commented May 29, 2024

@jvallesm

  1. About test case
    There are several functions that we should mock to avoid uncertainty of unit test. However, it will damage the code structure. So, I decide not to mock split.SplitText & tkm.Encode
    Why would these methods be undeterministic? Given the same model / encoding and input, text should be split or tokenised consistently.

split functions belong to external services. If there is an update in external service, the result could be failure.
In my opinion, unit test should mock the external service to focus on what we are testing.
And, the logic of the external service should contain their own test.

I tried to refactor the part of switch e.Task. However, it damaged the readability when I kept trying to abstract the code.
Could you expand on what you had in mind for this refactorisation and what you tried to achieve? This switch pattern was added not so long ago to avoid code duplication so I'd like to know if it's confusing or missing anything.

OK. I will checkout to another branch and push it for you to check it later!

I want to keep the Go version in this feature branch. So, I commented out the parts that are required to upgrade Go version when I cherry-pick the code.
Not in the scope of this PR but I'm for updating Go to the latest version. The current version is already preventing us from upgrading certain packages. cc @donch1989

OK. Thanks for letting me know!
After we check this topic, we can decide if we import the package or keep the current way.

@jvallesm
Copy link
Collaborator

split functions belong to external services. If there is an update in external service, the result could be failure.
In my opinion, unit test should mock the external service to focus on what we are testing.
And, the logic of the external service should contain their own test.

@chuang8511 by service you mean we're making external calls to split the text? I.e. not only to packages we import but to online APIs. Is there any call besides the encoding fetch from tiktoken (as mentioned, we can have it offline)? I think this is a task where the results shouldn't change if the dependencies are updated (except on e.g. a bugfix). My fear is that, by mocking the dependencies, we might be hiding the real behaviour of the component and have unexpected results even if the tests pass. Without mocking, a breaking test will signal a breaking change in a dependency. After inspecting the change, we can decide to incorporate the new behaviour if we find it's a legit modification.

I'd start by not mocking the depencencies (it's also less work on our side) and only consider mocking if we find the tests are flickering or requiring frequent updates.

@chuang8511
Copy link
Member Author

@jvallesm

My fear is that, by mocking the dependencies, we might be hiding the real behaviour of the component and have unexpected results even if the tests pass.

Before read
All contents below are my previous experience, but I think it is highly related to the character of the industry.

We divide unit test & integration test.
By integration test, we can monitor the dependencies.

We separate unit test & integration test because the integration test takes way longer time for CI.
The reason why taking longer time matters is related to two factors

  • The confidence of how much the test protects our function.
  • How urgent we need to release the functions.

I feel the confidence level below.

  • unit test -> 99% confidence.
    • It means no matter how urgent it is, we cannot release without passing unit test.
  • integration test -> 80% confidence.
    • It means we can come back to fix it later.

However, I missed considering that this external service won't take such long time.
And, the situation is totally different from my previous experience.
So, I think it will be good not to mock the external service here!

@jvallesm
Copy link
Collaborator

However, I missed considering that this external service won't take such long time.
And, the situation is totally different from my previous experience.
So, I think it will be good not to mock the external service here!

I understand your point about confidence & time. Although here we wouldn't be strictly unit-testing the package, we're tackling directly the input/output interface of our package. You get much more confidence by including the dependencies in your tests and, as you mentioned, the time difference is not significant. The added value in mocks here would be (in my opinion) testing edge cases such as error codes, etc. Here we're importing purely cpu-bound functions so it is not so different from not mocking a function in an official package like strings.

In DB tests you can see a more radical case of not following strict unit testing. We could use sqlmock to avoid needing a database running but in my experience production bugs tend to get leaked into mock expectations / return values. By not mocking the infra layer, we make sure that the package API implements the desired behaviour. E.g. testing a database lock with mocks is quite futile but testing them from the end-to-end tests is overkill given their execution cost. This middle ground is both quick and robust in terms of confidence.

@chuang8511 chuang8511 requested a review from jvallesm May 29, 2024 17:32
@chuang8511
Copy link
Member Author

It is quite weird that I added main_test.go but the coverage did not increase....
Not sure how to check it yet.
I have searched in the report but did not find it.

@donch1989
Copy link
Member

Hi @chuang8511
I've upgraded the Go version.

Copy link
Member

@donch1989 donch1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuang8511
I'll merge this PR first. Please note the improvement items, and we can optimize them afterward.

@donch1989 donch1989 merged commit 7b36553 into main May 31, 2024
8 checks passed
@donch1989 donch1989 deleted the chunhao/ins-4499 branch May 31, 2024 03:00
donch1989 pushed a commit that referenced this pull request Jun 6, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.19.0-beta](v0.18.0-beta...v0.19.0-beta)
(2024-06-05)


### Features

* add pdf component
([#138](#138))
([517afcf](517afcf))
* add task chunk text
([#139](#139))
([7b36553](7b36553))
* **instill:** adopt latest Model endpoints
([#146](#146))
([7f2537b](7f2537b))
* optimise ux for slack component
([#143](#143))
([ed60235](ed60235))
* refactor package structure
([#140](#140))
([4853d4c](4853d4c))
* support markdown to text function in text operator
([#149](#149))
([dcbae37](dcbae37))
* unify component interface
([#144](#144))
([ad35e10](ad35e10))


### Bug Fixes

* bug of failure of document component
([#152](#152))
([aed51f8](aed51f8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
namwoam pushed a commit to namwoam/component that referenced this pull request Jun 24, 2024
Because

- we want to enrich the functions to chunk text for knowledge base

This commit

- add the task to chunk
namwoam pushed a commit to namwoam/component that referenced this pull request Jun 24, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.19.0-beta](instill-ai/component@v0.18.0-beta...v0.19.0-beta)
(2024-06-05)


### Features

* add pdf component
([instill-ai#138](instill-ai#138))
([517afcf](instill-ai@517afcf))
* add task chunk text
([instill-ai#139](instill-ai#139))
([7b36553](instill-ai@7b36553))
* **instill:** adopt latest Model endpoints
([instill-ai#146](instill-ai#146))
([7f2537b](instill-ai@7f2537b))
* optimise ux for slack component
([instill-ai#143](instill-ai#143))
([ed60235](instill-ai@ed60235))
* refactor package structure
([instill-ai#140](instill-ai#140))
([4853d4c](instill-ai@4853d4c))
* support markdown to text function in text operator
([instill-ai#149](instill-ai#149))
([dcbae37](instill-ai@dcbae37))
* unify component interface
([instill-ai#144](instill-ai#144))
([ad35e10](instill-ai@ad35e10))


### Bug Fixes

* bug of failure of document component
([instill-ai#152](instill-ai#152))
([aed51f8](instill-ai@aed51f8))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: 👋 Done
Development

Successfully merging this pull request may close these issues.

4 participants