-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add task chunk text #139
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #139 +/- ##
==========================================
+ Coverage 57.32% 61.99% +4.66%
==========================================
Files 30 42 +12
Lines 3145 4305 +1160
==========================================
+ Hits 1803 2669 +866
- Misses 1104 1342 +238
- Partials 238 294 +56
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Is the code in external/langchaingo
fully copied from the langchaingo
project? Why not importing the package? If not, what are the main differences? The project is actively maintained so by duplicating its code we're losing any update they make (e.g. a vulnerability fix). I was reviewing this code, then realised it comes from somewhere else.
import ( | ||
"fmt" | ||
|
||
"github.com/pkoukk/tiktoken-go" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 In pipeline-backend-cloud
we use https://github.com/hupe1980/go-tiktoken, please consider using a single package across our codebase.
Since this is going to be used by an operator, I'd avoid making external calls so, if we need to keep this package, I'd use this loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not importing the package? If not, what are the main differences?
There is no difference. I chose this way because of this discussion.
- discussion
In short, the main reason is that importing all package needs to upgrade Go version.
And, I am not quite sure the risk about updating Go version.
So, I want to keep updating Go version in another branch.
The project is actively maintained so by duplicating its code we're losing any update they make (e.g. a vulnerability fix).
I agree! I am thinking about we transfer to import the package directly after we upgrade Go version.
However, I am not quite sure about this topic.
Do you have experience on this part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked Task_Split_By_Token, and it used "github.com/pkoukk/tiktoken-go"
Is it different between pipeline-backend-cloud
& pipeline-backend
now?
Should we modify pipeline-backend-cloud
?
I am not quite sure about whether it is a good source.
It seems the star / fork count is low.
Is it a standard to judge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- About the Go version, I'd import an earlier version of the package that is compatible with our Go version. If this is the reason to copy instead of import the code, a much more sensible approach is updating Go, then importing the latest version of the dependency, even if in the beginning we might potentially lose some functionality like a given token encoding. The version difference is small so this shouldn't be a big migration task.
- About the size of
langchaingo
, the rule of thumb I'd follow would be that it's a good argument as long as what you're trying to copy fits in your package (a function, a struct type). Creating 2external
packages and importing them adds confusion to the structure and maybe it's a smell that indicates that it's worth importing the source (and more honest). At this point I don't think binary size is a massive concern and I think there are plenty of ways to improve it anyways. In other SDKs like S3 or Stripe we're using a tiny subset of their features and we're not considering copying their source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it different between pipeline-backend-cloud & pipeline-backend now? Should we modify pipeline-backend-cloud?
pipeline-backend
doesn't use it because this is needed for the Credit feature, implemented only on Cloud.
I am not quite sure about whether it is a good source.
I picked the other package (without being aware of tiktoken-go
being used in component
) because the encoding is embedded by default and because gpt-4o was quickly added after it was announced, so it fit our needs. The star count is definitely an indicator of the quality and health of the project but we shouldn't be guided solely by it and in this case the choice made sense. I'm not against modifying pipeline-backend-cloud
instead of this package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if in the beginning we might potentially lose some functionality like a given token encoding. The version difference is small so this shouldn't be a big migration task.
Just share
In my previous experience / industry / company, losing functionality is not acceptable in terms of client sides. So, we barely upgrade the language version to keep the stability.
About Go version, I checked the release note.
It seems like what you said, it is not a big migration.
However, I still want to separate branches between upgrading Go version & this feature branch.
I think it will be clearer than combining them together.
How about we mark it as TODO and import the package after we upgrade the Go version?
What do you think about it?
I'm not against modifying pipeline-backend-cloud instead of this package.
I am confused about this. So, I want to confirm it again.
If we are going to import all langchaingo package, should we modify pipeline-backend-cloud
to avoid two token package rather than pipeline-backend
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we mark it as TODO and import the package after we upgrade the Go version?
If we're going to tackle one task after the other, I'm not against moving forward like this.
If we are going to import all langchaingo package, should we modify pipeline-backend-cloud to avoid two token package rather than pipeline-backend?
Yep my suggestion is that we can keep using https://github.com/pkoukk/tiktoken-go here and update pipeline-backend-cloud
to use it, too. I'll create a ticket and modify it in the next sprint, it's a low-hanging fruit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chuang8511
I'll upgrade the Golang version today in another PR.
If my understanding is correct, after we upgrade the Golang version, we don't need to modify any Golang code from LangChainGo, right? If so, let's just import it.
It has its own data model and interfaces for connecting LLMs, database I/O, etc., which is a bit too heavy for a single Instill Component. We probably want to implement the code ourselves instead of importing it as a whole Go package.
This only happens when you need to make modifications. If we don't update their code, we can just import it. The size of the package is not a problem, I remember that when using go build
, the compiler will only compile the code you used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked https://github.com/pkoukk/tiktoken-go, and they have added GPT-4 support. I guess we can safely switch to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some files where everything is commented out. We should not include them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- About test case
There are several functions that we should mock to avoid uncertainty of unit test. However, it will damage the code structure. So, I decide not to mocksplit.SplitText
&tkm.Encode
Why would these methods be undeterministic? Given the same model / encoding and input, text should be split or tokenised consistently.
- I tried to refactor the part of
switch e.Task
. However, it damaged the readability when I kept trying to abstract the code.
Could you expand on what you had in mind for this refactorisation and what you tried to achieve? This switch
pattern was added not so long ago to avoid code duplication so I'd like to know if it's confusing or missing anything.
- I want to keep the Go version in this feature branch. So, I commented out the parts that are required to upgrade Go version when I cherry-pick the code.
Not in the scope of this PR but I'm for updating Go to the latest version. The current version is already preventing us from upgrading certain packages. cc @donch1989
func chunkText(input ChunkTextInput) (ChunkTextOutput, error) { | ||
var split textsplitter.TextSplitter | ||
setting := input.Strategy.Setting | ||
setting.SetDefault() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point in having settings as user input (e.g. allowed special) if they're going to ve overriden? Shouldn't the default
field in the config schema determine the default behaviour, rather than the Go code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the default field in the config schema determine the default behaviour, rather than the Go code?
I also think so.
However, there is an unknown bug that I could not fix.
I also feel sad about it. 😢
But, in a positive way, it decreases the risk of the bugs come from other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I reviewed without having all the context in mind. If this is a workaround to a bug that we can't address in this sprint, I'd add a TODO comment before this line explaining why it's needed and how / when it should be fixed.
operator/text/v0/chunk_text.go
Outdated
if setting.ChunkMethod == "Token" { | ||
tkm, err := tiktoken.EncodingForModel(setting.ModelName) | ||
if err != nil { | ||
return ChunkTextOutput{}, err | ||
} | ||
token := tkm.Encode(input.Text, setting.AllowedSpecial, setting.DisallowedSpecial) | ||
output.TokenCount = len(token) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can go in the switch
statement above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in a previous comment, I'd avoid external calls in this document. You can set an offline BPE loader by using https://github.com/pkoukk/tiktoken-go-loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I'm fine keeping this package and updating Pipeline to use it, too. But if we do so, I'd add the offline loader to avoid external calls.
split functions belong to external services. If there is an update in external service, the result could be failure.
OK. I will checkout to another branch and push it for you to check it later!
OK. Thanks for letting me know! |
@chuang8511 by service you mean we're making external calls to split the text? I.e. not only to packages we import but to online APIs. Is there any call besides the encoding fetch from tiktoken (as mentioned, we can have it offline)? I think this is a task where the results shouldn't change if the dependencies are updated (except on e.g. a bugfix). My fear is that, by mocking the dependencies, we might be hiding the real behaviour of the component and have unexpected results even if the tests pass. Without mocking, a breaking test will signal a breaking change in a dependency. After inspecting the change, we can decide to incorporate the new behaviour if we find it's a legit modification. I'd start by not mocking the depencencies (it's also less work on our side) and only consider mocking if we find the tests are flickering or requiring frequent updates. |
Before read We divide unit test & integration test. We separate unit test & integration test because the integration test takes way longer time for CI.
I feel the confidence level below.
However, I missed considering that this external service won't take such long time. |
I understand your point about confidence & time. Although here we wouldn't be strictly unit-testing the package, we're tackling directly the input/output interface of our package. You get much more confidence by including the dependencies in your tests and, as you mentioned, the time difference is not significant. The added value in mocks here would be (in my opinion) testing edge cases such as error codes, etc. Here we're importing purely cpu-bound functions so it is not so different from not mocking a function in an official package like strings. In DB tests you can see a more radical case of not following strict unit testing. We could use sqlmock to avoid needing a database running but in my experience production bugs tend to get leaked into mock expectations / return values. By not mocking the infra layer, we make sure that the package API implements the desired behaviour. E.g. testing a database lock with mocks is quite futile but testing them from the end-to-end tests is overkill given their execution cost. This middle ground is both quick and robust in terms of confidence. |
It is quite weird that I added main_test.go but the coverage did not increase.... |
Hi @chuang8511 |
f51e04d
to
9b6ec01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chuang8511
I'll merge this PR first. Please note the improvement items, and we can optimize them afterward.
🤖 I have created a release *beep* *boop* --- ## [0.19.0-beta](v0.18.0-beta...v0.19.0-beta) (2024-06-05) ### Features * add pdf component ([#138](#138)) ([517afcf](517afcf)) * add task chunk text ([#139](#139)) ([7b36553](7b36553)) * **instill:** adopt latest Model endpoints ([#146](#146)) ([7f2537b](7f2537b)) * optimise ux for slack component ([#143](#143)) ([ed60235](ed60235)) * refactor package structure ([#140](#140)) ([4853d4c](4853d4c)) * support markdown to text function in text operator ([#149](#149)) ([dcbae37](dcbae37)) * unify component interface ([#144](#144)) ([ad35e10](ad35e10)) ### Bug Fixes * bug of failure of document component ([#152](#152)) ([aed51f8](aed51f8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Because - we want to enrich the functions to chunk text for knowledge base This commit - add the task to chunk
🤖 I have created a release *beep* *boop* --- ## [0.19.0-beta](instill-ai/component@v0.18.0-beta...v0.19.0-beta) (2024-06-05) ### Features * add pdf component ([instill-ai#138](instill-ai#138)) ([517afcf](instill-ai@517afcf)) * add task chunk text ([instill-ai#139](instill-ai#139)) ([7b36553](instill-ai@7b36553)) * **instill:** adopt latest Model endpoints ([instill-ai#146](instill-ai#146)) ([7f2537b](instill-ai@7f2537b)) * optimise ux for slack component ([instill-ai#143](instill-ai#143)) ([ed60235](instill-ai@ed60235)) * refactor package structure ([instill-ai#140](instill-ai#140)) ([4853d4c](instill-ai@4853d4c)) * support markdown to text function in text operator ([instill-ai#149](instill-ai#149)) ([dcbae37](instill-ai@dcbae37)) * unify component interface ([instill-ai#144](instill-ai#144)) ([ad35e10](instill-ai@ad35e10)) ### Bug Fixes * bug of failure of document component ([instill-ai#152](instill-ai#152)) ([aed51f8](instill-ai@aed51f8)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Because
This commit