Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add code splitter #7100

Merged
merged 11 commits into from Aug 1, 2023
Merged

add code splitter #7100

merged 11 commits into from Aug 1, 2023

Conversation

yisding
Copy link
Collaborator

@yisding yisding commented Jul 31, 2023

Description

Add code splitter to text splitters. Thanks to @kevinlu1248 from Sweep AI for the idea and push.

Type of Change

Please delete options that are not relevant.

  • [X ] New feature (non-breaking change which adds functionality)
  • [X ] This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • [X ] Added new unit/integration tests

Suggested Checklist:

  • [X ] I have performed a self-review of my own code
  • [X ] I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • [X ] My changes generate no new warnings
  • [X ] I have added tests that prove my fix is effective or that my feature works
  • [X ] New and existing unit tests pass locally with my changes

@yisding
Copy link
Collaborator Author

yisding commented Aug 1, 2023

@kevinlu1248 can you take a look at the autolanguage logic? I don't think it's 100% necessary, although it's a cool idea if it works.

@kevinlu1248
Copy link
Contributor

Taking a look. I'll play with it a bit later tonight.

@kevinlu1248
Copy link
Contributor

Talked to Yi, and the plan is to ship this for now and add auto-language detection later, possibly using https://github.com/yoeo/guesslang. I also added some tests for some additional common languages.

Technical detail: It seems tree-sitter-languages' version of typescript can parse TSX. At Sweep, we just use the TSX parser directly, not sure what tree-sitter-languages is using under the hood.

@yisding yisding marked this pull request as ready for review August 1, 2023 03:27
@yisding yisding merged commit e567e6a into run-llama:main Aug 1, 2023
8 checks passed
@jerryjliu
Copy link
Collaborator

thanks guys this is awesome!

@jon-chuang
Copy link
Contributor

add auto-language detection later

Another possibility:
https://github.com/matthewdeanmartin/whats_that_code

https://github.com/yoeo/guesslang

Requires tensorflow

@kevinlu1248
Copy link
Contributor

Makes sense, deep learning based language detector is overkill

@SebastianBodza
Copy link

Could it be that the chunk_lines parameter is not used at all? It only chunks it based on the max_chars?

Seems not to be used in the Code ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants