Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) #5171

Merged
merged 12 commits into from
May 30, 2023

Conversation

ByronHsu
Copy link
Contributor

@ByronHsu ByronHsu commented May 24, 2023

As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each splitter.
Let me know if any concerns.

Fixes # (issue)
#5170

Who can review?

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
@eyurtsev @hwchase17

…ift, rst)

Signed-off-by: byhsu <byhsu@linkedin.com>
@ByronHsu ByronHsu changed the title Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rst) Add more code splitters (go, rst, js, java, cpp, scala, ruby, php, swift, rust) May 24, 2023
Signed-off-by: byhsu <byhsu@linkedin.com>
@dev2049
Copy link
Contributor

dev2049 commented May 24, 2023

could we add some unit tests for all these :) may be easier to have sep pr for each, but will let you decide

@dev2049 dev2049 added the needs test PR needs to be updated with tests label May 24, 2023
@ByronHsu
Copy link
Contributor Author

@dev2049 ok i will add tests and split the pr :)

Signed-off-by: byhsu <byhsu@linkedin.com>
@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 25, 2023

Added tests! plz take a look again. Thanks!

@dev2049

btw, may i ask why i run into mypy error locally, but not on github action?

~/learn-repo/langchain more-code-splitter !2 ❯ make lint                                                                                              4s  langchain byhsu@byhsu-ld1 22:50:19
poetry run mypy .
langchain/evaluation/loading.py:5: error: Incompatible import of "load_dataset" (imported name has type "Callable[[str, Optional[str], Optional[str], Union[str, Sequence[str], Mapping[str, Union[str, Sequence[str]]], None], Union[str, Split, None], Optional[str], Optional[Features], Optional[DownloadConfig], Optional[GenerateMode], bool, Optional[bool], bool, Union[str, Version, None], Union[bool, str, None], Union[str, TaskTemplate, None], bool, Any, KwArg(Any)], Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]]", local name has type "Callable[[str], List[Dict[Any, Any]]]")  [assignment]
langchain/evaluation/loading.py:8: error: No overload variant of "__getitem__" of "list" matches argument type "str"  [call-overload]
langchain/evaluation/loading.py:8: note: Possible overload variants:
langchain/evaluation/loading.py:8: note:     def __getitem__(self, SupportsIndex, /) -> Dict[Any, Any]
langchain/evaluation/loading.py:8: note:     def __getitem__(self, slice, /) -> List[Dict[Any, Any]]
langchain/document_loaders/hugging_face_dataset.py:81: error: Item "Dataset" of "Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]" has no attribute "keys"  [union-attr]
langchain/document_loaders/hugging_face_dataset.py:81: error: Item "IterableDataset" of "Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]" has no attribute "keys"  [union-attr]
langchain/document_loaders/hugging_face_dataset.py:82: error: Value of type "Union[DatasetDict, Dataset, IterableDatasetDict, IterableDataset]" is not indexable  [index]
Found 5 errors in 2 files (checked 1024 source files)
make: *** [lint] Error 1

@eyurtsev eyurtsev requested review from dev2049 and eyurtsev May 25, 2023 14:54
byhsu added 2 commits May 27, 2023 00:56
Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
@ByronHsu
Copy link
Contributor Author

@eyurtsev Can you review? Thanks!

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor suggestion with enum. Let's update the exception to a ValueError rather than throwing a generic exception.

langchain/text_splitter.py Outdated Show resolved Hide resolved
@@ -439,3 +440,261 @@ def __init__(self, **kwargs: Any):
"",
]
super().__init__(separators=separators, **kwargs)


class Language(str, Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Have you considered using a union of literals instead of an enum?

Language = Union[Literal['go'], Literal['js'], ...]

Gives the same static typing guarantees and makes it such that a user does not have to import an extra enum.

I think the only downside is that it's harder to list all the options in the set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I would think prompting available languages is more important to users

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe VSCode/PyLance would still prompt appropriately for a union of literals, but the enum is more foolproof.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDE auto-completion will work in either case

I meant that with enums it may be more convenient for a developer to import the enum and then do something like looping through all the possibilities.

In [1]: from enum import Enum

In [2]: class A(Enum):
   ...:     foo = 'foo'
   ...:     bar = 'bar'
   ...: 

In [4]: list(A)
Out[4]: [<A.foo: 'foo'>, <A.bar: 'bar'>]

Anyway this works for now. I think we could propagate the file name through in the document metadata via the source field, then the code splitter will be able to automatically split based on the language potentially

@eyurtsev
Copy link
Collaborator

@ByronHsu i can take a look at mypy error on monday. i assume you checked from master?

byhsu added 3 commits May 28, 2023 20:24
Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
@dev2049 dev2049 added the 03 enhancement Enhancement of existing functionality label May 29, 2023
byhsu added 4 commits May 29, 2023 18:29
Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
Signed-off-by: byhsu <byhsu@linkedin.com>
@ByronHsu
Copy link
Contributor Author

@dev2049 i've added notebook examples and improved the tests. Could you plz review again? Thanks!

@eyurtsev eyurtsev added lgtm PR looks good. Use to confirm that a PR is ready for merging. maintainer-to-merge and removed needs test PR needs to be updated with tests labels May 30, 2023
@eyurtsev eyurtsev merged commit 9d658aa into langchain-ai:master May 30, 2023
13 checks passed
vowelparrot pushed a commit that referenced this pull request May 31, 2023
…ift, rust) (#5171)

As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.

Fixes # (issue)
#5170

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17

---------

Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
…ift, rust) (langchain-ai#5171)

As the title says, I added more code splitters.
The implementation is trivial, so i don't add separate tests for each
splitter.
Let me know if any concerns.

Fixes # (issue)
langchain-ai#5170

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
@eyurtsev @hwchase17

---------

Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants