Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (documents): add a source code loader based on AST manipulation #6486

Merged
merged 14 commits into from
Jun 27, 2023

Conversation

cristobalcl
Copy link
Contributor

Summary

A new approach to loading source code is implemented:

Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

For instance, having this script:

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()

The loader will create three documents with this content:

First document:

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

Second document:

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

Third document:

# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()

A threshold parameter is added to control whether small scripts are split in this way or not.

At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension.

Tests

This PR adds:

  • Unit tests
  • Integration tests

Dependencies

Only one dependency was added as optional (needed for the JavaScript parser).

Documentation

A notebook is added showing how the loader can be used.

Who can review?

@eyurtsev @hwchase17

@vercel
Copy link

vercel bot commented Jun 20, 2023

@cristobalcl is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlancemartin we have the concept of parsers - how does this relate to that?

@hwchase17
Copy link
Contributor

cc @eyurtsev for his thoughts as well

@cristobalcl
Copy link
Contributor Author

@rlancemartin we have the concept of parsers - how does this relate to that?

AFAIK parsers are used in LangChain to process the output of a model response and convert it to a Python struct.

In the context of this PR, parser refers to the manipulation of a source code written in a programming language, in order to separate chunks of text in a meaningful way.

Any suggestion about naming is welcome 😄

@dev2049 dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels Jun 21, 2023
@dev2049
Copy link
Contributor

dev2049 commented Jun 21, 2023

@rlancemartin we have the concept of parsers - how does this relate to that?

@hwchase17 you're referring to BlobParsers here, right (not OutputParser)

@vercel
Copy link

vercel bot commented Jun 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Jun 27, 2023 9:38pm

@rlancemartin
Copy link
Collaborator

Nice work! Interesting PR.

Suggestion: we can re-organize the code slightly to fit into a loader -> parser workflow as we do here.

(1) You should be able to use existing FileSystemBlobLoader here to load the .js or .py files. No new code should be required for loading.

(2) Then, just move your parsing logic into a new parser file in the directory here.

The UX is simple and we reduce code duplication (no new loading code). Specifically, the UX can follow what we do here where the loader and parser are called:

# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())
docs = loader.load()

In this case, of course it will look like:

# Parse js or py to docs
loader = GenericLoader(FileSystemBlobLoader(path="",glob=""), <New_Code_Parser>())
docs = loader.load()

Thoughts?

@cristobalcl
Copy link
Contributor Author

Oh, right! That makes much more sense. I wasn't familiar with GenericLoader, and I wasn't sure how parsers were used. Now that's clear.

Let me refactor the code.

@cristobalcl
Copy link
Contributor Author

Done! It's much better now, in my opinion.

One thing to note: I've reused langchain.text_splitter.Language to avoid duplicate things, but maybe that class should be moved to a global module.

More suggestions are welcome!

@rlancemartin
Copy link
Collaborator

rlancemartin commented Jun 26, 2023

Done! It's much better now, in my opinion.

One thing to note: I've reused langchain.text_splitter.Language to avoid duplicate things, but maybe that class should be moved to a global module.

More suggestions are welcome!

Nice! Yes, clean UX:

loader = GenericLoader.from_filesystem(
    "./example_data/languages",
    glob="*",
    suffixes=[".py", ".js"],
    parser=LanguageParser()
)
docs = loader.load()

Small clarification in the docs:

  1. it will default split on classes and functions? can this be configured (e.g., only split on functions)?
  2. will the top-level code, but without the already loaded functions and classes doc always be generated?

This is looking good. I think we only need to clarify usage a bit.

In short, it simply looks like this will by default split any js or py based on class and function definitions?

@cristobalcl
Copy link
Contributor Author

Exactly, it only splits, or segments, the source code based on top-level classes and functions, and the remaining code.

I've edited the documentation trying to make it more straightforward. Also, the docstring mentions the only parameters that can be configured: language and parser_threshold.

Should I add an example in the notebook like the one I mention in the first post in this PR? I thought that the code and the results would be self-explanatory, but perhaps it's not clear enough.

@rlancemartin
Copy link
Collaborator

Exactly, it only splits, or segments, the source code based on top-level classes and functions, and the remaining code.

I've edited the documentation trying to make it more straightforward. Also, the docstring mentions the only parameters that can be configured: language and parser_threshold.

Should I add an example in the notebook like the one I mention in the first post in this PR? I thought that the code and the results would be self-explanatory, but perhaps it's not clear enough.

Nice work! No need to add more examples. I'm running tests now and will plan to merge this once they all pass.

@rlancemartin rlancemartin merged commit e494b0a into langchain-ai:master Jun 27, 2023
14 checks passed
@cristobalcl
Copy link
Contributor Author

Nice, thanks for your help, and for that last fixes! 😃

kacperlukawski pushed a commit to kacperlukawski/langchain that referenced this pull request Jun 29, 2023
…angchain-ai#6486)

#### Summary

A new approach to loading source code is implemented:

Each top-level function and class in the code is loaded into separate
documents. Then, an additional document is created with the top-level
code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

For instance, having this script:

```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()
```

The loader will create three documents with this content:

First document:
```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")
```

Second document:
```
def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()
```

Third document:
```
# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()
```

A threshold parameter is added to control whether small scripts are
split in this way or not.

At this moment, only Python and JavaScript are supported. The
appropriate parser is determined by examining the file extension.

#### Tests

This PR adds:

- Unit tests
- Integration tests

#### Dependencies

Only one dependency was added as optional (needed for the JavaScript
parser).

#### Documentation

A notebook is added showing how the loader can be used.

#### Who can review?

@eyurtsev @hwchase17

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
vowelparrot pushed a commit that referenced this pull request Jul 4, 2023
…6486)

#### Summary

A new approach to loading source code is implemented:

Each top-level function and class in the code is loaded into separate
documents. Then, an additional document is created with the top-level
code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

For instance, having this script:

```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()
```

The loader will create three documents with this content:

First document:
```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")
```

Second document:
```
def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()
```

Third document:
```
# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()
```

A threshold parameter is added to control whether small scripts are
split in this way or not.

At this moment, only Python and JavaScript are supported. The
appropriate parser is determined by examining the file extension.

#### Tests

This PR adds:

- Unit tests
- Integration tests

#### Dependencies

Only one dependency was added as optional (needed for the JavaScript
parser).

#### Documentation

A notebook is added showing how the loader can be used.

#### Who can review?

@eyurtsev @hwchase17

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
aerrober pushed a commit to aerrober/langchain-fork that referenced this pull request Jul 24, 2023
…angchain-ai#6486)

#### Summary

A new approach to loading source code is implemented:

Each top-level function and class in the code is loaded into separate
documents. Then, an additional document is created with the top-level
code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

For instance, having this script:

```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()
```

The loader will create three documents with this content:

First document:
```
class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")
```

Second document:
```
def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()
```

Third document:
```
# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()
```

A threshold parameter is added to control whether small scripts are
split in this way or not.

At this moment, only Python and JavaScript are supported. The
appropriate parser is determined by examining the file extension.

#### Tests

This PR adds:

- Unit tests
- Integration tests

#### Dependencies

Only one dependency was added as optional (needed for the JavaScript
parser).

#### Documentation

A notebook is added showing how the loader can be used.

#### Who can review?

@eyurtsev @hwchase17

---------

Co-authored-by: rlm <pexpresss31@gmail.com>
hwchase17 added a commit that referenced this pull request Feb 13, 2024
## Description

I am submitting this for a school project as part of a team of 5. Other
team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR
also has contributions from community members @Harrolee and @Mario928.

Initial context is in the issue we opened (#11229).

This pull request adds:

- Generic framework for expanding the languages that `LanguageParser`
can handle, using the
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter)
parsing library and existing language-specific parsers written for it
- Support for the following additional languages in `LanguageParser`:
  - C
  - C++
  - C#
  - Go
- Java (contributed by @Mario928
ThatsJustCheesy#2)
  - Kotlin
  - Lua
  - Perl
  - Ruby
  - Rust
  - Scala
- TypeScript (contributed by @Harrolee
ThatsJustCheesy#1)

Here is the [design
document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk)
if curious, but no need to read it.

## Issues

- Closes #11229
- Closes #10996
- Closes #8405

## Dependencies

`tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add
these as optional dependencies.

## Documentation

We have updated the list of supported languages, and also added a
section to `source_code.ipynb` detailing how to add support for
additional languages using our framework.

## Maintainer

- @hwchase17 (previously reviewed
#6486)

Thanks!!

## Git commits

We will gladly squash any/all of our commits (esp merge commits) if
necessary. Let us know if this is desirable, or if you will be
squash-merging anyway.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com>
Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com>
Co-authored-by: Jeremy La <jeremylai511@gmail.com>
Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com>
Co-authored-by: Lee Harrold <lhharrold@sep.com>
Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
snsten pushed a commit to snsten/langchain that referenced this pull request Feb 15, 2024
…ai#13318)

## Description

I am submitting this for a school project as part of a team of 5. Other
team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR
also has contributions from community members @Harrolee and @Mario928.

Initial context is in the issue we opened (langchain-ai#11229).

This pull request adds:

- Generic framework for expanding the languages that `LanguageParser`
can handle, using the
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter)
parsing library and existing language-specific parsers written for it
- Support for the following additional languages in `LanguageParser`:
  - C
  - C++
  - C#
  - Go
- Java (contributed by @Mario928
ThatsJustCheesy#2)
  - Kotlin
  - Lua
  - Perl
  - Ruby
  - Rust
  - Scala
- TypeScript (contributed by @Harrolee
ThatsJustCheesy#1)

Here is the [design
document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk)
if curious, but no need to read it.

## Issues

- Closes langchain-ai#11229
- Closes langchain-ai#10996
- Closes langchain-ai#8405

## Dependencies

`tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add
these as optional dependencies.

## Documentation

We have updated the list of supported languages, and also added a
section to `source_code.ipynb` detailing how to add support for
additional languages using our framework.

## Maintainer

- @hwchase17 (previously reviewed
langchain-ai#6486)

Thanks!!

## Git commits

We will gladly squash any/all of our commits (esp merge commits) if
necessary. Let us know if this is desirable, or if you will be
squash-merging anyway.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com>
Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com>
Co-authored-by: Jeremy La <jeremylai511@gmail.com>
Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com>
Co-authored-by: Lee Harrold <lhharrold@sep.com>
Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
haydeniw pushed a commit to haydeniw/langchain that referenced this pull request Feb 27, 2024
…ai#13318)

## Description

I am submitting this for a school project as part of a team of 5. Other
team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR
also has contributions from community members @Harrolee and @Mario928.

Initial context is in the issue we opened (langchain-ai#11229).

This pull request adds:

- Generic framework for expanding the languages that `LanguageParser`
can handle, using the
[tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter)
parsing library and existing language-specific parsers written for it
- Support for the following additional languages in `LanguageParser`:
  - C
  - C++
  - C#
  - Go
- Java (contributed by @Mario928
ThatsJustCheesy#2)
  - Kotlin
  - Lua
  - Perl
  - Ruby
  - Rust
  - Scala
- TypeScript (contributed by @Harrolee
ThatsJustCheesy#1)

Here is the [design
document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk)
if curious, but no need to read it.

## Issues

- Closes langchain-ai#11229
- Closes langchain-ai#10996
- Closes langchain-ai#8405

## Dependencies

`tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add
these as optional dependencies.

## Documentation

We have updated the list of supported languages, and also added a
section to `source_code.ipynb` detailing how to add support for
additional languages using our framework.

## Maintainer

- @hwchase17 (previously reviewed
langchain-ai#6486)

Thanks!!

## Git commits

We will gladly squash any/all of our commits (esp merge commits) if
necessary. Let us know if this is desirable, or if you will be
squash-merging anyway.

<!-- Thank you for contributing to LangChain!

Replace this entire comment with:
  - **Description:** a description of the change, 
  - **Issue:** the issue # it fixes (if applicable),
  - **Dependencies:** any dependencies required for this change,
- **Tag maintainer:** for a quicker response, tag the relevant
maintainer (see below),
- **Twitter handle:** we announce bigger features on Twitter. If your PR
gets announced, and you'd like a mention, we'll gladly shout you out!

Please make sure your PR is passing linting and testing before
submitting. Run `make format`, `make lint` and `make test` to check this
locally.

See contribution guidelines for more information on how to write/run
tests, lint, etc:

https://github.com/langchain-ai/langchain/blob/master/.github/CONTRIBUTING.md

If you're adding a new integration, please include:
1. a test for the integration, preferably unit tests that do not rely on
network access,
2. an example notebook showing its use. It lives in `docs/extras`
directory.

If no one reviews your PR within a few days, please @-mention one of
@baskaryan, @eyurtsev, @hwchase17.
 -->

---------

Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com>
Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com>
Co-authored-by: Jeremy La <jeremylai511@gmail.com>
Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com>
Co-authored-by: Lee Harrold <lhharrold@sep.com>
Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants