feat (documents): add a source code loader based on AST manipulation #6486

cristobalcl · 2023-06-20T16:38:21Z

Summary

A new approach to loading source code is implemented:

Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

For instance, having this script:

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()

The loader will create three documents with this content:

First document:

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

Second document:

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

Third document:

# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()

A threshold parameter is added to control whether small scripts are split in this way or not.

At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension.

Tests

This PR adds:

Unit tests
Integration tests

Dependencies

Only one dependency was added as optional (needed for the JavaScript parser).

Documentation

A notebook is added showing how the loader can be used.

Who can review?

@eyurtsev @hwchase17

vercel · 2023-06-20T16:38:24Z

@cristobalcl is attempting to deploy a commit to the LangChain Team on Vercel.

A member of the Team first needs to authorize it.

hwchase17

@rlancemartin we have the concept of parsers - how does this relate to that?

hwchase17 · 2023-06-20T17:33:20Z

cc @eyurtsev for his thoughts as well

cristobalcl · 2023-06-20T19:49:49Z

@rlancemartin we have the concept of parsers - how does this relate to that?

AFAIK parsers are used in LangChain to process the output of a model response and convert it to a Python struct.

In the context of this PR, parser refers to the manipulation of a source code written in a programming language, in order to separate chunks of text in a meaningful way.

Any suggestion about naming is welcome 😄

dev2049 · 2023-06-21T01:46:26Z

@rlancemartin we have the concept of parsers - how does this relate to that?

@hwchase17 you're referring to BlobParsers here, right (not OutputParser)

vercel · 2023-06-21T14:36:44Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)			Jun 27, 2023 9:38pm

rlancemartin · 2023-06-25T20:58:02Z

Nice work! Interesting PR.

Suggestion: we can re-organize the code slightly to fit into a loader -> parser workflow as we do here.

(1) You should be able to use existing FileSystemBlobLoader here to load the .js or .py files. No new code should be required for loading.

(2) Then, just move your parsing logic into a new parser file in the directory here.

The UX is simple and we reduce code duplication (no new loading code). Specifically, the UX can follow what we do here where the loader and parser are called:

# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())
docs = loader.load()

In this case, of course it will look like:

# Parse js or py to docs
loader = GenericLoader(FileSystemBlobLoader(path="",glob=""), <New_Code_Parser>())
docs = loader.load()

Thoughts?

cristobalcl · 2023-06-26T06:28:59Z

Oh, right! That makes much more sense. I wasn't familiar with GenericLoader, and I wasn't sure how parsers were used. Now that's clear.

Let me refactor the code.

cristobalcl · 2023-06-26T16:34:38Z

Done! It's much better now, in my opinion.

One thing to note: I've reused langchain.text_splitter.Language to avoid duplicate things, but maybe that class should be moved to a global module.

More suggestions are welcome!

rlancemartin · 2023-06-26T17:50:30Z

Done! It's much better now, in my opinion.

One thing to note: I've reused langchain.text_splitter.Language to avoid duplicate things, but maybe that class should be moved to a global module.

More suggestions are welcome!

Nice! Yes, clean UX:

loader = GenericLoader.from_filesystem(
    "./example_data/languages",
    glob="*",
    suffixes=[".py", ".js"],
    parser=LanguageParser()
)
docs = loader.load()

Small clarification in the docs:

it will default split on classes and functions? can this be configured (e.g., only split on functions)?
will the top-level code, but without the already loaded functions and classes doc always be generated?

This is looking good. I think we only need to clarify usage a bit.

In short, it simply looks like this will by default split any js or py based on class and function definitions?

cristobalcl · 2023-06-27T16:14:30Z

Exactly, it only splits, or segments, the source code based on top-level classes and functions, and the remaining code.

I've edited the documentation trying to make it more straightforward. Also, the docstring mentions the only parameters that can be configured: language and parser_threshold.

Should I add an example in the notebook like the one I mention in the first post in this PR? I thought that the code and the results would be self-explanatory, but perhaps it's not clear enough.

rlancemartin · 2023-06-27T18:34:16Z

Exactly, it only splits, or segments, the source code based on top-level classes and functions, and the remaining code.

I've edited the documentation trying to make it more straightforward. Also, the docstring mentions the only parameters that can be configured: language and parser_threshold.

Should I add an example in the notebook like the one I mention in the first post in this PR? I thought that the code and the results would be self-explanatory, but perhaps it's not clear enough.

Nice work! No need to add more examples. I'm running tests now and will plan to merge this once they all pass.

cristobalcl · 2023-06-28T05:59:04Z

Nice, thanks for your help, and for that last fixes! 😃

@eyurtsev

…angchain-ai#6486) #### Summary A new approach to loading source code is implemented: Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes. This could improve the accuracy of QA chains over source code. For instance, having this script: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() if __name__ == '__main__': main() ``` The loader will create three documents with this content: First document: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") ``` Second document: ``` def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() ``` Third document: ``` # Code for: class MyClass: # Code for: def main(): if __name__ == '__main__': main() ``` A threshold parameter is added to control whether small scripts are split in this way or not. At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension. #### Tests This PR adds: - Unit tests - Integration tests #### Dependencies Only one dependency was added as optional (needed for the JavaScript parser). #### Documentation A notebook is added showing how the loader can be used. #### Who can review? @eyurtsev @hwchase17 --------- Co-authored-by: rlm <pexpresss31@gmail.com>

@eyurtsev

…6486) #### Summary A new approach to loading source code is implemented: Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes. This could improve the accuracy of QA chains over source code. For instance, having this script: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() if __name__ == '__main__': main() ``` The loader will create three documents with this content: First document: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") ``` Second document: ``` def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() ``` Third document: ``` # Code for: class MyClass: # Code for: def main(): if __name__ == '__main__': main() ``` A threshold parameter is added to control whether small scripts are split in this way or not. At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension. #### Tests This PR adds: - Unit tests - Integration tests #### Dependencies Only one dependency was added as optional (needed for the JavaScript parser). #### Documentation A notebook is added showing how the loader can be used. #### Who can review? @eyurtsev @hwchase17 --------- Co-authored-by: rlm <pexpresss31@gmail.com>

@eyurtsev

…angchain-ai#6486) #### Summary A new approach to loading source code is implemented: Each top-level function and class in the code is loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes. This could improve the accuracy of QA chains over source code. For instance, having this script: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() if __name__ == '__main__': main() ``` The loader will create three documents with this content: First document: ``` class MyClass: def __init__(self, name): self.name = name def greet(self): print(f"Hello, {self.name}!") ``` Second document: ``` def main(): name = input("Enter your name: ") obj = MyClass(name) obj.greet() ``` Third document: ``` # Code for: class MyClass: # Code for: def main(): if __name__ == '__main__': main() ``` A threshold parameter is added to control whether small scripts are split in this way or not. At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension. #### Tests This PR adds: - Unit tests - Integration tests #### Dependencies Only one dependency was added as optional (needed for the JavaScript parser). #### Documentation A notebook is added showing how the loader can be used. #### Who can review? @eyurtsev @hwchase17 --------- Co-authored-by: rlm <pexpresss31@gmail.com>

@LeilaChr

## Description I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928. Initial context is in the issue we opened (#11229). This pull request adds: - Generic framework for expanding the languages that `LanguageParser` can handle, using the [tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter) parsing library and existing language-specific parsers written for it - Support for the following additional languages in `LanguageParser`: - C - C++ - C# - Go - Java (contributed by @Mario928 ThatsJustCheesy#2) - Kotlin - Lua - Perl - Ruby - Rust - Scala - TypeScript (contributed by @Harrolee ThatsJustCheesy#1) Here is the [design document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk) if curious, but no need to read it. ## Issues - Closes #11229 - Closes #10996 - Closes #8405 ## Dependencies `tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add these as optional dependencies. ## Documentation We have updated the list of supported languages, and also added a section to `source_code.ipynb` detailing how to add support for additional languages using our framework. ## Maintainer - @hwchase17 (previously reviewed #6486) Thanks!! ## Git commits We will gladly squash any/all of our commits (esp merge commits) if necessary. Let us know if this is desirable, or if you will be squash-merging anyway.  --------- Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com> Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com> Co-authored-by: Jeremy La <jeremylai511@gmail.com> Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com> Co-authored-by: Lee Harrold <lhharrold@sep.com> Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

@LeilaChr

…ai#13318) ## Description I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928. Initial context is in the issue we opened (langchain-ai#11229). This pull request adds: - Generic framework for expanding the languages that `LanguageParser` can handle, using the [tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter) parsing library and existing language-specific parsers written for it - Support for the following additional languages in `LanguageParser`: - C - C++ - C# - Go - Java (contributed by @Mario928 ThatsJustCheesy#2) - Kotlin - Lua - Perl - Ruby - Rust - Scala - TypeScript (contributed by @Harrolee ThatsJustCheesy#1) Here is the [design document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk) if curious, but no need to read it. ## Issues - Closes langchain-ai#11229 - Closes langchain-ai#10996 - Closes langchain-ai#8405 ## Dependencies `tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add these as optional dependencies. ## Documentation We have updated the list of supported languages, and also added a section to `source_code.ipynb` detailing how to add support for additional languages using our framework. ## Maintainer - @hwchase17 (previously reviewed langchain-ai#6486) Thanks!! ## Git commits We will gladly squash any/all of our commits (esp merge commits) if necessary. Let us know if this is desirable, or if you will be squash-merging anyway.  --------- Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com> Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com> Co-authored-by: Jeremy La <jeremylai511@gmail.com> Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com> Co-authored-by: Lee Harrold <lhharrold@sep.com> Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

@LeilaChr

…ai#13318) ## Description I am submitting this for a school project as part of a team of 5. Other team members are @LeilaChr, @maazh10, @Megabear137, @jelalalamy. This PR also has contributions from community members @Harrolee and @Mario928. Initial context is in the issue we opened (langchain-ai#11229). This pull request adds: - Generic framework for expanding the languages that `LanguageParser` can handle, using the [tree-sitter](https://github.com/tree-sitter/py-tree-sitter#py-tree-sitter) parsing library and existing language-specific parsers written for it - Support for the following additional languages in `LanguageParser`: - C - C++ - C# - Go - Java (contributed by @Mario928 ThatsJustCheesy#2) - Kotlin - Lua - Perl - Ruby - Rust - Scala - TypeScript (contributed by @Harrolee ThatsJustCheesy#1) Here is the [design document](https://docs.google.com/document/d/17dB14cKCWAaiTeSeBtxHpoVPGKrsPye8W0o_WClz2kk) if curious, but no need to read it. ## Issues - Closes langchain-ai#11229 - Closes langchain-ai#10996 - Closes langchain-ai#8405 ## Dependencies `tree_sitter` and `tree_sitter_languages` on PyPI. We have tried to add these as optional dependencies. ## Documentation We have updated the list of supported languages, and also added a section to `source_code.ipynb` detailing how to add support for additional languages using our framework. ## Maintainer - @hwchase17 (previously reviewed langchain-ai#6486) Thanks!! ## Git commits We will gladly squash any/all of our commits (esp merge commits) if necessary. Let us know if this is desirable, or if you will be squash-merging anyway.  --------- Co-authored-by: Maaz Hashmi <mhashmi373@gmail.com> Co-authored-by: LeilaChr <87657694+LeilaChr@users.noreply.github.com> Co-authored-by: Jeremy La <jeremylai511@gmail.com> Co-authored-by: Megabear137 <zubair.alnoor27@gmail.com> Co-authored-by: Lee Harrold <lhharrold@sep.com> Co-authored-by: Mario928 <88029051+Mario928@users.noreply.github.com> Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

Add LanguageParser

022737a

Merge

f140c16

hwchase17 reviewed Jun 20, 2023

View reviewed changes

hwchase17 assigned rlancemartin Jun 20, 2023

cristobalcl added 2 commits June 20, 2023 21:41

Mark test requirement

7a29e21

Fix some linting issues

20ade7b

dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels Jun 21, 2023

cristobalcl added 2 commits June 21, 2023 16:35

Merge

a60f34d

Fix linting

fbda4eb

cristobalcl added 2 commits June 23, 2023 16:33

Fix linting

4b1ba42

Merge

d4ce5e2

cristobalcl added 3 commits June 26, 2023 18:08

Refactor to use the code as a parser for GenericLoader

b4d21de

Update LanguageParser notebook for doc

bbe75ee

Add some documentation

b512788

cristobalcl added 2 commits June 27, 2023 17:59

Fix tests

b63471a

Improve documentation

58b17a4

Formatting and re-name notebook for documentation

027f2f8

rlancemartin merged commit e494b0a into langchain-ai:master Jun 27, 2023
14 checks passed

rlancemartin mentioned this pull request Jun 29, 2023

Grobid parser for Scientific Articles from PDF #6729

Merged

ThatsJustCheesy mentioned this pull request Nov 14, 2023

Framework for supporting more languages in LanguageParser #13318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat (documents): add a source code loader based on AST manipulation #6486

feat (documents): add a source code loader based on AST manipulation #6486

cristobalcl commented Jun 20, 2023

vercel bot commented Jun 20, 2023

hwchase17 left a comment

hwchase17 commented Jun 20, 2023

cristobalcl commented Jun 20, 2023

dev2049 commented Jun 21, 2023

vercel bot commented Jun 21, 2023 •

edited

Loading

rlancemartin commented Jun 25, 2023

cristobalcl commented Jun 26, 2023

cristobalcl commented Jun 26, 2023

rlancemartin commented Jun 26, 2023 •

edited

Loading

cristobalcl commented Jun 27, 2023

rlancemartin commented Jun 27, 2023

cristobalcl commented Jun 28, 2023

feat (documents): add a source code loader based on AST manipulation #6486

feat (documents): add a source code loader based on AST manipulation #6486

Conversation

cristobalcl commented Jun 20, 2023

Summary

Tests

Dependencies

Documentation

Who can review?

vercel bot commented Jun 20, 2023

hwchase17 left a comment

Choose a reason for hiding this comment

hwchase17 commented Jun 20, 2023

cristobalcl commented Jun 20, 2023

dev2049 commented Jun 21, 2023

vercel bot commented Jun 21, 2023 • edited Loading

rlancemartin commented Jun 25, 2023

cristobalcl commented Jun 26, 2023

cristobalcl commented Jun 26, 2023

rlancemartin commented Jun 26, 2023 • edited Loading

cristobalcl commented Jun 27, 2023

rlancemartin commented Jun 27, 2023

cristobalcl commented Jun 28, 2023

vercel bot commented Jun 21, 2023 •

edited

Loading

rlancemartin commented Jun 26, 2023 •

edited

Loading