# Language

This notebook covers how to load source code files using a special approach with language parsing:

Each top-level functions and classes in the code are loaded into separate documents. Then, an additional document is created with the top-level code, but without the already loaded functions and classes.

This could improve the accuracy of QA chains over source code.

At this moment, only Python and JavaScript are supported. The appropriate parser is determined by examining the file extension.

In [1]:
from pprint import pprint

from langchain.document_loaders import LanguageLoader

In [2]:
loader = LanguageLoader(file_path="./example_data/example.py")

data = loader.load()

In [3]:
len(data)

3

In [4]:
for document in data:
    pprint(document.metadata)

{'content_type': 'functions_classes',
 'language': 'python',
 'source': './example_data/example.py'}
{'content_type': 'functions_classes',
 'language': 'python',
 'source': './example_data/example.py'}
{'content_type': 'simplified_code',
 'language': 'python',
 'source': './example_data/example.py'}


In [5]:
print("\n\n--8<--\n\n".join([document.page_content for document in data]))

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

--8<--

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

--8<--

# Code for: class MyClass:

# Code for: def main():

if __name__ == '__main__':
    main()


The parser can be disabled for small files. The parameter `parser_threshold` is the minimum number of lines that the source code file must have for the parser to work:

In [6]:
loader = LanguageLoader(file_path="./example_data/example.py", parser_threshold=1000)

data = loader.load()

In [7]:
len(data)

1

In [8]:
print(data[0].page_content)

class MyClass:
    def __init__(self, name):
        self.name = name

    def greet(self):
        print(f"Hello, {self.name}!")

def main():
    name = input("Enter your name: ")
    obj = MyClass(name)
    obj.greet()

if __name__ == '__main__':
    main()



## JavaScript code

Currently, only Python and JavaScript are supported:

In [9]:
loader = LanguageLoader(file_path="./example_data/example.js")

data = loader.load()

In [10]:
len(data)

3

In [11]:
for document in data:
    pprint(document.metadata)

{'content_type': 'functions_classes',
 'language': 'javascript',
 'source': './example_data/example.js'}
{'content_type': 'functions_classes',
 'language': 'javascript',
 'source': './example_data/example.js'}
{'content_type': 'simplified_code',
 'language': 'javascript',
 'source': './example_data/example.js'}


In [12]:
print("\n\n--8<--\n\n".join([document.page_content for document in data]))

class MyClass {
  constructor(name) {
    this.name = name;
  }

  greet() {
    console.log(`Hello, ${this.name}!`);
  }
}

--8<--

function main() {
  const name = prompt("Enter your name:");
  const obj = new MyClass(name);
  obj.greet();
}

--8<--

// Code for: class MyClass {

// Code for: function main() {

main();


## Splitting

Additional splitting could be needed for those functions, classes, or scripts that are too big.

In [13]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

In [14]:
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)

In [15]:
result = js_splitter.split_documents(data)

In [16]:
len(result)

7

In [17]:
print("\n\n--8<--\n\n".join([document.page_content for document in result]))

class MyClass {
  constructor(name) {
    this.name = name;

--8<--

}

--8<--

greet() {
    console.log(`Hello, ${this.name}!`);
  }
}

--8<--

function main() {
  const name = prompt("Enter your name:");

--8<--

const obj = new MyClass(name);
  obj.greet();
}

--8<--

// Code for: class MyClass {

// Code for: function main() {

--8<--

main();
