# Split code with Langchain

- Author: [greencode](https://github.com/greencode-99)
- Design: 
- Peer Review : [Teddy Lee](https://github.com/teddylee777), [heewung song](https://github.com/kofsitho87), [Teddy Lee](https://github.com/teddylee777)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/05-CodeSplitter.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/05-CodeSplitter.ipynb)

## Overview

`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.

You can split code written in various programming languages using `CodeTextSplitter`.

To do this, import the `Language` enum and specify the corresponding programming language.


### Table of Contents

- [Overview](#Overview)
- [Environment Setup](#environment-setup)
- [Code Spliter Examples](#code-splitter-examples)
   - [Python](#python)
   - [JS](#js)
   - [TS](#ts)
   - [Markdown](#markdown)
   - [LaTeX](#latex)
   - [HTML](#html)
   - [Solidity](#solidity)
   - [C#](#c)
   - [PHP](#php)
   - [Kotlin](#kotlin)


### References
- [How to split code](https://python.langchain.com/docs/how_to/code_splitter/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [2]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [3]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_text_splitters",
    ],
    verbose=False,
    upgrade=False,
)

In [4]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Code-Splitter",
    }
)

Environment variables have been set successfully.


In [5]:
from dotenv import load_dotenv

load_dotenv()

True

## Code Splitter Examples

Here is an example of splitting text using `RecursiveCharacterTextSplitter`.

- Import the `Language` and `RecursiveCharacterTextSplitter` classes from the `langchain_text_splitters` module.
- `RecursiveCharacterTextSplitter` is a text splitter that recursively splits text at the character level.

In [9]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

Supported languages are stored in the langchain_text_splitters.Language enum. 

API Reference: [Language](https://python.langchain.com/docs/api_reference/text_splitters/Language) | [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/api_reference/text_splitters/RecursiveCharacterTextSplitter)

Below is the full list of supported languages.

In [10]:
# Get the full list of supported languages.
[e.value for e in Language]

['cpp',
 'go',
 'java',
 'kotlin',
 'js',
 'ts',
 'php',
 'proto',
 'python',
 'rst',
 'ruby',
 'rust',
 'scala',
 'swift',
 'markdown',
 'latex',
 'html',
 'sol',
 'csharp',
 'cobol',
 'c',
 'lua',
 'perl',
 'haskell',
 'elixir']

You can use the `get_separators_for_language` method of the `RecursiveCharacterTextSplitter` class to check the separators used for a specific language.

- In the example, the `Language.PYTHON` enum value is passed as an argument to check the separators used for the Python language.

In [11]:
# You can check the separators used for the given language.
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

### Python

Use `RecursiveCharacterTextSplitter` to split Python code into document units.
- Specify `Language.PYTHON` as the `language` parameter to use the Python language.
- Set `chunk_size` to 50 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.

In [13]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)

# Create `Document`. The created `Document` is returned as a list.
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_world():\n    print("Hello, World!")'),
 Document(page_content='hello_world()')]

In [14]:
# This section iterates through the list of documents created by the RecursiveCharacterTextSplitter
# and prints each document's content followed by a separator line for readability.
for doc in python_docs:
    print(doc.page_content, end="\n==================\n")

def hello_world():
    print("Hello, World!")
hello_world()


### JS

Here is an example of using the JS text splitter.
- Specify `Language.JS` as the `language` parameter to use the JavaScript language.
- Set `chunk_size` to 60 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.


In [12]:
JS_CODE = """
function helloWorld() {
  console.log("Hello, World!");
}

helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)

# Create `Document`. The created `Document` is returned as a list.
js_docs = js_splitter.create_documents([JS_CODE])
js_docs

[Document(page_content='function helloWorld() {\n  console.log("Hello, World!");\n}'),
 Document(page_content='helloWorld();')]

### TS  

Here is an example of using the TS text splitter.
- Specify `Language.TS` as the `language` parameter to use the TypeScript language.
- Set `chunk_size` to 60 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.


In [15]:
TS_CODE = """
function helloWorld(): void {
  console.log("Hello, World!");
}

helloWorld();
"""

ts_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.TS, chunk_size=60, chunk_overlap=0
)


ts_docs = ts_splitter.create_documents([TS_CODE])
ts_docs

[Document(page_content='function helloWorld(): void {'),
 Document(page_content='console.log("Hello, World!");\n}'),
 Document(page_content='helloWorld();')]

### Markdown

Here is an example of using the Markdown text splitter.

- Specify `Language.MARKDOWN` as the `language` parameter to use the Markdown language.
- Set `chunk_size` to 60 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.

In [14]:
markdown_text = """
# 🦜️🔗 LangChain

⚡ Building applications with LLMs through composability ⚡

## What is LangChain?

# Hopefully this code block isn't split
LangChain is a framework for...

As an open-source project in a rapidly developing field, we are extremely open to contributions.
"""

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=60,
    chunk_overlap=0,
)

md_docs = md_splitter.create_documents([markdown_text])
md_docs

[Document(page_content='# 🦜️🔗 LangChain'),
 Document(page_content='⚡ Building applications with LLMs through composability ⚡'),
 Document(page_content='## What is LangChain?'),
 Document(page_content="# Hopefully this code block isn't split"),
 Document(page_content='LangChain is a framework for...'),
 Document(page_content='As an open-source project in a rapidly developing field, we'),
 Document(page_content='are extremely open to contributions.')]

### LaTeX

LaTeX is a markup language for document creation, widely used for representing mathematical symbols and formulas.

Here is an example of LaTeX text.
- Specify `Language.LATEX` as the `language` parameter to use the LaTeX language.
- Set `chunk_size` to 60 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.


In [16]:
latex_text = """
\documentclass{article}

\begin{document}

\maketitle

\section{Introduction}
Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in a variety of natural language processing tasks, including language translation, text generation, and sentiment analysis.

\subsection{History of LLMs}
The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}
LLMs have many applications in industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}
"""

latex_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.LATEX,
    chunk_size=60,
    chunk_overlap=0,
)

latex_docs = latex_splitter.create_documents([latex_text])
latex_docs

[Document(page_content='\\documentclass{article}\n\n\x08egin{document}\n\n\\maketitle'),
 Document(page_content='\\section{Introduction}\nLarge language models (LLMs) are a'),
 Document(page_content='type of machine learning model that can be trained on vast'),
 Document(page_content='amounts of text data to generate human-like language. In'),
 Document(page_content='recent years, LLMs have made significant advances in a'),
 Document(page_content='variety of natural language processing tasks, including'),
 Document(page_content='language translation, text generation, and sentiment'),
 Document(page_content='analysis.'),
 Document(page_content='\\subsection{History of LLMs}\nThe earliest LLMs were'),
 Document(page_content='developed in the 1980s and 1990s, but they were limited by'),
 Document(page_content='the amount of data that could be processed and the'),
 Document(page_content='computational power available at the time. In the past'),
 Document(page_content='decade, however, adva

### HTML

Here is an example of using the HTML text splitter.
- Specify `Language.HTML` as the `language` parameter to use the HTML language.
- Set `chunk_size` to 60 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.


In [51]:
html_text = """
<!DOCTYPE html>
<html>
    <head>
        <title>🦜️🔗 LangChain</title>
        <style>
            body {
                font-family: Arial, sans-serif;
            }
            h1 {
                color: darkblue;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🦜️🔗 LangChain</h1>
            <p>⚡ Building applications with LLMs through composability ⚡</p>
        </div>
        <div>
            As an open-source project in a rapidly developing field, we are extremely open to contributions.
        </div>
    </body>
</html>
"""

html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML, chunk_size=60, chunk_overlap=0
)

html_docs = html_splitter.create_documents([html_text])
html_docs

[Document(page_content='<!DOCTYPE html>\n<html>'),
 Document(page_content='<head>\n        <title>🦜️🔗 LangChain</title>'),
 Document(page_content='<style>\n            body {\n                font-family: Aria'),
 Document(page_content='l, sans-serif;\n            }\n            h1 {'),
 Document(page_content='color: darkblue;\n            }\n        </style>\n    </head'),
 Document(page_content='>'),
 Document(page_content='<body>'),
 Document(page_content='<div>\n            <h1>🦜️🔗 LangChain</h1>'),
 Document(page_content='<p>⚡ Building applications with LLMs through composability ⚡'),
 Document(page_content='</p>\n        </div>'),
 Document(page_content='<div>\n            As an open-source project in a rapidly dev'),
 Document(page_content='eloping field, we are extremely open to contributions.'),
 Document(page_content='</div>\n    </body>\n</html>')]

### Solidity

Here is an example of using the Solidity text splitter:

- The Solidity code is stored in the `SOL_CODE` variable as a string.
- The `RecursiveCharacterTextSplitter` is used to create `sol_splitter`, which splits the Solidity code into chunks.
  - The `language` parameter is set to `Language.SOL` to specify the Solidity language.
  - The `chunk_size` is set to 128 to specify the maximum size of each chunk.
  - The `chunk_overlap` is set to 0 to prevent overlap between chunks.
  
- The `sol_splitter.create_documents()` method is used to split `SOL_CODE` into chunks and store the split chunks in the `sol_docs` variable.
- The `sol_docs` are output to verify the split Solidity code chunks.


In [52]:
SOL_CODE = """
pragma solidity ^0.8.20; 
contract HelloWorld {  
   function add(uint a, uint b) pure public returns(uint) {
       return a + b;
   }
}
"""

sol_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.SOL, chunk_size=128, chunk_overlap=0
)

sol_docs = sol_splitter.create_documents([SOL_CODE])
sol_docs

[Document(page_content='pragma solidity ^0.8.20;'),
 Document(page_content='contract HelloWorld {  \n   function add(uint a, uint b) pure public returns(uint) {\n       return a + b;\n   }\n}')]

### C#

Here is an example of using the C# text splitter.
- Specify `Language.CSHARP` as the `language` parameter to use the C# language.
- Set `chunk_size` to 128 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.

In [54]:
C_CODE = """
using System;
class Program
{
    static void Main()
    {
        Console.WriteLine("Enter a number (1-5):");
        int input = Convert.ToInt32(Console.ReadLine());
        for (int i = 1; i <= input; i++)
        {
            if (i % 2 == 0)
            {
                Console.WriteLine($"{i} is even.");
            }
            else
            {
                Console.WriteLine($"{i} is odd.");
            }
        }
        Console.WriteLine("Goodbye!");
    }
}
"""

c_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.CSHARP, chunk_size=128, chunk_overlap=0
)

c_docs = c_splitter.create_documents([C_CODE])
c_docs

[Document(page_content='using System;'),
 Document(page_content='class Program\n{\n    static void Main()\n    {\n        Console.WriteLine("Enter a number (1-5):");'),
 Document(page_content='int input = Convert.ToInt32(Console.ReadLine());\n        for (int i = 1; i <= input; i++)\n        {'),
 Document(page_content='if (i % 2 == 0)\n            {\n                Console.WriteLine($"{i} is even.");\n            }\n            else'),
 Document(page_content='{\n                Console.WriteLine($"{i} is odd.");\n            }\n        }\n        Console.WriteLine("Goodbye!");'),
 Document(page_content='}\n}')]

### PHP

Here is an example of using the PHP text splitter.
- Specify `Language.PHP` as the `language` parameter to use the PHP language.
- Set `chunk_size` to 50 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.

In [56]:
PHP_CODE = """<?php
namespace foo;
class Hello {
    public function __construct() { }
}
function hello() {
    echo "Hello World!";
}
interface Human {
    public function breath();
}
trait Foo { }
enum Color
{
    case Red;
    case Blue;
}"""

php_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PHP, chunk_size=50, chunk_overlap=0
)

php_docs = php_splitter.create_documents([PHP_CODE])
php_docs

[Document(page_content='<?php\nnamespace foo;'),
 Document(page_content='class Hello {'),
 Document(page_content='public function __construct() { }\n}'),
 Document(page_content='function hello() {\n    echo "Hello World!";\n}'),
 Document(page_content='interface Human {\n    public function breath();\n}'),
 Document(page_content='trait Foo { }\nenum Color\n{\n    case Red;'),
 Document(page_content='case Blue;\n}')]

### Kotlin

Here is an example of using the kotlin text splitter.
- Specify `Language.KOTLIN` as the `language` parameter to use the PowerShell language.
- Set `chunk_size` to 100 to limit the maximum size of each document.
- Set `chunk_overlap` to 0 to disallow overlap between documents.

In [65]:
KOTLIN_CODE = """
fun main() {
    val directoryPath = System.getProperty("user.dir")
    val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy { it.lastModified() } ?: emptyArray()

    files.forEach { file ->
        println("Name: ${file.name} | Last Write Time: ${file.lastModified()}")
    }
}
"""

kotlin_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.KOTLIN, chunk_size=100, chunk_overlap=0
)

kotlin_docs = kotlin_splitter.create_documents([KOTLIN_CODE])
kotlin_docs

[Document(page_content='fun main() {\n    val directoryPath = System.getProperty("user.dir")'),
 Document(page_content='val files = File(directoryPath).listFiles()?.filter { !it.isDirectory }?.sortedBy {'),
 Document(page_content='it.lastModified() } ?: emptyArray()'),
 Document(page_content='files.forEach { file ->'),
 Document(page_content='println("Name: ${file.name} | Last Write Time: ${file.lastModified()}")\n    }\n}')]