## Data Connection

### Load Documents

more info: https://python.langchain.com/v0.1/docs/integrations/document_loaders/

In [1]:
from langchain.document_loaders import TextLoader

loader = TextLoader('docs/Advantages_of_Langchain.md')
docs = loader.load()
print(docs)

[Document(page_content='# Advantages of LangChain\n\n## 1. Connect LLMs to our own data\nRecently, there are many use cases in LLMs. However, LLMs may generate answers that do not meet our expectations in some use cases. As a result, we can solve this problem by using LangChain to connect LLMs to our own data. Therefore, LLMs could do referencing from our data.\n\n## 2. Combine LLMs on doing different tasks in one use case\nFor example, we can use GPT-4 to interpret our queries in a specific case and use Claud-3 to response to those queries. By combining two LLMs in different needs, we could have a better model in some use cases.\n\n## 3. Split the text to suitable length\nAs we known, LLMs usually have limited imput token length. LangChain allows us to split a long document into different sections and input to LLMs. Moreover, LLMs can recognize patterns in the texts. Therefore, text-splitting is very efficient.\n\n\n## 4. Control the format of LLM responses\nWe can provide an example 

### Split Documents

**Character splitting**

`CharacterTextSplitter` is the simplest document splitter; it splits documents into fixed-length text chunks.

In [2]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

split_docs = text_splitter.split_documents(docs)
print(split_docs)

[Document(page_content='# Advantages of LangChain\n\n## 1. Connect LLMs to our own data\nRecently, there are many use cases in LLMs. However, LLMs may generate answers that do not meet our expectations in some use cases. As a result, we can solve this problem by using LangChain to connect LLMs to our own data. Therefore, LLMs could do referencing from our data.\n\n## 2. Combine LLMs on doing different tasks in one use case\nFor example, we can use GPT-4 to interpret our queries in a specific case and use Claud-3 to response to those queries. By combining two LLMs in different needs, we could have a better model in some use cases.\n\n## 3. Split the text to suitable length\nAs we known, LLMs usually have limited imput token length. LangChain allows us to split a long document into different sections and input to LLMs. Moreover, LLMs can recognize patterns in the texts. Therefore, text-splitting is very efficient.', metadata={'source': 'docs/Advantages_of_Langchain.md'}), Document(page_c

**Code spilitting**

The `from_language` function of `RecursiveCharacterTextSplitter` can split code into appropriate text chunks based on the characteristics of the programming language.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

PYTHON_CODE = """
def hello_lanchain():
    print("Hello, Langchain!")

# Call the funciton
hello_langchain()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, 
    chunk_size=50, 
    chunk_overlap=0
)

python_docs = python_splitter.create_documents([PYTHON_CODE])
print(python_docs)

[Document(page_content='def hello_lanchain():'), Document(page_content='print("Hello, Langchain!")'), Document(page_content='# Call the funciton\nhello_langchain()')]


**Markdown file splitting**

`MarkdownHeaderTextSplitter` can split a Markdown document into chunks based on its paragraph structure and Markdown syntax.

In [4]:
# ! pip install unstructured
# ! pip install markdown

In [5]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_example = "# Chapter 1\n\n  ## Section 1\n\n This is section 1 ## Section 2\n\n This is section 2"

# Markdown file splitting
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3")
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
splits = splitter.split_text(markdown_example)
print(splits)

[Document(page_content='This is section 1 ## Section 2  \nThis is section 2', metadata={'Header 1': 'Chapter 1', 'Header 2': 'Section 1'})]


**Recursive character splitting**

It uses a set of characters as parameters, trying to split using these characters in sequence until the chunks are small enough. The default character list is ["\n\n", "\n", " ", ""]. It preserves paragraphs, sentences, and words as much as possible, thus ensuring semantic completeness as much as possible.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    length_function=len,
)

texts = text_splitter.split_documents(docs)
print(texts)

[Document(page_content='# Advantages of LangChain', metadata={'source': 'docs/Advantages_of_Langchain.md'}), Document(page_content='## 1. Connect LLMs to our own data', metadata={'source': 'docs/Advantages_of_Langchain.md'}), Document(page_content='Recently, there are many use cases in LLMs. However, LLMs may generate answers that do not meet our expectations in some use cases. As a result, we can solve this problem by using LangChain to', metadata={'source': 'docs/Advantages_of_Langchain.md'}), Document(page_content='we can solve this problem by using LangChain to connect LLMs to our own data. Therefore, LLMs could do referencing from our data.', metadata={'source': 'docs/Advantages_of_Langchain.md'}), Document(page_content='## 2. Combine LLMs on doing different tasks in one use case', metadata={'source': 'docs/Advantages_of_Langchain.md'}), Document(page_content='For example, we can use GPT-4 to interpret our queries in a specific case and use Claud-3 to response to those queries. By

**Token splitting**

Language models, such as OpenAI, have token limits. In API calls, this token limit should not be exceeded. Therefore, splitting text based on the number of tokens is a good idea.

In [7]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100,
    chunk_overlap=0
)

split_docs

[Document(page_content='# Advantages of LangChain\n\n## 1. Connect LLMs to our own data\nRecently, there are many use cases in LLMs. However, LLMs may generate answers that do not meet our expectations in some use cases. As a result, we can solve this problem by using LangChain to connect LLMs to our own data. Therefore, LLMs could do referencing from our data.\n\n## 2. Combine LLMs on doing different tasks in one use case\nFor example, we can use GPT-4 to interpret our queries in a specific case and use Claud-3 to response to those queries. By combining two LLMs in different needs, we could have a better model in some use cases.\n\n## 3. Split the text to suitable length\nAs we known, LLMs usually have limited imput token length. LangChain allows us to split a long document into different sections and input to LLMs. Moreover, LLMs can recognize patterns in the texts. Therefore, text-splitting is very efficient.', metadata={'source': 'docs/Advantages_of_Langchain.md'}),
 Document(page_

## Vectorize Document Chunks
Embedding models create vector representations of text fragments. This means we can process text in vector space and perform operations such as semantic search to find the most similar text fragments in the vector space.

In [8]:
# ! pip install tiktoken

In [9]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()
embedding = embedding_model.embed_documents(
    [
        "Hi!",
        "I am Johnson.",
        "I am great!"
    ]
)
# print(embedding)

## Store vector data
Vector data storage, also known as vector databases, is responsible for storing vector representations of text embeddings and providing vector retrieval capabilities. Langchain offers several open-source or commercial vector data storage options, including Chroma, FAISS, Pinecone, and others.

In [10]:
# ! pip install -q chromadb

### Storage

In [11]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
document = text_splitter.split_documents(docs)
db = Chroma.from_documents(document, OpenAIEmbeddings())

### Retrieval

In [12]:
query = "what is Langchain?"
doc = db.similarity_search(query)
print(doc[0].page_content)

Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


# Advantages of LangChain

## 1. Connect LLMs to our own data
Recently, there are many use cases in LLMs. However, LLMs may generate answers that do not meet our expectations in some use cases. As a result, we can solve this problem by using LangChain to connect LLMs to our own data. Therefore, LLMs could do referencing from our data.

## 2. Combine LLMs on doing different tasks in one use case
For example, we can use GPT-4 to interpret our queries in a specific case and use Claud-3 to response to those queries. By combining two LLMs in different needs, we could have a better model in some use cases.

## 3. Split the text to suitable length
As we known, LLMs usually have limited imput token length. LangChain allows us to split a long document into different sections and input to LLMs. Moreover, LLMs can recognize patterns in the texts. Therefore, text-splitting is very efficient.
