# OpenAI Multimodal

Let me introduce you to something amazing called GPT. Think of GPT as a very smart computer program that can understand both text and images. It can read and write just like a human, and it can also look at pictures and understand what they are. Imagine having a tool that can help you with writing, reading, and even looking at photos to tell you what they show. It's like having a really smart assistant who can do many things at once!

讓我向您介紹一個非常了不起的東西，叫做 GPT。想像一下，GPT 就像是一個非常聰明的電腦程式，它能理解文字和圖片。它可以像人一樣閱讀和寫作，還能看圖片並理解它們的內容。想像一下，有一個工具可以幫助您寫作、閱讀，甚至看照片並告訴您照片中的內容。這就像擁有一個非常聰明的助理，可以同時做很多事情！

In [1]:
from IPython.display import display, HTML

# Define the HTML to display images side by side
html = """
<div style="display: flex; justify-content: space-around;">
    <div>
        <img src="nDATs9kmQk7sNrx5ELrhZ.png" height="900" width="600" />
    </div>
    <div>
        <img src="754703591882697477.png" height="900" width="600" />
    </div>
</div>
"""

# Display the HTML
display(HTML(html))

In [2]:
import os

os.chdir("../../")

In [3]:
from langchain_community.chat_models import ChatOpenAI

from src.initialization import credential_init

credential_init()

model = ChatOpenAI(openai_api_key=os.environ['OPENAI_API_KEY'],
                   model_name="gpt-4o-2024-05-13", temperature=0)

  warn_deprecated(


GPT does not see an image, but something strange called base64 foramt string

In [4]:
import io
import base64

from PIL import Image
from langchain_core.messages.human import HumanMessage
from langchain.prompts import ChatPromptTemplate


def image_to_base64(image_path):
    
    with Image.open(image_path) as image:
        
        # Save the Image to a Buffer
        buffered = io.BytesIO()
        image.save(buffered, format="JPEG")
        
        # Encode the Image to Base64
        image_str = base64.b64encode(buffered.getvalue())
    
    return image_str.decode('utf-8')



### 1. Convert Image Path to Base64 String

- The image path is constructed and passed to image_to_base64 to get the Base64 string of the image.

In [5]:
from src.io.path_definition import get_project_dir

image_str = image_to_base64(os.path.join(get_project_dir(), 'tutorial/Week-5/754703591882697477.png'))

In [7]:
# image_str

### 2. Create a Human Message

- A HumanMessage object is created containing two parts:
    - A text message asking "What is in this image?"
    - An image URL containing the Base64 encoded image.

In [9]:
# python f-string

text = f"今天的天氣是: {天氣}"
天氣 = "陰天"

print(text)

今天的天氣是: 陰天


In [10]:
human_message = HumanMessage(content=[{'type': 'text', 
                                       'text': 'What is in this image?'},
                                      {'type': 'image_url',
                                       'image_url': {
                                           'url': f"data:image/jpeg;base64,{image_str}"}
                                      }])

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message])

# Generate the Chain
chain = prompt|model

In [12]:
output = chain.invoke(input={})

In [13]:
print(output.content)

The image depicts a person dressed in a cosplay outfit inspired by a kitsune, a mythical fox spirit from Japanese folklore. The individual is wearing fox ears and has multiple fox tails visible behind them. They are also dressed in traditional Japanese-style clothing, including a kimono with intricate patterns and a decorative hair accessory. The overall look is detailed and carefully crafted to resemble a character from anime, manga, or Japanese mythology.


## Make the input image as a dynamic variable

- With PromptTemplate

In [14]:
from langchain.prompts import HumanMessagePromptTemplate

HumanMessagePromptTemplate?

[1;31mInit signature:[0m
[0mHumanMessagePromptTemplate[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mprompt[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mlangchain_core[0m[1;33m.[0m[0mprompts[0m[1;33m.[0m[0mstring[0m[1;33m.[0m[0mStringPromptTemplate[0m[1;33m,[0m [0mList[0m[1;33m[[0m[0mUnion[0m[1;33m[[0m[0mlangchain_core[0m[1;33m.[0m[0mprompts[0m[1;33m.[0m[0mstring[0m[1;33m.[0m[0mStringPromptTemplate[0m[1;33m,[0m [0mlangchain_core[0m[1;33m.[0m[0mprompts[0m[1;33m.[0m[0mimage[0m[1;33m.[0m[0mImagePromptTemplate[0m[1;33m][0m[1;33m][0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0madditional_kwargs[0m[1;33m:[0m [0mdict[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;32mNone[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m      Human message prompt template. This is a message sent from the user.
[1;31mInit docstring:[0m
Create a new model by parsing and validati

In [15]:
from langchain.prompts import HumanMessagePromptTemplate

human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': 'What is in this image?'},
        {'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,{image_str}'}}
    ],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
chain = prompt|model

chain.invoke(input={"image_str": image_str})

AIMessage(content='The image depicts a person dressed in a cosplay outfit inspired by a kitsune, a mythical fox spirit from Japanese folklore. The individual is wearing fox ears and has multiple fox tails visible behind them. They are also dressed in traditional Japanese-style clothing, including a kimono with intricate patterns and a decorative hair accessory. The overall look is detailed and carefully crafted to resemble a character from anime, manga, or Japanese mythology.', response_metadata={'token_usage': {'completion_tokens': 83, 'prompt_tokens': 1118, 'total_tokens': 1201, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d0c6e590be', 'finish_reason': 'stop', 'logprobs': None}, id='run-d17695a0-4b04-44e8-bcf8-49dbc143c1f9-0')

將`問題`和`圖片`都變成輸入變數。

In [16]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '{question}'},
        {'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,{image_str}'}}
    ],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
chain = prompt|model

chain.invoke(input={"image_str": image_str, "question": "Are you able to connect this image with any anime character?"})

AIMessage(content='The character in the image appears to be a person dressed in a fox-themed costume, which is reminiscent of characters from various anime series. The fox ears and multiple tails suggest a kitsune, a mythical fox spirit from Japanese folklore, which is a common motif in anime. \n\nOne well-known anime character that fits this description is Ahri from the game "League of Legends," who is often depicted with fox ears and multiple tails. Another character is Tamamo no Mae from the "Fate" series, who also has a similar appearance. However, without more specific details, it\'s difficult to definitively identify the character.', response_metadata={'token_usage': {'completion_tokens': 124, 'prompt_tokens': 1124, 'total_tokens': 1248, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'syst

範圍似乎太廣了，給更多的條件: 來源是Azur Lane(碧藍航線)。

In [17]:
chain.invoke(input={"image_str": image_str, "question": "Are you able to connect this image with any anime character? Hint: Azur Lane."})

AIMessage(content='The character in the image appears to be cosplaying as Akagi from the game "Azur Lane." Akagi is known for her fox-like appearance, including fox ears and multiple tails, which are characteristic features depicted in the image.', response_metadata={'token_usage': {'completion_tokens': 46, 'prompt_tokens': 1130, 'total_tokens': 1176, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d0c6e590be', 'finish_reason': 'stop', 'logprobs': None}, id='run-ed283c6a-4d5c-489d-b75a-4d1b27783575-0')

將Chain更加一步強化: 圖片路徑作為輸入變數

In [None]:
# system_message = SystemMessagePromptTemplate(prompt=system_prompt)

# human_prompt = PromptTemplate(template='existing ingredients:[{existing_ingredients}]; '
#                                        'suggested ingredients: [{suggested_ingredients}]\n; '
#                                        'format instruction: {format_instructions}',
#                               input_variables=["existing_ingredients", "suggested_ingredients"],
#                               partial_variables={"format_instructions": format_instructions}
#                               )

# human_message = HumanMessagePromptTemplate(prompt=human_prompt)

# chat_prompt = ChatPromptTemplate.from_messages([system_message,
#                                                 human_message
#                                                 ])

In [18]:
from operator import itemgetter

from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '{question}'},
        {'type': 'image_url', 'image_url': {'url': 'data:image/jpeg;base64,{image_str}'}}
    ],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
chain = RunnablePassthrough.assign(image_str=itemgetter('image_path')|RunnableLambda(image_to_base64))|prompt|model|StrOutputParser()

In [19]:
image_path = os.path.join(get_project_dir(), 'tutorial/Week-5/nDATs9kmQk7sNrx5ELrhZ.png')

In [None]:
# chain = RunnablePassthrough.assign(image_str=itemgetter('image_path')|RunnableLambda(image_to_base64))|prompt

In [20]:
chain.invoke({"question": "What is in this image?",
              "image_path": image_path})

'The image appears to be a stylized illustration of a female character in a futuristic, form-fitting combat suit. She is holding a high-tech sniper rifle with a scope. The character has long, flowing hair and is depicted in a dynamic pose, suggesting readiness for action. The background includes some text and logos, with "NKF" prominently displayed in the top left corner. The overall aesthetic is reminiscent of sci-fi or cyberpunk themes.'

直接將圖片URL作為變數輸入

In [21]:
from IPython.display import Image as Image_IPYTHON

Image_IPYTHON(url="https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg")

In [22]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[
        {'type': 'text', 'text': '{question}'},
        {'type': 'image_url', 'image_url': {'url': '{image_url}'}}
    ],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

# Generate the Chain
chain = RunnablePassthrough.assign(image_url=itemgetter('url'))|prompt|model|StrOutputParser()

url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                                   
chain.invoke({"question": "What is in this image?",
              "url": url})

'The image depicts a scenic landscape with a wooden boardwalk path leading through a lush, green field. The sky is clear with a few scattered clouds, and the horizon is lined with trees and bushes. The overall atmosphere is serene and inviting, suggesting a natural, outdoor setting, possibly a park or nature reserve.'

## 回家作業1: 用LCEL建立一個影像分析函數，輸入為檔案名稱，輸出為content

## Multiple Images

In [23]:
human_message_template = HumanMessagePromptTemplate.from_template(
    template=[{'type': 'text', 
               'text': 'What are in these images? Is there any difference between them?'},
              {'type': 'image_url',
               'image_url': {
                   'url': "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
              },
              {'type': 'image_url',
               'image_url': {
                   'url': "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"}
              }],
)

# Create a Prompt Template
prompt = ChatPromptTemplate.from_messages([human_message_template])

model.invoke(prompt.format())

AIMessage(content='The images you provided are identical. They both depict the same scene: a nature boardwalk in Madison, Wisconsin. The image shows a wooden pathway surrounded by lush greenery, with trees and plants on either side. The sky is clear, and the overall setting appears to be a peaceful, natural environment.\n\nSince the images are the same, there is no difference between them.', response_metadata={'token_usage': {'completion_tokens': 74, 'prompt_tokens': 157, 'total_tokens': 231, 'prompt_tokens_details': {'cached_tokens': 0, 'audio_tokens': 0}, 'completion_tokens_details': {'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}}, 'model_name': 'gpt-4o-2024-05-13', 'system_fingerprint': 'fp_d0c6e590be', 'finish_reason': 'stop', 'logprobs': None}, id='run-aaba0571-21b7-49bf-b120-2a8a0b9f5b91-0')

有啥點子想試試看的嗎? 現場實操，希望不會翻車

# Text Splitting

https://www.youtube.com/watch?v=8OJC21T2SL4

- Character Split
- Recursive Character Split
- Document Specific Splitting
- Semantic Splitting
- Agentic Splitting

1. Context Limit: Limit on the amount of words/tokens you can pass to the language model
2. Signal to Noise: Remove information that isn't helpful to your task

## Character Splitting

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form

This method isn's recommended for any applications - but it's a great starting point for us to understand the basics.

- Pros: Easy & Simple
- Cons: Very rigid and doesn't take into account the structure of your text

Concepts to know:

- Chunk Size - The number of characters you would like in your chunks. 50, 100, 100000, etc.
- Chunk Overlap - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.


字元分割是將文本分割成最基本形式的方式。它是將文本簡單地分割成N個字元大小的區塊，而不考慮其內容或形式。

這種方法不推薦用於任何應用，但它是我們了解基礎知識的絕佳起點。

優點：簡單且容易
缺點：非常僵硬，不考慮文本結構
需要了解的概念：

區塊大小：您希望每個區塊包含的字元數量。例如，50，100，100000等。
區塊重疊：您希望順序區塊之間重疊的字元數量。這是為了避免將單個上下文切割成多個部分。這將在區塊之間創建重複數據。

In [24]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [25]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='unk up. It is the example text for '),
 Document(page_content='this exercise')]

In [26]:
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=4, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='o chunk up. It is the example text '),
 Document(page_content='ext for this exercise')]

In [27]:
from IPython.display import IFrame

IFrame(src='https://chunkviz.up.railway.app/', width=800, height=800)

- Separators are the character(s) sequences you would like to split on. Say you wanted to chunk your data at `ch`, you can specify it.

In [28]:
text_splitter = CharacterTextSplitter(chunk_size=4, chunk_overlap=0, separator='ch')
text_splitter.create_documents([text])

Created a chunk of size 33, which is longer than the specified 4


[Document(page_content='This is the text I would like to'),
 Document(page_content='unk up. It is the example text for this exercise')]

## Recursive character splitting

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

這種文本分割器是針對一般文本推薦的。它是由一個字元列表參數化的，按照順序嘗試在這些字元上進行分割，直到區塊足夠小。預設的列表是 ["\n\n", "\n", " ", ""]. 這樣做的效果是盡可能將所有段落（然後是句子，再然後是單詞）保持在一起，因為這些通常看起來是語義上最相關的文本片段。

### CNN (Cable News Network) 數據集

In [29]:
import pandas as pd

df_news = pd.read_csv("tutorial/Week-5/CNN_Articels_clean.csv")

In [30]:
df_news.head(5)

Unnamed: 0,Index,Author,Date published,Category,Section,Url,Headline,Description,Keywords,Second headline,Article text
0,0,"Jacopo Prisco, CNN",2021-07-15 02:46:59,news,world,https://www.cnn.com/2021/07/14/world/tusimple-...,"There's a shortage of truckers, but TuSimple t...",The e-commerce boom has exacerbated a global t...,"world, There's a shortage of truckers, but TuS...","There's a shortage of truckers, but TuSimple t...","(CNN)Right now, there's a shortage of truck d..."
1,2,"Stephanie Bailey, CNN",2021-05-12 07:52:09,news,world,https://www.cnn.com/2021/05/12/world/ironhand-...,Bioservo's robotic 'Ironhand' could protect fa...,Working in a factory can mean doing the same t...,"world, Bioservo's robotic 'Ironhand' could pro...",A robotic 'Ironhand' could protect factory wor...,(CNN)Working in a factory or warehouse can me...
2,3,"Words by Stephanie Bailey, video by Zahra Jamshed",2021-06-16 02:51:30,news,asia,https://www.cnn.com/2021/06/15/asia/swarm-robo...,This swarm of robots gets smarter the more it ...,"In a Hong Kong warehouse, a swarm of autonomou...","asia, This swarm of robots gets smarter the mo...",This swarm of robots gets smarter the more it ...,"(CNN)In a Hong Kong warehouse, a swarm of aut..."
3,4,"Paul R. La Monica, CNN Business",2022-03-15 09:57:36,business,investing,https://www.cnn.com/2022/03/15/investing/brics...,Russia is no longer an option for investors. T...,"For many years, the world's most popular emerg...","investing, Russia is no longer an option for i...",Russia is no longer an option for investors. T...,"New York (CNN Business)For many years, the wor..."
4,7,Reuters,2022-03-15 11:27:02,business,business,https://www.cnn.com/2022/03/15/business/russia...,Russian energy investment ban part of new EU s...,The European Union formally approved on Tuesda...,"business, Russian energy investment ban part o...",EU bans investment in Russian energy in new sa...,The European Union formally approved on Tuesda...


In [31]:
text = df_news.iloc[0]['Article text']

In [32]:
len(text)

12361

In [36]:
text[:100]

" (CNN)Right now, there's a shortage of truck drivers in the US and worldwide, exacerbated by the e-c"

In [33]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=65, chunk_overlap=0)

In [38]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=65, chunk_overlap=0, separators=[",", ".", "?", "!"])

In [39]:
documents = text_splitter.create_documents([text])

In [40]:
print(documents[0])
print(len(documents[0].page_content))

page_content='(CNN)Right now'
14


In [41]:
print(documents[1])
print(len(documents[1].page_content))

page_content=", there's a shortage of truck drivers in the US and worldwide"
61


In [42]:
print(documents[2])
print(len(documents[2].page_content))

page_content=', exacerbated by the e-commerce boom brought on by the pandemic'
63


In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)

In [None]:
documents = text_splitter.create_documents([text])

In [None]:
print(documents[0])
print(len(documents[0].page_content))

In [43]:
import re

# Input text
text = ", there's a shortage of truck drivers in the US and worldwide."

# Remove punctuation using regex
cleaned_text = re.sub(r"[^\w\s]", "", text)

print(cleaned_text)

 theres a shortage of truck drivers in the US and worldwide


# **** 預計第一個小時結束 ****


## Document Specific Splitting

### Markdown splitter

This code snippet demonstrates how to use LangChain's MarkdownTextSplitter to split a Markdown text document into smaller chunks. The MarkdownTextSplitter class is designed to handle Markdown-specific structure, making it easier to process and retrieve information from Markdown documents.

### 1. Import LangChain Components

- Ensure that the necessary components from LangChain are imported. This might include MarkdownTextSplitter.
- 確保導入 LangChain 的必要組件。這可能包括 MarkdownTextSplitter。

In [44]:
from langchain.text_splitter import MarkdownTextSplitter

### 2. Initialize the Text Splitter

- The MarkdownTextSplitter is initialized with a chunk_size of 40 and chunk_overlap of 0. This means each chunk will contain up to 40 characters, and there will be no overlap between chunks.
- MarkdownTextSplitter 被初始化為 chunk_size 為 40，chunk_overlap 為 0。這意味著每個塊將包含最多 40 個字符，並且塊之間不會重疊。

In [45]:
text_splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

In [46]:
markdown_text = """
# Fun in Califormia

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

### 3. Create Documents from Markdown Text

- The create_documents method of MarkdownTextSplitter is used to split the Markdown text into smaller chunks based on the specified chunk size.
- 使用 MarkdownTextSplitter 的 create_documents 方法根據指定的塊大小將 Markdown 文本拆分成較小的部分。

In [47]:
text_splitter.create_documents([markdown_text])

[Document(page_content='# Fun in Califormia\n\n## Driving'),
 Document(page_content='Try driving on the 1 down to San Diego'),
 Document(page_content='### Food'),
 Document(page_content="Make sure to eat a burrito while you're"),
 Document(page_content='there'),
 Document(page_content='## Hiking\n\nGo to Yosemite')]

### Python splitter

In [48]:
from langchain.text_splitter import PythonCodeTextSplitter

python_text = """
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

p1 = Person("John", 36)

for i in range(10):
    print(i)
"""

python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
python_splitter.create_documents([python_text])

[Document(page_content='class Person:\n    def __init__(self, name, age):\n        self.name = name\n        self.age = age'),
 Document(page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print(i)')]

In [49]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language


python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=100, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_text])
python_docs

[Document(page_content='class Person:\n    def __init__(self, name, age):\n        self.name = name\n        self.age = age'),
 Document(page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print(i)')]

### split code: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/code_splitter/

In [50]:
from semantic_router.encoders import HuggingFaceEncoder

encoder = HuggingFaceEncoder()

  from .autonotebook import tqdm as notebook_tqdm


## Semantic Splitting

- StatisticalChunker (text)
- ConsecutiveChunker (text, audio)
- CumulativeChunker (text)

### StatisticalChunker

The statistical chunking method our most robust chunking method, it uses a varying similarity threshold to identify more dynamic and local similarity splits. It offers a good balance between accuracy and efficiency but can only be used for text documents (unlike the multi-modal ConsecutiveChunker).

The StatisticalChunker can automatically identify a good threshold value to use while chunking our text, so it tends to require less customization than our other chunkers.

最強大的分塊方法是統計分塊方法，它使用變化的相似度閾值來識別更多動態和本地相似度的分割。它在準確性和效率之間提供了良好的平衡，但只能用於文本文件（與多模態的連續分塊器不同）。

統計分塊器可以自動識別一個好的閾值來用於分塊我們的文本，因此它通常比我們的其他分塊器需要更少的定制。

In [51]:
from semantic_chunkers import StatisticalChunker

chunker = StatisticalChunker(encoder=encoder)

chunks = chunker(docs=[text])

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean,
  ret = ret.dtype.type(ret / rcount)
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.28it/s]


In [59]:
chunks[0][0].splits

[", there's a shortage of truck drivers in the US and worldwide."]

### Consecutive Chunking

Consecutive chunking is the simplest version of semantic chunking.

連續分塊是語義分塊最簡單的版本。

In [60]:
from semantic_chunkers import ConsecutiveChunker

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

chunks = chunker(docs=[text])

100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 112.18it/s]
0it [00:00, ?it/s]


In [61]:
chunks[0][0].splits

[", there's a shortage of truck drivers in the US and worldwide."]

## Cumulative Chunking

Cumulative chunking is a more compute intensive process, but can often provide more stable results as it is more noise resistant. However, it is very expensive in both time and (if using APIs) money.

In [62]:
from semantic_chunkers import CumulativeChunker

chunker = CumulativeChunker(encoder=encoder, score_threshold=0.3)

chunks = chunker(docs=[text])

100%|████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]


In [63]:
chunks[0]

[Chunk(splits=[", there's a shortage of truck drivers in the US and worldwide."], is_triggered=False, triggered_score=None, token_count=None, metadata=None)]