<a href="https://colab.research.google.com/github/kairamilanifitria/PurpleBox-Intern/blob/main/02_13_Chunking_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document-Spesific Chunking

In [1]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

with open("/content/PDF1 (2).md","r") as file:
  markdown_text = file.read()

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2")
]

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False
)

In [2]:
chunks = splitter.split_text(markdown_text)

In [3]:
for i, chunk in enumerate(chunks):
  print(f"chunk {i+1}: \n{chunk}")

chunk 1: 
page_content='## A Hype-Adjusted Probability Measure for NLP Stock Return Forecasting  
Zheng Cao ∗ zcao26@jh.edu  
Hélyette Geman †  
hgeman1@jhu.edu' metadata={'Header 2': 'A Hype-Adjusted Probability Measure for NLP Stock Return Forecasting'}
chunk 2: 
page_content='## Abstract  
This article introduces a Hype-Adjusted Probability Measure in the context of a new Natural Language Processing (NLP) approach for stock return and volatility forecasting. A novel sentiment score equation is proposed to represent the impact of intraday news on forecasting next-period stock return and volatility for selected U.S. semiconductor tickers, a very vibrant industry sector. This work improves the forecast accuracy by addressing news bias, memory, and weight, and incorporating shifts in sentiment direction. More importantly, it extends the use of the remarkable tool of change of Probability Measure developed in the finance of Asset Pricing to NLP forecasting by constructing a Hype-Adjusted

# Semantic Chunkers

### Semantic Chunking (KMeans Method)
Semantic-based chunking involves splitting text into chunks based on the meaning of the content rather than fixed length or sentence boundaries. The idea is to group related information, which may vary in length, into semantic units. This is often done using NLP techniques, such as sentence embeddings or document clustering, to identify the most relevant sections for each chunk.

Use Case:
Best for use in situations where meaningful semantic units (e.g., paragraphs, topics) should be grouped together regardless of the chunk size. This is useful for document summarization, knowledge extraction, and other advanced NLP tasks.

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = markdown_text.split('. ')

embeddings = model.encode(sentences)
embeddings = embeddings.reshape(len(sentences), -1)

kmeans = KMeans(n_clusters=2, random_state=0)
labels = kmeans.fit_predict(embeddings)

chunks = {}
for i, label in enumerate(labels):
    if label not in chunks:
        chunks[label] = []
    chunks[label].append(sentences[i])

for label, chunk in chunks.items():
    print(f"Semantic Chunk {label + 1}: {', '.join(chunk)}")

In [16]:
# prompt: see how many chunks from previous step

print(len(chunks))


2


In [17]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
from sentence_transformers import SentenceTransformer
import numpy as np
import torch
import re

# Load Markdown file
with open("/content/PDF1 (2).md", "r") as file:
    markdown_text = file.read()

# Step 1: Document-Specific Chunking
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
chunks = splitter.split_text(markdown_text)


In [18]:
# Step 2: Identify long text chunks (excluding tables and images)
def is_table_or_image(chunk):
    return "![](" in chunk or "|" in chunk  # Simple check for images and tables

def needs_semantic_chunking(chunk, max_tokens=512):
    return not is_table_or_image(chunk) and len(chunk.split()) > max_tokens

In [19]:
# Load Hugging Face Embedding Model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def semantic_split(text, max_sentences=5):
    sentences = re.split(r'(?<=[.!?])\s+', text)  # Split by sentence
    embeddings = model.encode(sentences, convert_to_tensor=True)

    # Compute cosine similarity
    similarities = torch.nn.functional.cosine_similarity(embeddings[:-1], embeddings[1:])

    # Find split points where similarity drops
    split_points = [i+1 for i, sim in enumerate(similarities) if sim < 0.5]  # Threshold 0.5 (tune as needed)

    # Generate new chunks
    sub_chunks, start = [], 0
    for split in split_points:
        sub_chunks.append(" ".join(sentences[start:split]))
        start = split
    sub_chunks.append(" ".join(sentences[start:]))  # Add remaining text

    return [c for c in sub_chunks if c]

In [21]:
# Step 3: Apply Semantic Chunking where needed
final_chunks = []
for chunk in chunks:
    text = chunk.page_content  # Extract text from Document object
    if needs_semantic_chunking(text):
        final_chunks.extend(semantic_split(text))
    else:
        final_chunks.append(text)

# Print final chunks
for i, chunk in enumerate(final_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n---\n")

Chunk 1:
## A Hype-Adjusted Probability Measure for NLP Stock Return Forecasting  
Zheng Cao ∗ zcao26@jh.edu  
Hélyette Geman †  
hgeman1@jhu.edu
---

Chunk 2:
## Abstract  
This article introduces a Hype-Adjusted Probability Measure in the context of a new Natural Language Processing (NLP) approach for stock return and volatility forecasting. A novel sentiment score equation is proposed to represent the impact of intraday news on forecasting next-period stock return and volatility for selected U.S. semiconductor tickers, a very vibrant industry sector. This work improves the forecast accuracy by addressing news bias, memory, and weight, and incorporating shifts in sentiment direction. More importantly, it extends the use of the remarkable tool of change of Probability Measure developed in the finance of Asset Pricing to NLP forecasting by constructing a Hype-Adjusted Probability Measure, obtained from a redistribution of the weights in the probability space, meant to correct for exces