## Chunking Strategy


In [None]:
import json
import nltk
from nltk import word_tokenize
import re
import contractions
from langchain.text_splitter import TokenTextSplitter

* **Function: `chunk_text_file`**

  * Initializes a `TokenTextSplitter` from LangChain with:

    * `encoding_name`: which tokenizer to use (`cl100k_base`).
    * `chunk_size`: how many tokens per chunk (default 512).
    * `chunk_overlap`: how many tokens overlap between chunks (default 50).
  * Splits the text into token-based chunks using `text_splitter.split_text`.
  * Prints:

    * Approximate word count of the original text.
    * Number of chunks generated.
  * Returns the list of chunks.

* **Function: `save_chunks_to_json`**

  * Iterates over the chunks list.
  * For each chunk, creates a dictionary with:

    * `chunk_id`: sequential ID (starting from 1).
    * `text`: the chunked text content.
    * `token_count`: rough count of tokens (using `.split()` on whitespace, not exact).
  * Collects all chunk dictionaries into a list under `"chunks"`.
  * Saves the data as a formatted JSON file (`output_file`).
  * Prints confirmation of where the chunks were saved.

* **Main Execution (`if __name__ == "__main__":`)**

  * Defines the input file path (`cleaned_text_content.txt`) and output file path (`chunks.json`).
  * Calls `chunk_text_file` to generate token-based chunks.
  * Calls `save_chunks_to_json` to store the chunks into a JSON file.


In [None]:
def chunk_text_file(file_path, chunk_size=512, chunk_overlap=50, encoding_name="cl100k_base"):
    """
    Chunk a text file into fixed-size chunks based on tokens using LangChain

    Args:
        file_path (str): Path to the text file
        chunk_size (int): Number of tokens per chunk
        chunk_overlap (int): Number of tokens to overlap between chunks
        encoding_name (str): Tokenizer encoding (cl100k_base works with OpenAI models)

    Returns:
        list: List of text chunks
    """
    # Read the file content
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Initialize LangChain TokenTextSplitter
    text_splitter = TokenTextSplitter(
        encoding_name=encoding_name,
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )

    # Split text into chunks
    chunks = text_splitter.split_text(text)

    print(f"Original length: {len(text.split())} words (approx)")
    print(f"Number of chunks created: {len(chunks)}")
    return chunks


def save_chunks_to_json(chunks, output_file="chunks.json"):
    """
    Save chunks to a JSON file with metadata
    """
    chunks_data = []
    for i, chunk in enumerate(chunks):
        chunk_info = {
            "chunk_id": i + 1,
            "text": chunk,
            "token_count": len(chunk.split())  # rough token count, not exact
        }
        chunks_data.append(chunk_info)

    output_data = {"chunks": chunks_data}

    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(output_data, f, indent=2, ensure_ascii=False)

    print(f"\nChunks saved to {output_file}")


if __name__ == "__main__":
    file_path = "../Scraping_Digico_Website/scraped_data/cleaned_text_content.txt"
    output_json = "./data/chunks.json"

    # Fixed-size token chunking
    chunks = chunk_text_file(file_path, chunk_size=512, chunk_overlap=50)
    save_chunks_to_json(chunks, output_json)


Original length: 84155 words (approx)
Number of chunks created: 227

Chunks saved to ./data/chunks.json


## Data enriching section

In NLP, removing stopwords (common words like *the, is, and, of*) helps reduce noise and focus on the more meaningful terms in a text. Since stopwords occur very frequently but carry little semantic value, eliminating them can make text processing more efficient, reduce the size of the data, and improve the performance of models that rely on distinguishing informative words for tasks like semantic search

Note : there is stop words such as "not" that can change the meaning of a sentence, and that's why I defined a list of stopwords that should not be removed because they play a crucial role in building the meaning

Another note : I know that this list does not contain all the words that hold meaning, but this is still better than removing all stopwords

In [99]:
from nltk.corpus import stopwords

# Load standard English stopwords
stop_words = set(stopwords.words("english"))

# Keep negation-related words
negation_words = {"not", "no", "nor", "never", "very", "too", "just", "only", "even", "almost", "but", "however", "although", "though", "yet", "before", "after",  "until",  "since",  "because",  "if",  "unless",  "while", "all",  "any",  "few", "most", "some", "many", "several", "more", "less", "can", "could", "should", "would", "may", "might", "must", "shall", "not", "no", "nor", "never", "none", "nothing", "neither", "nowhere", "hardly", "scarcely", "barely"}
stop_words = stop_words - negation_words

def remove_stopwords(text):
    words = text.split()
    filtered = [w for w in words if w not in stop_words]
    return " ".join(filtered)


Ponctuation induces noise and does not help with building the meaning so we need to remove it (While writing the documentation I noticed that I could have used a library for catching punctuation)

in addition output of TokenTextSplitter contains strings that make noise : 
example : 
'\n\n'
'\n'
'\'

Also :
- I used contractions tool to expand contractions first (isn't -> is not, don't -> do not, etc.)
- I applied lowercasing (because the embedding vector of Laptop is different than the embedding vector of laptop and this may make some confusions)
- many lines start with ". " so I removed it 
- I normalized the spaces
- I thought of performing lemmatization but for the baseline version I choose not to overcomplicate my approach

In [None]:
from pathlib import Path

# Input/output paths
input_path = Path("./data/chunks.json")
output_path = Path("./data/chunks_cleaned.json")

# Replacement rules
replacements = [
    ('\n\n', ' '),
    ('\n', ' '),
    ('\"', ''),
    ('. ', ' '),
    (', ', ' '),
    ('; ', ' '),
    (': ', ' ')
]

def clean_text(text):
    cleaned = text

    # Expand contractions first (isn't -> is not, don't -> do not, etc.)
    cleaned = contractions.fix(cleaned)
    
    # Apply simple replacements
    for old, new in replacements:
        cleaned = cleaned.replace(old, new)

    # If line starts with ". " remove it
    cleaned = re.sub(r'^\ \s+', '', cleaned)

    # Normalize spaces 
    cleaned = re.sub(r'\s+', ' ', cleaned).strip()
    # lowercase the cleaned text
    cleaned = cleaned.lower()

    cleaned = remove_stopwords(cleaned)

    return cleaned

def main():
    # Load original file
    with open(input_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    cleaned_chunks = {"chunks": []}

    for chunk in data.get("chunks", []):
        original_text = chunk.get("text", "")
        cleaned_text = clean_text(original_text)

        cleaned_chunks["chunks"].append({
            "chunk_id": chunk.get("chunk_id"),
            "original_text": original_text,
            "cleaned_text": cleaned_text
        })

    # Save new file
    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(cleaned_chunks, f, ensure_ascii=False, indent=2)
    print(f"The output is saved to {output_path}")

if __name__ == "__main__":
    main()

The output is saved to data\chunks_cleaned.json
