# CHUNKING STRATEGY
RULES OF CHUNKING - 
1. Process every document separately.
2. Remove all elements before 'PART I' Business Strategy, 'Table of Contents', and 'Page Numbers'.
3. Chunk all lowest-level headings. Implement a 20% overlap.
4. Merge two chunks if one has less than 200 tokens.
5. Add the full hierarchical path of all headings/subheadings and document metadata to the metadata tag.
(e.g., document_id, fiscal_year, part, section_1, section_2, section_3)
6. Send the chunks along with metadata information to the LLM.
7. Architectural Layer: Implement a Parent-Child Retrieval mechanism where search is done on the fine-grained chunks, but chunks under second-level heading will be combined and sent to LLM.


## 10-K CHUNKING

### Algorithm - 
1. Iterate through every item one by one.
2. Initialize a Dynamic meta data dictionary for every chunk with a 'parent_header' : '{item_no: text}', 'form_type':'10_K', 'Company':{company name}, 'filing_date':'{date}'
3. If any element is a header, mention '--- heading' in front of that and join it back with the parent chunk.
4. Every chunk should contain maximum 250 words.
5. Mention table in the METADATA of table text and keep it a standalone text, no matter what the size is.

### Cleaning Chunks -
1. Remove the footers using this pattern '.* |.* |[0-9]*'
2. Remove the text if they only contain a single number '[0-9]+', it will remove page numbers.
3. Remove any text which contains '[a-z]* inc.' and has less than 3 words, lowercase the text for this.

### Further Refinements (Semantic Chunking) - 
1. For every Heading.
2. Check if the dictionary of headings is empty, if yes append level 1 heading to the dict.
3. Else check for the max heading num, then append it as max + 1.
4. Check if previous chunk was content and has atleast 10 words, then empty the heading dict and append a new heading in level '1' also append the chunk as a new one and empty curr_chunk.
5. While appending content check if the chunk size is 200 words, append if yes. Else continue adding.
6. If consecutively more than 3 headings occur, continue till the next content appears, append all headings as content.

In [None]:
from edgar import set_identity, Filing, find, use_local_storage, set_local_storage_path
import pandas as pd
import re
import os
from dotenv import loadenv
import math
from edgar.files.html_documents import TextBlock, TableBlock
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document

In [None]:
CHUNK_COUNTER = 1
TENK_TENQ_CHUNKS = []
TENK_CHUNKS = []
TENQ_CHUNKS = []
TOTAL_CHUNKS_TENK = 0
TOTAL_CHUNKS_TENQ = 0
#Reading all the metadata from 10-K
BASE_PATH_TENK = 'path_to tenk_metadata'
BASE_PATH_TENQ = 'path_to tenq_metadata'

loadenv()
set_identity(os.getenv("SEC_ID"))
use_local_storage(os.getenv("SEC_CACHE"))

In [None]:

COMPANY_NAMES = [
        "Apple Inc.",
        "MICROSOFT CORP",
        "Alphabet Inc.",
        "AMAZON COM INC.",
        "Meta Platforms, Inc.",
        "NVIDIA CORP",
        "Tesla, Inc.",
        "ORACLE CORP",
        "Salesforce, Inc.",
        "NETFLIX INC",
        "ADOBE INC."
    ]

TICKERS = {
        320193 : "AAPL",
        789019: "MSFT",
        1652044: "GOOGL",
        1018724: "AMZN",
        1326801 : "META",
        1045810: "NVDA",
        1318605 : "TSLA",
        1341439 : "ORCL",
        1108524 : "CRM",
        1065280 : "NFLX",
        796343 : "ADBE"
    }



#Post processing the exisiting chunks
def post_processing_chunks(all_chunks):
    global CHUNK_COUNTER
    all_chunks_processed = all_chunks.copy()
    splitter = SentenceSplitter(
                chunk_size=512,
                chunk_overlap=100
            )
    for chunk in all_chunks_processed:
        chunk['ID'] = CHUNK_COUNTER
        CHUNK_COUNTER+=1
        if chunk["Metadata"]:
            if chunk["Metadata"]['is_table']:
                chunk["Chunks"]['child_chunks'] = {}
                continue
            text = chunk["Chunks"]["parent_chunk"]
            chunk["Chunks"]['child_chunks'] = {}
            documents = [Document(text = text)]
            nodes = splitter.get_nodes_from_documents(documents)
            for i,node in enumerate(nodes):
                if text == node.text:
                    break
                if i==0:
                    if len(text.split()) - len(node.text.split()) > 100:
                        chunk["Chunks"]['child_chunks'][f"{i}"] = node.text
                    else:
                        break
                else:
                    if len(node.text.split()) > 50:
                        chunk["Chunks"]['child_chunks'][f"{i}"]=node.text
                    else:
                        chunk["Chunks"]['child_chunks'][f"{i-1}"]+=node.text
    return all_chunks_processed


#We will is_table as true if 30% of all_items matches one of the tables
def get_table_of_contents(chunk_obj):
    def is_table(table,all_items):
        q_items = all_items[:math.ceil(len(all_items)*0.3)]
        table = table.get_text().lower()
        for item in q_items:
            if item.lower() not in table:
                return False
        return True
    
    all_items = chunk_obj.list_items()
    table_of_contents = pd.DataFrame()
    MAX_ITEMS = -1
    ITEM_COLUMN_INDEX = -1
    for table in chunk_obj.tables():
        if is_table(table, all_items):
            table_of_contents = table.to_dataframe()
            #Naming the 'Items' columns and 'Headings' columns
            for i in range(table_of_contents.shape[1]):
                NUM_ITEMS = sum('item' in item.lower() for item in table_of_contents.iloc[:,i])
                if NUM_ITEMS > MAX_ITEMS:
                    ITEM_COLUMN_INDEX = i
                    MAX_ITEMS = NUM_ITEMS
            
            #naming the columns
            columns = [''] * table_of_contents.shape[1]
            columns[ITEM_COLUMN_INDEX] = 'Index'
            columns[ITEM_COLUMN_INDEX + 1] = 'Headings'

            #setting the columns to datarame's column att
            table_of_contents.columns = columns
            table_of_contents = table_of_contents.set_index('Index')

            return table_of_contents
        
def get_heading_dict(all_items, table_of_contents):
    #Dictionary of items
    dict = {}

    #if table of contents is none, then we will return items themselves in keys and value both
    if table_of_contents.empty :
        for item in all_items:
            dict[item] = item
        return dict

    for item in all_items:
        for t_item in table_of_contents.index:
            if bool(re.fullmatch(item.lower() + '\W*',t_item.lower())):
                dict[item] = item + ': ' + table_of_contents.loc[t_item,'Headings']
                break
    return dict
    

def chunk_document(chunk_obj, filing_date, company, item_heading_dictionary, form_type = '10-K'):
    regex1 = re.compile(
                    r"""
                        \n*                     # optional newlines at start
                        .*?                     # company name
                        \s\|\s                  # ' | '
                        .*?                     # Any text like 'Q2'
                        \d{4}                   # 4-digit year
                        \sForm\s10-K            # ' Form 10-K'
                        \s\|\s                  # ' | '
                        \d+                     # page number
                        \n*                     # optional newlines at end
                        $""",
                    re.VERBOSE,
                    )
    # Regex 2: Matches only digits
    # ^\d+$
    regex2 = re.compile(r"\n*\d+\n*$")

    metadata = {'item_heading':None, 'filing_date':filing_date, 'company':company, 'form':form_type, 'is_table':False}
    #Get all the items present in the table
    items = chunk_obj.list_items()

    all_chunks = []
    processed_chunk = []
    count_of_words = 0
    for item in items:
        metadata['item_heading'] = item_heading_dictionary.get(item, item)
        chunks_for_item = chunk_obj.chunks_for_item(item)
        Headings = {}
        prev_content = False
        prev_heading = False
        metadata_chunk = metadata.copy()
        count_of_words = 0
        for chunk in chunks_for_item:
            for element in chunk:
                #We want only text or tables
                if isinstance(element, (TextBlock)):
                    if element.is_header:
                        if regex1.fullmatch(element.get_text()) or regex2.fullmatch(element.get_text()):
                            Headings = {}
                            continue

                        text = element.get_text().strip('\n').strip(' ')
                        if item.lower() in text.lower() or 'part' in text.lower():
                            continue
                        if prev_content and processed_chunk:
                            chunk_text = ' '.join(processed_chunk)
                            count_of_words = 0
                            if len(chunk_text.split()) > 20:
                                all_chunks.append({
                                                    'Metadata': metadata_chunk.copy(),
                                                    'Chunks': {"parent_chunk":chunk_text}
                                                })
                            processed_chunk = []
                            Headings = {}
                        
                        if Headings:
                            max_level = max(list(Headings.keys()))
                            Headings[max_level+1] = text
                        else:
                            Headings[1] = text
                        
                        prev_content = False
                        prev_heading = True
                    
                    else:
                        if prev_heading:
                            metadata_chunk = metadata.copy()
                            if len(Headings) > 3:
                                chunk_text = ''
                                for key, value in Headings.items():
                                    chunk_text+=value + '\n'
                                all_chunks.append({
                                                    'Metadata': metadata_chunk,
                                                    'Chunks': {"parent_chunk":chunk_text}
                                                })
                            else:
                                metadata_chunk['sub_headings'] = {}
                                for key, value in Headings.items():
                                    metadata_chunk['sub_headings']['level_'+str(key)+'_heading'] = value

                        text = element.get_text().strip('\n').strip(' ')
                        processed_chunk.append(text)
                        count_of_words += len(text.split())
                        prev_content = True
                        prev_heading = False

                elif isinstance(element, TableBlock):
                    metadata_chunk = metadata.copy()
                    if prev_heading:
                        if len(Headings) > 3:
                            chunk_text = ''
                            for key, value in Headings.items():
                                chunk_text+=value + '\n'
                            all_chunks.append({
                                                'Metadata': metadata_chunk,
                                                'Chunks': {"parent_chunk":chunk_text}
                                            })
                        else:
                            metadata_chunk['sub_headings'] = {}
                            for key, value in Headings.items():
                                metadata_chunk['sub_headings']['level_'+str(key)+'_heading'] = value
                    
                    chunk_text = ''
                    if prev_content:
                        text = '\n'.join(processed_chunk)
                        if len(text.split()) < 20 and len(text.split()) > 5:
                            chunk_text = text

                    chunk_text += element.get_text()
                    metadata_chunk['is_table'] = True
                    all_chunks.append({
                                        'Metadata': metadata_chunk,
                                        'Chunks': {"parent_chunk":chunk_text}
                                    })
                    prev_content = True
                    prev_heading = False

        if processed_chunk:
            count_of_words = 0
            chunk_text = '\n'.join(processed_chunk)
            if len(chunk_text.split()) > 20:
                all_chunks.append({
                                    'Metadata': metadata_chunk,
                                    'Chunks': {"parent_chunk":chunk_text}
                                })
            processed_chunk = []

    return all_chunks


def get_chunks(filing_date: str,ticker:str, company: str, cik: int, accession_no:str) -> list:
    filing = Filing(
            form='10-K',
            filing_date= filing_date,
            company= ticker,
            cik=cik,
            accession_no= accession_no
        )

    obj = filing.obj()
    chunk_obj = obj.chunked_document

    table_of_contents = get_table_of_contents(chunk_obj)
    item_heading_dictionary = get_heading_dict(sorted(chunk_obj.list_items()), table_of_contents)
    all_chunks = chunk_document(chunk_obj, filing_date, company, item_heading_dictionary, form_type = '10-K')
    
    print(f"Processed {ticker}, {filing_date}")
    return all_chunks

def main():
    global TOTAL_CHUNKS_TENK
    for (cik, ticker),company in zip(TICKERS.items(),COMPANY_NAMES):
        meta_df_path = os.path.join(BASE_PATH_TENK, ticker, '10-K.csv')
        meta_df = pd.read_csv(meta_df_path)
        meta_df = meta_df[pd.to_datetime(meta_df['filing_date']).dt.year > 2023]

        for _, row in meta_df.iterrows():
            filing_date = row['filing_date']
            accession_no = row['accession_number']
            all_chunks = get_chunks(filing_date, ticker, company, cik, accession_no)
            TOTAL_CHUNKS_TENK+=len(all_chunks)
            all_chunks_processed = post_processing_chunks(all_chunks)
            
            chunk_det = {
                'filing_date' : filing_date,
                'ticker' : ticker,
                'file_chunks' : all_chunks_processed
            }
            
            TENK_CHUNKS.append(chunk_det)
if __name__ == '__main__':
    main()

  from .autonotebook import tqdm as notebook_tqdm


Processed AAPL, 2025-10-31
Processed AAPL, 2024-11-01
Processed MSFT, 2025-07-30
Processed MSFT, 2024-07-30
Processed GOOGL, 2025-02-05
Processed GOOGL, 2024-01-31
Processed AMZN, 2025-02-07
Processed AMZN, 2024-02-02
Processed META, 2025-01-30
Processed META, 2024-02-02
Processed NVDA, 2025-02-26
Processed NVDA, 2024-02-21
Processed TSLA, 2025-04-30
Processed TSLA, 2025-01-30
Processed TSLA, 2024-01-29
Processed ORCL, 2025-06-18
Processed ORCL, 2024-06-20
Processed CRM, 2025-03-05
Processed CRM, 2024-03-06
Processed NFLX, 2025-01-27
Processed NFLX, 2024-01-26
Processed ADBE, 2025-01-13
Processed ADBE, 2024-01-17


### SAVE 10K Chunks

In [None]:
import json

# Use 'w' mode for writing text
with open("./tenk_chunks.json", "w", encoding="utf-8") as f:
    json.dump(TENK_CHUNKS, f, indent=4)

## 10-Q CHUNKING

### Algorithm - 
1. Iterate over Parts one by one.
2. Check in headings and tables for 'Item' headings and match with the Parts' items.
3. Initialize a Dynamic meta data dictionary for every chunk with a 'part_header' : '{part_no, headings_text} and 'parent_header' : '{item_no, heading_text}', 'form_type':'10_K', 'Company':{company name}, 'filing_date':'{date}'
3. If any element is a header, mention '--- heading' in front of that and join it back with the parent chunk.
4. Every chunk should contain maximum 250 words.
5. Every table should be a separate chunk.
6. Mention table in the METADATA of table text and keep it a standalone text, no matter what the size is.

### Cleaning Chunks.
1. Remove the footers using this pattern '.* |.* |[0-9]*'
2. Remove the text if they only contain a single number '[0-9]+', it will remove page numbers.
3. Remove any text which contains '[a-z]* inc.' and has less than 3 words, lowercase the text for this.

In [None]:
metadata_csv = pd.DataFrame()


COMPANY_NAMES = [
        "Apple Inc.",
        "MICROSOFT CORP",
        "Alphabet Inc.",
        "AMAZON COM INC.",
        "Meta Platforms, Inc.",
        "NVIDIA CORP",
        "Tesla, Inc.",
        "ORACLE CORP",
        "Salesforce, Inc.",
        "NETFLIX INC",
        "ADOBE INC."
    ]

TICKERS = {
        320193 : "AAPL",
        789019: "MSFT",
        1652044: "GOOGL",
        1018724: "AMZN",
        1326801 : "META",
        1045810: "NVDA",
        1318605 : "TSLA",
        1341439 : "ORCL",
        1108524 : "CRM",
        1065280 : "NFLX",
        796343 : "ADBE"
    }


#Post processing the exisiting chunks
def post_processing_chunks(all_chunks):
    global CHUNK_COUNTER
    all_chunks_processed = all_chunks.copy()
    splitter = SentenceSplitter(
                chunk_size=512,
                chunk_overlap=100
            )
    for chunk in all_chunks_processed:
        chunk['ID'] = CHUNK_COUNTER
        CHUNK_COUNTER+=1
        if chunk["Metadata"]:
            if chunk["Metadata"]['is_table']:
                chunk["Chunks"]['child_chunks'] = {}
                continue
            text = chunk["Chunks"]["parent_chunk"]
            chunk["Chunks"]['child_chunks'] = {}
            documents = [Document(text = text)]
            nodes = splitter.get_nodes_from_documents(documents)
            for i,node in enumerate(nodes):
                if text == node.text:
                    break
                if i==0:
                    if len(text.split()) - len(node.text.split()) > 100:
                        chunk["Chunks"]['child_chunks'][f"{i}"] = node.text
                    else:
                        break
                else:
                    if len(node.text.split()) > 50:
                        chunk["Chunks"]['child_chunks'][f"{i}"]=node.text
                    else:
                        chunk["Chunks"]['child_chunks'][f"{i-1}"]+=node.text
    return all_chunks_processed

#We will is_table as true if 30% of all_items matches one of the tables
def get_table_of_contents(chunk_obj):
    def is_table(table,all_items):
        q_items = all_items[:math.ceil(len(all_items)*0.3)]
        table = table.get_text().lower()
        for item in q_items:
            if item.lower() not in table:
                return False
        return True
    
    all_items = chunk_obj.list_items()
    table_of_contents = pd.DataFrame()
    MAX_ITEMS = -1
    ITEM_COLUMN_INDEX = -1
    for table in chunk_obj.tables():
        if is_table(table, all_items):
            table_of_contents = table.to_dataframe()
            
            #Naming the 'Items' columns and 'Headings' columns
            for i in range(table_of_contents.shape[1]):
                NUM_ITEMS = sum('item' in item.lower() for item in table_of_contents.iloc[:,i])
                if NUM_ITEMS > MAX_ITEMS:
                    ITEM_COLUMN_INDEX = i
                    MAX_ITEMS = NUM_ITEMS
            
            #Naming the columns
            columns = [''] * table_of_contents.shape[1]
            columns[ITEM_COLUMN_INDEX] = 'Index'
            columns[ITEM_COLUMN_INDEX + 1] = 'Headings'

            #Setting the columns to datarame's column attribute
            table_of_contents.columns = columns
            
            return table_of_contents

def get_correct_item(item, all_items):
    for actual_item in all_items:
        if re.fullmatch(item+'[^a-zA-Z0-9]*',actual_item):
            return actual_item
    
    item = item.strip('. ').split()
    item = item[0].capitalize() + ' ' + item[1]
    return item



def get_heading_dict(all_items, table_of_contents):
    #Dictionary of items
    headings = {}
    part_1 = False
    part_2 = False

    #if table of contents is none, then we will return items themselves in keys and value both
    if table_of_contents.empty :
        headings['Part I'] = {'Heading':'Financial Information','Items':[]}
        headings['Part II'] = {'Heading':'Other Information','Items':[]}
        return headings

    for _, row in table_of_contents.iterrows() :
        if re.fullmatch(r'part i[^a-zA-Z0-9]*', row['Index'].lower()) or 'financial information' in row['Index'].lower() and not part_1:
            headings['Part I'] = {'Heading':'Part I: Financial Information','Items':{}}
            part_1, part_2 = True, False
            
        elif re.fullmatch(r'part i[^a-zA-Z0-9]*', row['Headings'].lower()) or 'financial information' in row['Headings'].lower() and not part_1:
            headings['Part I'] = {'Heading':'Part I: Financial Information','Items':{}}
            part_1, part_2 = True, False

        elif re.fullmatch(r'part ii[^a-zA-Z0-9]*', row['Index'].lower()) or 'other information' in row['Index'].lower() and not part_2:
            headings['Part II'] = {'Heading':'Part II: Other Information','Items':{}}
            part_1, part_2 = False, True

        elif re.fullmatch(r'part ii[^a-zA-Z0-9]*', row['Headings'].lower()) or 'other information' in row['Headings'].lower() and not part_2:
            headings['Part II'] = {'Heading':'Part II: Other Information','Items':{}}
            part_1, part_2 = False, True

        if part_1:
            item = row['Index'].strip('.')
            heading = row['Headings']
            if 'item' in item.lower():
                actual_item = get_correct_item(item,all_items)
                headings['Part I']['Items'][actual_item] = actual_item + ': '+ heading.strip(':. \n')

        elif part_2:
            item = row['Index'].strip('.')
            heading = row['Headings']
            if 'item' in item.lower():
                actual_item = get_correct_item(item,all_items)
                headings['Part II']['Items'][actual_item] = actual_item + ': '+ heading.strip(':. \n')

    return headings
    

def chunk_document(chunk_obj, filing_date, company, heading_dictionary, form_type = '10-Q'):

    regex1 = re.compile(
                        r"""
                            \n*                     # optional newlines at start
                            .*?                     # company name
                            \s\|\s                  # ' | '
                            .*?                     # Any text like 'Q2'
                            \d{4}                   # 4-digit year
                            \sForm\s10-Q            # ' Form 10-K'
                            \s\|\s                  # ' | '
                            \d+                     # page number
                            \n*                     # optional newlines at end
                            $""",
                        re.VERBOSE,
                        )

    # Regex 2: Matches only digits
    # ^\d+$
    regex2 = re.compile(r"^\n*\d+\n*$")

    metadata = {'part_heading':None,'item_heading':None, 'filing_date':filing_date, 'company':company, 'form':form_type, 'is_table':False}
    #Get all the items present in the table

    all_chunks = []

    processed_chunk = []
    prev_heading = {'T/F':False,'Heading': None}
    count_of_words = 0
    
    for part in ['Part I','Part II']:
        metadata['part_heading'] = heading_dictionary[part]['Heading']
        part_items = list(heading_dictionary[part]['Items'].keys())
        chunk_for_part = chunk_obj.chunks_for_part(part)
        
        Headings = {}
        prev_content = False
        prev_heading = False
        metadata_chunk = metadata.copy()
        count_of_words = 0
        
        for chunk in chunk_for_part:
            for element in chunk:
                #We want only text or tables
                if isinstance(element, (TextBlock)):
                    if element.is_header:
                        if regex1.fullmatch(element.get_text()) or regex2.fullmatch(element.get_text()):
                            Headings = {}
                            continue

                        #Extracting Items                        
                        for item in part_items:
                            item_heading = re.sub("item\s*\d*\W*:\s*","",heading_dictionary[part]['Items'][item].lower())
                            if item_heading in element.get_text().lower():
                                if processed_chunk:
                                    count_of_words = 0
                                    chunk_text = '\n'.join(processed_chunk)
                                    if len(chunk_text.split()) > 20:
                                        all_chunks.append({
                                                            'Metadata': metadata_chunk,
                                                            'Chunks': {"parent_chunk":chunk_text}
                                                        })
                                        processed_chunk = []
                                metadata['item_heading'] = heading_dictionary[part]['Items'][item]
                        
                        
                        text = element.get_text().strip('\n').strip(' ')

                        if 'item' in text.lower() or 'part' in text.lower():
                            continue

                        if prev_content and processed_chunk:
                            chunk_text = ' '.join(processed_chunk)
                            count_of_words = 0
                            if len(chunk_text.split()) > 20:
                                all_chunks.append({
                                                    'Metadata': metadata_chunk.copy(),
                                                    'Chunks': {"parent_chunk":chunk_text}
                                                })
                            processed_chunk = []
                            Headings = {}
                        
                        if Headings:
                            max_level = max(list(Headings.keys()))
                            Headings[max_level+1] = text
                        else:
                            Headings[1] = text
                        
                        prev_content = False
                        prev_heading = True
                    
                    else:
                        if prev_heading:
                            metadata_chunk = metadata.copy()
                            if len(Headings) > 3:
                                chunk_text = ''
                                for key, value in Headings.items():
                                    chunk_text+=value + '\n'
                                all_chunks.append({
                                                    'Metadata': metadata_chunk,
                                                    'Chunks': {"parent_chunk":chunk_text}
                                                })
                            else:
                                metadata_chunk['sub_headings'] = {}
                                for key, value in Headings.items():
                                    metadata_chunk['sub_headings']['level_'+str(key)+'_heading'] = value

                        text = element.get_text().strip('\n').strip(' ')
                        processed_chunk.append(text)
                        count_of_words += len(text.split())
                        prev_content = True
                        prev_heading = False

                elif isinstance(element, TableBlock):
                    #Update Metadata parent_heading key
                    for item in part_items:
                        item_heading = re.sub("item\s*\d*\W*:\s*","",heading_dictionary[part]['Items'][item].lower())
                        if item_heading in element.get_text().lower():
                            if processed_chunk:
                                chunk_text = '\n'.join(processed_chunk)
                                if len(chunk_text.split()) > 20:
                                    all_chunks.append({
                                                        'Metadata': metadata_chunk,
                                                        'Chunks': {"parent_chunk":chunk_text}
                                                    })
                                processed_chunk = []
                                count_of_words = 0
                            metadata['item_heading'] = heading_dictionary[part]['Items'][item]

                    if element.get_text().lower().count('item') < 6:
                        metadata_chunk = metadata.copy()
                        if prev_heading or prev_content:
                            if len(Headings) > 3:
                                chunk_text = ''
                                for key, value in Headings.items():
                                    chunk_text+=value + '\n'
                                all_chunks.append({
                                                    'Metadata': metadata_chunk,
                                                    'Chunks': {"parent_chunk":chunk_text}
                                                })
                            else:
                                metadata_chunk['sub_headings'] = {}
                                for key, value in Headings.items():
                                    metadata_chunk['sub_headings']['level_'+str(key)+'_heading'] = value

                        chunk_text = ''
                        if prev_content:
                            text = '\n'.join(processed_chunk)
                            if len(text.split()) < 20 and len(text.split()) > 5:
                                chunk_text = text

                        chunk_text += element.get_text()
                        
                        metadata_chunk['is_table'] = True
                        all_chunks.append({
                                            'Metadata': metadata_chunk,
                                            'Chunks': {"parent_chunk":chunk_text}
                                        })
                        prev_content = True
                        prev_heading = False

        if processed_chunk:
            count_of_words = 0
            chunk_text = '\n'.join(processed_chunk)
            if len(chunk_text.split()) > 20:
                all_chunks.append({
                                    'Metadata': metadata_chunk,
                                    'Chunks': {"parent_chunk":chunk_text}
                                })
            processed_chunk = []
            
    return all_chunks


def get_chunks(filing_date: str,ticker:str, company: str, cik: int, accession_no:str) -> list:
    # print(filing_date, ticker, cik, accession_no)
    filing = Filing(
            form='10-Q',
            filing_date= filing_date,
            company= ticker,
            cik=cik,
            accession_no= accession_no
        )

    obj = filing.obj()
    chunk_obj = obj.chunked_document

    table_of_contents = get_table_of_contents(chunk_obj)
    heading_dictionary = get_heading_dict(sorted(chunk_obj.list_items()),table_of_contents)
    all_chunks = chunk_document(chunk_obj, filing_date, company, heading_dictionary, form_type = '10-Q')
    
    print(f"Processed {ticker}, {filing_date}")
    return all_chunks

def main():
    global TOTAL_CHUNKS_TENQ
    for (cik, ticker),company in zip(TICKERS.items(),COMPANY_NAMES):
        meta_df_path = os.path.join(BASE_PATH_TENQ, ticker, '10-Q.csv')
        meta_df = pd.read_csv(meta_df_path)
        meta_df = meta_df[pd.to_datetime(meta_df['filing_date']).dt.year > 2023]

        for _, row in meta_df.iterrows():
            filing_date = row['filing_date']
            accession_no = row['accession_number']
            all_chunks = get_chunks(filing_date, ticker, company, cik, accession_no)
            TOTAL_CHUNKS_TENQ+=len(all_chunks)
            all_chunks_processed = post_processing_chunks(all_chunks)
            
            chunk_det = {
                'filing_date' : filing_date,
                'ticker' : ticker,
                'file_chunks' : all_chunks_processed
            }
            
            TENQ_CHUNKS.append(chunk_det)

if __name__ == '__main__':
    main()

Processed AAPL, 2025-08-01
Processed AAPL, 2025-05-02
Processed AAPL, 2025-01-31
Processed AAPL, 2024-08-02
Processed AAPL, 2024-05-03
Processed AAPL, 2024-02-02
Processed MSFT, 2025-10-29
Processed MSFT, 2025-04-30
Processed MSFT, 2025-01-29
Processed MSFT, 2024-10-30
Processed MSFT, 2024-04-25
Processed MSFT, 2024-01-30
Processed GOOGL, 2025-10-30
Processed GOOGL, 2025-07-24
Processed GOOGL, 2025-04-25
Processed GOOGL, 2024-10-30
Processed GOOGL, 2024-07-24
Processed GOOGL, 2024-04-26
Processed AMZN, 2025-10-31
Processed AMZN, 2025-08-01
Processed AMZN, 2025-05-02
Processed AMZN, 2024-11-01
Processed AMZN, 2024-08-02
Processed AMZN, 2024-05-01
Processed META, 2025-10-30
Processed META, 2025-07-31
Processed META, 2025-05-01
Processed META, 2024-10-31
Processed META, 2024-08-01
Processed META, 2024-04-25
Processed NVDA, 2025-11-19
Processed NVDA, 2025-08-27
Processed NVDA, 2025-05-28
Processed NVDA, 2024-11-20
Processed NVDA, 2024-08-28
Processed NVDA, 2024-05-29
Processed TSLA, 2025-1

### SAVE 10-Q CHUNKS

In [None]:
import json

# Use 'w' mode for writing text
with open("./tenq_chunks.json", "w", encoding="utf-8") as f:
    json.dump(TENQ_CHUNKS, f, indent=4)

In [6]:
TOTAL_CHUNKS_TENK + TOTAL_CHUNKS_TENQ

13745

## Save all Chunks

In [None]:
tenk_tenq_chunks = TENK_CHUNKS + TENQ_CHUNKS

with open("./tenk_tenq_chuks.json",'w', encoding="utf-8") as f:
    json.dump(tenk_tenq_chunks, f, indent = 4)