 # <center>Creating Embeddings for RAG Pipeline</center>

In this notebook, we delve into sophisticated methods for enhancing document retrieval capabilities, focusing on embeddings, text chunking, and efficient database setup and insertion. Our goal is to leverage these techniques to improve the performance of a Retriever-Augmented Generation (RAG) System, which combines the power of retrieval from a large corpus with the generative capabilities of a language model.

## 1. Embeddings

### What is an Embedding and Why We Need Them?
Embeddings are dense vector representations of text data that capture the semantic meaning of words, sentences, or documents. By transforming text into vectors in a high-dimensional space, embeddings allow us to perform arithmetic operations on words and documents, facilitating tasks like similarity search, clustering, and classification. They are crucial for enabling the RAG system to efficiently retrieve relevant documents based on the query's semantic content.

### Things to Keep in Mind While Creating Embeddings
- **Dimensionality**: The choice of dimensionality affects both the granularity of the captured semantics and the computational efficiency.
- **Training Data**: The quality and representativeness of the training data determine the embeddings' applicability to various domains.
- **Normalization**: Normalizing embeddings can improve the performance of similarity searches.

### Strategies for Creating Embeddings
- **Pre-trained Models**: Utilizing models pre-trained on large, diverse corpora as a starting point.
- **Domain-specific Training**: Fine-tuning pre-trained models on domain-specific data to capture specialized vocabulary and concepts.
- **Dimensionality Reduction**: Applying techniques like PCA to reduce the dimensionality of embeddings while preserving semantic relationships.

### MTEB Leaderboard
The Multilingual Text Embeddings Benchmark (MTEB) leaderboard showcases the performance of various embedding models across multiple languages and tasks, serving as a guide for selecting the best-performing models for specific applications.

### Creating Embeddings
This section will demonstrate how to generate embeddings from our text data using selected strategies, preparing them for use in our retrieval system.

### How Embeddings Impact the Performance of the RAG System
Embeddings play a pivotal role in the RAG system by enabling the efficient retrieval of semantically relevant documents. High-quality embeddings can significantly enhance the system's ability to generate accurate and contextually appropriate responses.

## 2. Chunking Text

### Why We Need Chunks?
Chunking helps in managing large documents by breaking them down into manageable pieces, which are more feasible for processing by machine learning models and for retrieval based on specific queries.

### Best Practices for Chunking
- **Consistency**: Maintain uniform chunk sizes where possible.
- **Contextual Integrity**: Ensure chunks are self-contained to preserve meaning.

### Different Types of Chunking
- **Sentence-level**: Ideal for fine-grained analysis or retrieval.
- **Paragraph-level**: Balances granularity with context preservation.
- **Section-level**: Useful for documents with clear structural delineations.

### Mega Chunking Strategy
Mega chunking involves grouping related smaller chunks into larger units for purposes like summarization or when a broader context is necessary for understanding.

### Creating Chunks and Mega Chunks
We will explore methods to programmatically segment our documents into chunks and group these into mega chunks, according to our predefined strategies.

### Summarizing Mega Chunks
Summarization techniques applied to mega chunks will be demonstrated to condense the information content, facilitating quicker comprehension by the RAG system.

## 3. DB Setup and Insertion

### How to Set Up a Database for Fast Retrieval
Efficient database setup is essential for minimizing retrieval times and enhancing the overall performance of the RAG system.

### How to Store Embeddings and Metadata
We discuss strategies for storing embeddings alongside rich metadata to facilitate effective retrieval and relevance determination.

### Creating FAISS Vectorstore

#### What is FAISS
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It excels in large-scale vector storage and quick nearest neighbor retrieval.

#### Different Types of FAISS Index
- **Flat (L2) Index**: Offers exact nearest neighbor search.
- **HNSW Index**: Provides a balance between speed and accuracy for approximate searches.

#### Creating L2 and HNSW Indexes
Step-by-step guidance on setting up L2 and HNSW indexes in FAISS, tailored for different retrieval needs.

### Creating MongoDB Collection (Metadata)

#### What Type of Metadata We Want to Store
Discussing the importance of various metadata fields in enhancing document retrieval and understanding.

#### Document Metadata Information (Mongo)
- **document_name**
- **document_word_count**
- **page_number**
- **page_word_count**
- **page_sentence_count**
- **page_token_count**
- **text**

This section outlines the process of creating a MongoDB collection to store the aforementioned metadata, ensuring that each document and its chunks are easily retrievable and


In the previous notebook, we understood the Project Overview, fetched and preprocessed our Documents.
Now In this Notebook, let us understand what are the best ways of Using these Documents to answer Queries using a Large Language Model.
Let us cover and understand the below points:

- 1. Embeddings
    - What is an Embedding and why we need them?
    - Things to keep in mind while creating Embeddings
    - Strategies for Creating Embeddings
    - MTEB Leaderboard
    - **Creating Embeddings**
    - How Embeddings impact the performance of RAG System.
    
- 2. Chunking Text
    - Why we Need Chunks?
    - What are the Best Practices for Chunking?
    - Different Types of Chunking
    - *Mega Chunking* Strategy
    - **Creating Chunks and Mega Chunks**
    - **Summarizing Mega chunks**
- 3. DB Setup and Insertion
    - How do setup a Database for Fast Retrieval
    - How to store Embeddings and Metadata
    - Creating FAISS Vectorstore
        - What is FAISS
        - Different types of FAISS Index
        - Creating L2 Index
        - Creating HNSW Index
    - Creating Mongo DB Collection (Metadata)
        - What Type of metadata we want to store
        - Document Metadata Information: (Mongo)
            - document_name
            - document_word_count
            - page_number
            - page_word_count
            - page_sentence_count
            - page_token_count
            - text
        

# 1. Chunking

This notebook covers the Chunking Strategies used in a RAG Pipeline.



-- Overlap Chunking
-- CharacterSplitter 
- Chunk Lacks Global Concept Awareness -> Sub Document Summaries
    - A Context Augmentation Technique.
    - Along with Chunk, we can summarize the whole document, and attach that summary along with it as metadata to each embedding.
    - This may not solve the problem completely, as each chunk can have a "local Context". For Example, a document on Sachin Tendulkar  can have various parts which focus on different aspects of his life.
    - So We can have sub-summaries. Different parts of document can summarised together.

- Size of Chunk must be < # Token in EMbedding Model.

- Split the document into smaller managable chunks

- SubDoc Summary - Divide the document in Sub Documents and then create a summary of each sub document.


So Lets say if a Document has 100 chunks
now lets combine 20 chunks and summarize them..
- User query will go to SubDocument Summary to get a chunk, then it will go the chunks which have that information.

- There is a trade-off
    - smaller chunks -> Higher Accuracy, more chunks
    - bigger chunks -> Lesser Accuracy, easy handling

## Chunking Strategy:
- Each Document will be divided into chunks of 300 character with 50 overlap.
- 10 chunks will be combined to create a summary_chunk


- Document Metadata Information: (Mongo)
    - document_name
    - document_word_count
    - page_number
    - page_word_count
    - page_sentence_count
    - page_token_count
    - text
    
- chunk_metadata
    - document_metadata +
    - chunk_text
    -  summary_chunk_text
    - FAISS_SUMMARY_CHUNK_INDEX (PK)
    - FAISS_CHUNK_INDEX_LIST

In [36]:
## Load Environment Variables
import os
from pathlib import Path
from dotenv import load_dotenv
load_dotenv(Path('C:/Users/erdrr/OneDrive/Desktop/Scholastic/NLP/LLM/RAG/CompleteRAG/.env'))

True

In [37]:
import os
from pathlib import Path
import unicodedata
from time import perf_counter as timer
from tqdm.auto import tqdm
import torch
import fitz
import pandas as pd
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from src.embeddings.models import summarize_mega_chunk,get_embedding_model
from src.embeddings.vectorstore import add_faiss_indices_to_dataframe
from src.embeddings.database import insert_into_mongodb
load_dotenv(Path('C:/Users/erdrr/OneDrive/Desktop/Scholastic/NLP/LLM/RAG/CompleteRAG/.env'))

True

In [38]:
preprocessed_data_path = Path(os.environ["PREPROCESSED_DATA_DIR"])
preprocessed_data_path

WindowsPath('C:/Users/erdrr/OneDrive/Desktop/Scholastic/NLP/LLM/RAG/CompleteRAG/data/preprocessed')

In [39]:
def get_document_pagewise_data(file_path:Path):
    doc = fitz.open(file_path)
    pagewise_data = []
    text=" "
    document_text = " "
    for num,page in tqdm(enumerate(doc)):
        text = page.get_text().replace("\n"," ").strip()
        normalized_text = unicodedata.normalize('NFKD', text)
        text = normalized_text.encode('ascii', 'ignore').decode('ascii')
        document_text += text + " "
        pagewise_data.append({
            "page_number": num + 1,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count": len(text.split(". ")),
            "page_token_count": len(text) // 4,
            "page_text": text,
            "document_name": os.path.basename(file_path).replace(".pdf","")
            })
    doc.close()
    ## Add Document Level Metadata
    for doc in pagewise_data:
        doc["document_word_count"] = len(document_text)
        doc["document_page_count"] = num + 1
    return pagewise_data

In [40]:
def get_all_documents_data(documents_dir_path : Path=preprocessed_data_path):
    all_documents_data = []
    docs_list = os.listdir(documents_dir_path)[40:80]
    for item in tqdm(docs_list):
        file_path = Path(os.path.join(preprocessed_data_path,item))
        all_documents_data.append(get_document_pagewise_data(file_path))
    
    # Flatten the list
    flattened_list = []
    for sublist in all_documents_data:
        # Loop through each dictionary in the inner lists and add it to the flattened_list
        for d in sublist:
            flattened_list.append(d)
    return flattened_list

In [5]:
all_documents_data = get_all_documents_data(preprocessed_data_path)

  0%|          | 0/40 [00:00<?, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

In [6]:
all_data_df = pd.DataFrame(all_documents_data)
all_data_df.shape

(67, 9)

In [7]:
all_data_df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,document_word_count,document_page_count
count,67.0,67.0,67.0,67.0,67.0,67.0,67.0
mean,1.492537,3971.253731,651.492537,30.492537,992.462687,7903.014925,1.985075
std,0.682534,1685.696613,277.44704,14.044705,421.436732,3688.485229,0.807025
min,1.0,11.0,1.0,1.0,2.0,1863.0,1.0
25%,1.0,3033.5,493.5,20.0,758.0,5156.0,1.0
50%,1.0,4976.0,788.0,35.0,1244.0,7033.0,2.0
75%,2.0,5296.5,857.0,40.0,1323.5,10298.0,2.0
max,4.0,5473.0,977.0,50.0,1368.0,16852.0,4.0


## Split Text

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [9]:
chunk_size = int(os.environ["CHUNK_SIZE"])
chunk_overlap = int(os.environ["CHUNK_OVERLAP_SIZE"])
mega_chunk_multiplier = int(os.environ["MEGA_CHUNK_MULTIPLIER"])
print(f"[INFO]: Using {chunk_size=} | {chunk_overlap=} | {mega_chunk_multiplier=}")

[INFO]: Using chunk_size=300 | chunk_overlap=50 | mega_chunk_multiplier=10


In [10]:
def find_overlap(str1: str, str2: str) -> int:
    """
    Finds the length of the largest overlap between the end of str1 and the start of str2.
    
    Parameters:
    - str1 (str): The first string.
    - str2 (str): The second string.
    
    Returns:
    - int: The length of the overlap.
    """
    min_length = min(len(str1), len(str2))
    for i in range(1, min_length + 1):
        if str1[-i:] == str2[:i]:
            return i
    return 0

def combine_strings_remove_overlap(str_list: list) -> str:
    """
    Combines a list of strings by removing overlapping text.
    
    Parameters:
    - str_list (list): A list of strings with possible overlaps.
    
    Returns:
    - str: A single combined string with overlaps removed.
    """
    if not str_list:
        return ""

    # Initialize the combined string with the first string in the list
    combined_string = str_list[0]

    # Iterate through the list, combining strings with overlap removed
    for i in range(1, len(str_list)):
        overlap_length = find_overlap(combined_string, str_list[i])
        combined_string += str_list[i][overlap_length:]

    return combined_string

In [18]:
import os
import pandas as pd

def create_chunks(df,
                  chunk_size: int = int(os.environ.get("CHUNK_SIZE", 256)),
                  chunk_overlap: int = int(os.environ.get("CHUNK_OVERLAP_SIZE", 50)),
                  mega_chunk_multiplier: int = int(os.environ.get("MEGA_CHUNK_MULTIPLIER", 4))):
    print(f"[INFO]: Creating Chunks Using: chunk_size={chunk_size} | chunk_overlap={chunk_overlap} | mega_chunk_multiplier={mega_chunk_multiplier}")

    # Assuming RecursiveCharacterTextSplitter is implemented correctly
    chunk_text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False,
    )

    # Split text into chunks
    df['chunks'] = df['page_text'].apply(lambda x: chunk_text_splitter.split_text(x))

    # Create a list of mega chunks, each with its constituent chunks
    def create_mega_chunks_with_chunks(chunks, multiplier):
        mega_chunks_with_chunks = []
        for i in range(0, len(chunks), multiplier):
            mega_chunk_with_chunks = chunks[i:i + multiplier]
            mega_chunks_with_chunks.append({'mega_chunk': ''.join(mega_chunk_with_chunks), 'chunks': mega_chunk_with_chunks})
        return mega_chunks_with_chunks

    df['mega_chunks_with_chunks'] = df['chunks'].apply(lambda x: create_mega_chunks_with_chunks(x, mega_chunk_multiplier))
    
    # Explode DataFrame by 'mega_chunks_with_chunks'
    df = df.explode('mega_chunks_with_chunks').reset_index(drop=True)
    
    # Extract 'mega_chunk' and 'chunks' from the exploded dictionaries
    df['mega_chunk'] = df['mega_chunks_with_chunks'].apply(lambda x: x['mega_chunk'])
    df['chunks_in_mega'] = df['mega_chunks_with_chunks'].apply(lambda x: x['chunks'])
    
    # Assign mega_chunk_number for enumeration within each document/page
    df['mega_chunk_number'] = df.groupby(['document_name', 'page_number']).cumcount() + 1

    # Calculate the number of mega chunks per page
    df['page_mega_chunk_count'] = df.groupby(['document_name', 'page_number'])['mega_chunk_number'].transform('max')

    # Specify columns to include in the final DataFrame
    df_columns = ['document_name', 'document_word_count', 'document_page_count', 'page_number', 'page_char_count', 'page_word_count',
                  'page_sentence_count', 'page_token_count', 'page_text', 'page_mega_chunk_count', 'mega_chunk_number', 'mega_chunk', 'chunks_in_mega']
    df = df[df_columns]
    df_final = df.rename(columns={'chunks_in_mega': 'chunks'})

    return df_final


In [19]:
create_chunks(all_data_df)#.to_json("test.json",orient="records",indent=4)

[INFO]: Creating Chunks Using: chunk_size=300 | chunk_overlap=50 | mega_chunk_multiplier=10


Unnamed: 0,document_name,document_word_count,document_page_count,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_text,page_mega_chunk_count,mega_chunk_number,mega_chunk,chunksa
0,Investopedia_1_What_Is_3_2_1_Buydown,7033,2,1,5320,932,40,1330,A 3-2-1 buydown mortgage is a type of home loa...,3,1,A 3-2-1 buydown mortgage is a type of home loa...,[A 3-2-1 buydown mortgage is a type of home lo...
1,Investopedia_1_What_Is_3_2_1_Buydown,7033,2,1,5320,932,40,1330,A 3-2-1 buydown mortgage is a type of home loa...,3,2,future years. Over the first three years of lo...,[future years. Over the first three years of l...
2,Investopedia_1_What_Is_3_2_1_Buydown,7033,2,1,5320,932,40,1330,A 3-2-1 buydown mortgage is a type of home loa...,3,3,such as investing it or using it to pay off ot...,[such as investing it or using it to pay off o...
3,Investopedia_1_What_Is_3_2_1_Buydown,7033,2,2,1710,296,12,427,"even then, ask yourself whether the maximum mo...",1,1,"even then, ask yourself whether the maximum mo...","[even then, ask yourself whether the maximum m..."
4,Investopedia_1_What_Is_3_6_3_Rule,5299,2,1,5271,850,32,1317,What Is the 3-6-3 Rule? The 3-6-3 rule is a sl...,3,1,What Is the 3-6-3 Rule? The 3-6-3 rule is a sl...,[What Is the 3-6-3 Rule? The 3-6-3 rule is a s...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,Investopedia_A_What_Is_Absoluteadvantage,10621,3,2,5343,873,35,1335,"disasters, for example, can destroy farmland, ...",3,2,bacon. Pacifica can spend one-third of the yea...,[bacon. Pacifica can spend one-third of the ye...
147,Investopedia_A_What_Is_Absoluteadvantage,10621,3,2,5343,873,35,1335,"disasters, for example, can destroy farmland, ...",3,3,represents Adam Smith's explanation of why cou...,[represents Adam Smith's explanation of why co...
148,Investopedia_A_What_Is_Absoluteadvantage,10621,3,3,11,1,1,2,advantages.,1,1,advantages.,[advantages.]
149,Investopedia_A_What_Is_Absolutereturn,3342,1,1,3340,544,26,835,What Is Absolute Return? Absolute return is th...,2,1,What Is Absolute Return? Absolute return is th...,[What Is Absolute Return? Absolute return is t...


# Embeddings

To save our Embeddings, we need a Vector Database, which can store our 1x1024 dimensional vector Embeddings. When a User queries, we need to convert the user query to an Embedding and serch in out Vector Database. There are vaious Vectro Databases Available, but we will be using FAISS Library from Facebook to store our Vectors.

## FAISS

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI Research to facilitate efficient similarity search and clustering of dense vectors. It's widely used for tasks such as image retrieval, recommendation systems, and natural language processing where finding nearest neighbors in high-dimensional spaces is required. FAISS is designed to be highly efficient both in terms of speed and memory usage for vector search operations. It supports various index types, each optimized for different use cases and trade-offs between speed, accuracy, and memory usage. Here are the main types of indexes in FAISS:

1. **Flat Indexes (IndexFlat)**: These are the simplest form of indexes in FAISS, where the database vectors are stored as they are, without any encoding or compression. Searches in flat indexes involve comparing the query vector with every vector in the database, which is computationally expensive but provides the exact nearest neighbors. Flat indexes are often used as benchmarks for accuracy.

2. **Quantization-based Indexes**:
   - **IndexIVFFlat**: This index uses an Inverted File system (IVF) with flat encoding. It divides the vector space into a finite number of coarse quantization regions (clusters) and assigns each vector to its nearest cluster. Searches are accelerated by only considering vectors in the nearest clusters to the query vector.
   - **IndexIVFPQ**: Similar to IndexIVFFlat, but within each cluster, it applies Product Quantization (PQ) for further compression and faster search. PQ decomposes the space into a Cartesian product of subspaces and quantizes each subspace separately.
   - **IndexIVFScalarQuantizer**: This index also uses IVF clustering but employs scalar quantization for the vectors within each cluster, offering a trade-off between memory usage and accuracy.

3. **Hierarchical Navigable Small World (HNSW) Indexes (IndexHNSW)**: HNSW indexes are graph-based and excel in providing a good balance between search speed and accuracy, especially for very high-dimensional data. They build a multi-layered graph structure that allows for efficient greedy searches.

4. **Binary Indexes (IndexBinaryFlat, IndexBinaryIVF, etc.)**: These indexes are designed for binary vectors (vectors of 0s and 1s) and use specialized distance measures like Hamming distance. They are useful for applications where data is naturally binary or has been binarized.

5. **Composite Indexes**:
   - **IndexIDMap**: This is a wrapper that allows any index to store arbitrary IDs for each vector, facilitating easy retrieval of original identifiers after search operations.
   - **IndexPreTransform**: This index applies a transformation (e.g., PCA, whitening) to vectors before indexing them with another index type, which can improve search performance and accuracy.

6. **GPU Indexes (GpuIndexFlat, GpuIndexIVFFlat, etc.)**: FAISS provides GPU versions for many of its indexes, enabling significantly faster search speeds by leveraging the parallel processing power of GPUs.

Each index type in FAISS serves different needs, ranging from exact searches to approximate searches that prioritize speed or memory efficiency. The choice of index depends on the specific requirements of the application, such as the size of the dataset, the dimensionality of the vectors, the acceptable trade-off between accuracy and speed, and the available computational resources.

# Database Setup

## Save dataframe to MongoDB

In [10]:
import json

In [11]:
processed_data_path = Path(os.environ["PROCESSED_DATA_DIR"])
json_file_path = os.path.join(processed_data_path,"all_chunks.json")

In [12]:
df = pd.read_json(json_file_path, lines=True)

In [13]:
df

Unnamed: 0,document_name,document_word_count,document_page_count,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_text,mega_chunk,mega_chunk_summary,mega_chunk_summary_embedding_index,chunks,chunks_embedding_list_index,_id
0,Investopedia_012715_What_Is_5_Richest_People_W...,17922,4,1,5358,892,45,1339,Billionaires play an outsized role in shaping ...,Billionaires play an outsized role in shaping ...,Forbes puts the number of billionaires in the ...,0,[Billionaires play an outsized role in shaping...,"[6627, 6628, 6629, 6630, 6631, 6632, 6633, 663...",0
1,Investopedia_012715_What_Is_5_Richest_People_W...,17922,4,1,5358,892,45,1339,Billionaires play an outsized role in shaping ...,"500 , becoming the largest company added, and ...","In April 2022, Musk began a campaign to take X...",1,[Billionaires play an outsized role in shaping...,"[6649, 6650, 6651, 6652, 6653, 6654, 6655, 665...",1
2,Investopedia_012715_What_Is_5_Richest_People_W...,17922,4,2,5308,832,44,1327,"10,000-year clockalso known as the Long Now. O...","10,000-year clockalso known as the Long Now. O...",Bezos' wealth peaked at $213 billion in the sa...,2,"[10,000-year clockalso known as the Long Now. ...","[6671, 6672, 6673, 6674, 6675, 6676, 6677, 667...",2
3,Investopedia_012715_What_Is_5_Richest_People_W...,17922,4,2,5308,832,44,1327,"10,000-year clockalso known as the Long Now. O...",seeks to leverage technology to fix societal i...,"Bill Gates is the co-founder of Microsoft, the...",3,"[10,000-year clockalso known as the Long Now. ...","[6692, 6693, 6694, 6695, 6696, 6697, 6698, 669...",3
4,Investopedia_012715_What_Is_5_Richest_People_W...,17922,4,3,5210,831,45,1302,"2014, shortly after stepping down as Microsoft...","2014, shortly after stepping down as Microsoft...",Ballmer lived in the same dorm and on the same...,4,"[2014, shortly after stepping down as Microsof...","[6713, 6714, 6715, 6716, 6717, 6718, 6719, 672...",4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6622,Investopedia_G_What_Is_G_30,5007,1,1,5005,803,32,1251,What Is the Group of 30 (G-30)? The Group of 3...,What Is the Group of 30 (G-30)? The Group of 3...,The Group of 30 (G-30) is a nonprofit group of...,6622,[What Is the Group of 30 (G-30)? The Group of ...,"[122830, 122831, 122832, 122833, 122834, 12283...",6622
6623,Investopedia_G_What_Is_G_30,5007,1,1,5005,803,32,1251,What Is the Group of 30 (G-30)? The Group of 3...,"Witteveen, the former managing director of the...",The Group of 30 (G-30) is a group of 10 nation...,6623,[What Is the Group of 30 (G-30)? The Group of ...,"[122850, 122851, 122852, 122853, 122854, 12285...",6623
6624,Investopedia_G_What_Is_Soros,5842,2,1,5300,892,39,1325,George Soros is a legendary hedge fund manager...,George Soros is a legendary hedge fund manager...,"Soros managed the Quantum Fund, a fund that ac...",6624,[George Soros is a legendary hedge fund manage...,"[122870, 122871, 122872, 122873, 122874, 12287...",6624
6625,Investopedia_G_What_Is_Soros,5842,2,1,5300,892,39,1325,George Soros is a legendary hedge fund manager...,"European Union issue perpetual bonds , a metho...",George Soros is unique among highly successful...,6625,[George Soros is a legendary hedge fund manage...,"[122892, 122893, 122894, 122895, 122896, 12289...",6625


In [30]:
doc_master["document_name"][0]+".pdf"

'Investopedia_012715_What_Is_5_Richest_People_World.pdf'

In [54]:
def insert_into_document_master(df):
    client = MongoClient(os.environ["MONGODB_IP"], int(os.environ["MONGODB_PORT"]))
    db = client[os.environ["MONGODB_DB"]]
    df = df[['document_name', 'document_word_count', 'document_page_count']].drop_duplicates()
    collection = db[os.environ["MONGODB_DOCUMENTS_MASTER_COLLECTION"]]
    df['_id'] = df['document_name']+".pdf"
    data_dict = df.to_dict("records")
    # Initialize counter for inserted documents
    inserted_count = 0

    # Insert documents only if they don't already exist
    for document in data_dict:
        if collection.count_documents({'_id': document['_id']}, limit=1) == 0:
            collection.insert_one(document)
            inserted_count += 1

    # Logging the result
    print(f"[INFO]: Successfully inserted {inserted_count} new documents.")

In [55]:
from pymongo import MongoClient

In [56]:
doc_master

Unnamed: 0,document_name,document_word_count,document_page_count,_id
0,Investopedia_012715_What_Is_5_Richest_People_W...,17922,4,Investopedia_012715_What_Is_5_Richest_People_W...
7,Investopedia_03_What_Is_071603,6008,2,Investopedia_03_What_Is_071603.pdf
10,Investopedia_042315_What_Is_How_Do_Prepaid_Deb...,5810,2,Investopedia_042315_What_Is_How_Do_Prepaid_Deb...
13,Investopedia_05_What_Is_Economicmoat,7852,2,Investopedia_05_What_Is_Economicmoat.pdf
16,Investopedia_063015_What_Is_What_Effective_Int...,9033,2,Investopedia_063015_What_Is_What_Effective_Int...
...,...,...,...,...
6613,Investopedia_G_What_Is_Guppy_Multiple_Moving_A...,7284,2,Investopedia_G_What_Is_Guppy_Multiple_Moving_A...
6616,Investopedia_G_What_Is_Gwei_Ethereum,6531,2,Investopedia_G_What_Is_Gwei_Ethereum.pdf
6619,Investopedia_G_What_Is_G_20,7306,2,Investopedia_G_What_Is_G_20.pdf
6622,Investopedia_G_What_Is_G_30,5007,1,Investopedia_G_What_Is_G_30.pdf


In [57]:
insert_into_document_master(df)

[INFO]: Successfully inserted 0 new documents.


In [None]:
df.columns

In [None]:
import numpy as np

In [None]:
def update_chunk_mapping_collection(input_df):
    """
    Transforms the DataFrame according to the specified operations for MongoDB insertion.
    
    Parameters:
    - input_df (pd.DataFrame): The input DataFrame with columns 
      'mega_chunk_summary_embedding_index' and 'chunks_embedding_list_index'.
    
    Returns:
    - pd.DataFrame: Transformed DataFrame with self mapping and '_id' column.
    """
    input_df = input_df[['chunks_embedding_list_index','mega_chunk_summary_embedding_index']]
    # Explode the 'chunks_embedding_list' column
    exploded_df = input_df.explode('chunks_embedding_list_index')
    
    # Add self-mapping rows by copying 'mega_chunk_summary_embedding' into a new row with the same value for '_id'
    self_map_df = pd.DataFrame({
        'mega_chunk_summary_embedding_index': input_df['mega_chunk_summary_embedding_index'],
        'chunks_embedding_list_index': input_df['mega_chunk_summary_embedding_index']
    })
    
    # Rename 'chunks_embedding_list' to '_id' in both DataFrames
    exploded_df = exploded_df.rename(columns={'chunks_embedding_list_index': '_id'})
    self_map_df = self_map_df.rename(columns={'chunks_embedding_list_index': '_id'})
    
    # Concatenate the exploded DataFrame with the self-mapping DataFrame
    final_df = pd.concat([exploded_df, self_map_df], ignore_index=True)

    # Convert all NumPy int64 types to Python native int types for MongoDB compatibility
    final_df = final_df.applymap(lambda x: int(x) if isinstance(x, np.int64) else x)

    # Insert into Mongo Collection
    client = MongoClient(os.environ["MONGODB_IP"], int(os.environ["MONGODB_PORT"]))
    
    
    ## Insert into Mongo Collection
    client = MongoClient(os.environ["MONGODB_IP"], int(os.environ["MONGODB_PORT"]))
    db = client[os.environ["MONGODB_DB"]]
    collection = db[os.environ["MONGODB_MAPPING_COLLECTION"]]
    data_dict = final_df.to_dict("records")
    collection.insert_many(data_dict)
    print(f"[INFO]: Successfully Inserted!")

In [None]:
update_chunk_mapping_collection(df)

In [None]:
df.columns

In [None]:
def insert_into_mongodb(df):
    client = MongoClient(os.environ["MONGODB_IP"], int(os.environ["MONGODB_PORT"]))
    db = client[os.environ["MONGODB_DB"]]
    collection = db[os.environ["MONGODB_COLLECTION"]]
    df['_id'] = df['mega_chunk_summary_embedding_index']
    data_dict = df.to_dict("records")
    collection.insert_many(data_dict)
    print(f"[INFO]: Successfully Inserted!")

In [None]:
df

In [None]:
df.drop_duplicates()