<a href="https://colab.research.google.com/github/prakul/MongoDB-AI-Resources/blob/main/Langchain%2BMongoDB_Parent_document_retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Author: Prakul agarwal
# Install prerequisites dependencies


In [None]:
!pip install langchain pypdf pymongo openai python-dotenv tiktoken

## Setup the environment

In [None]:
import os
from dotenv import load_dotenv
from pymongo import MongoClient

load_dotenv(override=True)

# Add an environment file to the notebook root directory called .env with MONGO_URI="xxx" to load these envvars

OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
MONGO_URI = os.environ["MONGO_URI"]
DB_NAME = "langchain-test-3"
COLLECTION_NAME = "test"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding"

client = MongoClient(MONGO_URI)
db = client[DB_NAME]
MONGODB_COLLECTION = db[COLLECTION_NAME]

## PREPARE DATA

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
#data = loader.load_and_split()
data = loader.load()
docs = loader.load_and_split()


In [None]:
import copy
def produce_sentences(docs):
    docs[1].page_content.split('.')
    new_docs = []
    for doc in docs:
        for sentence in doc.page_content.split('.'):
            if len(sentence) < 2:
                continue
            new_doc = copy.deepcopy(doc)
            new_doc.page_content = sentence
            new_doc.metadata['doc_level'] = 'sentence'
            new_docs.append(new_doc)
        doc.metadata['doc_level'] = 'page'

    return new_docs

In [None]:
from collections import defaultdict
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

def parent_child_splitter(data):
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    # This text splitter is used to create the child documents
    # It should create documents smaller than the parent
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    parent_docs = parent_splitter.split_documents(data)
    child_docs = child_splitter.split_documents(data)
    for parent_doc in parent_docs:
        parent_doc.metadata["doc_level"] = "parent"
    for child_doc in child_docs:
        child_doc.metadata["doc_level"] = "child"
    return parent_docs, child_docs



In [None]:
parent_docs, child_docs  = parent_child_splitter(data)

In [None]:
print(parent_docs[0])
len(parent_docs)

page_content='GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4,

207

In [None]:
print(child_docs[0])
len(child_docs)


page_content='GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated' metadata={'source': '/tmp/tmpfbhs4lhc/tmp.pdf', 'page': 0, 'doc_level': 'child'}


1393

## INSERT DATA

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import MongoDBAtlasVectorSearch

# insert the documents in MongoDB Atlas Vector Search
x = MongoDBAtlasVectorSearch.from_documents(
     documents=parent_docs+child_docs, embedding=OpenAIEmbeddings(disallowed_special=()), collection=MONGODB_COLLECTION, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
 )


## CREATE INDEX

 Create an Atlas search index with the  definition given below, using
(option a) pymongo driver
(option b) Atlas UI -> Search -> JSON Editor
 https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/


```
{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 1536,
        "similarity": "cosine",
        "type": "knnVector"
      },
      "doc_level": [
        {
          "type": "token"
        }
      ]
    }
  }
}
```

In [None]:

MONGODB_COLLECTION.create_search_index(
    {"definition":
        {"mappings":
         {"dynamic": True,
          "fields": {
            EMBEDDING_FIELD_NAME : {
                "dimensions": 1536,
                "similarity": "cosine",
                "type": "knnVector"
                },
            "doc_level": {
                "type": "token"
            }
            }}},
     "name": ATLAS_VECTOR_SEARCH_INDEX_NAME
    }
)

#DATA QUERY

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import MongoDBAtlasVectorSearch

vector_search = MongoDBAtlasVectorSearch.from_connection_string(
    MONGO_URI,
    DB_NAME + "." + COLLECTION_NAME,
    OpenAIEmbeddings(disallowed_special=()),
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME
)


In [None]:
# Let's now call the vector search functionality on the smaller 'child' chunks
sub_docs = vector_search.similarity_search("gpt-4 compute requirements", k=5, pre_filter=
            {
        "doc_level": { "$eq": "child"}
    }, post_filter_pipeline=[{"$project": {"embedding": 0}}]
                                           )
for doc in sub_docs:
    print(doc)


page_content='gpt3.5Figure 4. GPT performance on academic and professional exams. In each case, we simulate the\nconditions and scoring of the real exam. Exams are ordered from low to high based on GPT-3.5\nperformance. GPT-4 outperforms GPT-3.5 on most exams tested. To be conservative we report the\nlower end of the range of percentiles, but this creates some artifacts on the AP exams which have very' metadata={'_id': ObjectId('65430b6f179f5b5bbf292c55'), 'source': '/tmp/tmpfbhs4lhc/tmp.pdf', 'page': 5, 'doc_level': 'child'}
page_content='GPT-4 model, but they did not have the ability to ﬁne-tune it. They also did not have access to the\nﬁnal version of the model that we deployed. The ﬁnal version has capability improvements relevant\nto some of the factors that limited the earlier models power-seeking abilities, such as longer context\nlength, and improved problem-solving abilities as in some cases we /quotesingle.ts1ve observed.' metadata={'_id': ObjectId('65430b7c179f5b5bbf292f33')

In [None]:
# Let's now call the vector search functionality on the larger 'parent' chunks
sub_docs = vector_search.similarity_search("gpt-4 compute requirements", k=5, pre_filter=
            {
        "doc_level": { "$eq": "parent"}
    }, post_filter_pipeline=[{"$project": {"embedding": 0}}]
                                           )
for doc in sub_docs:
    print(doc)


page_content='Observed\nPrediction\ngpt-4\n100p 10n 1µ 100µ 0.01 1\nCompute1.02.03.04.05.06.0Bits per wordOpenAI codebase next word predictionFigure 1. Performance of GPT-4 and smaller models. The metric is ﬁnal loss on a dataset derived\nfrom our internal codebase. This is a convenient, large dataset of code tokens which is not contained in\nthe training set. We chose to look at loss because it tends to be less noisy than other measures across\ndifferent amounts of training compute. A power law ﬁt to the smaller models (excluding GPT-4) is\nshown as the dotted line; this ﬁt accurately predicts GPT-4’s ﬁnal loss. The x-axis is training compute\nnormalized so that GPT-4 is 1.\nObserved\nPrediction\ngpt-4\n1µ 10µ 100µ 0.001 0.01 0.1 1\nCompute012345– Mean Log Pass RateCapability prediction on 23 coding problems\nFigure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of\nthe HumanEval dataset. A power law ﬁt to the smaller models (excluding GPT-4) 

## Parent child retrieval

Now we want to search on the smaller 'child' chunks but return the larger 'parent' chunks

To perform this we will be performing an `left outer join` on our collection to find the parent chunk
corresponding to a child chunk such that it matches the
`page number`. This `left outer join` is performed via the [$lookup operator in MongoDB](https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/)




In [None]:
# now we want to search on the smaller 'child' chunks but return the larger 'parent' chunks
"""
To perform this we will be performing an `left outer join` on our collection to find the parent chunk
corresponding to a child chunk such that it matches the
`page number`. This `left outer join` is performed via the [$lookup operator in MongoDB](https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/)
"""
results = vector_search.similarity_search("gpt-4 compute requirements",
                                           k=5,
                                           pre_filter=
                                                    {
                                                "doc_level": { "$eq": "child"}
                                            },
                                           post_filter_pipeline=[
                                              {"$project": {"embedding": 0}},
                                              {'$lookup' :
                                                        {"from": COLLECTION_NAME,
                                                        "localField": "page",
                                                        "foreignField": "page",
                                                        "as": "parent_context",
                                                        "pipeline": [{"$match":{"doc_level": "parent"}},
                                                                    {"$limit": 1},
                                                                    {"$project": {"embedding": 0}}]
                                                         }
                                                }
                                            ]
                                           )
for result in results:
    print(f"Child_doc: {result.page_content}\n\nParent_doc: {result.metadata['parent_context'][0]['text']} \n\n\n")
    #print(result)



Child_doc: gpt3.5Figure 4. GPT performance on academic and professional exams. In each case, we simulate the
conditions and scoring of the real exam. Exams are ordered from low to high based on GPT-3.5
performance. GPT-4 outperforms GPT-3.5 on most exams tested. To be conservative we report the
lower end of the range of percentiles, but this creates some artifacts on the AP exams which have very

Parent_doc: AP Calculus BCAMC 12Codeforces RatingAP English LiteratureAMC 10Uniform Bar ExamAP English LanguageAP ChemistryGRE QuantitativeAP Physics 2USABO Semifinal 2020AP MacroeconomicsAP StatisticsLSATGRE WritingAP MicroeconomicsAP BiologyGRE VerbalAP World HistorySAT MathAP US HistoryAP US GovernmentAP PsychologyAP Art HistorySAT EBRWAP Environmental Science
Exam0%20%40%60%80%100%Estimated percentile lower bound (among test takers)
Exam results (ordered by GPT-3.5 performance)gpt-4
gpt-4 (no vision)
gpt3.5Figure 4. GPT performance on academic and professional exams. In each case, we simul


If we look at the first result we can see that the `result.page_content` is the child chunk, and the `result.metadata.parent_context` contains the parent chuck

*   ['page': 5, 'doc_level': 'child']
page_content='gpt3.5Figure 4. GPT performance on academic and professional exams. In each case, we simulate the\nconditions and scoring of the real exam. Exams are ordered from low to high based on GPT-3.5\nperformance. GPT-4 outperforms GPT-3.5 on most exams tested. To be conservative we report the\nlower end of the range of percentiles, but this creates some artifacts on the AP exams which have very'

* ['page': 5, 'doc_level': 'parent', 'metadata'.'parent_context'.'text' ]  
text= 'AP Calculus BCAMC 12Codeforces RatingAP English LiteratureAMC 10Uniform Bar ExamAP English LanguageAP ChemistryGRE QuantitativeAP Physics 2USABO Semifinal 2020AP MacroeconomicsAP StatisticsLSATGRE WritingAP MicroeconomicsAP BiologyGRE VerbalAP World HistorySAT MathAP US HistoryAP US GovernmentAP PsychologyAP Art HistorySAT EBRWAP Environmental Science\nExam0%20%40%60%80%100%Estimated percentile lower bound (among test takers)\nExam results (ordered by GPT-3.5 performance)gpt-4\ngpt-4 (no vision)\ngpt3.5Figure 4. GPT performance on academic and professional exams. In each case, we simulate the\nconditions and scoring of the real exam. Exams are ordered from low to high based on GPT-3.5\nperformance. GPT-4 outperforms GPT-3.5 on most exams tested. To be conservative we report the\nlower end of the range of percentiles, but this creates some artifacts on the AP exams which have very\nwide scoring bins. For example although GPT-4 attains the highest possible score on AP Biology (5/5),\nthis is only shown in the plot as 85th percentile because 15 percent of test-takers achieve that score.\nGPT-4 exhibits human-level performance on the majority of these professional and academic exams.\nNotably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of\ntest takers (Table 1, Figure 4).\nThe model’s capabilities on exams appear to stem primarily from the pre-training process and are not\nsigniﬁcantly affected by RLHF. On multiple choice questions, both the base GPT-4 model and the\nRLHF model perform equally well on average across the exams we tested (see Appendix B).\nWe also evaluated the pre-trained base GPT-4 model on traditional benchmarks designed for evaluating\nlanguage models. For each benchmark we report, we ran contamination checks for test data appearing\nin the training set (see Appendix D for full details on per-benchmark contamination).5We used\nfew-shot prompting [1] for all benchmarks when evaluating GPT-4.6'



-------------------------------------------------
