1. Create a python virtual env if not exists to run this notebook.
    > Sample: python -m venv .venv-exp
2. Required requirements installation


In [1]:
!pip install --q PyPDF2
!pip install --q unstructured langchain
!pip install --q chromadb
!pip install --q sentence_transformers


3. **Preprocessing Module -- with RecursiveCharacterTextSplitter and SentenceTransformersTokenTextSplitter**

> All the texts extracted from the PDF in this example are single sentences, we can encode texts up to a specified word length. For example, "all-MiniLM-L6-v2" encodes texts up to 256 words (tokens_per_chunk=256). It’ll truncate any text longer than this. Ref: https://realpython.com/chromadb-vector-database/

> Thats why character splitter (RecursiveCharacterTextSplitter) is not enough which produces text chunks more that 256 tokens thats I am using embedding model's SentenceTransformersTokenTextSplitter () with tokens_per_chunk=256 ,That’s the maximum context window length of our embedding model.

In [4]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter


def get_recursive_char_splitted_text(raw_texts):
    _
    character_splitter = RecursiveCharacterTextSplitter(
        separators=["\n\n", "\n", ". ", " ", ""],
        chunk_size=1000,
        chunk_overlap=0
    )
    character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

    #print((character_split_texts[10]))
    #print(f"\nTotal chunks: {len(character_split_texts)}")

    return character_split_texts

def transform(raw_texts):

    token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

    token_split_texts = []
    character_split_texts = get_recursive_char_splitted_text(raw_texts)
    for text in character_split_texts:
        token_split_texts += token_splitter.split_text(text)

    #print((token_split_texts[10]))
    #print(f"\nTotal chunks: {len(token_split_texts)}")

    return token_split_texts


4. **Custom Vector Data Store class creation module** -encapsulating multple related functions with Chroma DB Client and collection creation, used default "all-MiniLM-L6-v2" model for embedding vector creation

In [21]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

EMBED_MODEL = "all-MiniLM-L6-v2"
CHROMA_DATA_PATH = "chroma_vec_store/"


embedding_func = SentenceTransformerEmbeddingFunction(
     model_name=EMBED_MODEL
 )

class VectorDataStore:
    def __init__(self, collection_name) -> None:
        self.embedding_function = embedding_func
        self.vector_db_client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)
        self.vector_collection = self.vector_db_client.create_collection(
                                    name=collection_name,
                                    embedding_function=embedding_func,
                                    metadata={"hnsw:space": "cosine"},
                                 )
        
    def populate_vectors(self, chunked_raw_datasets):
        # Call preprocessor methods here
        token_split_texts = transform(chunked_raw_datasets)
        self.vector_collection.add(
                                    ids=[str(i) for i in range(len(token_split_texts))],
                                    documents=token_split_texts
                                    )
        print(f"# of collections:{self.vector_collection.count()}")

    def search_vectors(self, query, n_results = 1):

        

        results = self.vector_collection.query(query_texts=[query], n_results=n_results)
        retrieved_documents = results['documents'][0]

        return retrieved_documents


    

5. **Reading input PDF file from local folder using PdfReader and created list[str] as text data**

In [22]:
from PyPDF2 import PdfReader

# location of the pdf file/files. 
reader = PdfReader('MS_2022_Annual_Report.pdf')
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

#print((pdf_texts[0]))
#print(len(pdf_texts[0]))

6. **Vector Data Store & embedding vector creation module --**

In [24]:
COLLECTION_NAME = "take-home-vec-store" 

vector_store = VectorDataStore(COLLECTION_NAME)
vector_store.populate_vectors(pdf_texts)


# of collections:349


7. **Finally querying the vector datastore using sample query string --**

In [25]:

query = "What was the total revenue?"

retrieved_documents =vector_store.search_vectors(query,n_results=3)       

for document in retrieved_documents:
       print((document))
       print('\n')

revenue, classified by significant product and service offerings, was as follows : ( in millions ) year ended june 30, 2022 2021 2020 server products and cloud services $ 67, 321 $ 52, 589 $ 41, 379 office products and cloud services 44, 862 39, 872 35, 316 windows 24, 761 22, 488 21, 510 gaming 16, 230 15, 370 11, 575 linkedin 13, 816 10, 289 8, 077 search and news advertising 11, 591 9, 267 8, 524 enterprise services 7, 407 6, 943 6, 409 devices 6, 991 6, 791 6, 457 other 5, 291 4, 479 3, 768 total $ 198, 270 $ 168, 088 $ 143, 015 we have recast certain previously reported amounts in the table above to conform to the way we internally manage and monitor our business.


74 note 13 — unearned revenue unearned revenue by segment was as follows : ( in millions ) june 30, 2022 2021 productivity and business processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in unearned revenue were as follows : ( in milli

Querying the vector datastore using  2nd sample query string --

In [26]:
query = "What were Microsoft's mission for 2022?"

retrieved_documents =vector_store.search_vectors(query,n_results=2)       

for document in retrieved_documents:
       print((document))
       print('\n')

1 dear shareholders, colleagues, customers, and partners : we are living through a period of historic economic, societal, and geopolitical change. the world in 2022 looks nothing like the world in 2019. as i write this, inflation is at a 40 - year high, supply chains are stretched, and the war in ukraine is ongoing. at the same time, we are entering a technological era with the potential to power awesome advancements across every sector of our economy and society. as the world ’ s largest software company, this places us at a historic intersection of opportunity and responsibility to the world around us. our mission to empower every person and every organization on the planet to achieve more has never been more urgent or more necessary. for all the uncertainty in the world, one thing is clear : people and organizations in every industry are increasingly looking to digital technology to overcome today ’ s challenges and emerge stronger. and no


7 220, 000 people who work at microsoft. 

More queries....

In [27]:
query = "Tell me how did Microsoft Protect fundamental rights in 2022?"

retrieved_documents =vector_store.search_vectors(query,n_results=1)       

for document in retrieved_documents:
       print((document))
       print('\n')

addressing the world ’ s most pressing issues. this year, we provided $ 3. 2 billion in donated and discounted technology to 302, 000 nonprofits serving over 1. 2 billion people globally. and earlier this month, we announced that microsoft will double the number of nonprofits we reach worldwide over the next five years. protect fundamental rights we unequivocally support the fundamental rights of people, from defending democracy, to protecting human rights, to addressing racial injustice and inequity. and, as people ’ s access to education, healthcare, jobs, and other critical services becomes increasingly dependent on technology, it ’ s clear that access to broadband and accessible technology is also fundamental to building a more equitable future. since 2017, we ’ ve helped more than 50 million people in unserved rural communities globally gain access to affordable




In [28]:
query = "Tell me what was the total amount shared repurchased by Microsoft Protect in First quarter of 2022?"

retrieved_documents =vector_store.search_vectors(query,n_results=1)       

for document in retrieved_documents:
       print((document))
       print('\n')

share repurchases. this share repurchase program commenced in november 2021, following completion of the program approved on september 18, 2019, has no expiration date, and may be terminated at any time. as of june 30, 2022, $ 40. 7 billion remained of this $ 60. 0 billion share repurchase program. we repurchased the following shares of common stock under the share repurchase programs : ( in millions ) shares amount shares amount shares amount year ended june 30, 2022 2021 2020 first quarter 21 $ 6, 200 25 $ 5, 270 29 $ 4, 000 second quarter 20 6, 233 27 5, 750 32 4, 600 third quarter 26 7, 800 25 5, 750 37 6, 000 fourth quarter 28 7, 800 24 6, 200 28 5, 088 total 95 $ 28, 033 101 $ 22, 970 126 $ 19, 688 all repurchases were made using cash resources. shares repurchased during the fourth and third quarters of fiscal year




In [34]:
query = "Tell me what ambition drives the Microsoft as a Company?"

retrieved_documents =vector_store.search_vectors(query,n_results=3)       

for document in retrieved_documents:
       print((document))
       print('\n')

our future growth depends on our ability to transcend current product category definitions, business models, and sales motions. we have the opportunity to redefine what customers and partners can expect and are working to deliver new solutions that reflect the best of microsoft.


microsoft is a technology company whose mission is to empower every person and every organization on the planet to achieve more. we strive to create local opportunity, growth, and impact in every country around the world. our platforms and tools help drive small business productivity, large business competitiveness, and public - sector efficiency. they also support new startups, improve educational and health outcomes, and empower human ingenuity. we generate revenue by offering a wide range of cloud - based and other services to people and businesses ; licensing and supporting an array of software products ; designing, manufacturing, and selling devices ; and delivering relevant online advertising to a globa

In [32]:
query = "Tell me the future opportunities of Microsoft?"

retrieved_documents =vector_store.search_vectors(query,n_results=1)       

for document in retrieved_documents:
       print((document))
       print('\n')

our future growth depends on our ability to transcend current product category definitions, business models, and sales motions. we have the opportunity to redefine what customers and partners can expect and are working to deliver new solutions that reflect the best of microsoft.


