------------------------------------
#### Retrievals using embeddings (key points)
-----------------------------------

**Dataset**
- PDF file on MS annual report (2022)
    - plain text
    - tables
    - Plots/graphs
    
**Extract data**
- Extract text pages

**Chunking**
- Two-Step Chunking Strategy: LangChain + SentenceTransformersTokenTextSplitter

**Embeddings**
- sentence transformer (from chromadb)

**Vector database**
- in memory chromadb

**Query**
- Using embeddings
- Using RAG


In [1]:
#!pip install chromadb

In [2]:
#pip install langchain

In [3]:
#!pip install pdfreader
#!pip install PypDF2

In [4]:
#!pip install pypdf

In [5]:
from helper_utils import word_wrap

In [6]:
from pypdf import PdfReader

In [7]:
reader = PdfReader(r".\data\microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

In [8]:
len(pdf_texts)

93

In [9]:
# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [10]:
len(pdf_texts)

90

**RecursiveCharacterTextSplitter**

The `RecursiveCharacterTextSplitter` is a utility that helps split long text into smaller chunks while maintaining as much context as possible. Here's how it works:

##### Separators

The `separators` list defines the order in which the text will be split. In this example:

- It first attempts to split by two newlines (`"\n\n"`), which typically indicates a paragraph break.
- If the chunk size condition isn't met, it moves on to split by a single newline (`"\n"`), indicating line breaks or new sentences.
- Then it tries to split by period followed by a space (`". "`), which indicates sentence boundaries.
- After that, it splits by a space (`" "`), which breaks the text at the word level.
- Finally, it splits by individual characters (`""`) if none of the above yield a chunk that meets the size requirement.

##### Chunk size and overlap

- `chunk_size=1000` means that each chunk will have a maximum of 1000 characters.
- `chunk_overlap=0` means there will be no overlap between consecutive chunks (i.e., no repeated content).

##### Recursive splitting

The process is recursive because it starts from the largest separator (paragraphs), and if the resulting chunk is still larger than 1000 characters, it moves down to the next smaller separator (sentences, words, etc.), ensuring that the chunks are as close to 1000 characters as possible while retaining coherent pieces of text.


```python
text = "This is a long paragraph with multiple sentences. It discusses several topics and ideas, flowing continuously. For instance, it talks about machine learning, deep learning, and various AI applications. While doing so, it doesn’t include paragraph breaks or line breaks. Everything is packed in a single block."
```

##### Initial Split
It tries to split using `"\n\n"` (paragraph breaks). There are no `\n\n` in this text, so no split happens.

##### Next Split
It then looks for `"\n"` (line breaks). There are none here either.

##### Next Split
It tries `". "` (sentence breaks). Here, it successfully splits the text into three sentences:
- "This is a long paragraph with multiple sentences."
- "It discusses several topics and ideas, flowing continuously."
- "For instance, it talks about machine learning, deep learning, and various AI applications."
- "While doing so, it doesn’t include paragraph breaks or line breaks. Everything is packed in a single block."

##### Final Chunks
If any of these sentences exceed 1000 characters, it continues splitting by `" "` (spaces) and eventually by characters if necessary.


```python
text = """Data science is an interdisciplinary field that uses various techniques to extract insights from data. It involves statistics, machine learning, and data analysis.

Machine learning is a subset of AI that enables systems to learn from data and improve from experience.

Deep learning, a branch of machine learning, uses neural networks to model complex patterns in data."""
```

##### Initial Split
The first separator `"\n\n"` (paragraph breaks) will be applied:
- "Data science is an interdisciplinary field that uses various techniques to extract insights from data. It involves statistics, machine learning, and data analysis."
- "Machine learning is a subset of AI that enables systems to learn from data and improve from experience."
- "Deep learning, a branch of machine learning, uses neural networks to model complex patterns in data."

##### Next Split
If any paragraph exceeds 1000 characters, it would then try to split further using `"\n"`, `". "`, and so on.

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [12]:
character_splitter = RecursiveCharacterTextSplitter(
    separators   = ["\n\n", "\n", ". ", " ", ""],
    chunk_size   = 1000,
    chunk_overlap= 0
)

In [13]:
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

In [14]:
print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 344


In [15]:
word_wrap(character_split_texts[10])

'increased, due in large part to significant global datacenter\nexpansions and the growth in Xbox sales and usage. Despite \nthese\nincreases, we remain dedicated to achieving a net -zero future. We\nrecognize that progress won’t always be linear, \nand the rate at which\nwe can implement emissions reductions is dependent on many factors that\ncan fluctuate over time.  \nOn the path to becoming water positive, we\ninvested in 21 water replenishment projects that are expected to\ngenerate \nover 1.3 million cubic meters of volumetric benefits in nine\nwater basins around the world. Progress toward our zero waste\n\ncommitment included diverting more than 15,200 metric tons of solid\nwaste otherwise headed to landfills and incinerators, \nas well as\nlaunching new Circular Centers to increase reuse and reduce e-waste at\nour datacenters.  \nWe contracted to protect over 17,000 acres of land\n(50% more than the land we use to operate), thus achieving our'

#### The SentenceTransformersTokenTextSplitter

The `SentenceTransformersTokenTextSplitter` is designed to split text based on token count, using tokenization principles similar to those employed by models like Sentence Transformers. Here's how it works:

##### chunk_overlap=0
This means there is no overlap between consecutive chunks. Each chunk will be entirely separate from the previous one, with no repeated content.

##### tokens_per_chunk=256
This indicates that each chunk will contain a maximum of 256 tokens. Tokens here refer to the processed units of text after tokenization, which could be words, parts of words, punctuation marks, etc., depending on the tokenizer.

##### Use Case
This splitter is typically useful when working with models that have token limits (like many transformer models), where you need to control the number of tokens being processed at a time.


#### Two-Step Chunking Strategy: LangChain + SentenceTransformersTokenTextSplitter

Chunking first with LangChain using a chunk size of 1000 characters, followed by further splitting each of those chunks using `SentenceTransformersTokenTextSplitter` with 256 tokens, provides a `layered approach` to ensure efficient processing for large language models. Here's how it benefits:

##### 1. Balanced Chunk Sizes for Text Processing
- **Initial Character-Based Chunking:** The initial chunking by LangChain (1000 characters) ensures that the text is divided into manageable pieces that retain context, such as paragraphs or sentences, without breaking down into excessively small parts.
- **Token-Based Splitting for Model Constraints:** After chunking, each chunk is split further based on token limits (256 tokens per chunk) to fit within the constraints of transformer models, preventing errors during inference.

##### 2. Optimized for Transformer Models
- Transformer-based models typically have a **maximum token limit** (often 512 or 1024 tokens). By splitting into 256-token chunks, you ensure that each chunk is well within the limit, reducing the risk of truncation or cutting off important information in the middle of a chunk.

##### 3. Combines Flexibility with Granularity
- **Character-based Splitting:** Handles initial splitting by context (paragraphs, sentences) and ensures that large blocks of text are broken up in a logical way without splitting mid-word.
- **Token-based Splitting:** Offers more **granularity** by ensuring each piece fits neatly into a model’s processing window, providing efficient model performance without losing coherence.

##### 4. Improved Performance for Downstream Tasks
- The combination of these two splitting strategies helps to balance **context retention** (larger chunks from character splitting) with **computational efficiency** (smaller chunks optimized for transformer models).
- This is especially useful for **tasks like text embedding, summarization, and question answering**, where the


In [16]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

In [17]:
token_split_texts = []

for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}") 

increased, due in large part to significant global datacenter
expansions and the growth in xbox sales and usage. despite these
increases, we remain dedicated to achieving a net - zero future. we
recognize that progress won ’ t always be linear, and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time. on the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate over 1. 3 million cubic meters of volumetric benefits in nine
water basins around the world. progress toward our zero waste
commitment included diverting more than 15, 200 metric tons of solid
waste otherwise headed to landfills and incinerators, as well as
launching new circular centers to increase reuse and reduce e - waste
at our datacenters. we contracted to protect over 17, 000 acres of land
( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


In [18]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# essentially a BERT model
# BERT gives embeddings for tokens
# Sentence transformers take the tokens of a document and produces
# embeddings for the documents (in this case chunks)

# that is why we used sentence transformer token text splitter

In [19]:
embedding_function = SentenceTransformerEmbeddingFunction()
len(embedding_function([token_split_texts[10]])[0])

384

#### Overview of `SentenceTransformerEmbeddingFunction`

1. **Input Text**: 
   - The function takes input text, which can be sentences, paragraphs, or even entire documents.

2. **Tokenization**: 
   - It processes the text through a Sentence Transformer model, which first tokenizes the input to handle it appropriately for embedding.

3. **Embedding Generation**: 
   - The tokenized input is passed through the model to generate embeddings. Each embedding is typically a fixed-length vector that represents the semantic meaning of the text.

##### Key Parameters (Example)

While the implementation specifics may vary, common parameters for initializing a SentenceTransformer embedding function might include:

- **model_name**: The name of the pre-trained Sentence Transformer model to use (e.g., `"all-MiniLM-L6-v2"`).
- **device**: Specifies whether to run the model on CPU or GPU for faster processing.

##### Benefits

1. **High-Quality Embeddings**: 
   - Sentence Transformers are pre-trained on large datasets and are optimized for producing high-quality embeddings that capture nuanced semantic meanings.

2. **Versatility**: 
   - The embeddings can be used in various NLP applications, including:
   - **Semantic similarity**
   - **Information retrieval**
   - **Text classification**
   - **Clustering**


In [20]:
# Configure the client with allow_reset enabled
chroma_client = chromadb.Client(
    chromadb.config.Settings(
        allow_reset      =True,                            # Enable the ability to reset the database
        #persist_directory="path_to_persistent_directory",  # Optional for persistence
        #chroma_db_impl   ="duckdb+parquet"                 # Optional for persistence
    )
)

In [21]:
# reset
chroma_client.reset()                # Clears all data in the database
print("Database has been reset.")

Database has been reset.


In [22]:
chroma_collection = chroma_client.get_or_create_collection("microsoft_annual_report_2022", 
                                                            embedding_function = embedding_function)

In [23]:
%%time
# takes time

# create IDs
ids = [str(i) for i in range(len(token_split_texts))]

# add documents to chroma collection
chroma_collection.add(ids=ids, documents=token_split_texts)

chroma_collection.count()

CPU times: total: 24.8 s
Wall time: 22.7 s


349

In [24]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts= [query], 
                                  n_results  = 5)

retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

74 note 13 — unearned revenue unearned revenue by segment was as
follows : ( in millions ) june 30, 2022 2021 productivity and business
processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more
personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in
unearned revenue were as follows : ( in millions ) year ended june 30,
2022 balance, beginning of period $ 44, 141 deferral of revenue 110,
455 recognition of unearned revenue ( 106, 188 ) balance, end of period
$ 48, 408 revenue allocated to remaining performance obligations, which
includes unearned revenue and amounts that will be invoiced and
recognized as revenue in future periods, was $ 193 billion as of june
30, 2022, of which $ 189 billion is related to the commercial portion
of revenue. we expect to recognize approximately 45 % of this revenue
over the next 12 months and the remainder thereafter. note 14 — leases


that are not sold separately. • we tested the mathematical accuracy of
management ’ s calculat

In [25]:
#!pip install openai

In [26]:
import os
import openai
from openai import OpenAI

openai_client = OpenAI()

In [27]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {
            "role": "user", "content": f"Question: {query}. \n Information: {information}"
        }
    ]
    
    response = openai_client.chat.completions.create(
        model   = model,
        messages= messages,
    )
    content = response.choices[0].message.content
    return content

In [28]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

The total revenue was $198,270 million for the year ended June 30,
2022.
