Referenced Sites
https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/

https://hasanaboulhasan.medium.com/the-best-text-chunking-method-f5faeb243d80

https://docs.llamaindex.ai/en/stable/api_reference/embeddings/huggingface/

https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/token_text_splitter/

https://docs.llamaindex.ai/en/stable/examples/embeddings/ollama_embedding/

https://ollama.com/blog/embedding-models

https://docs.llamaindex.ai/en/stable/examples/llm/ollama_gemma/

https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/



In [1]:
import chromadb
import os
from pypdf import PdfReader

In [2]:
# extract the text from pdf
def extract_text_from_pdf(pdf_path):
    pdf_texts=[]
    for filename in os.listdir(pdf_path):
        pdf_path= os.path.join(pdf_path, filename)
        if filename.endswith('.pdf'):
            reader= PdfReader(pdf_path)
            text=''
            for page in reader.pages:
                text+=page.extract_text().strip()
            pdf_texts.append(text.strip())
    return pdf_texts

In [3]:
extracted_text_from_pdf= extract_text_from_pdf('google_pdfs')

In [5]:
extracted_text_from_pdf= '\n'.join(extracted_text_from_pdf)

an embedding model is required to do the semantic chunking siupported by langchain.

For that here i use ollama embeddings

In [None]:
! pip install llama-index-embeddings-huggingface

In [7]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model=HuggingFaceEmbedding(model_name='BAAI/bge-small-en')

In [8]:
from llama_index.core.node_parser import (
    SemanticSplitterNodeParser,
    SentenceSplitter)
from llama_index.core.schema import Document


splitter=SemanticSplitterNodeParser(
    buffer_size=5, 
    breakpoint_percentile_threshold=70,
    embed_model=embed_model,
)

doc=Document(text=extracted_text_from_pdf)


In [9]:
# chunk the pdf text now
nodes= splitter.get_nodes_from_documents([doc])

In [29]:
print(nodes[100].get_content())

Security Ownership of Certain Beneficial Owners and Management and 
Related Stockholder Matters
90
Item 13. Certain Relationships and Related Transactions, and Director Independence 90
Item 14. Principal Accountant Fees and Services 90
Part IV Item 15. Exhibits, Financial Statement Schedules 91
Item 16. Form 10-K Summary 94
Signatures 95iv Alphabet 2023 Annual Report
NOTE ABOUT FORWARD-LOOKING STATEMENTS
This Annual Report on Form 10-K contains forward-looking statements within the meaning of the Private Securities Litigation 
Reform Act of 1995. These include, among other things, statements regarding:
• the growth of our business and revenues and our 
expectations about the factors that influence our success 
and trends in our business;
• fluctuations in our revenues and margins and various 
factors contributing to such fluctuations;
• our expectation that the continuing shift from an offline to 
online world will continue to benefit our business;
• our expectation that the portion of

In [13]:
# get the text chunks
chunked_data= [node.get_content() for node in nodes]

* When you tokenize text with Hugging Face's AutoTokenizer, it returns a dictionary of token IDs, attention masks, etc.
* These token IDs are numerical representations of subword tokens.

For example:

```
tokenizer("Hello")  ➝ {'input_ids': [101, 3872, 776, 102], ...}
Here, 3872 and 776 are subword token IDs.
```

If you want to see the subwords, do this:

```
tokenizer.tokenize("hey there") ➝ ['hey', 'there']
tokenizer.convert_ids_to_tokens(tokenizer("Hey there")['input_ids']) 
```

In [19]:
import chromadb
from sentence_transformers import SentenceTransformer

In [20]:
# build your chroma db vector database
model_name_or_path = 'BAAI/bge-small-en'

def build_embeddings_and_store_in_vector_db(chunked_data):
    
    # get the embedding model
    embedding_model= SentenceTransformer(model_name_or_path)

    # craete the chroma client
    client = chromadb.PersistentClient(
        path="google_pdf_embeddings"
    )

    # define the collection [like a table]
    collection= client.get_or_create_collection(
        name="google_collection",
    ) 

    # get the embedding of each chunk
    embedding= embedding_model.encode(chunked_data)

    # add this to the collection
    collection.add(
        documents=chunked_data,
        ids=[str(i) for i in range(len(chunked_data))],
        embeddings= embedding,
        metadatas=[{"source": "google_pdfs"}] * len(chunked_data),
    )
    return collection

In [21]:
collection_built= build_embeddings_and_store_in_vector_db(chunked_data=chunked_data)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

* Now lets define a query
* Send it in to the LLM
* Recieve the Relevant queries

In [30]:
import ollama
def send_query_to_llm_for_more_queries(query, model_name="gemma3n:latest"):
    prompt = f"""   You are a knowledgeable financial research assistant. 
    Your users are inquiring about an annual report. 
    For the given question, propose up to five related questions to assist them in finding the information they need. 
    Provide concise, single-topic questions (withouth compounding sentences) that cover various aspects of the topic. 
    Ensure each question is complete and directly related to the original inquiry. 
    List each question on a separate line without numbering.
    

    Question:
    {query}

    Answer:"""
    

    try:
        response = ollama.chat(
            model=model_name,
            messages=[
                {
                    'role': 'user',
                    'content': prompt,
                }
            ],
            options={
                'temperature': 0.5, #[ closer to 0: Deterministic]
                'max_tokens': 250,  # length of response
                'top_p': 0.7, # Filters the model's vocabulary to the top probable tokens.
            }                 # Range: 0.1 (strict) to 1.0 (broad).
        )
        return response['message']['content']
    except Exception as e:
        return f"Error generating response: {str(e)}"


In [31]:
query="How did the google's revenue change from 2022 to 2023? And Why?"

In [32]:
similar_queries= send_query_to_llm_for_more_queries(query=query, model_name="gemma3n:latest")

In [33]:
similar_queries

"What was Google's total revenue in 2022?\nWhat was Google's total revenue in 2023?\nWhat were the primary drivers of Google's revenue growth or decline between 2022 and 2023?\nHow did revenue from Google's advertising business change between 2022 and 2023?\nWhat was the impact of economic conditions on Google's revenue during 2022 and 2023?\n\n\n\n"

In [35]:
# now make an augmented query
def augment_query_with_similar_queries(original_query, similar_queries):
    return f"{original_query}/n {similar_queries}"

In [36]:
augmented_query= augment_query_with_similar_queries(
    original_query=query,
    similar_queries=similar_queries)

retrieve the queries results from the Vector Database

In [38]:
retrieved_results= collection_built.query(
    query_texts=[augmented_query],
    n_results=5,
    include=["documents"]
)

In [39]:
retrieved_results

{'ids': [['297', '32', '265', '370', '308']],
 'embeddings': None,
 'documents': [['Changes in cost-per-click and cost-per-impression are driven by a number of interrelated factors including changes in \ndevice mix, geographic mix, advertiser spending, ongoing product and policy changes, product mix, property mix, and \nchanges in foreign currency exchange rates.\nGoogle subscriptions, platforms, and devices\nGoogle subscriptions, platforms, and devices revenues increased $5.6 billion from 2022 to 2023 primarily driven by \ngrowth in subscriptions, largely for YouTube services. The growth in YouTube services was primarily due to an increase in \npaid subscribers.\nGoogle subscriptions, platforms, and devices revenues increased $1.0 billion from 2021 to 2022 primarily driven by growth \nin subscription and device revenues, partially offset by a decrease in platform revenues. The growth in subscriptions was \nlargely for YouTube services, primarily due to an increase in paid subscribers.

hmmm.. seems a better reply..

lets test this

In [40]:
# send the response to the LLM for summarization
def send_retrieved_response_llm(retrieved_response, model_name="gemma3n:latest"):
    prompt = f"""You are a financial research assistant. 
    Also let the users know what all information [line by line] was provided to you.
    Your users have retrieved information from an annual report. 
    Summarize the key points from the provided text, focusing on the main findings and insights. 
    Ensure the summary is concise and captures the essence of the information.

    Retrieved Information:
    {retrieved_response}

    Summary:"""
    
    try:
        response = ollama.chat(
            model=model_name,
            messages=[
                {
                    'role': 'user',
                    'content': prompt,
                }
            ],
            options={
                'temperature': 0.5, #[ closer to 0: Deterministic]
                'max_tokens': 300,  # length of response
                'top_p': 0.7, # Filters the model's vocabulary to the top probable tokens.
            }                 # Range: 0.1 (strict) to 1.0 (broad).
        )
        return response['message']['content']
    except Exception as e:
        return f"Error generating response: {str(e)}"


In [41]:
final_response=send_retrieved_response_llm(retrieved_response=retrieved_results['documents'], model_name="gemma3n:latest")

In [42]:
final_response

"## Summary of Key Points from Alphabet's 2023 Annual Report\n\nThis summary highlights the key findings and insights from the provided text, focusing on financial performance, strategic direction, and market trends.\n\n**Financial Performance:**\n\n*   **Revenue Growth:** Google subscriptions, platforms, and devices revenue increased significantly from 2022 to 2023 ($5.6 billion) and from 2021 to 2022 ($1.0 billion), primarily driven by growth in YouTube subscriptions.\n*   **International Revenue:** Revenues from international markets are increasing, driven by growing internet adoption, particularly in emerging markets like India. However, these revenues are susceptible to fluctuations in foreign currency exchange rates.\n*   **Non-Advertising Revenue Growth:** Revenues beyond advertising (cloud, subscriptions, platforms, devices) are growing, but these revenues generally have lower margins than advertising. Device sales, in particular, can negatively impact overall margins due to pr

"## Summary of Key Points from Alphabet's 2023 Annual Report\n\nThis summary highlights the key findings and insights from the provided text, focusing on financial performance, strategic direction, and market trends.\n\n**Financial Performance:**\n\n*   **Revenue Growth:** Google subscriptions, platforms, and devices revenue increased significantly from 2022 to 2023 ($5.6 billion) and from 2021 to 2022 ($1.0 billion), primarily driven by growth in YouTube subscriptions.\n*   **International Revenue:** Revenues from international markets are increasing, driven by growing internet adoption, particularly in emerging markets like India. However, these revenues are susceptible to fluctuations in foreign currency exchange rates.\n*   **Non-Advertising Revenue Growth:** Revenues beyond advertising (cloud, subscriptions, platforms, devices) are growing, but these revenues generally have lower margins than advertising. Device sales, in particular, can negatively impact overall margins due to pricing and cost pressures.\n*   **Operating Income:** Total income from operations increased from $74.842 billion in 2022 to $84.293 billion in 2023, driven by growth in Google Services and Google Cloud.\n*   **General & Administrative Expenses:** G&A expenses increased from $15.724 billion in 2022 to $16.425 billion in 2023, largely due to increased compensation expenses related to workforce reductions.\n*   **Google Cloud:** Google Cloud experienced a shift from a loss in 2022 to an operating income of $1.716 billion in 2023, indicating improved performance.\n\n**Strategic Insights:**\n\n*   **User-Centric Development:** Alphabet prioritizes user experience when developing new products and services, followed by monetization strategies.\n*   **Cloud Focus:** Google Cloud, established in 2008, has become a major enterprise player.\n*   **Emerging Markets:**  Significant investment is being made in developing localized products and advertising programs to capitalize on the growth in online users in emerging markets.\n*   **Investment in Operations:**  Continued heavy investment in operating and capital expenditures is planned to support business expansion.\n\n**Key Risks & Considerations:**\n\n*   **Foreign Exchange Risk:** Fluctuations in foreign currency exchange rates pose a risk to international revenues.\n*   **Margin Pressure:** Non-advertising revenues have lower margins, potentially impacting overall profitability.\n*   **Workforce Optimization:**  Significant workforce reductions and office space optimization efforts are underway, resulting in associated costs.\n\n\n\n**Information Provided to Generate this Summary:**\n\n1.  **Changes in cost-per-click and cost-per-impression...changes in foreign currency exchange rates.** - Explains factors influencing advertising costs.\n2.  **Google subscriptions, platforms, and devices revenues increased $5.6 billion from 2022 to 2023...paid subscribers.** - Details revenue growth in this segment, driven by YouTube subscriptions.\n3.  **Google was built in the cloud...top enterprise companies in the world.** - Highlights the company's history and current position in cloud computing.\n4.  **When developing new products and services...2023 Annual Report.** -  Outlines the company's product development philosophy.\n5.  **As users in developing economies increasingly come online...emerging markets.** - Discusses the importance of international markets and the associated investment.\n6.  **International revenues represent a significant portion of our revenues...U.S. dollar.** -  Addresses the impact of foreign exchange rates on international revenue.\n7.  **Revenues that we derive beyond advertising...advertising revenues.** - Explains the growth of non-advertising revenues and their margin characteristics.\n8.  **(Alphabet) became the successor issuer to Google.** - Provides a corporate structure detail.\n9.  **General and Administrative...2023.** - Presents G&A expenses and their percentage of revenues for 2022 and 2023, along with a breakdown of the increase in G&A expenses.\n10. **Segment Profitability...2022 2023.** - Shows operating income (loss) for Google Services, Google Cloud, Other Bets, and Alphabet-level activities for 2022 and 2023.\n11. **In addition to the costs included in Alphabet-level activities...2023.** - Provides additional context on charges related to workforce reductions and office space optimization.\n\n\n\n"

So i evaluated the final_response given by LLM.

On a rating of 1-10 these are the ratings for each question it gave->

* How did Google's revenue change from 2022 to 2023?
    * Assessment: 7/10 (Provides specific segment growth but lacks a holistic view of total revenue change).

* Why? 
    * Assessment: Rating: 6/10 (Identifies key drivers like YouTube and emerging markets but lacks depth on other factors).

* What was Google's total revenue in 2022?
    * Assessment: Rating: 1/10 (No direct answer provided).

* What was Google's total revenue in 2023?
    * Assessment: Rating: 1/10 (No direct answer provided).

* What were the primary drivers of Google's revenue growth or decline between 2022 and 2023?
    * Assessment: 7/10 (Covers major drivers but lacks comprehensive detail on all segments).

* How did revenue from Google's advertising business change between 2022 and 2023?
    * Assessment: Rating: 2/10 (No direct data on advertising revenue changes, only indirect references).

* What was the impact of economic conditions on Google's revenue during 2022 and 2023?
    * Assessment: Rating: 3/10 (Only indirectly mentions foreign exchange risks, no broader economic context).


#### We can see that there is still scope of improvement.