<a href="https://colab.research.google.com/github/midhun-james/gliner-desc/blob/main/rag_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q -U google-generativeai langchain-google-genai langchain-community chromadb sentence-transformers

In [2]:
pip install -q pymupdf

In [3]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = ''
os.environ['NO_GCE_CHECK'] = 'True'

Setting up the api key configuration

In [29]:
from google.colab import userdata
import os

os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_retries=2
)


In [46]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.chains import RetrievalQA
from langchain_google_genai import ChatGoogleGenerativeAI
import pprint
import fitz

Copied the text in 10-k reports to a txt file.

Used fitz package from Pymupdf to extract text from the pdf

In [47]:

pdf_path='assign_3_data.txt'
doc=fitz.open(pdf_path)
text=''
for page in doc:
  text+=page.get_text()
print(text[:500])

 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
FORM 10-K
 
 
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE
SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Fiscal Year Ended June 30, 2022
 
 
 
OR
 
 
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF
THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Transition Period From                  to
Commission File Number 001-37845
 
MICROSOFT CORPORATION
 
 
Washington
 
91-1144442
(STATE OF INCORPORATION)
 
(I


For text splitting setted the chunk size as 1500 and chunk overlap as 200.

Since it is a very large file chunk making chunk size bigger is needed. overlapping is setted inorder to not loose continuity

In [14]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=200,
    separators=["\n\n", "\n", ".", " ", ""]
)
chunks=text_splitter.split_text(text)



In [15]:
print(f'Total chunks: {len(chunks)}')
print(f'first chunk : {chunks[0]}')

Total chunks: 286
first chunk : UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
FORM 10-K
 
 
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE
SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Fiscal Year Ended June 30, 2022
 
 
 
OR
 
 
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF
THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Transition Period From                  to
Commission File Number 001-37845
 
MICROSOFT CORPORATION
 
 
Washington
 
91-1144442
(STATE OF INCORPORATION)
 
(I.R.S. ID)
 
ONE MICROSOFT WAY, REDMOND, Washington 98052-6399
(425) 882-8080
www.microsoft.com/investor
 
 
 
 
 
Securities registered pursuant to Section 12(b) of
the Act:
 
 
 
 
 
 
 
 
 
Title of each class
 
Trading Symbol
 
Name of exchange on which registered
 
 
 
 
 
Common stock, $0.00000625 par value per share
 
MSFT
 
Nasdaq
3.125% Notes due 2028
 
MSFT
 
Nasdaq
2.625% Notes due 2033
 
MSFT
 
Nasdaq
 
 
 
 
 
Securities registered pursuant to Section 12(g

For RAG we need these text chunks as documents.

By doing this we can get a metadata along with the text content

In [16]:
docs = text_splitter.create_documents([text])
print(f"Total documents: {len(docs)}")
print(docs[0].page_content[:300])

Total documents: 286
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
FORM 10-K
 
 
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE
SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Fiscal Year Ended June 30, 2022
 
 
 
OR
 
 
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF
THE SECURIT


In [17]:
embedding_function = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")



  embedding_function = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [18]:
# Create a Chroma vector store and persist it locally
db = Chroma.from_documents(documents=docs, embedding=embedding_function, persist_directory="chroma_store")

# Save to disk
db.persist()

  db.persist()


Set {"k": 10} which means the retriever picks up 3 chunks or documents

In [50]:
persist_directory = "chroma_store"

embedding_model = SentenceTransformerEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vector_store = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding_model
)

retriever = vector_store.as_retriever(search_kwargs={"k": 10})

QA chain is defined where chain type is set as 'stuff' which is a simple mode whch will stuff all the retrived chunks into a single prompt

In [51]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

In [52]:
query = "What were the company's total revenues for the fiscal year that ended on June 30, 2022?"
result = qa_chain.invoke(query)


In [53]:
result['result']

"The company's total revenues for the fiscal year that ended on June 30, 2022, were $198,270 million."

By reviwing the source document we can see the value **$198,270** there hence we know the answer is correct

In [54]:
result['source_documents'][0].page_content

'services, and customer service and support. Each\nallocation is measured differently based on the\nspecific facts and circumstances of the costs being\nallocated.\nSegment revenue and operating income were as follows\nduring the periods presented:\n \n(In millions)\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nYear Ended June 30,\n \n \n2022\n \n \n \n2021\n \n \n \n2020\n \n \n \n \n \n \n \n \n \n \n \n \n \n \nRevenue\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nProductivity and Business Processes\n \n$\n63,364\n \n \n$\n53,915\n \n \n$\n46,398\n \nIntelligent Cloud\n \n \n75,251\n \n \n \n60,080\n \n \n \n48,366\n \nMore Personal Computing\n \n \n59,655\n \n \n \n54,093\n \n \n \n48,251\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nTotal\n \n$\n198,270\n \n \n$\n168,088\n \n \n$\n143,015\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nOperating Income\n \n \n \n \n \n \n \n \n \n \n \n \

**Specific fact evaluation**

By analyzing these values we can say it gives very good results

In [55]:
specific_questions= ["list all the directors ","How much did Microsoft spend on research and development during the fiscal year 2022?","Total Number shares purchased from April 1st 2022 to June 30 2022?"]
for query in specific_questions:
  print(f'Question :  {query}\n\n\n')
  result = qa_chain.invoke(query)
  print(f'Answer : {result["result"]}\n\n')
  pprint.pprint(f'Source Document : {result["source_documents"][0].page_content}\n\n')

Question :  list all the directors 



Answer : Here are all the directors listed in the provided text:

*   Reid Hoffman
*   Hugh F. Johnston
*   Teri L. List
*   Sandra E. Peterson
*   Penny S. Pritzker
*   Carlos A. Rodriguez
*   Charles W. Scharf
*   John W. Stanton
*   John W. Thompson (Lead Independent Director)
*   Emma N. Walmsley
*   Padmasree Warrior


('Source Document : Reid Hoffman\n'
 ' \n'
 'Director\n'
 ' \n'
 ' \n'
 '/s/ HUGH F. JOHNSTON        \n'
 ' \n'
 'Hugh F. Johnston\n'
 ' \n'
 'Director\n'
 ' \n'
 ' \n'
 '/s/ TERI L. LIST\n'
 ' \n'
 'Teri L. List\n'
 ' \n'
 'Director\n'
 ' \n'
 ' \n'
 '/s/ SANDRA E. PETERSON\n'
 ' \n'
 'Sandra E. Peterson\n'
 ' \n'
 'Director\n'
 ' \n'
 ' \n'
 ' \n'
 '/s/ PENNY S. PRITZKER\n'
 ' \n'
 'Penny S. Pritzker\n'
 ' \n'
 'Director\n'
 ' \n'
 ' \n'
 '/s/ CARLOS A. RODRIGUEZ\n'
 ' \n'
 'Director\n'
 'Carlos A. Rodriguez\n'
 ' \n'
 ' \n'
 ' \n'
 ' \n'
 '/s/ CHARLES W. SCHARF        \n'
 ' \n'
 'Charles W. Scharf\n'
 ' \n'
 'Director\n'
 '

Summarization questions


In [56]:
summarization_questions=["Summarize Microsoft’s approach to sustainability and carbon reduction initiatives mentioned in the report.","Provide a brief summary of Microsoft’s strategy for cloud services and intelligent edge."]

for query in summarization_questions:
  print(f'Question :  {query}\n\n\n')
  result = qa_chain.invoke(query)
  print(f'Answer : {result["result"]}\n\n')
  pprint.pprint(f'Source Document : {result["source_documents"][0].page_content}\n\n')


Question :  Summarize Microsoft’s approach to sustainability and carbon reduction initiatives mentioned in the report.



Answer : Microsoft's approach to sustainability and carbon reduction is comprehensive, aiming for a more sustainable future by reducing its environmental footprint, advancing research, helping customers build sustainable solutions, and advocating for beneficial environmental policies.

Key initiatives and commitments include:

*   **Carbon Negative by 2030:** A bold commitment announced in January 2020, with a detailed plan.
*   **Historical Carbon Removal by 2050:** To remove all the carbon emitted since its founding in 1975.
*   **Investment in Climate Solutions:** Pledged $1 billion over four years (starting 2020) in new technologies and innovative climate solutions.
*   **Broader Sustainability Goals by 2030:**
    *   Water positive.
    *   Zero waste.
    *   Protect ecosystems by developing a Planetary Computer.
*   **Supporting Others:** Helping suppliers a

Keyword Dependendent questions

In [57]:
keyword_questions=["What does the report mention about 'Windows OEM' revenue trends?","What does the report state about Microsoft’s 'gaming' or 'Xbox' business performance in fiscal year 2022?"]

for query in keyword_questions:
  print(f'Question :  {query}\n\n\n')
  result = qa_chain.invoke(query)
  print(f'Answer : {result["result"]}\n\n')
  pprint.pprint(f'Source Document : {result["source_documents"][0].page_content}\n\n')

Question :  What does the report mention about 'Windows OEM' revenue trends?



Answer : The report states that Windows OEM revenue increased by 11%. This growth was driven by continued strength in the commercial PC market, which has a higher revenue per license.


('Source Document : business within this segment. These metrics provide\n'
 'strategic product insights which allow us to assess\n'
 'the performance across our commercial and consumer\n'
 'businesses. As we have diversity of target audiences\n'
 'and sales motions within the Windows business, we\n'
 'monitor metrics that are reflective of those varying\n'
 'motions.\n'
 ' \n'
 'Windows OEM revenue growth\n'
 ' \n'
 'Revenue from sales of Windows Pro and non-Pro\n'
 'licenses sold through the OEM channel\n'
 ' \n'
 ' \n'
 ' \n'
 'Windows Commercial products and cloud services\n'
 'revenue growth\n'
 ' \n'
 'Revenue from Windows Commercial products and cloud\n'
 'services, comprising volume licensing of the Windows\n'
 'opera