<a href="https://colab.research.google.com/github/mskudlugi/Microsoft_ESG2022/blob/main/Microsoft_ESG2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip -q install langchain openai tiktoken PyPDF2 faiss-cpu

In [19]:
import os

os.environ["OPENAI_API_KEY"] = ""

Read PDF

In [6]:
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter

In [7]:
# read pdf file from the location
document = PdfReader('/content/sample_data/2022 Microsoft Environmental Sustainability Report.pdf')

In [8]:
document

<PyPDF2._reader.PdfReader at 0x7bea88ea54b0>

In [66]:
# read the data into a string variable raw_text
raw_text = ''
for i, page in enumerate(document.pages):
  text = page.extract_text()
  if text:
    raw_text += text

In [67]:
len(raw_text)

233736

In [12]:
raw_text[:100]

'2022 \nEnvironmental Sustainability Report\nEnabling sustainability for our company, \nour customers, a'

Split the Text

In [13]:
# split the data into chunks for indexing
text_splitter = CharacterTextSplitter(
    separator = '\n',
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len,
)

In [14]:
text = text_splitter.split_text(raw_text)



In [15]:
len(text)

307

In [17]:
text[0]

'2022 \nEnvironmental Sustainability Report\nEnabling sustainability for our company, \nour customers, and the worldColor Palette\nNames are TBCContents\nOverview  \nForeword  4\n2022 progress  7\nHow we work  8\nAbout this report  9Microsoft  \nsustainability\nCarbon  \nOur approach  11\nReducing Scope 1 and 2 emissions  15\nReducing Scope 3 emissions  17\nTransitioning to carbon-free energy  20\nRemoving carbon  22\nKey trends and what’s next  24Water  \nOur approach  26\nReducing our water footprint  30\nReplenishing water  33\nImproving access to water  34\nKey trends and what’s next  35\nWaste  \nOur approach  37\nReducing our waste footprint  41\nKey trends and what’s next  44Ecosystems  \nOur approach  46\nTaking responsibility for  \nour land footprint  48\nKey trends and what’s next  50Customer sustainability\n \nCommitments and progress  53\nMicrosoft Cloud for Sustainability  54\nGreen software  56\nSustainable devices  59\nPlanetary Computer \nand AI for Good  63Global \nsu

Embedding Creation

In [20]:
# download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [21]:
docsearch = FAISS.from_texts(text, embeddings)

In [22]:
docsearch.embedding_function

<bound method OpenAIEmbeddings.embed_query of OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base='', openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization='', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False)>

In [28]:
query = "with how many people did microsoft provide clean water in 2022 and where?"

In [29]:
docs = docsearch.similarity_search(query)

In [30]:
len(docs)

4

In [31]:
docs[0]

Document(page_content='309,921India\n225,389 IndonesiaWater Table 3 \nDelivering on our water positive commitment by enabling access to water and sanitation services, and through water replenishment projects \nIn FY22, Microsoft provided 552,058 people with water access across Brazil, India, Indonesia, and Mexico. From the program’s inception through December 2022, we have provided nearly one million people with water access  \nacross these regions. \nSince year one, we have contracted 27 replenishment programs in water-stressed basins, which are contracted to deliver more than 35 million m3 of replenishment over their lifetime. \n552,058 \nTotal population \nprovided with wateraccess in FY22\n27 \nTotal number of water replenishment projects   \nLearn more in the Environmental Data Fact Sheet \na. Reported access values represent data reviewed and validated by water.org . \nb. Microsoft’s reported replenishment values represent impact since program’s inception in FY18.29\n |  |  | Red

QA Chain

In [32]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [33]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [34]:
chain.llm_chain.prompt.template

"Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:"

In [None]:
query = "with how many people did microsoft provide clean water in 2022 and where?"

In [37]:
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' In FY22, Microsoft provided 552,058 people with water access across Brazil, India, Indonesia, and Mexico.'

In [38]:
query = "how much carbon was removed in fy22"
docs = docsearch.similarity_search(query)
chain.run(input_documents=docs, question=query)

' 1,443,981 metric tons of carbon were removed in FY22.'

QA Chain with MapReduce

In [53]:
chain = load_qa_chain(OpenAI(), chain_type="map_rerank", return_intermediate_steps=True)

In [58]:
query = "how much carbon was removed in fy22"
docs = docsearch.similarity_search(query)
res = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
res



{'intermediate_steps': [{'answer': ' 1,443,981 metric tons of carbon removal',
   'score': '100'},
  {'answer': ' 514,156 metric tons of carbon removal', 'score': '100'},
  {'answer': ' 1,443,981 metric tons of carbon removal', 'score': '100'},
  {'answer': ' 28,000 mt of CO2e savings', 'score': '100'}],
 'output_text': ' 1,443,981 metric tons of carbon removal'}

In [55]:
query = "with how many people did microsoft provide clean water in 2022 and where?"
docs = docsearch.similarity_search(query, k=10)
res = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
res



{'intermediate_steps': [{'answer': ' Microsoft provided 552,058 people with water access across Brazil, India, Indonesia, and Mexico in 2022.',
   'score': '100'},
  {'answer': ' In FY22, Microsoft’s contribution to water.org has provided more than 550,000 people with access to clean water globally.',
   'score': '100'},
  {'answer': ' Microsoft provided access to clean water and sanitation solutions to 850,000 people in Brazil, India, Indonesia, and Mexico in 2022. ',
   'score': '100'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': " Microsoft's contribution in FY22 provided more than 1.5 million people with access to safe drinking water and sanitation solutions, primarily in communities and schools. ",
   'score': '80'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' This document does not answer the question', 'score': '0'},
  {'answer': ' Microsoft provided access to clean water and sanitation s

In [59]:
query = "where is the apple stock at now?"
docs = docsearch.similarity_search(query, k=10)
res = chain({"input_documents": docs, "question": query}, return_only_outputs=True)
res



{'intermediate_steps': [{'answer': ' This document does not answer the question.',
   'score': '0'},
  {'answer': ' This document does not answer the question', 'score': '0'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' This document does not answer the question', 'score': '0'},
  {'answer': ' This document does not answer the question. ', 'score': '0'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' This document does not answer the question.', 'score': '0'},
  {'answer': ' This document does not answer the question', 'score': '0'},
  {'answer': ' This document does not answer the question', 'score': '0'}],
 'output_text': ' This document does not answer the question.'}

Retrieval QA

In [60]:
from langchain.schema import retriever
from langchain.chains import RetrievalQA

retriever = docsearch.as_retriever(search_type="similarity")

rqa = RetrievalQA.from_chain_type(llm=OpenAI(),
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

In [62]:
rqa("with how many people did microsoft provide clean water in 2022 and where?")['result']

' In 2022, Microsoft provided 550,000 people with access to clean water across Brazil, India, Indonesia, and Mexico.'