# this notebook focuses on handling excel files with langchain
most of this is focused on the [UnstructuredExcelLoader](https://python.langchain.com/v0.2/docs/integrations/document_loaders/microsoft_excel/)

In [2]:
#%pip install --upgrade --quiet langchain-community unstructured openpyxl

from langchain_community.document_loaders import UnstructuredExcelLoader


In [4]:
file_path = './pdf.xlsx'
loader = UnstructuredExcelLoader(file_path=file_path, mode="elements")
docs = loader.load()

print(len(docs))

docs



6


[Document(page_content='\n\n\nLegal name\nShort name\nRisk Based Category\n\n\nAlly Financial Inc.\nAlly\nCategory IV\n\n\nAmerican Express Company\nAmerican Express\nCategory IV\n\n\nBank of America Corporation\nBank of America\nCategory I\n\n\nThe Bank of New York Mellon Corporation\nBank of NY-Mellon\nCategory I\n\n\nBarclays US LLC\nBarclays US\nCategory III\n\n\nBMO Financial Corp.\nBMO\nCategory III\n\n\nCapital One Financial Corporation\nCapital One\nCategory III\n\n\nThe Charles Schwab Corporation\nCharles Schwab Corp\nCategory III\n\n\nCitigroup Inc.\nCitigroup\nCategory I\n\n\nCitizens Financial Group, Inc.\nCitizens\nCategory IV\n\n\nDB USA Corporation\nDB USA\nCategory III\n\n\nDiscover Financial Services\nDiscover\nCategory IV\n\n\nFifth Third Bancorp\nFifth Third\nCategory IV\n\n\nThe Goldman Sachs Group, Inc.\nGoldman Sachs\nCategory I\n\n\nHSBC North America Holdings Inc.\nHSBC\nCategory IV\n\n\nHuntington Bancshares Incorporated\nHuntington\nCategory IV\n\n\nJPMorgan C

In [5]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")



In [8]:
from langchain_community.vectorstores import FAISS 
from langchain_openai import OpenAIEmbeddings

# from langchain_text_splitters import RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# splits = text_splitter.split_documents(docs)


vectorstore = FAISS.from_documents(documents=docs, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()




In [9]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate


In [10]:
system_prompt = (
    "You are an assistant for extracting information from excel sheets. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "Compare and contrast the loan loss numbers for us bank from Table026 and Table016"})

results


{'input': 'Compare and contrast the loan loss numbers for us bank from Table026 and Table016',
 'context': [Document(page_content='\n\n\nLoan Type\nbillions of dollars\nportfolio loss rate (percent)\n\n\nLoan losses\n15.7\n0.044\n\n\nFirst-lien mortgages, domestic\n1\n0.009\n\n\nJunior liens and HELOCs,\n0.3\n0.025\n\n\nCommercial and industrial\n4.6\n0.046\n\n\nCommercial real estate, domestic\n3.4\n0.071\n\n\nCredit cards\n4.7\n0.173\n\n\nOther consumer\n0.9\n0.036\n\n\nOther loans\n0.8\n0.028\n\n\n', metadata={'source': './pdf.xlsx', 'file_directory': '.', 'filename': 'pdf.xlsx', 'last_modified': '2024-07-07T21:27:56', 'page_name': 'Table016 (Page 7)', 'page_number': 6, 'text_as_html': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Loan Type</td>\n      <td>billions of dollars</td>\n      <td>portfolio loss rate (percent)</td>\n    </tr>\n    <tr>\n      <td>Loan losses</td>\n      <td>15.7</td>\n      <td>0.044</td>\n    </tr>\n    <tr>\n      <td>First-lien 

In [11]:
print(results['answer'])

From Table026, we see that U.S. Bancorp had $6.8 billion in loan losses. In Table016, the loan losses for U.S. Bancorp are not directly provided, but we can see that the total loan losses for the 31 banks listed is $7.1 billion. 

Therefore, we can compare that the U.S. Bancorp's loan losses of $6.8 billion from Table026 are slightly less than the total loan losses of $7.1 billion for the 31 banks listed in Table016. 

In terms of contrast, specifically looking at the types of loans and their associated losses, we can see from Table026 that U.S. Bancorp had a higher portfolio loss rate for its loan types compared to some other banks. For example, its loss rate for credit cards was 0.173, which is higher than that of several other banks. 

However, without the specific breakdown of U.S. Bancorp's loan losses by type from Table016, we cannot provide a detailed comparison on the individual loan type losses between U.S. Bancorp and the other banks listed.


above answer is flawed in some ways\ 
some ways to improve this could be editing the table, finetuning prompt etc

In [12]:
results = rag_chain.invoke({"input": "Compare and contrast the loan loss numbers for us bank with other banks using Table026"})

results


{'input': 'Compare and contrast the loan loss numbers for us bank with other banks using Table026',
 'context': [Document(page_content='\n\n\nLoan Type\nbillions of dollars\nportfolio loss rate (percent)\n\n\nLoan losses\n15.7\n0.044\n\n\nFirst-lien mortgages, domestic\n1\n0.009\n\n\nJunior liens and HELOCs,\n0.3\n0.025\n\n\nCommercial and industrial\n4.6\n0.046\n\n\nCommercial real estate, domestic\n3.4\n0.071\n\n\nCredit cards\n4.7\n0.173\n\n\nOther consumer\n0.9\n0.036\n\n\nOther loans\n0.8\n0.028\n\n\n', metadata={'source': './pdf.xlsx', 'file_directory': '.', 'filename': 'pdf.xlsx', 'last_modified': '2024-07-07T21:27:56', 'page_name': 'Table016 (Page 7)', 'page_number': 6, 'text_as_html': '<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Loan Type</td>\n      <td>billions of dollars</td>\n      <td>portfolio loss rate (percent)</td>\n    </tr>\n    <tr>\n      <td>Loan losses</td>\n      <td>15.7</td>\n      <td>0.044</td>\n    </tr>\n    <tr>\n      <td>First-

In [14]:
print(results['answer'])

Based on the data provided in Table026:

- **US Bancorp** has total loan losses of 6.8 billion dollars.
- The average total loan losses for the 31 banks listed in the table is 7.1 billion dollars.

Therefore, in comparison to the average of the 31 banks, US Bancorp has slightly lower total loan losses.


: 

above answer is correct, though limited\
more exploration in to llms ability to understand the table

https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb?ref=blog.langchain.dev