<a href="https://colab.research.google.com/github/kanalive/notebooks/blob/main/chatwithmyfile.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Your API Key
Please mount your google drive before executing below code.

In [None]:
import json
# Load environment object from JSON
file_path = '/content/drive/MyDrive/keys/keys.json'  # Replace with the actual file path
with open(file_path, 'r') as file:
    loaded_object = json.load(file)

OPEN_AI_API_KEY = loaded_object['OPEN_AI_API_KEY']

#Environment Setup

*   Pip install all packages
*   Importing required libaries
*   Setup OpenAI API key
*   Initiatie LLM model



In [None]:
!pip install openai -q
!pip install langchain -q
!pip install chromadb -q
!pip install tiktoken -q
!pip install pypdf -q
!pip install unstructured[local-inference] -q


Highlevel architecture view
https://miro.medium.com/v2/resize:fit:4800/format:webp/1*vQUhrf8uCyFQfBbIgF46Zw.png

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain

import os
os.environ["OPENAI_API_KEY"] = OPEN_AI_API_KEY

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

In [None]:
import PIL
print(PIL.__version__)

9.5.0


In [None]:
!pip uninstall Pillow
!pip install --upgrade Pillow
print(PIL.__version__)

Found existing installation: Pillow 9.5.0
Uninstalling Pillow-9.5.0:
  Would remove:
    /usr/local/lib/python3.10/dist-packages/PIL/*
    /usr/local/lib/python3.10/dist-packages/Pillow-9.5.0.dist-info/*
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libXau-154567c4.so.6.0.0
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libbrotlicommon-92722cb2.so.1
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libbrotlidec-db4b3db6.so.1.0.9
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libfreetype-cb9caf6f.so.6.19.0
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libharfbuzz-3543f599.so.0.60710.0
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libjpeg-f2134fdd.so.62.3.0
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/liblcms2-12745711.so.2.0.15
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/liblzma-95592ee6.so.5.4.2
    /usr/local/lib/python3.10/dist-packages/Pillow.libs/libopenjp2-78c47f58.so.2.5.0
    /usr/local/lib/python3.10/d

8.4.0


# Load data from your location
In my example my files are located on my google drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
from langchain.document_loaders import DirectoryLoader

pdf_loader = DirectoryLoader('/content/drive/MyDrive/Colab Notebooks/test/', glob="**/*.pdf")


loaders = [pdf_loader]
documents = []
for loader in loaders:
  documents.extend(loader.load())
print (f"Total number of documents: {len(documents)}")

Total number of documents: 1


# File data processing
Split text by defined chunk size, embedding and store the data in vector store.

In [24]:
from langchain import vectorstores
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
vectorstores = Chroma.from_documents(documents, embeddings)




# Constructe the LLM model
Constructe the LLM model with the created vector store.

In [26]:
qa = ConversationalRetrievalChain.from_llm(ChatOpenAI(temperature=0), vectorstores.as_retriever())

# Start chatting!

In [13]:
chat_history = []


In [27]:
query = "what is this document about"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

"The document is a report on data science methods by Barclays, discussing topic modeling as an unsupervised machine learning technique to attribute a topic to a text. It also includes information on the author's role in the Fixed Income, Currencies and Commodities Research department and important disclosures related to equity and fixed income research. Additionally, the report includes brief summaries of economic prospects, manufacturing, money measures, and trade."

In [28]:
query = "Could you summarise what is top modelling"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'Topic modeling is an unsupervised machine learning technique used to attribute a topic to a text. It allows for detailed assessment of how the focus of texts change over time. The topic of a text can be a full document, but one might also decide to split the full document into sentences or paragraphs, and apply a topic model on those levels. The topic taxonomy and the text granularity are part of the design choices the modeler needs to make upfront, as are which topic model to use and how many topics to model. All these choices have large implications for model complexity and interpretability.'

In [30]:
query = "Could you summarise the end to end processes of the top modelling described in this document"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'The document describes the end-to-end process of topic modeling, starting with data preparation, which involves selecting important keywords and formatting and normalizing the remaining tokens. Then, the high-dimensional TFIDF matrix is transformed into a lower-dimensional representation using Non-negative matrix factorization (NMF) or Latent semantic indexing (LSI) algorithms. Other algorithms like BERTopic and Top2Vec are also available. The best model choice is a simple, easy-to-understand model that provides intuitive and usable results. Finally, the topics are analyzed and interpreted to gain insights from the data.'

In [32]:
query = "Could you articulate the step 2 - high-dimensional TFIDF matrix is transformed into a lower-dimensional representation using Non-negative matrix factorization (NMF) or Latent semantic indexing (LSI) algorithms"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'Yes, step 2 involves transforming the high-dimensional TFIDF matrix into a lower-dimensional representation using either Non-negative matrix factorization (NMF) or Latent semantic indexing (LSI) algorithms. \n\nIn Non-negative matrix factorization (NMF), the algorithm decomposes the TFIDF matrix into a document-topic matrix and a topic-term matrix. This transformation reduces the dimensionality of the matrix, making it easier to analyze and interpret.\n\nIn Latent semantic indexing (LSI), the algorithm uses truncated singular value decomposition to reduce the number of words while preserving the similarity structure among columns. This transformation also reduces the dimensionality of the matrix, making it easier to analyze and interpret.\n\nBoth NMF and LSI are well-documented and well-tested algorithms that have optimized python encapsulations in the gensim module.'

#The RBA meeting test case questions

In [None]:
query = "Could you summarise the main points the April 2021 RBA statement is trying to deliver"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'The Reserve Bank of Australia (RBA) has decided to maintain the current policy settings, including the targets of 10 basis points for the cash rate and the yield on the 3-year Australian Government bond, as well as the parameters of the Term Funding Facility and the government bond purchase program. The global economy is recovering, although the recovery is uneven, and inflation remains low. The Australian economy is recovering faster than expected, with GDP increasing by 3.1% in the December quarter, boosted by a further lift in household consumption as the health situation improved. The recovery is expected to continue, with above-trend growth this year and next. Wage and price pressures are subdued and are expected to remain so for some years. The Board is committed to maintaining highly supportive monetary conditions until its goals of full employment and inflation consistent with the target are achieved. The Board will not increase the cash rate until actual inflation is sustaina

In [None]:
query = "In the April 2021 RBA statement, does the sentiment sounds like RBA is going to lift the interest rate in the next meeting?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'No, the sentiment in the April 2021 RBA statement does not suggest that the RBA is going to lift the interest rate in the next meeting. The statement mentions that the Board decided to maintain the cash rate target at 10 basis points and the interest rate on Exchange Settlement balances at zero per cent. The statement also mentions that the Board is committed to achieving the goals of full employment and inflation consistent with the target, and that the current measures will provide the continuing monetary support that the economy needs as it transitions from the recovery phase to the expansion phase.'

In [None]:
query = "What are the key points delivered in the 2 May 2023 RBA statement?"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

"The key points delivered in the 2 May 2023 RBA statement are:\n\n- The Board decided to increase the cash rate target by 25 basis points to 3.85 per cent and the rate paid on Exchange Settlement balances by 25 basis points to 3.75 per cent.\n- Inflation in Australia has passed its peak, but at 7 per cent is still too high and it will be some time yet before it is back in the target range.\n- Goods price inflation is clearly slowing due to a better balance of supply and demand following the resolution of the pandemic disruptions. But services price inflation is still very high and broadly based and the experience overseas points to upside risks.\n- The labour market remains very tight, with the unemployment rate at a near 50-year low. Many firms continue to experience difficulty hiring workers, although there has been some easing in labour shortages and the number of vacancies has declined a little.\n- The Board's priority remains to return inflation to target. High inflation makes lif

In [None]:
query = "In the May 2023 RBA statement, What's likelihood of RBA lift interest rate in the next meeting, 1 to indicate most likely, 0 not likely"
result = qa({"question": query, "chat_history": chat_history})
result["answer"]

'The statement does not provide information on the likelihood of the RBA lifting interest rates in the next meeting. The Board will continue to assess the state of the economy and the outlook, and make decisions based on developments in the global economy, trends in household spending, and the outlook for inflation and the labor market.'

In [None]:
!pip install tabula-py


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tabula-py
  Downloading tabula_py-2.7.0-py3-none-any.whl (12.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m62.5 MB/s[0m eta [36m0:00:00[0m
Collecting distro (from tabula-py)
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Installing collected packages: distro, tabula-py
Successfully installed distro-1.8.0 tabula-py-2.7.0


In [None]:
import tabula
file1 = "/content/drive/MyDrive/Colab Notebooks/test/2023-asx-half-year-financial-statements.pdf"
table = tabula.read_pdf(file1,pages=6)
table[0]

Unnamed: 0.1,Unnamed: 0,31 Dec,30 Jun,Variance,Unnamed: 1
0,,2022,2022,increase/(decrease),
1,,$m,$m,$m,%
2,Assets,,,,
3,Cash,5952.9,4972.2,980.7,19.7
4,Financial assets1,6072.3,9484.8,"(3,412.5)",(36.0)
5,Intangibles (excluding software),2325.5,2325.5,—,—
6,"Capitalised software and property, plant and e...",158.7,363.5,(204.8),(56.3)
7,Investments,90.3,97.6,(7.3),(7.5)
8,Right-of-use assets,52.8,58.3,(5.5),(9.4)
9,Other assets,831.9,935.6,(103.7),(11.1)
