<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/end_to_end/01_open_AI_in_finance_domain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download the libraries

In [50]:
!pip install -q langchain langchain_community langchain-openai unstructured libmagic python-magic faiss-cpu sentence-transformers langchain-huggingface

In [71]:
import os
import configparser
from glob import glob
from dotenv import load_dotenv

import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
import pickle

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

## Loading the Env Variables


In [5]:
if os.path.exists(".env"):
    os.remove(".env")

from google.colab import files
uploaded = files.upload()
if uploaded:
    if load_dotenv(".env"):
        print("Uploaded and Loaded Sucessfully")

Saving .env to .env
Uploaded and Loaded Sucessfully


When loading using config.parser

In [6]:
# if os.path.exists("config.cfg"):
#     os.remove("config.cfg")

# from google.colab import files
# uploaded = files.upload()
# if uploaded:
#     try:
#         ## Read the config file
#         config = configparser.ConfigParser()
#         config.read('config.cfg')

#         ## get the API information
#         api_key = config['OPENAI']['API_KEY']

#         ## set it as environment variable
#         os.environ['OPEN_API_KEY'] = api_key

#     except Exception as e:
#         print("Error",e)

## Load the Document Source

In [7]:
from langchain.document_loaders import TextLoader, CSVLoader, UnstructuredURLLoader

url_loader = UnstructuredURLLoader(
    urls = [
        "https://bbc.com/sport/football/articles/cvgm79448kyo",
        "https://www.standard.co.uk/sport/football/transfer-news-live-arsenal-fc-chelsea-man-utd-mbeumo-zubimendi-dorgu-b1202083.html"
    ]
)

data = url_loader.load()
len(data)

2

In [8]:
for doc in data:
    print(doc.metadata)
    print(f"Number of characters in this article: {len(doc.page_content)}")

{'source': 'https://bbc.com/sport/football/articles/cvgm79448kyo'}
Number of characters in this article: 5421
{'source': 'https://www.standard.co.uk/sport/football/transfer-news-live-arsenal-fc-chelsea-man-utd-mbeumo-zubimendi-dorgu-b1202083.html'}
Number of characters in this article: 6777


## Text Splitting

**Why chunking the text ?**

Chunking
Purpose: Chunking is used to manage large texts by breaking them into smaller, meaningful pieces.
Benefits:
1. Token Limits: Ensures text fits within the model's token limits.
2. Retrieval Precision: Improves the quality of information retrieval.
3. Semantic Understanding: Maintains contextual integrity.
4. Efficiency: Reduces latency and computational costs.
5. Preventing Hallucinations: Balances context richness with retrieval precision.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

chunks = text_splitter.split_documents(data)
print(f"Data Source splitter into {len(chunks)} chunks")

Data Source splitter into 90 chunks


## Embedding

Now embedding the input chunks to numerical representation using a encoder based model.

We will store it in in-memory database

In [16]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-mpnet-base-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Note**: When you pass a text longer than 384 tokens to the all-mpnet-base-v2 encoder model using the SentenceTransformer library, the model will truncate the input text to 384 tokens by default. This ensures that the input fits within the model's maximum token limit and can be processed correctly

Read model about this [model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)

In [23]:
sentences = [
    "The latest fashion trends emphasize sustainable materials and eco-friendly designs, making it easier than ever to look stylish while caring for the planet.",
    "Exploring the ancient ruins of Machu Picchu offers a breathtaking glimpse into the rich history and culture of the Inca civilization.",
    "The excitement in the stadium was palpable as the underdog team scored a last-minute goal to clinch the championship title."
]

vectors = encoder.encode(sentences)
vectors.shape

(3, 768)

In [24]:
import faiss

dim = vectors.shape[1]
index = faiss.IndexFlatL2(dim)

## Add all the vectos into the in-memory vector databse
index.add(vectors)

In [28]:
search_query = "I want to play football"

## convert the query to vector
search_vector = encoder.encode(search_query)
print(search_vector.shape)

## Reshape it to 2D vector
search_vector = search_vector.reshape(1,-1)
print(search_vector.shape)

(768,)
(1, 768)


index this search query vector in vector databse

In [46]:
distances, I = index.search(search_vector, k=2)

for d,i in zip(distances[0], I[0]):
    sentence = sentences[i]
    print(f"sentence:\n{sentence}")
    print(f'distance: {d:.2f}')
    print("-"*30)

sentence:
The excitement in the stadium was palpable as the underdog team scored a last-minute goal to clinch the championship title.
distance: 1.81
------------------------------
sentence:
Exploring the ancient ruins of Machu Picchu offers a breathtaking glimpse into the rich history and culture of the Inca civilization.
distance: 2.02
------------------------------


## Create Vector Database From Documents



In [52]:
embeddings = HuggingFaceEmbeddings(model_name='all-mpnet-base-v2')

db = FAISS.from_documents(
    documents = chunks,
    embedding = embeddings
)

Now export this vector store as a pickle file

In [66]:
with open("vector_database.pkl", "wb") as pkl_file:
    pickle.dump(db, pkl_file)

## Retrieval

In [67]:
from langchain_openai import ChatOpenAI
from langchain.chains.qa_with_sources.retrieval import RetrievalQAWithSourcesChain

In [58]:
llm = ChatOpenAI(model='gpt-4o-mini')

***Note***: The token limit for the chat-gpt4o mini model is **16,384** tokens. This includes both the input tokens and the output tokens, allowing for substantial context and detailed responses.

In [68]:
## Load the retriever

vector_db_filepath = "vector_database.pkl"
if os.path.exists(vector_db_filepath):
    with open(vector_db_filepath, "rb") as pkl_file:
        vector_index = pickle.load(pkl_file)
else:
    print("file doesnot exists")

Lets create a RAG pipeline using pre-built chain using `RetrievalQAWithSourcesChain`

In [69]:
chain = RetrievalQAWithSourcesChain.from_llm(
    llm = llm,
    retriever = vector_index.as_retriever()
)

**About [`RetrievalQAWithSourcesChain`](https://python.langchain.com/api_reference/langchain/chains/langchain.chains.qa_with_sources.retrieval.RetrievalQAWithSourcesChain.html)**

The RetrievalQAWithSourcesChain is a prebuilt chain for question-answering tasks that integrates document retrieval and processing. It uses a combination of MapReduceDocumentsChain and ReduceDocumentsChain to extract relevant information and generate a final answer with sources. The chain leverages ChatOpenAI models for language processing and FAISS for efficient document retrieval

Key Components:
1. **RetrievalQAWithSourcesChain**:

    Purpose: A chain designed for question-answering tasks with sources.
    Verbose: Set to False to minimize logging output.
    
2. **Combine Documents Chain**:

    Type: MapReduceDocumentsChain.

    <img src = "https://miro.medium.com/v2/resize:fit:1100/format:webp/1*e9t9O9BxQTc9E6bWA1JAuA.png" height=250/>

    LLM Chain: Uses LLMChain with a prompt to extract relevant text from documents.
    Prompt: Asks to return any relevant text verbatim based on the context and question.
    LLM: Utilizes ChatOpenAI with the model gpt-4o-mini.
3. **Reduce Documents Chain**:

    Type: ReduceDocumentsChain.
    Combine Documents Chain: Uses StuffDocumentsChain.
    LLM Chain: Uses LLMChain with a prompt to create a final answer with references.
    Prompt: Instructs to create a final answer with sources and not to make up answers if unknown.
    LLM: Utilizes ChatOpenAI with the model gpt-4o-mini.
4. **Document Prompts:**

    MapReduceDocumentsChain: Uses a prompt to extract relevant text.
    StuffDocumentsChain: Uses a prompt to combine extracted parts into a final answer.
5. **Retriever**:

    Type: VectorStoreRetriever.
    Tags: Includes FAISS and HuggingFaceEmbeddings.
    Vectorstore: Uses FAISS for efficient similarity search

In [72]:
query = "Which players are the transfer targets for Arsenal ?"

langchain.debug = True

chain.invoke(
    {"question": query},
    return_only_outputs = True
)

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "Which players are the transfer targets for Arsenal ?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:MapReduceDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:MapReduceDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "input_list": [
    {
      "context": "Arsenal appear to be one of the busiest clubs in the Premier League right now. Not only have the Gunners reportedly agreed a deal to sign Real Sociedad and Spain midfielder Martin Zubimendi this",
      "question": "Which players are the transfer targets for Arsenal ?"
    },
    {
      "context": "said to be a man in demand and reports suggest he is now Arsenal’s leading transfer target.",
      "question": "Which players are the transfer targets for Arsenal ?"
    },


{'answer': "Arsenal's transfer targets include Spain midfielder Martin Zubimendi from Real Sociedad and Brentford's forward Bryan Mbeumo. Reports suggest that Zubimendi has had an agreement for a deal, and Mbeumo is currently a leading transfer target for Arsenal.\n\n",
 'sources': 'https://www.standard.co.uk/sport/football/transfer-news-live-arsenal-fc-chelsea-man-utd-mbeumo-zubimendi-dorgu-b1202083.html'}