<a href="https://colab.research.google.com/github/kavyajeetbora/nlp_rag/blob/master/end_to_end/01_open_AI_in_finance_domain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download the libraries

In [15]:
!pip install -q langchain langchain_community langchain-openai unstructured libmagic python-magic faiss-cpu sentence-transformers

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [4]:
import os
import configparser
from glob import glob
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

## Loading the Env Variables


In [5]:
if os.path.exists(".env"):
    os.remove(".env")

from google.colab import files
uploaded = files.upload()
if uploaded:
    if load_dotenv(".env"):
        print("Uploaded and Loaded Sucessfully")

Saving .env to .env
Uploaded and Loaded Sucessfully


When loading using config.parser

In [6]:
# if os.path.exists("config.cfg"):
#     os.remove("config.cfg")

# from google.colab import files
# uploaded = files.upload()
# if uploaded:
#     try:
#         ## Read the config file
#         config = configparser.ConfigParser()
#         config.read('config.cfg')

#         ## get the API information
#         api_key = config['OPENAI']['API_KEY']

#         ## set it as environment variable
#         os.environ['OPEN_API_KEY'] = api_key

#     except Exception as e:
#         print("Error",e)

## Load the Document Source

In [7]:
from langchain.document_loaders import TextLoader, CSVLoader, UnstructuredURLLoader

url_loader = UnstructuredURLLoader(
    urls = [
        "https://bbc.com/sport/football/articles/cvgm79448kyo",
        "https://www.standard.co.uk/sport/football/transfer-news-live-arsenal-fc-chelsea-man-utd-mbeumo-zubimendi-dorgu-b1202083.html"
    ]
)

data = url_loader.load()
len(data)

2

In [8]:
for doc in data:
    print(doc.metadata)
    print(f"Number of characters in this article: {len(doc.page_content)}")

{'source': 'https://bbc.com/sport/football/articles/cvgm79448kyo'}
Number of characters in this article: 5421
{'source': 'https://www.standard.co.uk/sport/football/transfer-news-live-arsenal-fc-chelsea-man-utd-mbeumo-zubimendi-dorgu-b1202083.html'}
Number of characters in this article: 6777


## Text Splitting

**Why chunking the text ?**

Chunking
Purpose: Chunking is used to manage large texts by breaking them into smaller, meaningful pieces.
Benefits:
1. Token Limits: Ensures text fits within the model's token limits.
2. Retrieval Precision: Improves the quality of information retrieval.
3. Semantic Understanding: Maintains contextual integrity.
4. Efficiency: Reduces latency and computational costs.
5. Preventing Hallucinations: Balances context richness with retrieval precision.

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

chunks = text_splitter.split_documents(data)
print(f"Data Source splitter into {len(chunks)} chunks")

Data Source splitter into 90 chunks


## Embedding

Now embedding the input chunks to numerical representation using a encoder based model.

We will store it in in-memory database

In [16]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-mpnet-base-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Note**: When you pass a text longer than 384 tokens to the all-mpnet-base-v2 encoder model using the SentenceTransformer library, the model will truncate the input text to 384 tokens by default. This ensures that the input fits within the model's maximum token limit and can be processed correctly

Read model about this [model](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)

In [19]:
text_lines = [
    "The quick brown fox jumps over the lazy dog.",
    "In the beginning, there was light.",
    "To be or not to be, that is the question."
]

encoder.encode(text_lines).shape

(3, 768)