In [None]:
CHATGPT
Using **LangChain** is a great choice for managing LLM pipelines, including Retrieval-Augmented Generation (RAG) tasks. Here’s how you can integrate LangChain into your project to build a multimodal LLM for financial documents:

### Steps to Implement RAG with LangChain

1. **Data Ingestion & Preprocessing**:
   - **Document Loading**: Use LangChain’s built-in document loaders to import your financial documents (PDFs, earnings calls, transcripts).
     - `langchain.document_loaders` provides support for PDF, CSV, text, and other file types.
     - For financial documents, you might use **PDFLoader** or **PyMuPDFLoader** from LangChain.
     ```python
     from langchain.document_loaders import PyMuPDFLoader

     loader = PyMuPDFLoader("financial_document.pdf")
     documents = loader.load()
     ```
   
   - **Text Preprocessing**: Clean and preprocess the text using libraries like `nltk` or `spaCy` to handle financial jargon, remove noise, and tokenize text.

2. **Embedding Financial Documents**:
   - Convert your documents into embeddings using a pretrained financial model like **FinBERT** or a model from Hugging Face.
   - LangChain has built-in support for various embedding models. You can directly use `OpenAIEmbeddings`, `HuggingFaceEmbeddings`, or **SentenceTransformers** to convert financial documents into vector representations.
     ```python
     from langchain.embeddings import HuggingFaceEmbeddings

     embeddings = HuggingFaceEmbeddings(model_name="bert-base-uncased")
     document_embeddings = embeddings.embed_documents([doc.page_content for doc in documents])
     ```

3. **Storing and Indexing**:
   - Use **FAISS** or **Pinecone** as your vector store to index the embeddings of financial documents for retrieval.
   - LangChain supports vector stores like FAISS, which integrates easily for building a RAG pipeline.
     ```python
     from langchain.vectorstores import FAISS

     vector_store = FAISS.from_documents(documents, embeddings)


In [1]:
from langchain.document_loaders import PyMuPDFLoader
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
import re
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')

loader = PyMuPDFLoader("PTLO 2023 Q4 10K.pdf")
documents = loader.load()

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\David\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [2]:
def clean_text(text):
    # Remove unwanted symbols and special characters (like checkboxes)
    text = re.sub(r'[☐☒]', '', text)  # Remove checkbox symbols

    # Remove newlines and excessive whitespace
    text = re.sub(r'\n+', ' ', text)  # Replace newlines with space
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space

    # Trim leading and trailing whitespace
    text = text.strip()

    return text

In [3]:
cleaned_text = [clean_text(doc.page_content) for doc in documents]

In [4]:
cleaned_text[1]

"If securities are registered pursuant to Section 12(b) of the Act, indicate by check mark whether the financial statements of the registrant included in the filing reflect the correction of an error to previously issued financial statements. Yes No Indicate by check mark whether any of those error corrections are restatements that required a recovery analysis of incentive-based compensation received by any of the registrant’s executive officers during the relevant recovery period pursuant to § 240.10D-1(b). Yes No Indicate by check mark whether the registrant is a shell company (as defined in Rule 12b-2 of the Act). Yes No The aggregate market value of the common stock held by non-affiliates of the registrant on June 23, 2023, the last business day of the Registrant's most recently completed second fiscal quarter, based on the closing price of the registrant's Class A common stock as reported by The Nasdaq Stock Market on that date, was approximately $ 981,393,224 . This calculation d

In [8]:
# Function to map POS tags to WordNet
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return 'a'  # Adjective
    elif treebank_tag.startswith('V'):
        return 'v'  # Verb
    elif treebank_tag.startswith('N'):
        return 'n'  # Noun
    elif treebank_tag.startswith('R'):
        return 'r'  # Adverb
    else:
        return 'n'  # Default to noun
        
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))

    # Remove numbers
    text = re.sub(r'\d+', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # POS (Part of Speech) tagging
    pos_tags = nltk.pos_tag(tokens)
    
    # Apply lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in pos_tags]


    return ' '.join(lemmatized_tokens)

In [9]:
preprocessed_text = [preprocess_text(text) for text in cleaned_text]

In [10]:
preprocessed_text[1]

'security register pursuant section b act indicate check mark whether financial statement registrant include filing reflect correction error previously issue financial statement yes indicate check mark whether error correction restatement require recovery analysis incentivebased compensation receive registrant ’ executive officer relevant recovery period pursuant § db yes indicate check mark whether registrant shell company define rule b act yes aggregate market value common stock hold nonaffiliates registrant june last business day registrant recently complete second fiscal quarter base closing price registrant class common stock report nasdaq stock market date approximately calculation reflect determination certain person affiliate registrant purpose february share registrant class common stock par value per share issue outstanding'

In [13]:
# For now, using bert-base-uncased to get a model running. However the chunks (which is currently the amount of text per page) is quite large
# and may be needed to be split up before stored.
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="bert-base-uncased")
document_embeddings = embeddings.embed_documents(preprocessed_text)

  from tqdm.autonotebook import tqdm, trange
No sentence-transformers model found with name bert-base-uncased. Creating a new one with mean pooling.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [15]:
len(document_embeddings[1])

768

In [19]:
# Facebook AI similarity search allows us to get the closest vectors to a query vector
from langchain.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

In [20]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x1a9915df410>