# Reading and Cleaning Text

In [1]:
# pdf_file = "./doc/1706.03762.pdf"
pdf_file = "./doc/2005.11401.pdf"
text_file = "./doc/textfile.txt"

In [2]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"
    return text

In [11]:
unclean_text = extract_text_from_pdf(pdf_file)
unclean_text

'Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrick Lewis†‡, Ethan Perez⋆,\nAleksandra Piktus†, Fabio Petroni†, Vladimir Karpukhin†, Naman Goyal†, Heinrich Küttler†,\nMike Lewis†, Wen-tau Yih†, Tim Rocktäschel†‡, Sebastian Riedel†‡, Douwe Kiela†\n†Facebook AI Research; ‡University College London; ⋆New York University;\nplewis@fb.com\nAbstract\nLarge pre-trained language models have been shown to store factual knowledge\nin their parameters, and achieve state-of-the-art results when ﬁne-tuned on down-\nstream NLP tasks. However, their ability to access and precisely manipulate knowl-\nedge is still limited, and hence on knowledge-intensive tasks, their performance\nlags behind task-speciﬁc architectures. Additionally, providing provenance for their\ndecisions and updating their world knowledge remain open research problems. Pre-\ntrained models with a differentiable access mechanism to explicit non-parametric\nmemory have so far been only investigated for extract

In [4]:
import re
import unicodedata

def clean_text_(text):
    # text = text.replace("\n", " ")  # Replace newlines with spaces
    text = re.sub(r'\s+', ' ', str(text))  # Remove extra spaces
    return text.strip()  # Trim leading and trailing spaces

def remove_special_chars(text):
    text = re.sub(r'[^a-zA-Z0-9.,!?\'" ]', '', text)  # Keep letters, numbers, and common punctuation
    return text

def fix_hyphenation(text):
    return re.sub(r'(\w+)-\s+(\w+)', r'\1\2', text)  # Removes hyphenation across lines

def normalize_unicode(text):
    return unicodedata.normalize("NFKD", text)

def remove_headers_footers(text):
    lines = text.split("\n")
    cleaned_lines = [line for line in lines if not re.match(r'(Page \d+|Confidential|Company Name)', line)]
    return " ".join(cleaned_lines)

def normalize_text(text):
    return " ".join(text.lower().split())  # Lowercase and remove extra spaces

def full_text_cleanup(text):
    """"
    Takes in unclean text and return cleaned text by applying a series of cleaning functions.
    
    """
    text = clean_text_(text)
    text = fix_hyphenation(text)
    text = remove_special_chars(text)
    text = normalize_unicode(text)
    text = remove_headers_footers(text)
    text = normalize_text(text)
    
    return text

In [12]:
clean_text = full_text_cleanup(unclean_text)
clean_text

'retrievalaugmented generation for knowledgeintensive nlp tasks patrick lewis, ethan perez, aleksandra piktus, fabio petroni, vladimir karpukhin, naman goyal, heinrich kttler, mike lewis, wentau yih, tim rocktschel, sebastian riedel, douwe kiela facebook ai research university college london new york university plewisfb.com abstract large pretrained language models have been shown to store factual knowledge in their parameters, and achieve stateoftheart results when netuned on downstream nlp tasks. however, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledgeintensive tasks, their performance lags behind taskspecic architectures. additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. pretrained models with a differentiable access mechanism to explicit nonparametric memory have so far been only investigated for extractive downstream tasks. we explore a generalpurpose netuning

In [6]:
from langchain.text_splitter import NLTKTextSplitter
def split_text_into_sentences(text):
    """Splits text into sentences using NLTKTextSplitter."""
    text_splitter = NLTKTextSplitter()
    return text_splitter.split_text(text)

sentences = split_text_into_sentences(clean_text)
print(sentences)

['retrievalaugmented generation for knowledgeintensive nlp tasks patrick lewis, ethan perez, aleksandra piktus, fabio petroni, vladimir karpukhin, naman goyal, heinrich kttler, mike lewis, wentau yih, tim rocktschel, sebastian riedel, douwe kiela facebook ai research university college london new york university plewisfb.com abstract large pretrained language models have been shown to store factual knowledge in their parameters, and achieve stateoftheart results when netuned on downstream nlp tasks.\n\nhowever, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledgeintensive tasks, their performance lags behind taskspecic architectures.\n\nadditionally, providing provenance for their decisions and updating their world knowledge remain open research problems.\n\npretrained models with a differentiable access mechanism to explicit nonparametric memory have so far been only investigated for extractive downstream tasks.\n\nwe explore a generalpur

In [25]:
# Using Langchain to extract text 
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import NLTKTextSplitter

def extract_text_langchain(pdf_path):
    loader = PyMuPDFLoader(pdf_path)
    documents = loader.load()
    return "\n".join([doc.page_content for doc in documents])


def lang_clean_text(text):
    # text = text.replace("\n", " ").strip()  
    text = full_text_cleanup(text)
    text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
    return text_splitter.split_text(text)

def split_text_into_sentences(text):
    """Splits text into sentences using NLTKTextSplitter."""
    text_splitter = NLTKTextSplitter()
    sentences = text_splitter.split_text(text)
    cleaned_sentences = [sentence.replace("\n", " ") for sentence in sentences]
    return cleaned_sentences

lang_text = extract_text_langchain(pdf_file)
lang_cleaned_text = lang_clean_text(lang_text)
lang_sentences = split_text_into_sentences(lang_cleaned_text[0])

lang_sentences

['retrievalaugmented generation for knowledgeintensive nlp tasks patrick lewis, ethan perez, aleksandra piktus, fabio petroni, vladimir karpukhin, naman goyal, heinrich kttler, mike lewis, wentau yih, tim rocktschel, sebastian riedel, douwe kiela facebook ai research university college london new york university plewisfb.com abstract large pretrained language models have been shown to store factual knowledge in their parameters, and achieve stateoftheart results when netuned on downstream nlp tasks.  however, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledgeintensive tasks, their performance lags behind taskspecic architectures.  additionally, providing provenance for their decisions and updating their world knowledge remain open research problems.  pretrained models with a differentiable access mechanism to explicit nonparametric memory have so far been only investigated for extractive downstream tasks.  we explore a generalpurpose net

In [26]:
import chromadb
chroma_client = chromadb.Client()

sentence_chunks = lang_sentences

collection = chroma_client.get_or_create_collection(name="my_collection")
collection.add(
    documents=sentence_chunks,
    ids=[f"{i}" for i in range(len(sentence_chunks))]
)


Insert of existing embedding ID: 0
Insert of existing embedding ID: 1
Insert of existing embedding ID: 2
Insert of existing embedding ID: 3
Insert of existing embedding ID: 4
Insert of existing embedding ID: 5
Insert of existing embedding ID: 6
Insert of existing embedding ID: 7
Insert of existing embedding ID: 8
Insert of existing embedding ID: 9
Insert of existing embedding ID: 10
Insert of existing embedding ID: 11
Insert of existing embedding ID: 12
Insert of existing embedding ID: 13
Insert of existing embedding ID: 14
Insert of existing embedding ID: 15
Insert of existing embedding ID: 16
Insert of existing embedding ID: 17
Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of ex

In [27]:
results = collection.query(
    query_texts=["Wha are paramatric and non parametric weights"], # Chroma will embed this for you
    n_results=2 # how many results to return
)

print(results["documents"])

[['arxiv, abs2004.07159, 2020.  url httpsarxiv.orgabs2004.07159.  5 danqi chen, adam fisch, jason weston, and antoine bordes.  reading wikipedia to answer opendomain questions.  in proceedings of the 55th annual meeting of the association for computational linguistics volume 1 long papers, pages 18701879, vancouver, canada, july 2017. association for computational linguistics.  doi 10.18653v1p171171.  url httpswww.aclweb.organthologyp171171.  6 eunsol choi, daniel hewlett, jakob uszkoreit, illia polosukhin, alexandre lacoste, and jonathan berant.  coarsetone question answering for long documents.  in proceedings of the 55th annual meeting of the association for computational linguistics volume 1 long papers, pages 209220, vancouver, canada, july 2017. association for computational linguistics.  doi 10.18653v1p171020.  url httpswww.aclweb.organthologyp171020.  10 7 christopher clark and matt gardner.  simple and effective multiparagraph reading comprehension.  arxiv1710.10723 cs, octobe

# LLM Intergration 

In [53]:
import requests
import json

def ask_ollama(query, context=None):

    context = "-".join(context) if context else "No context provided."
    
    prompt = f"""
            You are an advanced Retrieval-Augmented Generation (RAG) system designed to provide highly accurate and contextually relevant responses. Use *only* the information provided in the context below to generate your answer. Do not use any prior knowledge or external sources. If the context does not contain enough information to answer the question, explicitly state: "I cannot answer this question based on the provided information."

            ## Instructions:
            - Analyze the retrieved context carefully to extract the most relevant details.
            - Ensure that your answer is comprehensive, well-structured, and directly addresses the user's question.
            - If multiple pieces of evidence exist in the context, synthesize them for a cohesive response.
            - If the context is unclear, ambiguous, or conflicting, acknowledge this uncertainty in your response.
            - Do not assume or infer facts beyond what is stated in the provided context.

            ## Context:
            {context}

            ## Question:
            {query}

            ## Answer:
            """
    
    grok_prompt = f"""
        ### Prompt for RAG System

            **Instruction:**
            You are an AI designed to answer queries using a two-step process involving context retrieval and knowledge-based answering. Here's how you should proceed:

            1. **Context Retrieval (Step 1):**
            - **Context:** {context}
            - **Query:** {query}

            First, attempt to answer the query using the provided context. Look for relevant information within the context that directly relates to the query. If you can answer the query comprehensively using only this context, do so. If you cannot:

            2. **Knowledge-Based Answer (Step 2):**
            - If the context does not provide enough information to answer the query accurately, or if the query is not adequately addressed by the context, use your pre-existing knowledge to answer the query. 
            - Be clear that you are now using your knowledge by starting your response with "Based on my knowledge:".

            **Guidelines:**
            - **Accuracy:** Prioritize accuracy. If the context does not provide a clear answer and your knowledge is uncertain or outdated, acknowledge this by saying, "I'm not certain about this, but based on my knowledge:".
            - **Completeness:** If part of the query can be answered with context but not fully, use context for what you can and supplement with knowledge.
            - **Citations:** When answering from context, if possible, reference or quote directly from the context by using quotation marks or by specifying where in the context the answer was found (e.g., "According to the context...").
            - **Admit Limitations:** If neither the context nor your knowledge can provide an answer, admit this by saying, "I do not have enough information to answer this query adequately."

            **Example Response Formats:**

            - **From Context:** "The context states that the boiling point of water at sea level is 100°C."
            - **From Knowledge:** "Based on my knowledge, the average adult human body contains approximately 60% water."
            - **Mixed:** "From the context, we learn that the Eiffel Tower was completed in 1889. Based on my knowledge, it was designed by Gustave Eiffel."
            - **Admitting Limitation:** "I do not have enough information to answer this query adequately."

            **Proceed:**
            Now, attempt to answer the query provided:

            **Query:** {query}

            Your answer should be just explain of your understanding of the question. Dont list steps or any other things. Just explain the concept. Dont say Based on my knowledge
    
    """
    
    # Ollama local API endpoint
    OLLAMA_URL = "http://localhost:11434/api/generate"

    # Define the request payload
    payload = {
        "model": "llama3.2",  # Change this to the model you have installed
        "prompt": grok_prompt,
        "stream": False  # Set to True if you want to stream responses
    }

    # Send the request
    response = requests.post(OLLAMA_URL, json=payload)

    # Parse and print response
    if response.status_code == 200:
        data = response.json()
        print(data["response"])
    else:
        print(f"Error: {response.status_code}, {response.text}")




In [54]:
query = ["What are parametric and non-parametric weights?"]
response_query = collection.query(query_texts=query, n_results=1)
context = response_query["documents"][0]

response = ask_ollama(query, context)

From the context, it seems that "parametric" and "non-parametric" are terms used in the context of machine learning models, specifically in relation to memory components in hybrid generation models.

Parametric weights refer to the learnable parameters or weights of a model that are adjusted during training to optimize its performance. These weights are typically learned using an optimization algorithm such as stochastic gradient descent (SGD).

Non-parametric weights, on the other hand, refer to the knowledge or information stored in the memory component of a model that is not explicitly defined by a set of learnable parameters. This type of weight can be thought of as " implicit" or "hardcoded" into the model's architecture.

In essence, parametric weights are learned during training, while non-parametric weights are more like pre-defined rules or heuristics that guide the model's behavior without being explicitly optimized for performance.
