# Reading and Cleaning Text

In [7]:
# pdf_file = "./doc/1706.03762.pdf"
pdf_file = "./doc/2005.11401.pdf"
text_file = "./doc/textfile.txt"
brian_pdf = "./doc/Brian's_Resume.pdf"
opt_workshop = "./doc/opt_workshop_umbc.pdf"
eda_chatper = "./doc/eda-chapter.pdf"

In [8]:
import fitz

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text("text") + "\n"
    return text

unclean_text = extract_text_from_pdf(eda_chatper)
unclean_text

'Chapter 4\nExploratory Data Analysis\nA ﬁrst look at the data.\nAs mentioned in Chapter 1, exploratory data analysis or “EDA” is a critical\nﬁrst step in analyzing the data from an experiment. Here are the main reasons we\nuse EDA:\n• detection of mistakes\n• checking of assumptions\n• preliminary selection of appropriate models\n• determining relationships among the explanatory variables, and\n• assessing the direction and rough size of relationships between explanatory\nand outcome variables.\nLoosely speaking, any method of looking at data that does not include formal\nstatistical modeling and inference falls under the term exploratory data analysis.\n4.1\nTypical data format and the types of EDA\nThe data from an experiment are generally collected into a rectangular array (e.g.,\nspreadsheet or database), most commonly with one row per experimental subject\n61\n\n62\nCHAPTER 4. EXPLORATORY DATA ANALYSIS\nand one column for each subject identiﬁer, outcome variable, and explanatory\

In [None]:
import re
import unicodedata

def clean_text_(text):
    # text = text.replace("\n", " ")  # Replace newlines with spaces
    text = re.sub(r'\s+', ' ', str(text))  # Remove extra spaces
    return text.strip()  # Trim leading and trailing spaces

def remove_special_chars(text):
    text = re.sub(r'[^a-zA-Z0-9.,!?\'" ]', '', text)  # Keep letters, numbers, and common punctuation
    return text

def fix_hyphenation(text):
    return re.sub(r'(\w+)-\s+(\w+)', r'\1\2', text)  # Removes hyphenation across lines

def normalize_unicode(text):
    return unicodedata.normalize("NFKD", text)

def remove_headers_footers(text):
    lines = text.split("\n")
    cleaned_lines = [line for line in lines if not re.match(r'(Page \d+|Confidential|Company Name)', line)]
    return " ".join(cleaned_lines)

def normalize_text(text):
    return " ".join(text.lower().split())

def full_text_cleanup(text):
    """"
    Takes in unclean text and return cleaned text by applying a series of cleaning functions.
    
    """
    text = clean_text_(text)
    text = fix_hyphenation(text)
    text = remove_special_chars(text)
    text = normalize_unicode(text)
    text = remove_headers_footers(text)
    text = normalize_text(text)
    
    return text

In [10]:
clean_text = full_text_cleanup(unclean_text)
clean_text

'chapter 4 exploratory data analysis a rst look at the data. as mentioned in chapter 1, exploratory data analysis or eda is a critical rst step in analyzing the data from an experiment. here are the main reasons we use eda detection of mistakes checking of assumptions preliminary selection of appropriate models determining relationships among the explanatory variables, and assessing the direction and rough size of relationships between explanatory and outcome variables. loosely speaking, any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis. 4.1 typical data format and the types of eda the data from an experiment are generally collected into a rectangular array e.g., spreadsheet or database, most commonly with one row per experimental subject 61 62 chapter 4. exploratory data analysis and one column for each subject identier, outcome variable, and explanatory variable. each column contains the numeri

In [22]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import NLTKTextSplitter

def extract_text_langchain(pdf_path):
    loader = PyMuPDFLoader(pdf_path)
    documents = loader.load()
    return "\n".join([doc.page_content for doc in documents])


def lang_clean_text(text):
    # text = text.replace("\n", " ").strip()  
    text = full_text_cleanup(text)
    text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
    return text_splitter.split_text(text)

def split_text_into_sentences(text):
    text_splitter = NLTKTextSplitter()
    sentences = text_splitter.split_text(text)
    cleaned_sentences = [sentence.replace("\n", " ") for sentence in sentences]
    return cleaned_sentences


def create_resources(file_path):
    lang_text = extract_text_langchain(eda_chatper)
    lang_cleaned_text = lang_clean_text(lang_text)
    lang_sentences = split_text_into_sentences(lang_cleaned_text[0])
    return lang_sentences

In [24]:
import chromadb
chroma_client = chromadb.Client()

sentence_chunks = create_resources(eda_chatper)

collection = chroma_client.get_or_create_collection(name="my_collection")
collection.add(
    documents=sentence_chunks,
    ids=[f"{i}" for i in range(len(sentence_chunks))]
)

# chroma_client.delete_collection("my_collection")

In [14]:
results = collection.query(
    query_texts=["what is Central tendency ?"],
    n_results=4
)

print(results["documents"])

[['for nonsymmetric distributions, the mean is the balance point if the histogram is cut out of some homogeneous stimaterial such as cardboard, it will balance on a fulcrum placed at the mean.  for many descriptive quantities, there are both a sample and a population version.  for a xed nite population or for a theoretic innite population described by a pmf or pdf, there is a single population mean which is a xed, often unknown, value called the mean parameter see section 3.5. on the other hand, the sample mean will vary from sample to sample as dierent samples are taken, and so is a random variable.  the probability distribution of the sample mean is referred to as its sampling distribution.  this term expresses the idea that any experiment could at least theoretically, given enough resources be repeated many times and various statistics such as the sample mean can be calculated each time.  often we can use probability theory to work out the exact distribution of the sample statistic,

# LLM Intergration 

We are using Llama3.2 with knowledge cut off (December 2023)

In [25]:
import requests
import json

def ask_ollama(query, context=None):

    context = "-".join(context) if context else "No context provided."
    
    prompt = f"""
            You are an advanced Retrieval-Augmented Generation (RAG) system designed to provide highly accurate and contextually relevant responses. Use *only* the information provided in the context below to generate your answer. Do not use any prior knowledge or external sources. If the context does not contain enough information to answer the question, explicitly state: "I cannot answer this question based on the provided information."

            ## Instructions:
            - Analyze the retrieved context carefully to extract the most relevant details.
            - Ensure that your answer is comprehensive, well-structured, and directly addresses the user's question.
            - If multiple pieces of evidence exist in the context, synthesize them for a cohesive response.
            - If the context is unclear, ambiguous, or conflicting, acknowledge this uncertainty in your response.
            - Do not assume or infer facts beyond what is stated in the provided context.

            ## Context:
            {context}

            ## Question:
            {query}

            ## Answer:
            """
    
    grok_prompt = f"""
        ### Prompt for RAG System

            **Instruction:**
            You are an AI designed to answer queries using a two-step process involving context retrieval and knowledge-based answering. Here's how you should proceed:

            1. **Context Retrieval (Step 1):**
            - **Context:** {context}
            - **Query:** {query}

            First, attempt to answer the query using the provided context. Look for relevant information within the context that directly relates to the query. If you can answer the query comprehensively using only this context, do so. If you cannot:

            2. **Knowledge-Based Answer (Step 2):**
            - If the context does not provide enough information to answer the query accurately, or if the query is not adequately addressed by the context, use your pre-existing knowledge to answer the query. 
            - Be clear that you are now using your knowledge by starting your response with "Based on my knowledge:".

            **Guidelines:**
            - **Accuracy:** Prioritize accuracy. If the context does not provide a clear answer and your knowledge is uncertain or outdated, acknowledge this by saying, "I'm not certain about this, but based on my knowledge:".
            - **Completeness:** If part of the query can be answered with context but not fully, use context for what you can and supplement with knowledge.
            - **Citations:** When answering from context, if possible, reference or quote directly from the context by using quotation marks or by specifying where in the context the answer was found (e.g., "According to the context...").
            - **Admit Limitations:** If neither the context nor your knowledge can provide an answer, admit this by saying, "I do not have enough information to answer this query adequately."

            **Example Response Formats:**

            - **From Context:** "The context states that the boiling point of water at sea level is 100°C."
            - **From Knowledge:** "Based on my knowledge, the average adult human body contains approximately 60% water."
            - **Mixed:** "From the context, we learn that the Eiffel Tower was completed in 1889. Based on my knowledge, it was designed by Gustave Eiffel."
            - **Admitting Limitation:** "I do not have enough information to answer this query adequately."

            **Proceed:**
            Now, attempt to answer the query provided:

            **Query:** {query}

            Your answer should be just explain of your understanding of the question. Dont list steps or any other things. Just explain the concept. Dont say Based on my knowledge or Based on the provided context, on your answer
            Just answer the question directly and in detail simple way
            DONT MENTION ABOUT THE CONTEXT OR THE QUESTION IN YOUR ANSWER
    
    """
    
    OLLAMA_URL = "http://localhost:11434/api/generate"

    payload = {
        "model": "llama3.2",  
        "prompt": grok_prompt,
        "stream": False 
    }

    response = requests.post(OLLAMA_URL, json=payload)


    if response.status_code == 200:
        data = response.json()
        print(data["response"])
    else:
        print(f"Error: {response.status_code}, {response.text}")


In [26]:
qn1 = "What is Central tendency ?"
qn2 = "What Non-graphical methods of data presentation involves ?"
qn3 = "Is (IQR) is a robust measure of spread ?"

In [27]:
def ask_rag(question):
    response_query = collection.query(query_texts=[question], n_results=4)
    context = response_query["documents"][0]
    response = ask_ollama(question, context)

    return response

In [28]:
ask_rag(qn1)

Central tendency is a statistical measure that describes the central or typical value of a dataset. It provides an overview of the data's main characteristics, allowing for easy interpretation and comparison. The most commonly used measures of central tendency are the mean, median, and mode.

The mean is calculated by summing up all the values in the dataset and dividing by the number of values. It is sensitive to extreme values and can be affected by outliers.

The median is the middle value in a sorted dataset. If there is an even number of values, the median is the average of the two middle values. The median is more resistant to outliers than the mean.

The mode is the most frequently occurring value in the dataset. A dataset can have one mode (unimodal), two modes (bimodal), or no mode at all ( multimodal).

Central tendency is important because it helps us understand the main characteristics of a dataset, which can inform decisions and predictions. It is widely used in statistics

In [29]:
ask_rag(qn2)

Non-graphical methods of data presentation involve statistical methods such as calculating means, variances, standard deviations, t-tests, ANOVA, regression analysis, hypothesis testing, confidence intervals, and non-parametric tests. These methods use mathematical formulas and equations to summarize and analyze data without the need for visual representations like charts or graphs.


In [21]:
ask_rag(qn3)

The IQR (Interquartile Range) is a robust measure of spread because it is not significantly affected by extreme values or outliers. Half of the data points fall within an interval whose width equals the IQR, meaning that even if most values are concentrated in the middle part of the distribution, the IQR remains relatively stable and resistant to changes caused by outliers. This property makes the IQR a more robust measure of spread compared to variance or standard deviation.


In [None]:
7