# Preprocessing and Modeling

This notebook does the preprocessing of the data scraped from the `arXiv.org` on "Alzheimer's Disease". The goal of preprocessing is to prepare the data for the LLM interpretation as part of the Retrieval Augmented Generation (RAG) architecture, added below:

<br>

<div style="text-align: center;">
  <img src="img/RAG-Architecture.png" alt="rag" width="600"/><br>
  <em>Picture reference: Litvinov, A. (2024, Feb 19). How was @ZoomcampQABot made? 
  <a href="https://docs.google.com/presentation/d/1Z__Qo7g8j6TWxMN0yxmVeXGyji0QmA2q4zTCQt4Zgs4/edit?slide=id.p#slide=id.p" target="_blank">Google Slides presentation</a></em>
</div>




<br>


**Note:** Due to limitations on computational resources, I am using a small LLM, and small data size. The point of this exercise is illustrating the RAG pipeline having access to online data. 

# 0. Model Response before RAG

In [1]:
from transformers import pipeline

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

ask_llm = pipeline(
    model= model_name)


Device set to use mps:0


In [2]:
prompt_1 = "At what age do people usually start showing AD symptoms?"
prompt_2 = "What is the latest development in treating AD?"
prompt_3 = "At what age do people usually start showing Alzheimer's Disease symptoms?"
prompt_4 = prompt_3 + " Give me a number."

In [3]:
llm_response_1 = ask_llm(prompt_1)[0]["generated_text"]
print(llm_response_1)

At what age do people usually start showing AD symptoms?


In [4]:
llm_response_2 = ask_llm(prompt_2)[0]["generated_text"]
print(llm_response_2)

What is the latest development in treating AD?


In [5]:
llm_response_3 = ask_llm(prompt_3)[0]["generated_text"]
print(llm_response_3)

At what age do people usually start showing Alzheimer's Disease symptoms?


Without RAG, the model keeps repeating the question without providing any answer. Let's try to change the prompt, giving some tips to the model. We get a slightly more informative response. But, still not quite right. 

In [6]:
llm_response_4 = ask_llm(prompt_4)[0]["generated_text"]
print(llm_response_4)

At what age do people usually start showing Alzheimer's Disease symptoms? Give me a number.


The focus of this notebook is not on prompt engineering. One could argue that with more trial and error, we can find a prompt that works relatively well. But, the idea is to do with minimum prompt engineering, rather to rely on facts. 

# 1. Preprocessing

## 1.1. Setup
The following need to be installed once. Commented out because of that. 

In [7]:
# !pip install llama-index
# !pip install llama_index.embeddings.huggingface
# !pip install llama_index
# !pip install llama_index.llms.huggingface

In [7]:
import pandas as pd  # for handling structured data, here .json
import requests # to download files or make HTTP requests
import nest_asyncio  # allows running async code in environments like Jupyter notebooks
nest_asyncio.apply()  # applies the asyncio patch to enable nested event loops

from llama_index.core import SimpleDirectoryReader  # for loading documents from a local directory
from llama_index.core import Document  # used to convert raw text into Document objects for processing
from llama_index.core.node_parser import SentenceSplitter  # for splitting documents into smaller text chunks (nodes)
from llama_index.core import Settings  # to configure global settings like LLMs or embedding models
from llama_index.embeddings.huggingface import HuggingFaceEmbedding  # to use Hugging Face models for generating embeddings
from llama_index.core import VectorStoreIndex  # to build a vector index for retrieval and search
from transformers import pipeline  # provides access to Hugging Face pre-trained models for tasks like text generation or classification
from huggingface_hub import notebook_login  # to authenticate with Hugging Face and access gated models or datasets
from llama_index.llms.huggingface import HuggingFaceLLM  # to use a Hugging Face language model (LLM) as the backend for generating responses in LlamaIndex
from llama_index.core.indices.list import ListIndex # import a simple sequential index for storing and querying documents as a list
from sentence_transformers import SentenceTransformer, util  

## 1.2. Load the Data

In [8]:
df = pd.read_json('data/alzheimer.json').T
df.head()

Unnamed: 0,link,published,title,summary,authors,author,arxiv_affiliation
0,http://arxiv.org/abs/2111.08794v2,2021-11-16T21:48:09Z,Investigating Conversion from Mild Cognitive I...,Alzheimer's disease is the most common cause o...,"[{'name': 'Deniz Sezin Ayvaz'}, {'name': 'Inci...",Inci M. Baytas,
1,http://arxiv.org/abs/1411.4221v1,2014-11-16T06:39:23Z,A dynamic mechanism of Alzheimer based on arti...,"In this paper, we provide another angle to ana...",[{'name': 'Zhi Cheng'}],Zhi Cheng,
2,http://arxiv.org/abs/1509.02273v2,2015-09-08T08:02:18Z,Reduction of Alzheimer's disease beta-amyloid ...,Alzheimer's disease is the most common form of...,"[{'name': 'T. Harach'}, {'name': 'N. Marungrua...",T. Bolmont,
3,http://arxiv.org/abs/2409.05989v1,2024-09-09T18:31:39Z,A Comprehensive Comparison Between ANNs and KA...,Alzheimer's Disease is an incurable cognitive ...,"[{'name': 'Akshay Sunkara'}, {'name': 'Sriram ...",Himesh Anumala,
4,http://arxiv.org/abs/2402.11931v1,2024-02-19T08:18:52Z,Soft-Weighted CrossEntropy Loss for Continous ...,Alzheimer's disease is a common cognitive diso...,"[{'name': 'Xiaohui Zhang'}, {'name': 'Wenjie F...",Mangui Liang,


## 1.3. Convert to Document

The most important field is the "abstract" or "summary" colunmn, which has the most information. So, I use the information in that column to feed the model.


In [9]:
documents = [Document(text=s) for s in df["summary"].tolist()]


## 1.4. Split to Chunks/Nodes

In [10]:
# initiate a splitter
splitter = SentenceSplitter(chunk_size=200,
                           chunk_overlap=20)

# create nodes from the documents using the splitter
nodes = splitter.get_nodes_from_documents(documents)

## 1.5. Vectorize the Data

In [11]:
# initiate an embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# vectorize the data
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=Settings.embed_model
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# 2. Modeling

In this section, we use the prepared data (vectorized data) to feed the model. 

## 2.1. Logging in and Defining Model

In [12]:
# login into your hugging face account. You need to create a token 
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [13]:

# define your model with desires parameters
llm = HuggingFaceLLM(
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    tokenizer_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # match model and tokenizer
    device_map="cpu",
    context_window=2048,
    max_new_tokens=256,
)


# setting this Hugging Face model (llm) as the default LLM
Settings.llm = llm


## 2.2. Invoke LLM

In [14]:
query_engine = index.as_query_engine()

In [15]:
RAG_response_1 = query_engine.query(prompt_1) # Reminder: prompt_1 = "At what age do people usually start showing AD symptoms?"
print(RAG_response_1)

65 years of age is the age at which most people start showing AD symptoms.


In [16]:
RAG_response_2 = query_engine.query(prompt_2) # Reminder: prompt_2 = "What is the latest development in treating AD?"
print(RAG_response_2)

Multitarget molecules, especially those targeting neuronal membrane
protection, could offer a comprehensive approach to AD therapy, advocating for
further research into their mechanisms and therapeutic potential.


In [17]:
RAG_response_3 = query_engine.query(prompt_3) # Reminder: prompt_3 = "At what age do people usually start showing Alzheimer's Disease symptoms?"
print(RAG_response_3)

65 years or older.


In [18]:
RAG_response_4 = query_engine.query(prompt_4) # Reminder: prompt_4 = prompt_3 + " Give me a number."
print(RAG_response_4)

65 years is the age at which Alzheimer's Disease symptoms usually start showing.


To save computational resources, one can use a chunk of the data as well as the `ListIndex` package of `hugging face` as an alternative. 

The query with smaller set of data is provided below:

In [19]:
# using part of data , as the whole data could not be analyzed given the current computational resources
small_node_list = nodes[:5]

# using ListIndex instead of vectorized index, defined above for the same reason.
small_index = ListIndex(nodes=small_node_list)
query_engine = small_index.as_query_engine()

response = query_engine.query(prompt_1)
print(response)


60-70 years old is the typical age range for showing AD symptoms.


In [20]:
response = query_engine.query(prompt_2)
print(response)


The latest development in treating AD is the development of new drugs and
therapies. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new drugs and therapies is a crucial step in
treating AD. The development of new


In [21]:
response = query_engine.query(prompt_3)
print(response)

65 years of age is the average age at which Alzheimer's Disease symptoms
begin to appear.


In [22]:
response = query_engine.query(prompt_4)
print(response)

65 years old is the average age at which Alzheimer's Disease symptoms are first noticed.


We see even the simplified version provides an acceptable answer, although not as elaborate. 

Having big data, like what we have here, seems to be confusing to the model. In a recent study, it is shown that "...while [LLMs] perform well in short contexts (<1K), performance degrades significantly as context length increases ([Ref](https://arxiv.org/pdf/2502.05167v3)).

One way to circumvent this issue and boost model performance is by retreiving relative chunks from the context using semantic search. This is followed by reranking of the most relevant chunks and using the one with highest score. The chart below elaborates on the architecture.

<br>

<div style="text-align: center;">
  <img src="img/Retrieval.png" alt="rag" width="700"/><br>
  <em>Picture reference: Litvinov, A. (2024, Feb 19). How was @ZoomcampQABot made? 
  <a href="https://docs.google.com/presentation/d/1Z__Qo7g8j6TWxMN0yxmVeXGyji0QmA2q4zTCQt4Zgs4/edit?slide=id.p#slide=id.p" target="_blank">Google Slides presentation</a></em>
</div>




<br>


I use the "all-MiniLM-L6-v2" embedding model for semantic search.

In [27]:

# Load embedder on CPU to avoid MPS issues
embedder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")


The corpus is too big as is. And it is causing memory shortage errors. 

For that reason, I use the nodes I created above by chunking the text to make it smaller.

In [28]:
corpus = [node.text for node in nodes[:5000]]

In [29]:
corpus_embeddings_np = embedder.encode(
    corpus,
    batch_size=32,
    convert_to_numpy=True,
    show_progress_bar=True
)

print("Corpus length:", len(corpus))                    # Should be 5000
print("Embeddings shape:", corpus_embeddings_np.shape)  # Should be (5000, 384)


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

Corpus length: 5000
Embeddings shape: (5000, 384)


In [30]:
# defining a function that given the arguments returns the relevant corpus chunk together with the similarity score
def semantic_search(query, corpus, corpus_embeddings_np, embedder, top_k=5):
    query_embedding_np = embedder.encode(query, convert_to_numpy=True)
    query_embedding = torch.tensor(query_embedding_np, dtype=torch.float32).cpu() #using CPU exlicitly to avoid MPS issues 
    corpus_tensor = torch.tensor(corpus_embeddings_np, dtype=torch.float32).cpu()

    similarity_scores = util.cos_sim(query_embedding, corpus_tensor)[0]

    k = min(top_k, len(corpus))
    scores, indices = torch.topk(similarity_scores, k=k)

    results = [(corpus[i], scores[idx].item()) for idx, i in enumerate(indices.tolist())]
    return results


In [31]:
# function retrieveing relevant chunks with scores to use as context for running query
def generate_answer(query):
    retrieved = semantic_search(query, corpus, corpus_embeddings_np, embedder)
    retrieved_docs = [text for text, score in retrieved]

    context = "\n\n".join(retrieved_docs)

    response = query_engine.query(prompt)
    return response.text


In [32]:
semantic_results_1 = semantic_search(prompt_1, corpus, corpus_embeddings_np, embedder, top_k=1)
print("Top results:", semantic_results_1)


Top results: [('The prevalence of\nAD around the world is on the rise, with a predicted 152\nmillion people to be affected by the disease in 2050 [1]. While\nAD is incurable, the early detection of AD has been found\nto help with managing cognitive symptoms and quality of life\nproblems that might be related to the disorder [2]. AD is found\nto be most prevalent in people who are 65 years and older,\nwhich makes it hard to detect early on as diminishing cognitive\ncapability can be otherwise attributed to old age within this\ngroup [3]. A solution that can aid with the early detection of\nAD for those of old age is crucial in ensuring that the highest\nquality treatment is possible for this group of people.', 0.6020330190658569)]


In [33]:
semantic_results_2 = semantic_search(prompt_2, corpus, corpus_embeddings_np, embedder, top_k=1)
print("Top results:", semantic_results_2)


Top results: [('The prevalence of\nAD around the world is on the rise, with a predicted 152\nmillion people to be affected by the disease in 2050 [1]. While\nAD is incurable, the early detection of AD has been found\nto help with managing cognitive symptoms and quality of life\nproblems that might be related to the disorder [2]. AD is found\nto be most prevalent in people who are 65 years and older,\nwhich makes it hard to detect early on as diminishing cognitive\ncapability can be otherwise attributed to old age within this\ngroup [3]. A solution that can aid with the early detection of\nAD for those of old age is crucial in ensuring that the highest\nquality treatment is possible for this group of people.', 0.5587400197982788)]


In [34]:
semantic_results_3 = semantic_search(prompt_3, corpus, corpus_embeddings_np, embedder, top_k=1)
print("Top results:", semantic_results_3)

Top results: [("Furthermore, Alzheimer's \ndisease does not appear suddenly; dementia symptoms \nemerge gradually. Memory loss is limited in the \nbeginning stages of Alzheimer's, but people with late-\nstage Alzheimer's frequently lose their capacity to \ncommunicate and respond to their surroundings \n(Alzheimer's Association, 2019). Even with advances in \nneuroimaging techniques, physicians and doctors have \ndifficulty \ndiagnosing \nAlzheimer's \ndisease \nstages. \nApproximately 5.8 million Americans of all ages lived \nwith Alzheimer's dementia in 2019. The distribution of \nthis is given in Fig. 1. Manual diagnosis of Alzheimer's \ndisease \nis \nsubjective \nand \ntime-consuming \nand \ngeriatricians are sometimes needed to determine the exact \nstage of the disease.", 0.7148532867431641)]
