# Preprocessing and Modeling

This notebook does the preprocessing of the data scraped from the `arXiv.org` on "Alzheimer's Disease". The goal of preprocessing is to prepare the data for the LLM interpretation as part of the Retrieval Augmented Generation (RAG) architecture, added below:

<br>

<div style="text-align: center;">
  <img src="RAG-Architecture.png" alt="rag" width="600"/><br>
  <em>Picture reference: Litvinov, A. (2024, Feb 19). How was @ZoomcampQABot made? 
  <a href="https://docs.google.com/presentation/d/1Z__Qo7g8j6TWxMN0yxmVeXGyji0QmA2q4zTCQt4Zgs4/edit?slide=id.p#slide=id.p" target="_blank">Google Slides presentation</a></em>
</div>




<br>

The focus of this notebook is mostly on the "ingestion" part of this architecture.

# 1. Preprocessing

## 1.1. Setup
The following need to be installed once. Commented out because of that. 

In [1]:
# !pip install llama-index
# !pip install llama_index.embeddings.huggingface
# !pip install llama_index
# !pip install llama_index.llms.huggingface

In [2]:
import pandas as pd  # for handling structured data, here .json

import nest_asyncio  # allows running async code in environments like Jupyter notebooks
nest_asyncio.apply()  # applies the asyncio patch to enable nested event loops

from llama_index.core import SimpleDirectoryReader  # for loading documents from a local directory

from llama_index.core import Document  # used to convert raw text into Document objects for processing

from llama_index.core.node_parser import SentenceSplitter  # for splitting documents into smaller text chunks (nodes)

from llama_index.core import Settings  # to configure global settings like LLMs or embedding models
from llama_index.embeddings.huggingface import HuggingFaceEmbedding  # to use Hugging Face models for generating embeddings

from llama_index.core import VectorStoreIndex  # to build a vector index for retrieval and search

from transformers import pipeline  # provides access to Hugging Face pre-trained models for tasks like text generation or classification
from huggingface_hub import notebook_login  # to authenticate with Hugging Face and access gated models or datasets

from llama_index.llms.huggingface import HuggingFaceLLM  # to use a Hugging Face language model (LLM) as the backend for generating responses in LlamaIndex

from llama_index.core.indices.list import ListIndex # import a simple sequential index for storing and querying documents as a list



## 1.2. Load the Data

In [3]:
df = pd.read_json('alzheimer.json').T
df.head()

Unnamed: 0,link,published,title,summary,authors,author,arxiv_affiliation
0,http://arxiv.org/abs/2111.08794v2,2021-11-16T21:48:09Z,Investigating Conversion from Mild Cognitive I...,Alzheimer's disease is the most common cause o...,"[{'name': 'Deniz Sezin Ayvaz'}, {'name': 'Inci...",Inci M. Baytas,
1,http://arxiv.org/abs/1411.4221v1,2014-11-16T06:39:23Z,A dynamic mechanism of Alzheimer based on arti...,"In this paper, we provide another angle to ana...",[{'name': 'Zhi Cheng'}],Zhi Cheng,
2,http://arxiv.org/abs/1509.02273v2,2015-09-08T08:02:18Z,Reduction of Alzheimer's disease beta-amyloid ...,Alzheimer's disease is the most common form of...,"[{'name': 'T. Harach'}, {'name': 'N. Marungrua...",T. Bolmont,
3,http://arxiv.org/abs/2409.05989v1,2024-09-09T18:31:39Z,A Comprehensive Comparison Between ANNs and KA...,Alzheimer's Disease is an incurable cognitive ...,"[{'name': 'Akshay Sunkara'}, {'name': 'Sriram ...",Himesh Anumala,
4,http://arxiv.org/abs/2402.11931v1,2024-02-19T08:18:52Z,Soft-Weighted CrossEntropy Loss for Continous ...,Alzheimer's disease is a common cognitive diso...,"[{'name': 'Xiaohui Zhang'}, {'name': 'Wenjie F...",Mangui Liang,


## 1.3. Convert to Document

The most important field is the "abstract" or "summary" colunmn, which has the most information. So, I use the information in that column to feed the model.

In [4]:
# convert your list of strings into Document objects
documents = [Document(text=s) for s in df["summary"].tolist()]

## 1.4. Split to Chunks/Nodes

In [5]:
# initiate a splitter
splitter = SentenceSplitter(chunk_size=200,
                           chunk_overlap=20)

# create nodes from the documents using the splitter
nodes = splitter.get_nodes_from_documents(documents)

## 1.5. Vectorize the Data

In [6]:
# initiate an embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# vectorize the data
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=Settings.embed_model
)

# 2. Modeling

In this section, we use the prepared data (vectorized data) to feed the model. 

## 2.1. Logging in and Defining Model

In [7]:
# login into your hugging face account. You need to create a token 
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [8]:

# define your model with desires parameters
llm = HuggingFaceLLM(
    model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    tokenizer_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # match model and tokenizer
    device_map="auto",
    context_window=2048,
    max_new_tokens=256,
)


# setting this Hugging Face model (llm) as the default LLM
Settings.llm = llm


## 2.2. Invoke LLM

In [9]:
query_engine = index.as_query_engine()

response = query_engine.query("What is the onset of AD age in people?")
print(response)

The onset of Alzheimer's disease (AD) is typically in the late 40s to
early 60s, with a median age of onset of 65 years.


To save computational resources, one can use a chunk of the data as well as the `ListIndex` package of `hugging face` as an alternative. 

The query with vectorized data is provided below:


In [10]:
# using part of data , as the whole data could not be analyzed given the current computational resources
small_node_list = nodes[:5]

# using ListIndex instead of vectorized index, defined above for the same reason.
small_index = ListIndex(nodes=small_node_list)
query_engine = small_index.as_query_engine()

response = query_engine.query("What is the onset of AD age in people?")
print(response)


65 years old is the onset age for Alzheimer's disease.


We see even the simplified version provides an acceptable answer, although not as elaborate. 

# 3. Evaluation