# Article Retrieval System with Gemma and LangChain

## Introduction

#### This notebook focuses on two primary objectives:
   - Constructing a system designed to retrieve relevant article fragments according to a given query.
   - Utilizing the retrieved information to enhance an input to a generative model to develop a RAG Q&A system.

I'm going to use the articles from the [1300+ Towards DataScience Medium Articles Dataset](https://www.kaggle.com/datasets/meruvulikith/1300-towards-datascience-medium-articles-dataset) 

## Installation and imports


In [None]:
!pip install -q -U transformers accelerate bitsandbytes langchain sentence-transformers

In [None]:
import pandas as pd
import re
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DataFrameLoader
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig
from sentence_transformers import SentenceTransformer, util,CrossEncoder
from IPython.display import display, Markdown
import logging
import warnings

In [None]:
# Hide warnings
warnings.filterwarnings('ignore')
# Disable logging for transformers library
logging.getLogger("transformers").setLevel(logging.ERROR)
# Disable logging for XLA (Accelerated Linear Algebra) library
logging.getLogger("tensorflow").setLevel(logging.ERROR)

## Data cleaning and exploration

In [None]:
df=pd.read_csv('/kaggle/input/medium/medium.csv')
df.describe()

### Handling duplicate Article Titles

In [None]:
mask = df.groupby('Title')['Title'].transform('size') > 1
non_unique_rows = df[mask]
non_unique_rows

In [None]:
for i,row in enumerate(non_unique_rows['Text']):
    print(f'Text {i}')
    print(row)
    print('--------------------------------------------------------------------------')

### Remove the duplicate row - the texts with the same article title are identical.

In [None]:
df = df.drop(616)

### Remove emojis

In [None]:
def remove_emojis(text):
        emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                                   "]+", flags=re.UNICODE)
        return emoji_pattern.sub(r'', text) # no emoji
# df.apply(remove_emojis)
df['Title'] = df['Title'].apply(remove_emojis)
df['Text'] = df['Text'].apply(remove_emojis)





### Explore the data

In [None]:
df_stats=df.copy()
df_stats['Text_length'] = df['Text'].apply(len)
df_stats['Words_count'] = df['Text'].apply(lambda x: len(x.split(" ")))
df_stats['Sentence_count'] = df['Text'].apply(lambda x: len(x.split(". ")))
df_stats['Token_count']=df['Text'].apply(len)/4
df_stats.plot(y='Text_length')
df_stats.describe()


## Chunking
### I'll utilize a `Recursive Text Splitter`. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. 

### After conducting experiments with different chunk sizes, I've determined that a chunk size of 1024 works best. This size effectively captures context and is likely to contain all necessary information within the top chunks.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1024,
    chunk_overlap  = 100
)

articles = DataFrameLoader(df, page_content_column = "Text")
document = articles.load()
text_chunks= text_splitter.split_documents(document)

In [None]:
for i,chunk in enumerate(text_chunks[:3]):
   print(f'Chunk {i}')
   print(chunk.metadata)
   print(chunk.page_content)
   print('---------------------------------------------------------------------------------')

#### Convert chunk document to a DataFrame

In [None]:
chunks_df=pd.DataFrame()
chunks_df['Chunk']=[chunk.page_content for chunk in text_chunks]
chunks_df['Title']=[chunk.metadata['Title'] for chunk in text_chunks]
chunks_list=list(chunks_df['Chunk'])
chunks_df

## Embedding
### I'll be utilizing the `gte-large-en-v1.5` model for embedding purposes. This specific model has been chosen for its lower memory usage compared to other models, while also demonstrating strong performance. Currently, it holds the 9th position on the leaderboard, which can be viewed [here](https://huggingface.co/spaces/mteb/leaderboard).


In [None]:
embedding_model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True )

Note: Creating the embeddings using this model takes some time, therefore I've saved them to a csv file. You can uncomment this cell to get the embeddings without a csv.

In [None]:
embeddings = embedding_model.encode(chunks_df['Chunk'],convert_to_tensor=True,normalize_embeddings=True)
# embeddings_df = pd.DataFrame(embeddings.tolist())
# embeddings_df_save_path = "embeddings_df.csv"
# embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [None]:
# embeddings_df=pd.read_csv('/kaggle/input/embeddings/embeddings_df.csv')
# embeddings = torch.tensor(embeddings_df.values, dtype=torch.float32, device='cuda')
# embeddings

In [None]:
embeddings.shape

### Retrieval Without Reranking


In [None]:
def top_k_results_with_scores(query, embeddings,chunks_df, k_=5, embedding_model=embedding_model):
    """
    Calculates the dot product similarity scores between the query embedding and the embeddings of chunks in the corpus.
    Returns the top k_ chunks along with their corresponding similarity scores.

    Parameters:
    - query: The query string.
    - embeddings: Embeddings of the chunks.
    - chunks_df: DataFrame containing the chunks.
    - k_: Number of top chunks to retrieve (default is 5).
    - embedding_model: The model used for embedding (default is embedding_model).

    Returns:
    - A generator of tuples containing the top k_ chunks from chunks_df along with their dot product similarity scores.
      Each tuple consists of a chunk (DataFrame row) and its corresponding score.
    """
    query_embedding=embedding_model.encode(query,convert_to_tensor=True,normalize_embeddings=True)
    dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
    top_results_dot_product = torch.topk(dot_scores, k=k_)
    return zip([chunks_df.loc[i] for i in top_results_dot_product[1].tolist()], top_results_dot_product[0].tolist())

In [None]:
def top_k_results(query, embeddings,chunks_df, k_=5,embedding_model=embedding_model):
    """
    Calculates the dot product similarity scores between the query embedding and the embeddings of chunks in the corpus.
    Returns the top k_ chunks.

    Parameters:
    - query: The query string.
    - embeddings: Embeddings of the chunks.
    - chunks_df: DataFrame containing the chunks.
    - k_: Number of top chunks to retrieve (default is 5).
    - embedding_model: The model used for embedding (default is embedding_model).

    Returns:
    - DataFrame containing the top k_ chunks from chunks_df.
    """
    query_embedding=embedding_model.encode(query,convert_to_tensor=True,normalize_embeddings=True)
    dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
    top_results_dot_product = torch.topk(dot_scores, k=k_)
    top_k=chunks_df.loc[top_results_dot_product[1].tolist()]
    top_k.reset_index(drop=True, inplace=True)
    return top_k

#### Example query

In [None]:
query='What is the curse of dimensionality?'
for chunk, score in top_k_results_with_scores(query, embeddings,chunks_df):
    print(f'Score: {score}')
    print("Title: "+ chunk['Title']+'\n')
    print("Text:")
    print(chunk['Chunk']+'\n')


## Reranking
### Reranking serves the purpose of reordering and refining a set of retrieved article fragments based on their relevance to a given query.


In [None]:
reranking_model = CrossEncoder("mixedbread-ai/mxbai-rerank-large-v1")

### Searching for relevant article fragments according to a given query

In [None]:
def top_k_results_reranked(query,embeddings, chunks_df, embedding_model=embedding_model, reranking_model=reranking_model, k1=15, k2=5 ):
    """
    Embeds a query with the specified model and retrieves the top k1 relevant article fragments (chunks) using semantic search.
    Then, it reranks these chunks using the reranking model and returns the top k2 article fragments along with their scores.

    Parameters:
    - query: The query string.
    - embeddings: Embeddings of the chunks.
    - chunks_df: DataFrame containing the chunks.
    - embedding_model: The model used for embedding (default is embedding_model).
    - reranking_model: The model used for reranking (default is reranking_model).
    - k1: Number of top relevant chunks to retrieve initially (default is 10).
    - k2: Number of top reranked chunks to return (default is 5).

    Returns:
    - top_k_reranked: DataFrame containing the top k2 reranked chunks.
    - results['score']: Series containing the scores of the top k2 reranked chunks.
    """
    top_k=top_k_results(query, embeddings,chunks_df, embedding_model=embedding_model, k_=k1)
    results = reranking_model.rank(query, list(top_k['Chunk']), return_documents=False, top_k=k2)
    results=pd.DataFrame(results)
    index=(results['corpus_id'])
    top_k_reranked=top_k.loc[index]
    top_k_reranked.reset_index(drop=True, inplace=True)
    return top_k_reranked, results['score']

In [None]:
def print_top_k_results_reranked(query,embeddings, chunks_df, embedding_model=embedding_model, reranking_model=reranking_model, k1=10, k2=5 ):
    """
    Prints out the results from the function top_k_results_reranked
    """
    top_k_reranked,scores=top_k_results_reranked(query, embeddings, chunks_df, k1=10, k2=5 )
    for i, chunk in top_k_reranked.iterrows():
        print(f'Score: {scores[i]}')
        print("Title: "+ chunk['Title'])
        print("Text:")
        print(chunk['Chunk']+'\n')

#### Get results for the same example query with reranking

In [None]:
query='What is the curse of dimensionality?'
print_top_k_results_reranked(query, embeddings, chunks_df, k1=10, k2=5 )

#### Comparing the results, it's evident that reranking significantly improves the relevance of retrieved passages, with higher-scoring passages aligning more closely with the query's intent. This highlights the effectiveness of rerankers in enhancing the quality of retrieved content for RAG models, ensuring more accurate and contextually appropriate responses.

#### More examples

In [None]:
query='How to handle outliers in a dataset'
print_top_k_results_reranked(query, embeddings, chunks_df, k1=15, k2=5 )

In [None]:
query='How does gradient descent work?'
print_top_k_results_reranked(query, embeddings, chunks_df, k1=15, k2=5 )

###  LLM Model
#### I'll utilize a `gemma-1.1-7b-it` model for text generation.

In [None]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/gemma/transformers/1.1-7b-it/1/")
model = AutoModelForCausalLM.from_pretrained(
    "/kaggle/input/gemma/transformers/1.1-7b-it/1/",
    quantization_config=quantization_config,
    low_cpu_mem_usage=False,
)


### Testing the model without RAG

In [None]:
query="What are some common cases for using on-device deep learning with TensorFlow Mobile?"
chat = [
    { "role": "user", "content": query },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_new_tokens=256)
answer = (tokenizer.decode(outputs[0]))
display(Markdown(answer.replace(prompt, '')))

### Q&A system

In [None]:
def format_prompt(query, context_items) -> str:
    """
    Formats a prompt for generating answers based on a given query and context items.
    """
    context= "- "+ "\n- ".join(context_items)
    base_prompt = """Based on the following context items, please provide a comprehensive answer to the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are some commonly used evaluation metrics for classification tasks?
Answer: Some commonly used evaluation metrics for classification tasks include:
1. Accuracy: It measures the proportion of correct predictions among the total number of cases examined. It's suitable for well-balanced classification problems without class imbalance.
2. Precision: It measures the proportion of true positive predictions among all positive predictions made by the model. It's useful when the cost of false positives is high.
3. Recall (Sensitivity): It measures the proportion of true positive predictions among all actual positive cases in the dataset. It's important when the cost of false negatives is high.
4. F1-score: It is the harmonic mean of precision and recall, providing a balance between the two metrics. It's helpful when there's an uneven class distribution or when both false positives and false negatives are important.
5. ROC (Receiver Operating Characteristic) curve: It plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. It's useful for assessing the trade-off between true positive rate and false positive rate.
6. AUC (Area Under the ROC Curve): It quantifies the overall performance of a binary classification model by calculating the area under the ROC curve. A higher AUC indicates better model performance.
These metrics provide insights into different aspects of a classification model's performance and are commonly used to evaluate and compare models in various applications.
\nExample 2:
Query: What is the role of cross-validation in model evaluation?
Answer: Cross-validation is a technique used to assess how well a predictive model will generalize to an independent dataset. It involves partitioning the dataset into multiple subsets, called folds, training the model on several combinations of these folds, and evaluating its performance on the remaining fold(s). This process helps to detect overfitting by providing a more accurate estimate of the model's performance on unseen data compared to a single train-test split. Common cross-validation methods include k-fold cross-validation and stratified k-fold cross-validation, which ensure that each fold preserves the class distribution of the original dataset. Cross-validation is essential for robust model evaluation and hyperparameter tuning in machine learning projects.
\nExample 3:
Query: How does regularization help in preventing overfitting in machine learning models?
Answer: Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty term to the loss function. This penalty term discourages complex models that fit the training data too closely, thus reducing the likelihood of overfitting. Common regularization techniques include L1 regularization (Lasso), L2 regularization (Ridge), and elastic net regularization, each of which adds a different type of penalty to the model's weights. Regularization helps to improve the model's generalization performance by balancing between fitting the training data well and avoiding excessive complexity.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""
    prompt=base_prompt.format(context=context, query=query)
    return prompt

In [None]:
def ask(query, temperature=0.7,max_new_tokens=512, embeddings=embeddings, chunks_df=chunks_df):
    """
    Takes a query, finds relevant resources/context and generates an answer to the query based on the relevant resources.
    """
    context_items,_= top_k_results_reranked(query,embeddings, chunks_df)
    input_text=format_prompt(query,list(context_items["Chunk"]))
    chat = [
    { "role": "user", "content": input_text },
    ]
    prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**input_ids,max_new_tokens=max_new_tokens, temperature=temperature)
    answer = (tokenizer.decode(outputs[0]))
    return answer.replace(prompt, '').replace('<bos>', '').replace('<eos>', ''), context_items

def print_context_items(context_items):
    print("Retrieved article fragments:")
    for _, chunk in context_items.iterrows():
        print("---------------------------------------------------------------------------------------")
        print("Title: "+ chunk['Title'])
        print("Text:")
        print(chunk['Chunk']+'\n')

### Get results for the same query with RAG

In [None]:
query="What are some common cases for using on-device deep learning with TensorFlow Mobile?"
answer, context_items=ask(query)
print(f'Query: {query}')
display(Markdown(answer))
print_context_items(context_items)

#### The variation in output between the two responses highlights the impact of using RAG. The second response focused specifically on common cases for on-device deep learning with TensorFlow Mobile, emphasizing applications like image recognition and text classification. In contrast, the first response, without RAG, covered a broader range of use cases. This demonstrates how RAG can yield more targeted and contextually relevant responses from a specific domain of knowledge.

### More query examples

In [None]:
query = "Can you explain PCA?"
answer, context_items=ask(query)
print(f'Query: {query}')
display(Markdown(answer))
print_context_items(context_items)

In [None]:
query = "What is the purpose of feature scaling in machine learning?"
answer, context_items=ask(query)
print(f'Query: {query}')
display(Markdown(answer))
print_context_items(context_items)

In [None]:
query="What are some ethical considerations in data science, particularly regarding privacy and bias?"
answer, context_items=ask(query)
print(f'Query: {query}')
display(Markdown(answer))
print_context_items(context_items)

In [None]:
query="Explain how k-nearest neighbors (k-NN) algorithm works and discuss its limitations."
answer, context_items=ask(query)
print(f'Query: {query}')
display(Markdown(answer))
print_context_items(context_items)