# 1. Introduction

This notebook demonstrates how you can build an advanced RAG (Retrieval Augmented Generation) for explaining concepts from Kaggle competition solution write-ups.

We are going to use the following public dataset : https://www.kaggle.com/datasets/thedrcat/kaggle-winning-solutions-methods

# 2. Installation and imports

## 2.1 Install packages

In [1]:
# pip3 install accelerate bitsandbytes langchain langchain-community sentence-transformers ragatouille faiss-cpu rank_bm25
# pip3 install beautifulsoup4 # Install beautifulsoup4 if you are running the notebook not in Kaggle
# pip3 install keras-nlp
# pip3 install keras>3

## 2.2 Imports

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import tensorflow

In [4]:
from langchain_community.vectorstores import Chroma
from sentence_transformers import SentenceTransformer

In [12]:
!pip install -U pip setuptools wheel
!pip install -U spacy

zsh:1: /Users/nayaghos/Documents/training-project/langenv/bin/pip: bad interpreter: /Users/nayaghos/Documents/gemini-rag/langenv/bin/python3.11: no such file or directory
zsh:1: /Users/nayaghos/Documents/training-project/langenv/bin/pip: bad interpreter: /Users/nayaghos/Documents/gemini-rag/langenv/bin/python3.11: no such file or directory


In [13]:
from langchain.schema import Document
from transformers import AutoModel
import spacy
spacy.load('en_core_web_sm')

ModuleNotFoundError: No module named 'spacy'

In [None]:
import os
import keras
import keras_nlp
import pandas as pd

from bs4 import BeautifulSoup
from typing import Optional, List, Tuple
from IPython.display import display, Markdown

from transformers import AutoTokenizer
from ragatouille import RAGPretrainedModel
from langchain.docstore.document import Document
from langchain.prompts.prompt import PromptTemplate
from langchain_core.runnables import ConfigurableField
from langchain_community.vectorstores import FAISS, Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DataFrameLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy

os.environ["KERAS_BACKEND"] = "jax"  # Or "torch" or "tensorflow".
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "1.00" # Avoid memory fragmentation on JAX backend.

# 3. Prepare the data
## 3.1 Preprocessing

In [10]:
data = pd.read_csv('kaggle_winning_solutions_methods.csv')
data.head()

Unnamed: 0,link,place,competition_name,prize,team,kind,metric,year,nm,writeup,num_tokens,methods,cleaned_methods
0,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Replace augmentation
1,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Finger tree rotate
2,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Data Augmentation
3,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Onecycle scheduler
4,https://www.kaggle.com/c/asl-signs/discussion/...,2,Google - Isolated Sign Language Recognition,"$100,000",1165,Research,PostProcessorKernelDesc,2023,406306,<h2>TLDR</h2>\n<p>We used an approach similar ...,2914,"['EfficientNet-B0', 'Data Augmentation', 'Norm...",Flip pose


Let's look at an example of a write-up

In [11]:
data['writeup'][42]

'<p>Here is a quick overview of the 5th-place solution.</p>\n<ol>\n<li><p><strong>we applied various augmentations like flip, concatenation, etc</strong><br>\n1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -&gt; 0.78)</p></li>\n<li><p><strong>the model is only a transformer model based on the public kernels</strong><br>\n2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78-&gt;0.8) in public LB.<br>\n2.1.1. 3 layers of transformer with the embedding size 480.</p></li>\n<li><p><strong>Preprocessing by mean and std of single sign sequence</strong><br>\n3.1. the preprocessing does affect the final performance. <br>\n3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.</p></li>\n<li><p><strong>Feature engineering like distances between points</strong><br>\n4.1. we selected and used around 106 p

The write-ups contain HTML tags and links that are not relevant to our knowledge base. So we'll use BeautifulSoup to extract all the texts and concatenate them into a single one.

In [12]:
%%time

def clean_html(html_content):
    """Function to clean up HTML tags in each writeup"""
    soup = BeautifulSoup(html_content, 'html.parser')
    # Use '\n' as a separator to preserve the structure of the various parts
    text = soup.get_text(separator='\n', strip=True)
    return text

data['writeup'] = data['writeup'].apply(clean_html) # This might take a while

CPU times: user 13 s, sys: 155 ms, total: 13.1 s
Wall time: 13.2 s


**Here is the result :**

In [13]:
print(data['writeup'][42])

Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we selected and used around 106 points (as the public notebook by Heck).
4.2. distances withinpoints of hands/nose/eyes/… are calculated.
some methods to prevent overfitting like awp, random mask of frames, ema,

**This looks good now !**

To build our knowledge base, which will serve as the context for the LLM, we will concatenate relevant information such as the name of the competition, the rank of the competitors who proposed the solution and the solution itself.

Note that we can also add other columns that might also be relevant to answering the user's query.
But let's keep it simple for now.


In [14]:
data['LLM_context'] = (
    "Competition Name: " + data['competition_name'] +
    ",\nPlace: " + data['place'].astype(str) +
    ",\nMethods Used: " + data['methods'] +
    ",\nSolution: " + data['writeup']
)

print(data['LLM_context'][42])

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

In [15]:
data = data.drop("writeup", axis=1) # We remove 'writeup' column as it is already in LLM_context

## 3.2 Loading data

We'll now use LangChain's [DataFrameLoader](https://python.langchain.com/docs/integrations/document_loaders/pandas_dataframe) to store the information as a LangChain [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) list. 

The **`Document`** class in **LangChain** serves as a fundamental building block for storing text and associated metadata. Let's explore its key features:

1. **Purpose**: The `Document` class is designed to hold a piece of text along with relevant metadata. You can think of it as a container for textual content.

2. **Attributes**:
    - **`page_content`**: This attribute stores the actual text content of the document.
    - **`metadata` (Optional)**: You can attach arbitrary metadata to the document. For example, this could include information about the source of the content or relationships to other documents.

For more detailed information, you can refer to the [official LangChain documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/) .



In [16]:
loader = DataFrameLoader(data, page_content_column="LLM_context")
docs = loader.load()
docs_subset = docs[:1500] # Part of the data is used to reduce execution time.

In [17]:
print("-----------PAGE CONTENT-----------")
print(docs_subset[42].page_content)
print("\n\n-----------METADATA-----------\n")
print(docs_subset[42].metadata)

-----------PAGE CONTENT-----------
Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like dist


# 4. Chunking

To create relevant answer snippets for the LLM, we break down the knowledge base documents into smaller pieces. These chunks should capture specific ideas, not be too short (cutting off the thought) or too long (making it hard to find the main point).

We use "recursive chunking" to achieve this. It works by repeatedly splitting the text into smaller parts using a list of separators (e.g. ["\n\n", "\n", ".", ""]), starting with the most important (like double line breaks) and moving down to less important ones (like sentence ends). This ensures that chunks are neither too large nor too small for the LLM to process effectively.

In [None]:
# EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5" 
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
CHUNK_SIZE = 256 # We choose a chunk size adapted to our model

In [None]:
# %%time

# def split_documents(chunk_size: int, knowledge_base: List[Document], tokenizer_name: Optional[str] = EMBEDDING_MODEL_NAME) -> List[Document]:
#     """
#     Split documents into chunks of maximum size `chunk_size` tokens and return a list of documents.
#     """
    
#     text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
#         AutoTokenizer.from_pretrained(tokenizer_name),
#         chunk_size = chunk_size,
#         chunk_overlap = int(chunk_size / 10),
#         add_start_index = True,
#         strip_whitespace = True,
#     )

#     docs_processed = []
#     for doc in knowledge_base:
#         docs_processed += text_splitter.split_documents([doc])

#     # Remove duplicates
#     unique_texts = {}
#     docs_processed_unique = []
#     for doc in docs_processed:
#         if doc.page_content not in unique_texts:
#             unique_texts[doc.page_content] = True
#             docs_processed_unique.append(doc)

#     return docs_processed_unique

# chunked_docs = split_documents(
#     CHUNK_SIZE,  
#     docs_subset,
#     tokenizer_name=EMBEDDING_MODEL_NAME,
# )

Token indices sequence length is longer than the specified maximum sequence length for this model (956 > 512). Running this sequence through the model will result in indexing errors


CPU times: user 17.9 s, sys: 156 ms, total: 18 s
Wall time: 18.5 s


In [None]:
# Load spaCy model for semantic chunking
nlp = spacy.load("en_core_web_sm")  # Ensure you have this model installed: `python -m spacy download en_core_web_sm`

def split_documents_semantic(knowledge_base: List[Document], chunk_size: Optional[int] = None, tokenizer_name: Optional[str] = None) -> List[Document]:
    """
    Split documents into semantic chunks (e.g., sentences or paragraphs) and return a list of documents.
    Optionally, enforce a maximum chunk size in tokens.
    """
    docs_processed = []

    # Load tokenizer if chunk_size is provided
    tokenizer = None
    if chunk_size and tokenizer_name:
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    for doc in knowledge_base:
        # Use spaCy to split the document into sentences or paragraphs
        spacy_doc = nlp(doc.page_content)
        
        # Split into sentences
        sentences = [sent.text for sent in spacy_doc.sents]
        
        # Optionally enforce a maximum chunk size in tokens
        if chunk_size and tokenizer:
            current_chunk = []
            current_length = 0
            
            for sentence in sentences:
                sentence_tokens = tokenizer.tokenize(sentence)
                sentence_length = len(sentence_tokens)
                
                # If adding this sentence exceeds the chunk size, finalize the current chunk
                if current_length + sentence_length > chunk_size and current_chunk:
                    docs_processed.append(Document(page_content=" ".join(current_chunk), metadata=doc.metadata))
                    current_chunk = []
                    current_length = 0
                
                # Add the sentence to the current chunk
                current_chunk.append(sentence)
                current_length += sentence_length
            
            # Add the last chunk if it exists
            if current_chunk:
                docs_processed.append(Document(page_content=" ".join(current_chunk), metadata=doc.metadata))
        else:
            # If no chunk_size is provided, treat each sentence as a separate chunk
            for sentence in sentences:
                docs_processed.append(Document(page_content=sentence, metadata=doc.metadata))

    # Remove duplicates (optional)
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

**If the dataset is too large, chunking all the documents can take a long time. To speed things up, consider working with a representative subset of the data.**


# 5. Embeddings and retriever
## 5.1 Embeddings

Now that the documents are correctly sized, we're ready to start building a database that includes their embeddings.

To create embeddings for document segments, we'll be using LangChain's [HuggingFaceEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/huggingfacehub) in conjunction with the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. A broader selection of text embedding models can be found on the Hugging Face Hub, where the most effective models are highlighted in the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [20]:
embedding_model = HuggingFaceEmbeddings(
    model_name = EMBEDDING_MODEL_NAME,
    multi_process = True,
    model_kwargs = {"device": "cpu"},
    encode_kwargs = {"normalize_embeddings": True},  # set True for cosine similarity
)

  embedding_model = HuggingFaceEmbeddings(


In [21]:
# embedding_model = HuggingFaceEmbeddings(
#     model_name="sentence-transformers/all-MiniLM-L6-v2",
#     model_kwargs={"device": "cpu"},
#     encode_kwargs={"normalize_embeddings": True},
# )

# # Test embedding a single sentence
# try:
#     test_embeddings = embedding_model.embed_documents(["This is a test sentence."])
#     print(f"Test embeddings shape: {len(test_embeddings[0])}")  # Should print 384 for all-MiniLM-L6-v2
# except Exception as e:
#     print(f"Error with HuggingFaceEmbeddings: {e}")

In [22]:
# embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

In [23]:
# test_embeddings = embedding_model.encode(["This is a test sentence."])
# print(f"Test embeddings shape: {len(test_embeddings[0])}")
# print(f"Embeddings shape: {test_embeddings.shape}") 

## 5.2 Fusion retrieval or hybrid search

This concept, though not entirely new, involves integrating the strengths of two distinct search methods: traditional keyword-based search, which employs sparse retrieval algorithms such as [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or the search industry standard [BM25](https://en.wikipedia.org/wiki/Okapi_BM25), and contemporary semantic or vector search.

The challenge lies in effectively merging the results obtained from these different similarity scoring methods. This issue is typically addressed using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) (RRF) algorithm, which re-ranks the retrieved results to produce the final output.

In LangChain this is implemented in the [Ensemble Retriever class](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble), combining a list of retrievers you define, for example a Faiss vector index and a BM25 based retriever and using RRF for reranking.


As vector database, we'll use [FAISS](https://github.com/facebookresearch/faiss), a library developed by Facebook AI. FAISS specializes in the efficient similarity search and clustering of dense vectors, which suits our needs perfectly. Currently, FAISS is among the top libraries for conducting Nearest Neighbor (NN) search in large datasets.



In [24]:
num_docs = 5 # Default number of documents to retrieve

# chunked_docs = [
#     Document(page_content="This is the first document.", metadata={"source": "doc1"}),
#     Document(page_content="This is the second document.", metadata={"source": "doc2"}) ]

bm25_retriever = BM25Retriever.from_documents(chunked_docs).configurable_fields(
    k=ConfigurableField(
        id="search_kwargs_bm25",
        name="k",
        description="The search kwargs to use",
    )
)
# chroma_vectorstore = Chroma.from_documents(
#     chunked_docs, embedding_model
# )
faiss_vectorstore = FAISS.from_documents(
    chunked_docs, embedding_model, distance_strategy=DistanceStrategy.COSINE)

# try:
#     faiss_vectorstore = FAISS.from_documents(
#         chunked_docs, embedding_model, distance_strategy=DistanceStrategy.COSINE
#     )
#     print("FAISS vector store created successfully.")
# except Exception as e:
#     print(f"Error creating FAISS vector store: {e}")

faiss_retriever = faiss_vectorstore.as_retriever(
    search_kwargs={"k": num_docs}
    ).configurable_fields(
    search_kwargs=ConfigurableField(
        id="search_kwargs_faiss",
        name="Search Kwargs",
        description="The search kwargs to use",
    )
)

# initialize the ensemble retriever
vector_database = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5] # You can adjust the weight of each retriever in the EnsembleRetriever
)

I pick the row 42 as base to generate questions for the model.

In [25]:
print(data.iloc[42, :])

link                https://www.kaggle.com/c/asl-signs/discussion/...
place                                                               5
competition_name          Google - Isolated Sign Language Recognition
prize                                                        $100,000
team                                                            1,165
kind                                                         Research
metric                                        PostProcessorKernelDesc
year                                                             2023
nm                                                             406491
num_tokens                                                        473
methods             ['Augmentation', 'Transformer model', 'Preproc...
cleaned_methods                                       Post-processing
LLM_context         Competition Name: Google - Isolated Sign Langu...
Name: 42, dtype: object


In [26]:
print(data['LLM_context'][42])

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

Below are several questions that can be derived from the above solution description (generated by a LLM):

- "What specific augmentations were applied to improve the cross-validation score, and how did each contribute to the increase?"
- "Why was a transformer model chosen for this solution, and how did public kernels influence its development?"
- "Can you detail the impact of increasing the model's parameters on its performance on the public leaderboard?"
- "Describe the architecture of the 3-layer transformer model, specifically focusing on the choice of embedding size."
- "How does preprocessing with mean and standard deviation of single sign sequences enhance model performance?"
- "What process did you use to determine that using mean and std of single sign sequences yields better cross-validation scores?"
- "In terms of feature engineering, why were distances between points chosen as a feature, and how were they calculated?"
- "How did the selection of 106 points influence the model's ability to understand and process the data?"
- "What methods were implemented to prevent overfitting, and can you explain how each method contributed to model robustness?"
- "Reflecting on your teamwork, how did your teammates contribute to the development and success of the solution?"

You can use them as inspiration or rephrase them before asking Gemma the question. I'll choose one to test the model.

Let's make a simple query on our database !

In [27]:
user_query = """
I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
What overfitting prevention techniques were used, and how did they ensure model robustness?
"""
config = {"configurable": {"search_kwargs_faiss": {"k": 5}, "search_kwargs_bm25": 5}}
retrieved_docs = vector_database.invoke(user_query, config=config)
print("----------------------Top document content----------------------")
print(retrieved_docs[0].page_content)
print("----------------------Top document metadata----------------------")
print(retrieved_docs[0].metadata)

----------------------Top document content----------------------
Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.

# 6. Reranking 

A practical strategy for RAG involves fetching a larger number of documents initially than the final count you aim for, followed by employing a stronger retrieval model to rerank these results. This process narrows down the selection to only the best top_k documents.

To implement this, we will use [Colbertv2](https://arxiv.org/abs/2112.01488), which is conveniently accessible through the [RAGatouille library](https://github.com/bclavie/RAGatouille).


In [28]:
reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

[Mar 04, 23:59:09] Loading segmented_maxsim_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


  self.scaler = torch.cuda.amp.GradScaler()


In [29]:
page_contents = [doc.page_content for doc in retrieved_docs]  # keep only the text
relevant_docs = reranker.rerank(user_query, page_contents, k=5)
relevant_docs = [doc["content"] for doc in relevant_docs]

  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
100%|██████████| 1/1 [00:01<00:00,  1.35s/it]


In [30]:
print(relevant_docs[0])

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we select

# 7. Model building

In [47]:
%%time
# gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_instruct_2b_en")
# gemma_lm = keras_nlp.models.GPT2CausalLM.from_preset("gpt2_base_en")

# model_name = "mistralai/Mistral-7B-Instruct-v0.2"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)

CPU times: user 3 μs, sys: 1 μs, total: 4 μs
Wall time: 6.2 μs


## 7.1 Testing Gemma model directly

In [48]:
%%time
# display(Markdown(model.generate("Hi, what can you tell me about Kaggle competitions?", max_length=256)))

CPU times: user 2 μs, sys: 1e+03 ns, total: 3 μs
Wall time: 4.77 μs


## 7.2 Prompt

The template for the RAG prompt we will use involves inputting it in the format preferred by the LLM's chat interface. This format includes providing our context along with the user's question.

In [None]:
# prompt_template = """
# Based on your extensive knowledge and the following detailed context, 
# please provide a comprehensive answer to explain concepts from Kaggle competition solution write-ups:

# CONTEXT:
# {context}

# QUESTION:
# {question}

# ANSWER:
# """

# RAG_PROMPT_TEMPLATE = PromptTemplate(
#     input_variables=["context", "question"],
#     template=prompt_template,
# )

# 8. Creating the RAG pipeline

In [None]:
# def answer_with_rag(
#     question: str,
#     llm,
#     knowledge_index: FAISS,
#     reranker: Optional[RAGPretrainedModel] = None,
#     num_retrieved_docs: int = 10,
#     num_docs_final: int = 5,
# ) -> Tuple[str, List[Document]]:
#     # Gather documents with retriever
#     print("=> Retrieving documents...")
#     config = {"configurable": {"search_kwargs_faiss": {"k": num_retrieved_docs}, "search_kwargs_bm25": num_retrieved_docs}}
#     relevant_docs = knowledge_index.invoke(question, config=config)
#     relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text
    
#     # Optionally rerank results
#     if reranker:
#         print("=> Reranking documents...")
#         relevant_docs = reranker.rerank(question, relevant_docs, k = num_docs_final)
#         relevant_docs = [doc["content"] for doc in relevant_docs]
        
#     relevant_docs = relevant_docs[:num_docs_final] # Keeping only num_docs_final documents

#     # Build the final prompt
#     context = relevant_docs[0] # We select only the top relevant document
    
#     final_prompt = RAG_PROMPT_TEMPLATE.format(context = context, question = question)

    # # Redact an answer
    # print("=> Generating answer...")
    # answer = llm.generate(final_prompt, max_length=1024)

    # return answer, relevant_docs

In [None]:
%%time
# question = """I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
# What overfitting prevention techniques were used, and how did they ensure model robustness?
# """
# answer, relevant_docs = answer_with_rag(question, model, vector_database, reranker)

In [None]:
# def get_gemma_answer(generated_answer: str) -> str:
#     """Function to get Gemma answer"""
#     split = generated_answer.split("ANSWER:")
#     return split[1] if len(split) > 1 else "No answer has been generatedCliquez pour utiliser cette solution"

# display(Markdown("### Gemma Answer"))
# display(Markdown(get_gemma_answer(answer)))
# display(Markdown("### Source docs"))
# for i, doc in enumerate(relevant_docs):
#     display(Markdown(f"**Document {i}------------------------------------------------------------**"))
#     display(Markdown(doc))

**Let's ask a another question**

In [None]:
%%time
# question = """What can you tell me about the 'RSNA Screening Mammography Breast Cancer Detection' competition ?
# """
# answer, relevant_docs = answer_with_rag(question, gemma_lm, vector_database, reranker)

# display(Markdown("### Gemma Answer"))
# display(Markdown(get_gemma_answer(answer)))
# display(Markdown("### Source docs"))
# for i, doc in enumerate(relevant_docs):
#     display(Markdown(f"**Document {i}------------------------------------------------------------**"))
#     display(Markdown(doc))

In [31]:
import torch
print(torch.backends.mps.is_available())  # Should return True
print(torch.backends.mps.is_built())  # Should return True

True
True


In [32]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from langchain.prompts import PromptTemplate
from langchain.vectorstores import FAISS
from typing import List, Tuple, Optional

# Check if MPS is available
if not torch.backends.mps.is_available():
    raise RuntimeError("MPS backend is not available. Ensure you're using PyTorch 2.0+ on an Apple Silicon device.")

# Initialize Mistral model and tokenizer with mixed precision
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to("mps")

# Define the prompt template
prompt_template = """
Based on your extensive knowledge and the following detailed context, 
please provide a comprehensive answer to explain concepts from Kaggle competition solution write-ups:

CONTEXT:
{context}

QUESTION:
{question}

ANSWER:
"""

RAG_PROMPT_TEMPLATE = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

def answer_with_rag(
    question: str,
    llm,
    knowledge_index: FAISS,
    reranker,
    num_retrieved_docs: int = 10,
    num_docs_final: int = 5):
    # Gather documents with retriever
    print("=> Retrieving documents...")
    config = {"configurable": {"search_kwargs_faiss": {"k": num_retrieved_docs}, "search_kwargs_bm25": num_retrieved_docs}}
    relevant_docs = knowledge_index.invoke(question, config=config)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text
    
    # Optionally rerank results
    if reranker:
        print("=> Reranking documents...")
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]
        
    relevant_docs = relevant_docs[:num_docs_final]  # Keeping only num_docs_final documents

    # Build the final prompt
    context = relevant_docs[0]  # We select only the top relevant document
    
    final_prompt = RAG_PROMPT_TEMPLATE.format(context=context, question=question)

    # Generate an answer using Mistral
    print("=> Generating answer...")
    inputs = tokenizer(final_prompt, return_tensors="pt").to("mps")
    outputs = model.generate(**inputs, max_length=1024)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer, relevant_docs

def get_mistral_answer(generated_answer: str) -> str:
    """Function to get Mistral answer"""
    split = generated_answer.split("ANSWER:")
    return split[1] if len(split) > 1 else "No answer has been generated"

# Example usage
question = """I want to understand the 5th-place solution in the 'Google - Isolated Sign Language Recognition' competition. 
What overfitting prevention techniques were used, and how did they ensure model robustness?
"""
answer, relevant_docs = answer_with_rag(question, model, vector_database, reranker)

display(Markdown("### Mistral Answer"))
display(Markdown(get_mistral_answer(answer)))
display(Markdown("### Source docs"))
for i, doc in enumerate(relevant_docs):
    display(Markdown(f"**Document {i}------------------------------------------------------------**"))
    display(Markdown(doc))

Loading checkpoint shards: 100%|██████████| 3/3 [00:26<00:00,  8.77s/it]


=> Retrieving documents...


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


=> Reranking documents...


100%|██████████| 1/1 [00:02<00:00,  2.57s/it]
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


=> Generating answer...


### Mistral Answer


In the 5th-place solution for the 'Google - Isolated Sign Language Recognition' competition, several techniques were used to prevent overfitting and ensure model robustness. These techniques include:

1. Augmentation: Augmentation is a data preprocessing technique used to artificially increase the size of the training dataset by applying various transformations to the existing data. In this solution, different augmentations like flip, concatenation, etc., were applied to the sign language videos to increase the variability of the training data and reduce overfitting. By applying different augmentations, the cross-validation score was increased from 0.76 to 0.78.

2. Transfer Learning: The model used in this solution is a transformer model based on public kernels. Transfer learning is a machine learning technique where a pre-trained model is used as a starting point for a new model, and the new model is fine-tuned on a new dataset. By using a pre-trained model, the new model can learn from the features extracted by the pre-trained model, reducing the risk of overfitting to the new dataset.

3. Preprocessing: Preprocessing is a data preprocessing technique used to transform the raw data into a format that is suitable for machine learning models. In this solution, the mean and standard deviation of each single sign sequence were calculated and used for preprocessing. This preprocessing step affects the final performance, and using the mean and standard deviation of the single sign sequence resulted in better cross-validation scores.

4. Feature Engineering: Feature engineering is a process of extracting meaningful features from raw data to be used as inputs to machine learning models. In this solution, around 106 points were selected and used for feature engineering. Distances between points within hands, nose, eyes, etc., were calculated as features. This feature engineering step helps to reduce the dimensionality of the data and improve the model's ability to learn meaningful patterns from the data.

5. Overfitting Prevention Techniques: Several techniques were used to prevent overfitting and ensure model robustness. These techniques include:

   a. Average Weighted Pruning (AWP): AWP is a regularization technique used to prevent overfitting by pruning the weights of the model based on their importance. In this solution, AWP was used to reduce the number of parameters in the model and prevent overfitting.

   b. Random Mask of Frames: Random mask

### Source docs

**Document 0------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 5,
Methods Used: ['Augmentation', 'Transformer model', 'Preprocessing', 'Feature engineering', 'Overfitting prevention'],
Solution: Here is a quick overview of the 5th-place solution.
we applied various augmentations like flip, concatenation, etc
1.1. By applying different augmentations, we can increase the cv by ~ 0.02 (0.76 -> 0.78)
the model is only a transformer model based on the public kernels
2.1. By increasing the number of parameters, the performance of a single model can be increased to around 0.8 (0.78->0.8) in public LB.
2.1.1. 3 layers of transformer with the embedding size 480.
Preprocessing by mean and std of single sign sequence
3.1. the preprocessing does affect the final performance.
3.1.1. we tried different ways of calculating the mean and std and found out that using the mean and std of the single sign sequence results in better cv.
Feature engineering like distances between points
4.1. we selected and used around 106 points (as the public notebook by Heck).
4.2. distances withinpoints of hands/nose/eyes/… are calculated.
some methods to prevent overfitting like awp, random mask of frames, ema, etc …
many thanks to my teammates
@qiaoshiji
@zengzhaoyang
The source code for training models can be found here :
https://github.com/zhouyuanzhe/kaggleasl5thplacesolution

**Document 1------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 6,
Methods Used: ['MLP', 'Encoder', 'Transformer', 'Convolutional Neural Network (CNN)', 'Data Augmentation', 'Cross Entropy Loss', 'Weight Decay', 'Mean Teacher', 'Knowledge Distillation', 'Ensemble Learning', 'Stratified K-fold', 'Baseline Model', 'Deberta', 'Max Pooling', 'Normalization', 'Interpolation', 'Manifold Mixup', 'Face CutMix', 'Outlier Sample Mining (OUSM)', 'Model Soup', 'Data Relabeling', 'Data Truncation', 'Mish Activation Function'],
Solution: Thanks to both, the organizers of this competition who offered a fun yet challenging problem as well as all of the other competitors - well done to everyone who worked hard for small incremental increases.
Although I am the one posting the topic, this is the result of a great team effort, so big shoutout to
@christofhenkel
.
Brief Summary
Our solution is a 2 model ensemble of a MLP-encoder-frame-transformer model. We pushed our transformer models close to the limit and implemented a lot of tricks to climb up to 6th place.
I have 1403 hours of experiment monitoring time in April (that’s 48h per day :)).
Update :
Code is available here :
https://github.com/TheoViel/kaggle_islr
Detailed Summary
Preprocessing & Model
Preprocessing
Remove frames without fingers
Stride the sequence (use 1 every n frames) such that the sequence size is
<= max_len
. We used
max_len=25
and
80
in the final ensemble

**Document 2------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 14,
Methods Used: ['Transformer architecture', 'Mixed PostLN & PreLN architecture', 'Data preprocessing', 'Increase number of layers', 'Set number of hidden units for each part', 'ArcFace layer', 'Loss function', 'Label smoothing', 'Data augmentation', 'TFLite Conversion', 'TTA (Test Time Augmentation)'],
Solution: Summary
I thank Kaggle administraror & host for holding this competition. Although struggling with Tensorflow's unfriendly errors and lots of try-and-errors for failing TFLite conversion was really, really tough, these low-layer experience was valuable for me.
Below is my solution writeup of this competition.
Pipeline
I tried over 377 different patterns of training models for this competition, however, the best architecture is only minor-changed one from
Mark Wijkhuizen's great public notebook
.
Mixed PostLN & PreLN Architecture
I tested a) PostLN, b) PreLN, c) Mixed architectures, and found Mixed architecture provides the best result. This architecture was originally (perhaps unintentionally) implemented in the Mark Wijkhuizen's public notebook (in earlier version).
Other modifications
keep frames with no hands (instead of dropping) in pre-processing.
increase number of layers in the keypoint encoder. With more layers, the accuracy gets better. The 4x layer seetting is the best tradeoff for accuracy and inference time.
set number of hidden units in keypoint encoder independently for each parts (lips=192, left_hand=256, right_hand=256, pose=128). This reduces inference time without losing accuracy.
attach ArcFace layer on training.
Training Setting
loss function:
0.5 * ArcFace + 0.5 * CrossEntropy
(This setting was shared by
Med Ali Bouchhioua

**Document 3------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 26,
Methods Used: ['Mixup', 'Mirroring', 'LLaMa-inspired architecture', 'RMSNorm normalization', 'Lion optimizer', 'Cosine decay learning rate', 'Batch size 128', 'Dropout 0.1', 'Exponential moving average of weights'],
Solution: Github with all the code used
Summary
The most important part of the solution is the data utilization. Major improvements were from keypoints choice and mixup. External data does not help because it is from a very different distribution. Given data amount does not benefit larger models so ensembles of small models is the way to utilize given constraints to the fullest.
Most augmentations are not helpful, because they prevent model from learning the true data distribution. So only used mirroring and mixup (0.5).
Inputs to the model
All models are trained to support sequences of up to 512 frames.
Preprocessing
Only 2d coordinates are used as 3rd dimension leads to unstable training.
To normalize inputs all keypoints are shifted so that head is located at the origin.
Scaling did not provide any benefit so not used.
All nans are replaced with 0 after normalization.
Chosen keypoints
All (21) hand keypoints
26 face keypoints
17 pose keypoints
Architecture
LLaMa-inspired architecture. Most notable improvement comes from much better normalization RMSNorm.
For all models head dimensions are set to 64
Single model (Private/Public LB: 0.8543689/0.7702471)
6 heads 5 layers 9.2M parameters
Ensemble of 3 models (Private/Public LB: 0.8584568/0.7725324)
2 heads 6 layers 1.7M parameters per model
Larger models could be fit into file size limit, but it would time out during submission.
Augmentations

**Document 4------------------------------------------------------------**

Competition Name: Google - Isolated Sign Language Recognition,
Place: 11,
Methods Used: ['Ensemble', 'Strong augmentation', 'Manual model conversion from pytorch to tensorflow', 'CLIP transformer architecture', 'Decrease parameter size', 'Motion features', 'Longer epoch'],
Solution: Thank you to the organizer and Kaggle for hosting this interesting challenge.
Especially I enjoyed this strict inference time restriction. It keeps model size reasonable and requires us for some practical technique.
TL;DR
Ensemble 5 transformer models
Strong augmentation
Manual model conversion from pytroch to tensorflow
Code is available here ->
https://github.com/bamps53/kaggle-asl-11th-place-solution
Overview
I started from
@hengck23
‘s
great discussion
and
notebook
. Thanks for sharing a lot of useful tricks as always!
The changes I made are following;
Change model architecture to CLIP transformer in HuggingFace
Decrease parameter size to maximize latency within the range of same accuracy
Some strong augmentations
Horizontal flip(p=0.5)
Random 3d rotation(p=1, -45~45)
Random scale(p=1, 0.5~1.5)
Random shift(p=1, 0.7~1.3)
Random mask frames(p=1, mask_ratio=0.5)
Random resize (p=1, 0.5~1.5)
Add motion features
current - prev
next - current
Velocity
Longer epoch, 250 for 5 fold and 300 for all data
For the details, please refer to the code.(planning to upload)
Model conversion