# Image and Text Feature Extractor for RIMAS dataset

## Setup and libraries imports

In [27]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [28]:
import os
import sys

import cv2
from tqdm import tqdm
import numpy as np
import pandas as pd
from PIL import Image
from pathlib import Path
from typing import List, Tuple

sys.path.append(str(Path(os.getcwd()).resolve().parent.parent))

from src.core.config.config import Config
from src.ml.embeddings.feature_extractor import FeatureExtractor
from src.utils.saving_utils.save_embeddings import EmbeddingsSaver
from src.core.loaders.embedding_loader import EmbeddingsLoader

config = Config()
saver_service = EmbeddingsSaver(compression=config.COMPRESSION)
loader_service = EmbeddingsLoader()

## Research Workflow: Image & Text Feature Extraction

### Goal

Identify the most effective feature extraction method (for both images and text) by experimenting with multiple approaches and comparing classification performance.

---

## Image Feature Extraction

### Methods to Compare

* Flattening approach
* HOG (Histogram of Oriented Gradients)
* LBP (Local Binary Patterns)
* SIFT (Scale-Invariant Feature Transform)
* SURF (Speeded-Up Robust Features)
* (Optional: add more methods)

### Workflow

1. Extract features using each method.
2. Construct a consolidated `DataFrame` with all feature sets.
3. Train classification models on each feature representation.
4. Evaluate classifiers on the task: detecting the presence of specific letters in word images.
5. Compare metrics across models and methods.
6. Select the best-performing image feature extractor.

---

## Text Feature Extraction

### Starting Point

* Bag of letters representation

### Next Steps

1. Implement bag of letters as baseline.
2. Experiment with additional encoders (TF-IDF, n-grams, etc.).
3. Train classifiers on text-based features.
4. Evaluate and compare performance.

---

## Evaluation

* **Metrics:** Accuracy, Precision, Recall, F1-score (and others if needed).
* **Outcome:** Best image feature extractor + best text feature extractor → Final approaches for classification tasks.


### Flatten Image Embeddings

In [29]:

feature_extractor = FeatureExtractor(
    dataset_path=config.DATASET_PATH,
    target_size=config.TARGET_SIZE,
    image_embeddings_path=config.IMAGE_EMBEDDINGS_PATH,
    encoder_type=config.ENCODER_TYPE,
    saver_service=saver_service
)

INFO:core.loaders.data_loader:Loaded 28475 text-image pairs


In [30]:
# If you want to compute embeddings and save them in batches
feature_extractor.prepare_and_save_text_image_pairs_with_batch(
    saver_service=saver_service,
    batch_size=config.BATCH_SIZE,
    image_embeddings_path=config.IMAGE_EMBEDDINGS_PATH
)

Embedding batches:   0%|          | 0/29 [00:00<?, ?it/s]INFO:src.utils.saving_utils.save_embeddings:Image embeddings saved to /home/nikolay/Deloitte/RIMAS/src/data/processed/words/weights/image_batches/part-00000.parquet successfully.
INFO:src.utils.saving_utils.save_embeddings:Image embeddings saved to /home/nikolay/Deloitte/RIMAS/src/data/processed/words/weights/image_batches/part-00001.parquet successfully.
Embedding batches:   7%|▋         | 2/29 [01:23<18:59, 42.21s/it]INFO:src.utils.saving_utils.save_embeddings:Image embeddings saved to /home/nikolay/Deloitte/RIMAS/src/data/processed/words/weights/image_batches/part-00002.parquet successfully.
Embedding batches:  10%|█         | 3/29 [02:04<17:54, 41.34s/it]INFO:src.utils.saving_utils.save_embeddings:Image embeddings saved to /home/nikolay/Deloitte/RIMAS/src/data/processed/words/weights/image_batches/part-00003.parquet successfully.
INFO:src.utils.saving_utils.save_embeddings:Image embeddings saved to /home/nikolay/Deloitte/RIMA

In [33]:
# If you want to load precomputed embeddings
feature_extractor.image_embeddings_list = loader_service.load_image_embeddings_from_all_batches(
    image_embeddings_dir=config.IMAGE_EMBEDDINGS_PATH
)
image_embeddings = feature_extractor.image_embeddings_list
print(image_embeddings.shape)
image_embeddings.set_index("id").head()

(28475, 4)


Unnamed: 0_level_0,label,image_path,image_embedding
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,l'expression,/home/nikolay/Deloitte/RIMAS/src/data/processe...,"[19.508981704711914, 15.61077880859375, 11.281..."
1,Sincères,/home/nikolay/Deloitte/RIMAS/src/data/processe...,"[12.892561912536621, 9.752065658569336, 9.1487..."
2,compte,/home/nikolay/Deloitte/RIMAS/src/data/processe...,"[12.104000091552734, 22.055999755859375, 19.55..."
3,privées,/home/nikolay/Deloitte/RIMAS/src/data/processe...,"[19.907894134521484, 16.74342155456543, 14.822..."
4,domicile,/home/nikolay/Deloitte/RIMAS/src/data/processe...,"[15.165775299072266, 18.45989227294922, 17.016..."
