<a href="https://colab.research.google.com/github/mansamoussa/llm-skill-extractor/blob/main/notebooks/01_data_preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-Stage Job Advertisement Analysis ‚Äî Data Preparation

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mansamoussa/llm-skill-extractor/blob/main/notebooks/01_data_preparation.ipynb)

---

### Objective
Prepare the raw annotated dataset (`annotated.json`) for training a multilingual BERT model for **zone identification** in job advertisements.

This notebook will:
1. Load and parse the annotated dataset  
2. Tokenize text using `BertTokenizerFast`  
3. Align character-level labels with tokens  
4. Handle long sequences using the sliding window approach  
5. Generate and save:
   - `label2id.json`  
   - `id2label.json`  
   - PyTorch `train_dataset` and `test_dataset`

### Input Data
- `data/annotated.json` ‚Äî cleaned and annotated job ads  
- `src/preprocessing.py` ‚Äî preprocessing helper script  

### Output
- Tokenized and labeled datasets  
- `label2id.json` and `id2label.json` mapping files  


---
# 1. Setting up my Environment

**Objective:** Prepare the Google Colab environment by cloning the project repository and installing the necessary Python dependencies.

**Why I need this:**
* **The Issue:** When I start Google Colab, it is empty. It does not have my project files.
* **The Fix:** I clone my repository. This makes sure I have my code (like `preprocessing.py`). If I do not do this, I will get errors about missing files.

**What I did:**
1.  **Clone Repository:** I downloaded my files from GitHub.
2.  **Install Dependencies:** I installed the Python libraries I need using`requirements.txt`.

In [1]:
# --- SETUP STEP ---
# This cell prepares the Colab environment by downloading the code and installing libraries.
import os

# 1. Clone the repository if it doesn't exist in the notebook
if not os.path.exists('llm-skill-extractor'):
    !git clone https://github.com/mansamoussa/llm-skill-extractor.git
else:
    print("Repository already cloned.")

# 2. Install dependencies
!pip install -r llm-skill-extractor/requirements.txt

Cloning into 'llm-skill-extractor'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 32 (delta 9), reused 21 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (32/32), 2.83 MiB | 7.44 MiB/s, done.
Resolving deltas: 100% (9/9), done.


# 2. Loading my Data

**Objective:** Securely import the raw `annotated.json` dataset from Google Drive into the local runtime environment.

**Why I need this:**
* **The Issue:** My data file is private. It is not on GitHub. If I skip this, the code will crash because the `data/` folder is empty.
* **The Fix:** I connect to my Google Drive. I copy the file from Drive to my project folder manually.

**What I did:**
1.  **Mount Drive:** I connected my Google Drive.
2.  **Copy Data:** I copied `annotated.json` to the `data/` folder.

In [3]:
# --- DATA LOADING STEP ---
# This cell brings the gridve data into the project environment
from google.colab import drive
import shutil

# 1. Mount Google Drive
drive.mount('/content/drive')

# 2. Define paths (here works with mika's gdrive)
source_path = '/content/drive/MyDrive/GEN03/annotated.json'
destination_folder = '/content/llm-skill-extractor/data/'

# 3. Copy the file
if os.path.exists(source_path):
    os.makedirs(destination_folder, exist_ok=True)
    shutil.copy(source_path, destination_folder)
    print(f"‚úÖ Success! Data copied to {destination_folder}")
else:
    print(f"‚ùå File not found at {source_path}")
    print("Please verify the path in your Drive or upload 'annotated.json' manually to the 'llm-skill-extractor/data/' folder using the Files sidebar on the left.")

Mounted at /content/drive
‚úÖ Success! Data copied to /content/llm-skill-extractor/data/


# 3. Configuring my Project

**Objective:** Configure the Python execution environment to recognize custom modules and define all required input/output file paths.

**Why I need this:**
* **The Issue 1 (Logging File):** The `preprocessing.py` script attempts to save a log file using a relative path (`../data/preprocessing.log`). Because I was running the notebook from the root folder, the script failed with a `FileNotFoundError` during import.
* **The Fix 1:** I temporarily changed the working directory to the `src` folder just long enough for the script to set up its logging correctly, and then immediately switched back.
* **The Issue 2 (Silent Logs):** The script also removes all existing log handlers, silencing all output in the notebook.
* **The Fix 2:** I explicitly added the console logger back *after* the import so I can see the processing messages.

**What I did:**
1.  **Directory Fix:** I temporarily changed the directory to `src/` to resolve the log file path issue.
2.  **System Path:** I added the `src/` folder to the system path.
3.  **Imports:** I imported my custom functions (`preprocess_data`, `create_dataset`).
4.  **Logging:** I restored the console logging (screen output).
5.  **Paths:** I defined all project file paths.

In [4]:
# --- Configuration Step ---
# This cell links the environment to the custom scripts in the 'src' folder.
import sys
import os
import logging
import pandas as pd

# 1. Define Project Structure
PROJECT_ROOT = '/content/llm-skill-extractor'
SRC_PATH = os.path.join(PROJECT_ROOT, 'src')
DATA_PATH = os.path.join(PROJECT_ROOT, 'data')

# 2. Ensure Python can find the code
# We add 'src' to the system path
if SRC_PATH not in sys.path:
    sys.path.append(SRC_PATH)
    print(f"‚úÖ Added '{SRC_PATH}' to system path.")

# 3. Import the Module using a temporary directory change (The fix!)
original_dir = os.getcwd() # Save where we are now
try:
    # Change directory so the relative log path (../data) resolves correctly
    os.chdir(SRC_PATH)
    print(f"üîÑ Temporarily changed directory to: {os.getcwd()}")

    # Importing 'preprocessing' will now execute the logging setup successfully
    from preprocessing import preprocess_data, create_dataset
    print("‚úÖ 'preprocessing' module imported successfully!")

except Exception as e:
    print(f"‚ùå CRITICAL ERROR during import: {e}")
    raise e

finally:
    # Always change the directory back immediately
    os.chdir(original_dir)
    print(f"‚û°Ô∏è Directory restored to: {os.getcwd()}")


# 4. Configure Logging
# The preprocessing script removes screen logs. We turn them back on here.
logger = logging.getLogger()
logger.setLevel(logging.INFO)

has_screen_handler = any(isinstance(h, logging.StreamHandler) for h in logger.handlers)
if not has_screen_handler:
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setFormatter(logging.Formatter('%(message)s'))
    logger.addHandler(console_handler)
    print("‚úÖ Logger output restored to screen.")

# 5. Define File Paths for the next steps
INPUT_FILE_PATH = os.path.join(DATA_PATH, 'annotated.json')
LABEL_MAPPING_PATH = os.path.join(PROJECT_ROOT, 'model', 'label2id.json')
ID2LABEL_PATH = os.path.join(PROJECT_ROOT, 'model', 'id2label.json')
TRAIN_DATASET_PATH = os.path.join(DATA_PATH, 'train_dataset.pt')
TEST_DATASET_PATH = os.path.join(DATA_PATH, 'test_dataset.pt')

print("\nüöÄ SETUP COMPLETE. You can now run the 'Load Data' cell.")

‚úÖ Added '/content/llm-skill-extractor/src' to system path.
üîÑ Temporarily changed directory to: /content/llm-skill-extractor/src
‚úÖ 'preprocessing' module imported successfully!
‚û°Ô∏è Directory restored to: /content

üöÄ SETUP COMPLETE. You can now run the 'Load Data' cell.


# 4. Reading the Data

**Objective:** Load the raw JSON data into a Pandas DataFrame and perform initial data quality filtering.

**What I am doing:** I am reading my raw data into a table so I can look at it.

**What I did:**
1.  **Load JSON:** I read the `annotated.json` file.
2.  **Filter:** I removed empty rows. This stops errors from happening later.

In [5]:
# PROJECT STEP 1: Load and parse the annotated dataset
logger.info(f"üìñ Reading data from {INPUT_FILE_PATH}...")

try:
    # Load JSON into a Pandas DataFrame
    df = pd.read_json(INPUT_FILE_PATH)

    # Filter out rows that don't have valid annotations (prevent errors later)
    initial_count = len(df)
    df = df[df.annotations.apply(lambda x: isinstance(x, list) and len(x) > 0 and isinstance(x[0], dict) and len(x[0].get('result', [])) > 0)]

    logger.info(f"‚úÖ Successfully loaded {len(df)} rows (filtered from {initial_count}).")

    # Display the first few rows to check
    display(df.head())

except Exception as e:
    logger.error(f"‚ùå Error loading data: {e}")

Unnamed: 0,id,annotations,file_upload,drafts,predictions,data,meta,created_at,updated_at,inner_id,total_annotations,cancelled_annotations,total_predictions,comment_count,unresolved_comment_count,last_comment_updated_at,project,updated_by,comment_authors
0,1,"[{'id': 10, 'completed_by': 2, 'result': [{'va...",e6d47862-240625_content_clean_Kopie.json,[],[],{'duplicate_group': '50481aee-11dd-4167-bc63-8...,{},2024-06-25 11:28:24.727849+00:00,2024-06-25 11:50:28.913292+00:00,1,1,0,0,0,0,NaT,1,2,[]
1,2,"[{'id': 7, 'completed_by': 4, 'result': [{'val...",e6d47862-240625_content_clean_Kopie.json,[],[],{'duplicate_group': '609e314f-6ed2-4876-beb2-0...,{},2024-06-25 11:28:24.727981+00:00,2024-06-25 11:44:47.858773+00:00,2,1,0,0,0,0,NaT,1,4,[]
2,3,"[{'id': 11, 'completed_by': 2, 'result': [{'va...",e6d47862-240625_content_clean_Kopie.json,[],[],{'duplicate_group': 'ed0ea7e0-253f-40a9-8e09-4...,{},2024-06-25 11:28:24.728044+00:00,2024-06-25 11:55:00.450712+00:00,3,1,0,0,0,0,NaT,1,2,[]
3,4,"[{'id': 12, 'completed_by': 4, 'result': [{'va...",e6d47862-240625_content_clean_Kopie.json,[],[],{'duplicate_group': '5cd13b12-9c04-4477-9dc7-8...,{},2024-06-25 11:28:24.728132+00:00,2024-06-25 11:56:05.915975+00:00,4,1,0,0,0,0,NaT,1,4,[]
4,5,"[{'id': 13, 'completed_by': 2, 'result': [{'va...",e6d47862-240625_content_clean_Kopie.json,"[{'id': 1049, 'user': 'marcel.blattner@x28.ch'...",[],{'duplicate_group': '0fa3d5dd-eaa6-4134-bad9-2...,{},2024-06-25 11:28:24.728189+00:00,2024-06-25 11:57:00.795412+00:00,5,1,0,0,0,0,NaT,1,2,[]


# 5. Processing the Text

**Objective:** Transform raw text and annotations into tokenized, BERT-compatible sequences using a sliding window approach.

**Why I need this:**
* **The Issue:** BERT cannot read very long texts (more than 512 tokens). Many of my job ads are too long.
* **The Fix:** I used a "Sliding Window". This cuts the long text into smaller pieces that overlap. The code handles the warning about "sequence length" automatically.

**What I did:**
* I tokenized the text.
* I split long documents into chunks.
* I matched the labels to the correct tokens.

In [6]:
# PROJECT STEPS 2, 3, & 4:
# - Tokenize text using BertTokenizerFast
# - Align character-level labels with tokens
# - Handle long sequences using the sliding window approach
logger.info("‚öôÔ∏è Starting preprocessing (Tokenization & Label Alignment)...")

try:
    # preprocess_data is the function we imported from your 'preprocessing.py' file
    # this function performs all the steps above.
    processed_data, label2id = preprocess_data(df)

    logger.info("‚úÖ Preprocessing complete!")
    logger.info(f"üì¶ Generated {len(processed_data)} total sequences (chunks).")
    logger.info(f"üè∑Ô∏è  Labels found: {list(label2id.keys())}")

except Exception as e:
    logger.error(f"‚ùå Error during preprocessing: {e}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (575 > 512). Running this sequence through the model will result in indexing errors


# 6. Saving the Labels

**Objective:** Generate and persist the `label2id` and `id2label` mappings to ensure consistent decoding of model predictions.

**Why I need this:**
I need to know which number matches which label (e.g., 0 = "Anstellung"). I will need these files later to understand what the model predicts.

In [7]:
# PROJECT STEP 5 (Part A): Generate and save label2id.json and id2label.json
import json

# Create inverse mapping (ID -> Label)
id2label = {i: label for label, i in label2id.items()}

# Ensure the output directory exists
os.makedirs(os.path.dirname(LABEL_MAPPING_PATH), exist_ok=True)

# Save label2id.json
with open(LABEL_MAPPING_PATH, 'w') as f:
    json.dump(label2id, f, indent=2)
logger.info(f"üíæ Saved label2id to: {LABEL_MAPPING_PATH}")

# Save id2label.json
with open(ID2LABEL_PATH, 'w') as f:
    json.dump(id2label, f, indent=2)
logger.info(f"üíæ Saved id2label to: {ID2LABEL_PATH}")

# 7. Saving the Datasets

**Objective:** Split the processed data into training and evaluation sets, convert them into PyTorch TensorDatasets, and save them to disk for the training phase.

**What I am doing:** I am saving the final data so it is ready for training.

**What I did:**
1.  **Split:** I separated the data: 80% for training and 20% for testing.
2.  **Convert:** I turned the data into PyTorch format (TensorDataset).
3.  **Save:** I saved the files to the disk (`.pt` files). Now I can load them quickly in the next notebook.

In [8]:
# PROJECT STEP 5 (Part B): Generate and save PyTorch train_dataset and test_dataset
from sklearn.model_selection import train_test_split
import torch

# 1. Split data into Training and Testing sets (80/20 split)
train_data, test_data = train_test_split(processed_data, test_size=0.2, random_state=42)
logger.info(f"Training chunks: {len(train_data)}")
logger.info(f"Testing chunks:  {len(test_data)}")

# 2. Convert to PyTorch TensorDatasets
# This handles padding so all sequences in a batch are the same length
logger.info("üîÑ Converting to PyTorch Datasets...")
train_dataset = create_dataset(train_data, label2id)
test_dataset = create_dataset(test_data, label2id)

# 3. Save the final datasets
logger.info("üíæ Saving datasets to disk...")
torch.save(train_dataset, TRAIN_DATASET_PATH)
torch.save(test_dataset, TEST_DATASET_PATH)

logger.info("üéâ Data Preparation Finished Successfully!")
logger.info(f"Train Dataset saved to: {TRAIN_DATASET_PATH}")
logger.info(f"Test Dataset saved to:  {TEST_DATASET_PATH}")

# 8. Saving Outputs to Google Drive

**Objective:** Persist the generated datasets and label mappings to permanent cloud storage to prevent data loss upon session termination.

**Why I need this:**
* **The Issue:** Google Colab is temporary. If I close this tab or disconnect, all the files I just created (the datasets and maps) will be deleted immediately.
* **The Fix:** I must copy these files to my Google Drive. This way, they are safe, and I can load them easily when I start the next notebook for training.

**What I did:**
1.  **Create Folder:** I made a new folder in my Google Drive called `processed_data`.
2.  **Copy Files:** I copied the 4 critical files (`train_dataset.pt`, `test_dataset.pt`, `label2id.json`, `id2label.json`) into that folder.

In [9]:
# CELL 8: Save Outputs to Google Drive (FIXED)
import shutil
import os

# 1. Define where we want to save the results in your Drive
# I am creating a new folder called 'processed_data' in your GEN03 folder
drive_save_path = '/content/drive/MyDrive/GEN03/processed_data'

# 2. Create the folder if it doesn't exist
os.makedirs(drive_save_path, exist_ok=True)

# 3. Use the ABSOLUTE PATH VARIABLES defined in Cell 3!
# This ensures the script always finds the files, regardless of the current directory.
files_to_save = [
    LABEL_MAPPING_PATH,
    ID2LABEL_PATH,
    TRAIN_DATASET_PATH,
    TEST_DATASET_PATH
]

# 4. Copy them
print(f"üöÄ Backing up files to: {drive_save_path}")

for full_source_path in files_to_save:
    filename = os.path.basename(full_source_path)
    destination = os.path.join(drive_save_path, filename)

    if os.path.exists(full_source_path):
        # We don't need os.path.abspath here since the variables are already absolute
        shutil.copy(full_source_path, destination)
        print(f"‚úÖ Saved: {filename}")
    else:
        # This should no longer happen if Cells 5, 6, and 7 ran successfully
        print(f"‚ö†Ô∏è Could not find: {filename} at path: {full_source_path}")

print("\nüéâ Everything is saved to my Google Drive! I can safely close this tab.")

üöÄ Backing up files to: /content/drive/MyDrive/GEN03/processed_data
‚úÖ Saved: label2id.json
‚úÖ Saved: id2label.json
‚úÖ Saved: train_dataset.pt
‚úÖ Saved: test_dataset.pt

üéâ Everything is saved to my Google Drive! I can safely close this tab.
