# DL Text Tokenization and Sequencing  

To effectively train **Deep Learning models** for text classification, raw textual data must be transformed into a structured numerical format that neural networks can process. Unlike **traditional Machine Learning models** that leverage **TF-IDF** or **Bag-of-Words (BoW)** for feature extraction, deep learning architectures rely on **embedding-based representations** to capture semantic relationships between words.  

In this notebook, we focus on **preparing text data for deep learning models** by performing the following key steps:  

- **Tokenization**: Converting text into a vocabulary of indexed tokens using Keras' `Tokenizer`.  
- **Sequence Encoding**: Mapping words to their corresponding integer representations.  
- **Padding Sequences**: Ensuring uniform input size by applying zero-padding to sequences.  

These transformations are crucial for feeding textual data into **Recurrent Neural Networks (RNNs)**, **Long Short-Term Memory Networks (LSTMs)**, **Gated Recurrent Units (GRUs)**, **1D Convolutional Neural Networks (Conv1D)**, and **Deep Neural Networks (DNNs)**. Each of these architectures expects fixed-length input sequences and benefits from structured token representations.  

By the end of this notebook, we will have a **tokenized and padded dataset**, ready for deep learning model training in the next phase.  


## 1. Import Required Libraries 

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
import os
from pathlib import Path
import pickle
import pandas as pd

from sklearn.model_selection import train_test_split
import tensorflow as tf

import importlib


### Setting Up Project Paths and Configurations

In [2]:
# Get the current notebook directory
CURRENT_DIR = Path(os.getcwd()).resolve()

# Automatically find the project root (go up 1 level)
PROJECT_ROOT = CURRENT_DIR.parents[1]

# Add project root to sys.path
sys.path.append(str(PROJECT_ROOT))

# Function to get relative paths from project root
def get_relative_path(absolute_path):
    return str(Path(absolute_path).relative_to(PROJECT_ROOT))

# Print project root directory
print(f"Project Root Directory: {PROJECT_ROOT.name}")  # Display only the root folder name

import config  # Now Python can find config.py

Project Root Directory: Data_Scientist_Rakuten_Project-main


## 2. Load Preprocessed Data

Before applying tokenization and sequencing, we will load the final cleaned datasets (`X_train_final.pkl` and `X_test_final.pkl`).  
Additionally, we will load the encoded target labels (`y_train_encoded.pkl`) that were previously prepared during the TF-IDF vectorization step.  
Reusing these labels ensures consistency during the training of our deep learning models.


In [22]:
importlib.reload(config)  # Reload config to ensure any updates are applied

# Define paths for datasets
train_pickle_path = Path(config.XTRAIN_FINAL_ENCODED_PATH)
# train_pickle_path = Path(config.XTRAIN_FINAL_PATH)
test_pickle_path = Path(config.XTEST_SUB_FINAL_PATH)
y_train_pickle_path = Path(config.YTRAIN_FINAL_PATH)
y_train_encoded_pickle_path = Path(config.YTRAIN_ENCODED_PATH)

# Function to get relative paths from project root
def get_relative_path(absolute_path: Path):
    """Returns the relative path from the project root."""
    return str(absolute_path.relative_to(config.BASE_DIR))

# Function to load a Pickle file safely
def load_pickle(file_path: Path, dataset_name: str):
    """Loads a pickle file with error handling and basic visualization."""
    if not file_path.exists():
        print(f"Error: `{dataset_name}` file not found at {file_path}")
        return None

    try:
        data = pd.read_pickle(file_path)
        print(f"Successfully loaded `{dataset_name}` | Shape: {data.shape}")
        
        # Display first rows if dataset is not empty
        if not data.empty:
            display(data.head())

        return data
    except Exception as e:
        print(f"Error loading `{dataset_name}`: {e}")
        return None

# List of required files with their names
required_files = {
    "Training Dataset": train_pickle_path,
    "Testing Dataset": test_pickle_path,
    "Encoded Training Labels": y_train_encoded_pickle_path,
    "Final Training Labels": y_train_pickle_path
}

# Check if files exist before loading
for name, path in required_files.items():
    if not path.exists():
        raise FileNotFoundError(f"Error: `{name}` file not found at {get_relative_path(path)}")

# Load datasets
X_train_full = load_pickle(train_pickle_path, "X_train_final_encoded.pkl")
y_train_encoded = load_pickle(y_train_encoded_pickle_path, "y_train_encoded.pkl")
X_test_Submission = load_pickle(test_pickle_path, "X_test_Submission_final.pkl") # Submission dataset



Successfully loaded `X_train_final_encoded.pkl` | Shape: (84916, 9)


Unnamed: 0,designation,description,productid,imageid,prdtypecode,prdtypecode_encoded,Label,image_name,text
0,Olivia: Personalisiertes Notizbuch / 150 Seite...,,3804725264,1263597046,10,0,Adult Books,image_1263597046_product_3804725264.jpg,olivia personalisiertes notizbuch seiten punkt...
1,Journal Des Arts (Le) N° 133 Du 28/09/2001 - L...,,436067568,1008141237,2280,1,Magazines,image_1008141237_product_436067568.jpg,journal arts art marche salon art asiatique pa...
2,Grand Stylet Ergonomique Bleu Gamepad Nintendo...,PILOT STYLE Touch Pen de marque Speedlink est ...,201115110,938777978,50,2,Video Games Accessories,image_938777978_product_201115110.jpg,grand stylet ergonomique bleu gamepad nintendo...
3,Peluche Donald - Europe - Disneyland 2000 (Mar...,,50418756,457047496,1280,3,Toys for Children,image_457047496_product_50418756.jpg,peluche donald europe disneyland marionnette d...
4,La Guerre Des Tuques,Luc a des id&eacute;es de grandeur. Il veut or...,278535884,1077757786,2705,4,Books,image_1077757786_product_278535884.jpg,guerre tuques luc idees grandeur veut organise...


Successfully loaded `y_train_encoded.pkl` | Shape: (84916,)


0    0
1    1
2    2
3    3
4    4
Name: prdtypecode, dtype: int64

Successfully loaded `X_test_Submission_final.pkl` | Shape: (13812, 6)


Unnamed: 0,designation,description,productid,imageid,image_name,text
84916,Folkmanis Puppets - 2732 - Marionnette Et Théâ...,,516376098,1019294171,image_1019294171_product_516376098.jpg,folkmanis puppets marionnette theatre mini turtle
84917,Porte Flamme Gaxix - Flamebringer Gaxix - 136/...,,133389013,1274228667,image_1274228667_product_133389013.jpg,porte flamme gaxix flamebringer twilight dragons
84918,Pompe de filtration Speck Badu 95,,4128438366,1295960357,image_1295960357_product_4128438366.jpg,pompe filtration speck badu
84919,Robot de piscine électrique,<p>Ce robot de piscine d&#39;un design innovan...,3929899732,1265224052,image_1265224052_product_3929899732.jpg,robot piscine electrique robot design innovant...
84920,Hsm Destructeur Securio C16 Coupe Crois¿E: 4 X...,,152993898,940543690,image_940543690_product_152993898.jpg,hsm destructeur securio coupe croise


## 3. Apply Tokenization and Sequencing  


******************


### 3.1 Splitting Data into Training and Testing Sets

In [23]:
from sklearn.model_selection import train_test_split

# Définir les paramètres
TEST_SIZE = 0.2
RANDOM_STATE = 1234

# Split en conservant toutes les colonnes
X_train_full, X_test_full = train_test_split(X_train_full, test_size=TEST_SIZE, random_state=RANDOM_STATE)

# Vérification
print(f" Training Shape : {X_train_full.shape}")
print(f" Testing Shape : {X_test_full.shape}")

#  Sauvegarde des splits
X_train_full.to_pickle(Path(config.PROCESSED_DIR) / "X_train_split.pkl")
X_test_full.to_pickle(Path(config.PROCESSED_DIR) / "X_test_split.pkl")


 Training Shape : (67932, 9)
 Testing Shape : (16984, 9)


In [24]:
X_train_text = X_train_full["text"]
X_test_text = X_test_full["text"]

X_test_Submission_text = X_test_Submission["text"]

y_train_text = X_train_full["prdtypecode_encoded"]
y_test_text = X_test_full["prdtypecode_encoded"]

### 3.4 Define and Train the Tokenizer - Converting Text to Numerical Form

In [37]:
%%time

# Define the tokenizer with a max vocabulary size
MAX_VOCAB_SIZE = 20000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>")


# Fit the tokenizer on the training text data
tokenizer.fit_on_texts(X_train_text)

# Define word-index mappings
word2idx = tokenizer.word_index
idx2word = tokenizer.index_word
vocab_size = tokenizer.num_words

# Print vocabulary size
print(f"Vocabulary Size: {vocab_size}")



Vocabulary Size: 20000
Wall time: 2.27 s


### 3.3 Convert Text Data to Tokenized Sequences

In [38]:
# Convert text data to sequences
X_train_text = tokenizer.texts_to_sequences(X_train_text)
X_test_text = tokenizer.texts_to_sequences(X_test_text)
X_test_Submission_text  = tokenizer.texts_to_sequences(X_test_Submission_text)  # Submission dataset
# Print example sequences
print("Example tokenized sentence (before padding):", X_train_text[0])


Example tokenized sentence (before padding): [43, 39, 179, 1967, 36, 3316, 4045, 802, 400, 180, 179, 1967, 36, 828, 43, 17, 31, 1595, 938, 1322, 2805, 352, 39, 3252, 932, 614, 1036, 2058, 15, 2738, 1448, 236, 1, 3137, 757, 610, 682, 61, 862, 14877, 113, 2751, 298, 830, 1032, 834, 932, 3379, 515, 1389, 716, 1392, 253, 5624, 680, 2519, 1207, 284, 573, 10709, 2388, 149, 239, 2253, 4404, 237, 21, 1061, 2423, 122, 250, 1151, 179, 144, 249, 510, 4981, 35, 617, 8160, 5970, 810, 2718, 68, 968, 46, 132, 160, 5, 828, 5177, 280, 12424, 3517, 1066, 2859, 10193, 6871, 459, 933, 1404, 447, 66, 9732, 847, 2566, 879, 2790, 46, 1304, 11659, 6442, 8, 4508, 1795, 126, 113, 1952, 1274, 2058, 193, 2, 3232, 5514, 22, 583, 84, 78, 153, 381, 238, 84, 48, 467, 3106]


### 3.4 Apply Padding to Standardize Input Length

In [44]:
# Define the maximum sequence length
maxlen = 500  

# Pad sequences to ensure equal length
X_train_text = tf.keras.preprocessing.sequence.pad_sequences(X_train_text, maxlen=maxlen, padding='post')
X_test_text = tf.keras.preprocessing.sequence.pad_sequences(X_test_text, maxlen=maxlen, padding='post')

X_test_Submission_text = tf.keras.preprocessing.sequence.pad_sequences(X_test_Submission_text, maxlen=maxlen, padding='post')  # Submission dataset

# Print shape to confirm
print(f"Training data shape: {X_train_text.shape}")
print(f"Testing data shape: {X_test_text.shape}")
print(f"Submission data shape: {X_test_Submission_text.shape}")

Training data shape: (67932, 500)
Testing data shape: (16984, 500)
Submission data shape: (13812, 500)


### 3.5 Save Processed Data

In [54]:

# Define save paths
TOKENIZER_PATH = Path(config.PROCESSED_TEXT_DIR) / "tokenizer.pkl"
X_TRAIN_SPLIT_TOKENIZED_PATH = Path(config.PROCESSED_TEXT_DIR) / "X_train_TokenizationSequencing.pkl"
X_TEST_SPLIT_TOKENIZED_PATH = Path(config.PROCESSED_TEXT_DIR) / "X_test_TokenizationSequencing.pkl"
Y_TRAIN_SPLIT_PATH  = Path(config.PROCESSED_TEXT_DIR) / "y_train_split.pkl"
Y_TEST_SPLIT_PATH = Path(config.PROCESSED_TEXT_DIR) / "y_test_split.pkl"
X_SUBMISSION_TOKENIZED_PATH = Path(config.PROCESSED_TEXT_DIR) / "X_submission_TokenizationSequencing.pkl"


# Save tokenized & padded datasets using Pickle
with open(TOKENIZER_PATH, "wb") as f:
    pickle.dump(tokenizer, f)
    
    
with open(X_TRAIN_SPLIT_TOKENIZED_PATH, "wb") as f:
    pickle.dump(X_train, f)

with open(X_TEST_SPLIT_TOKENIZED_PATH, "wb") as f:
    pickle.dump(X_test, f)
    
with open(Y_TRAIN_SPLIT_PATH, "wb") as f:
    pickle.dump(y_train, f)
    
with open(Y_TEST_SPLIT_PATH, "wb") as f:
    pickle.dump(y_test, f)

with open(X_SUBMISSION_TOKENIZED_PATH, "wb") as f:
    pickle.dump(X_submission, f)

# Check if files were saved successfully before printing

if TOKENIZER_PATH.exists():
    print(f"\nProcessed tokenizer data saved at: {TOKENIZER_PATH}")
else:
    print("Error: Training data file was not saved!")
    
if X_TRAIN_TOKENIZED_PATH.exists():
    print(f"\nProcessed training data saved at: {X_TRAIN_SPLIT_TOKENIZED_PATH}")
else:
    print("Error: Training data file was not saved!")

if X_TEST_TOKENIZED_PATH.exists():
    print(f"\nProcessed test data saved at: {X_TEST_SPLIT_TOKENIZED_PATH}")
else:
    print("Error: Test data file was not saved!")
    
if Y_TRAIN_SPLIT_PATH.exists():
    print(f"\ny_train data saved at: {Y_TRAIN_SPLIT_PATH}")
else:
    print("Error: y_train data file was not saved!")

if Y_TEST_SPLIT_PATH.exists():
    print(f"\nx_train test data saved at: {Y_TEST_SPLIT_PATH}")
else:
    print("Error: x_train data file was not saved!")

if X_SUBMISSION_TOKENIZED_PATH.exists():
    print(f"\nProcessed submission data saved at: {X_SUBMISSION_TOKENIZED_PATH}")
else:
    print("Error: Submission data file was not saved!")


Processed tokenizer data saved at: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\processed\text\tokenizer.pkl

Processed training data saved at: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\processed\text\X_train_TokenizationSequencing.pkl

Processed test data saved at: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\processed\text\X_test_TokenizationSequencing.pkl

y_train data saved at: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\processed\text\y_train_split.pkl

x_train test data saved at: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\processed\text\y_test_split.pkl

Processed submission data saved at: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\processed\text\X_submission_TokenizationSequencing.pkl


## 4. 🔄Next Steps  

In this notebook, we have preprocessed the text data by tokenizing and padding the sequences, preparing them for deep learning models. The following processed data has been saved for future use:
- The **tokenizer** object, which contains the word-to-index mapping for tokenization.
- The **tokenized and padded sequences** for the training, test, and submission datasets.

---
We have now completed the exploration and preprocessing of textual data, from the initial dataset analysis to text cleaning and visualization using Word Clouds, as well as text feature extraction for machine learning models and text tokenization and sequence preparation for deep learning models.

The following notebooks were executed as part of this process:

 **1_Project_and_Data_Overview.ipynb** → Initial project and data exploration  
 **2_CSV_Exploration_and_Visualization.ipynb** → CSV data exploration and visualization  
 **3_Image_Exploration_and_Visualization.ipynb** → Image dataset analysis  
 **4_Text_Cleaning.ipynb** → Text preprocessing and cleaning  
 **5_Text_WordClouds_for_Product_Categories.ipynb** → Visualizing text data through word clouds  
 **6_ML_Text_Vectorization_TF-IDF.ipynb** → Text feature extraction for machine learning models  
 **7_DL_Text_Tokenization_and_Sequencing.ipynb** → Text tokenization and sequence preparation for deep learning models

---
➡️ 
We will now move on to the modeling phase, starting with classic machine learning models for text classification, such as **Logistic Regression**, **Random Forest**, etc., to establish a baseline performance. We will then explore deep learning architectures like **RNNs**, **LSTMs**, or **GRUs** for more advanced modeling. After optimizing both approaches for text, we will shift our focus to image data using **CNNs** (Convolutional Neural Networks). Once both text and image models are optimized, we will explore a **multimodal** (text + image) approach to enhance classification performance.
