#  Text Vectorization with TF-IDF  


This notebook processes cleaned text data using **TF-IDF vectorization** and prepares both **training (`Xtrain_matrix.pkl`)** and **test (`Xtest_matrix.pkl`)** feature matrices. Additionally, it converts the target labels into numerical format (`ytrain.pkl`).  

By the end of this notebook, we will save:  
- **`Xtrain_matrix.pkl`** → TF-IDF matrix for training data  
- **`Xtest_matrix.pkl`** → TF-IDF matrix for test data  
- **`tfidf_vectorizer.pkl`** → The trained TF-IDF vectorizer  
- **`y_train_encoded`** → Processed target labels  



 ## 1. Load Preprocessed Data
Before applying TF-IDF, we first load the finals cleaned datasets (`X_train_final.pkl` and `X_test_final.pkl`).  
Additionally, we load the target labels (`y_train_final.pkl`) to transform them into a format suitable for model training.  


In [42]:
import os
import pandas as pd
import numpy as np
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer


train_pickle_path = "../../data/processed/X_train_final.pkl"  
test_pickle_path = "../../data/processed/X_test_final.pkl"
y_train_pickle_path = "../../data/processed/y_train_final.pkl" 

# Function to load a Pickle file safely
def load_pickle(file_path, dataset_name):
    if os.path.exists(file_path):
        try:
            data = pd.read_pickle(file_path)
            print(f"Successfully loaded `{dataset_name}` | Shape: {data.shape}\n")
            display(data.head())  # Display first few rows
            return data
        except Exception as e:
            print(f"Error loading `{dataset_name}`: {e}")
    else:
        print(f"File not found: {file_path}")
    return None

# Load both datasets
X_train  = load_pickle(train_pickle_path, "X_train_final.pkl")
X_test = load_pickle(test_pickle_path, "X_test_final.pkl")
y_train = load_pickle(y_train_pickle_path, "y_train_final.pkl")


# # Extract text column
# train_text = X_train['text']
# test_text = X_test['text']

# print("✅ Data successfully loaded!")
# print(f"Training samples: {train_text.shape[0]}, Test samples: {test_text.shape[0]}")


Successfully loaded `X_train_final.pkl` | Shape: (84916, 8)



Unnamed: 0,designation,description,productid,imageid,prdtypecode,image_name,text,Label
0,Olivia: Personalisiertes Notizbuch / 150 Seite...,,3804725264,1263597046,10,image_1263597046_product_3804725264.jpg,olivia personalisiertes notizbuch seiten punkt...,Adult Books
1,Journal Des Arts (Le) N° 133 Du 28/09/2001 - L...,,436067568,1008141237,2280,image_1008141237_product_436067568.jpg,journal arts art marche salon art asiatique pa...,Magazines
2,Grand Stylet Ergonomique Bleu Gamepad Nintendo...,PILOT STYLE Touch Pen de marque Speedlink est ...,201115110,938777978,50,image_938777978_product_201115110.jpg,grand stylet ergonomique bleu gamepad nintendo...,Video Games Accessories
3,Peluche Donald - Europe - Disneyland 2000 (Mar...,,50418756,457047496,1280,image_457047496_product_50418756.jpg,peluche donald europe disneyland marionnette d...,Toys for Children
4,La Guerre Des Tuques,Luc a des id&eacute;es de grandeur. Il veut or...,278535884,1077757786,2705,image_1077757786_product_278535884.jpg,guerre tuques luc idees grandeur veut organise...,Books


Successfully loaded `X_test_final.pkl` | Shape: (13812, 6)



Unnamed: 0,designation,description,productid,imageid,image_name,text
84916,Folkmanis Puppets - 2732 - Marionnette Et Théâ...,,516376098,1019294171,image_1019294171_product_516376098.jpg,folkmanis puppets marionnette theatre mini turtle
84917,Porte Flamme Gaxix - Flamebringer Gaxix - 136/...,,133389013,1274228667,image_1274228667_product_133389013.jpg,porte flamme gaxix flamebringer twilight dragons
84918,Pompe de filtration Speck Badu 95,,4128438366,1295960357,image_1295960357_product_4128438366.jpg,pompe filtration speck badu
84919,Robot de piscine électrique,<p>Ce robot de piscine d&#39;un design innovan...,3929899732,1265224052,image_1265224052_product_3929899732.jpg,robot piscine electrique robot design innovant...
84920,Hsm Destructeur Securio C16 Coupe Crois¿E: 4 X...,,152993898,940543690,image_940543690_product_152993898.jpg,hsm destructeur securio coupe croise


Successfully loaded `y_train_final.pkl` | Shape: (84916, 1)



Unnamed: 0,prdtypecode
0,10
1,2280
2,50
3,1280
4,2705


## 2. Convert and Save Target Labels


The `prdtypecode` column contains **product category codes**, which need to be converted into numerical labels ranging from 0 to 26.  
This ensures that our classification model understands the target variable correctly.  
After conversion, we save the processed labels as ``y_train_encoded.pkl`.  


### 2.1 From previous Notebook : *5_WordClouds_for_Text_Product_Categories.ipynb*

In [43]:
import os
import pandas as pd

# From previous Notebook : 5_WordClouds_for_Text_Product_Categories
dict_code_label = {
    10: "Adult Books",
    40: "Imported Video Games",
    50: "Video Games Accessories",
    60: "Games and Consoles",
    1140: "Figurines and Toy Pop",
    1160: "Playing Cards",
    1180: "Figurines, Masks, and Role-Playing Games",
    1280: "Toys for Children",
    1281: "Board Games",
    1300: "Remote Controlled Models",
    1301: "Accessories for Children",
    1302: "Toys, Outdoor Playing, and Clothes",
    1320: "Early Childhood",
    1560: "Interior Furniture and Bedding",
    1920: "Interior Accessories",
    1940: "Food",
    2060: "Decoration Interior",
    2220: "Supplies for Domestic Animals",
    2280: "Magazines",
    2403: "Children Books and Magazines",
    2462: "Games",
    2522: "Stationery",
    2582: "Furniture, Kitchen, and Garden",
    2583: "Piscine and Spa",
    2585: "Gardening and DIY",
    2705: "Books",
    2905: "Online Distribution of Video Games"
}

### 2.2 Convert product codes to numerical labels

In [44]:
# Convert product codes to numerical labels
prdtypecodes = list(y_train['prdtypecode'].unique())  # Extract unique product categories
target_mapping = {code: i for i, code in enumerate(prdtypecodes)}  # Create mapping {prdtypecode: numeric_label}

# Apply mapping to create numerical target labels
y_train_encoded = y_train['prdtypecode'].map(target_mapping)

# Ensure integer format
y_train_encoded = y_train_encoded.astype('int64')
# ─────────────────────────────────────────────────────────────────────────────── #

X_train["prdtypecode_encoded"] = X_train["prdtypecode"].map(target_mapping)

# Ensure integer format
X_train["prdtypecode_encoded"] = X_train["prdtypecode_encoded"].astype("int64")

# Reorder columns to place "Encoded target" right after "prdtypecode"
columns_order = [
    "designation", "description", "productid", "imageid", "prdtypecode", 
    "prdtypecode_encoded", "Label", "image_name", "text"
]

# Apply the new column order
X_train = X_train[columns_order]

# Display the comparison between original and encoded labels
# ─────────────────────────────────────────────────────────────────────────────── #
comparison_df_y_train = pd.DataFrame({
    "Original prdtypecode": y_train['prdtypecode'].head(10),  # 10 first rows before encoding
    "prdtypecode_encoded": y_train_encoded.head(10)  # 10 first rows after encoding
})

print('y_train - print("Comparison: Original vs. Encoded Labels')
display(comparison_df_y_train)
# Display the comparison between original and encoded labels

comparison_df_X_train = pd.DataFrame({
    "Original prdtypecode": X_train['prdtypecode'].head(10),  # 10 first rows before encoding
    "prdtypecode_encoded": X_train['prdtypecode_encoded'].head(10)# 10 first rows after encoding
})
print('X_train - Display the comparison between original and encoded labels')
display(comparison_df_X_train)




y_train - print("Comparison: Original vs. Encoded Labels


Unnamed: 0,Original prdtypecode,prdtypecode_encoded
0,10,0
1,2280,1
2,50,2
3,1280,3
4,2705,4
5,2280,1
6,10,0
7,2522,5
8,1280,3
9,2582,6


X_train - Display the comparison between original and encoded labels


Unnamed: 0,Original prdtypecode,prdtypecode_encoded
0,10,0
1,2280,1
2,50,2
3,1280,3
4,2705,4
5,2280,1
6,10,0
7,2522,5
8,1280,3
9,2582,6


## 3. Apply TF-IDF Vectorization

TF-IDF (Term Frequency - Inverse Document Frequency) converts text data into numerical representations.  
We set `max_features=5000` to limit the vocabulary size, ensuring efficiency while keeping relevant information.  
We apply **`fit_transform()`** on `train_text` and **`transform()`** on `test_text` using the same TF-IDF model.  


In [45]:
# Extract text column
train_text = X_train['text']
test_text = X_test['text']


print(f"Training samples: {train_text.shape[0]}")
print(f"Testing samples: {test_text.shape[0]}")

Training samples: 84916
Testing samples: 13812


In [46]:
# Initialize TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000)  # Limit vocabulary size for efficiency

# Fit and transform on training text
X_train_matrix = tfidf.fit_transform(train_text)
print("Shape of the TF-IDF X_train Matrix:", X_train_matrix.shape)

# Transform test text using the same vectorizer
X_test_matrix = tfidf.transform(test_text)
print("Shape of the TF-IDF X_test Matrix:", X_test_matrix.shape)

# Display a sample of the TF-IDF X_train matrix
print("Sample of the TF-IDF X_train Matrix:")
print(X_train_matrix[:5, :5].toarray())  # Displaying a small portion of the matrix (5 rows and 5 columns)

# Display the corresponding words (vocabulary terms) for the sample
print("\nCorresponding feature names (terms) for the sample:")
print(tfidf.get_feature_names_out()[:5])  # Display the first 5 feature names

# Display a sample of the TF-IDF X_test matrix
print("\nSample of the TF-IDF X_test Matrix:")
print(X_test_matrix[:5, :5].toarray())  # Displaying a small portion of the matrix (5 rows and 5 columns)

# Display the corresponding words (vocabulary terms) for the sample
print("\nCorresponding feature names (terms) for the sample:")
print(tfidf.get_feature_names_out()[:5])  # Display the first 5 feature names


Shape of the TF-IDF X_train Matrix: (84916, 5000)
Shape of the TF-IDF X_test Matrix: (13812, 5000)
Sample of the TF-IDF X_train Matrix:
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Corresponding feature names (terms) for the sample:
['aaa' 'abat' 'aberration' 'ability' 'abord']

Sample of the TF-IDF X_test Matrix:
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Corresponding feature names (terms) for the sample:
['aaa' 'abat' 'aberration' 'ability' 'abord']


 ## 4. Save Encoded Labels, Product Mapping, and TF-IDF Data
 
To ensure **reproducibility** and **efficient data handling**, we save the following files:

✔ **`y_train_encoded.pkl`** → Encoded target labels for training data.  
✔ **`X_train_final_encoded.pkl`** → Training dataset with a new column prdtypecode_encoded and a new column order.  
✔ **`prdtypecode_mapping.csv`** → CSV file mapping original product codes to encoded targets.    
✔ **`prdtypecode_mapping.pkl`** → Pickle version of the same mapping for easier loading.  
✔ **`Xtrain_matrix.pkl`** → TF-IDF representation of the training data.    
✔ **`Xtest_matrix.pkl`** → TF-IDF representation of the test data.    
✔ **`tfidf_vectorizer.pkl`** → The trained vectorizer, ensuring we apply the same transformation later.


In [47]:
import os
import pickle
import pandas as pd  # Ensure Pandas is imported

# Define the output directory for saving files
TEXT_PROCESSED_DIR  = "../../data/processed/text/" # For text-related data
GENERAL_PROCESSED_DIR = "../../data/processed/"    # For general processed data

os.makedirs(TEXT_PROCESSED_DIR, exist_ok=True)  # Create the directory if it doesn't exist
os.makedirs(GENERAL_PROCESSED_DIR, exist_ok=True)  # Create the directory if it doesn't exist

# ─────────────────────────────────────────────────────────────────────────────── #
# Save encoded target labels
label_path = os.path.join(TEXT_PROCESSED_DIR, "y_train_encoded.pkl")
y_train_encoded.to_pickle(label_path)
print(f"\n[✔] Target labels saved at: {label_path}")

# ─────────────────────────────────────────────────────────────────────────────── #
# Save the updated X_train with the encoded target
X_train_encoded_path = os.path.join(GENERAL_PROCESSED_DIR, "X_train_final_encoded.pkl")
X_train.to_pickle(X_train_encoded_path)

print(f"\n[✔] X_train with encoded target saved at: {X_train_encoded_path}")

# ─────────────────────────────────────────────────────────────────────────────── #
#  Create a mapping between original product codes and encoded targets
mapping_df = pd.DataFrame(list(target_mapping.items()), columns=["Original prdtypecode", "Encoded target"])

# Add product labels from dictionary mapping
mapping_df["Label"] = mapping_df["Original prdtypecode"].map(dict_code_label)

#  Save the mapping as CSV and Pickle for future reference
mapping_csv_path = os.path.join(TEXT_PROCESSED_DIR, "prdtypecode_mapping.csv")
mapping_pkl_path = os.path.join(TEXT_PROCESSED_DIR, "prdtypecode_mapping.pkl")

mapping_df.to_csv(mapping_csv_path, index=False)
mapping_df.to_pickle(mapping_pkl_path)

print(f"\n[✔] Mapping saved as CSV: {mapping_csv_path}")
print(f"[✔] Mapping saved as Pickle: {mapping_pkl_path}")

# ─────────────────────────────────────────────────────────────────────────────── #
#  Define file paths for matrices and the trained TF-IDF vectorizer
file_paths = {
    "X_train_matrix": os.path.join(TEXT_PROCESSED_DIR, "Xtrain_matrix.pkl"),
    "X_test_matrix": os.path.join(TEXT_PROCESSED_DIR, "Xtest_matrix.pkl"),
    "TF-IDF vectorizer": os.path.join(TEXT_PROCESSED_DIR, "tfidf_vectorizer.pkl")
}

#  Save matrices and TF-IDF vectorizer
try:
    pickle.dump(X_train_matrix, open(file_paths["X_train_matrix"], "wb"))
    pickle.dump(X_test_matrix, open(file_paths["X_test_matrix"], "wb"))
    pickle.dump(tfidf, open(file_paths["TF-IDF vectorizer"], "wb"))

    #  Verify if files were successfully saved
    for name, path in file_paths.items():
        if os.path.exists(path):
            print(f"[✔] {name} saved at: {path}")
        else:
            print(f"[X] Error: {name} was not saved.")

except Exception as e:
    print(f"[X] Error during saving: {e}")



[✔] Target labels saved at: ../../data/processed/text/y_train_encoded.pkl

[✔] X_train with encoded target saved at: ../../data/processed/X_train_final_encoded.pkl

[✔] Mapping saved as CSV: ../../data/processed/text/prdtypecode_mapping.csv
[✔] Mapping saved as Pickle: ../../data/processed/text/prdtypecode_mapping.pkl
[✔] X_train_matrix saved at: ../../data/processed/text/Xtrain_matrix.pkl
[✔] X_test_matrix saved at: ../../data/processed/text/Xtest_matrix.pkl
[✔] TF-IDF vectorizer saved at: ../../data/processed/text/tfidf_vectorizer.pkl


In [37]:
# Reload labels and mapping to verify correctness
y_train_encoded_check = pd.read_pickle(label_path)
mapping_df_check = pd.read_pickle(mapping_pkl_path)

print("\nVerification: First 5 rows of reloaded labels:")
print(y_train_encoded_check.head())

print("\nVerification: Mapping (first 5 rows):")
print(mapping_df_check.head())


Verification: First 5 rows of reloaded labels:
0    0
1    1
2    2
3    3
4    4
Name: prdtypecode, dtype: int64

Verification: Mapping (first 5 rows):
   Original prdtypecode  Encoded target                    Label
0                    10               0              Adult Books
1                  2280               1                Magazines
2                    50               2  Video Games Accessories
3                  1280               3        Toys for Children
4                  2705               4                    Books


## 5. 🔄 Next Steps

Now that we have **vectorized our text data using TF-IDF**, a technique commonly employed in Machine Learning models, and **prepared our target labels**, we are ready to move on to the next phase: **Text Data Preparation for Deep Learning Models**.

In the next step, we will focus on **tokenizing and sequencing our text data** to make it suitable for deep learning architectures. This involves:

- **Tokenization**: Converting text into sequences of tokens (words or subwords) that can be processed by neural networks.
- **Sequencing and Padding**: Ensuring all text sequences are of uniform length to efficiently batch and train models.

These processes are crucial for the effective training of deep learning models on textual data.

➡️ *Continue with the next notebook:*
**`7_DL_Text_Tokenization_and_Sequencing.ipynb`**
  
