# Text Cleaning


## **Notebook Objective**  
This notebook focuses on **cleaning and preprocessing text data** to improve the quality of textual features for modeling.  
Text cleaning is essential in **Natural Language Processing (NLP)** to ensure consistency, remove noise, and prepare data for classification models.

### **Key Steps**
✔ **Loading structured datasets** → Importing training and test datasets containing product features, labels, and image filenames.  
-  **`X_train_full_img.pkl`** → Training data with text data, target labels (y_train), and associated image filenames.  
-  **`X_test_img_sub.pkl`** → Test dataset with text data and generated image filenames. 
  
✔ **Cleaning product titles (`designation`)** → Standardizing before merging.  
✔ **Creating** a unified **text** column → Combining `designation` and `description` to enrich textual features and handle missing values. Since 35.1% of description is missing in the training data and 35.4% in the submission data.  
✔ **Applying full text cleaning pipeline** → Lowercasing, removing accents, filtering noise, and more.  
✔ **Finalizing and saving** → Preparing cleaned text for modeling.  


## 1. Load Pickle Files (X_train_full_img.pkl & X_test_img_sub.pkl)

###  Import Required Libraries 

In [46]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
import os
from pathlib import Path
import re
import pickle
import html
import importlib
import string
import pandas as pd
from nltk.corpus import stopwords

#### Setting Up Project Paths and Configurations

In [47]:
# Get the current notebook directory
CURRENT_DIR = Path(os.getcwd()).resolve()

# Automatically find the project root (go up 1 level)
PROJECT_ROOT = CURRENT_DIR.parents[2]

# Add project root to sys.path
sys.path.append(str(PROJECT_ROOT))

# Function to get relative paths from project root
def get_relative_path(absolute_path):
    return str(Path(absolute_path).relative_to(PROJECT_ROOT))

# Print project root directory
print(f"Project Root Directory: {PROJECT_ROOT.name}")  # Display only the root folder name

import config  # Now Python can find config.py

Project Root Directory: Append_Data_Engineer_AWS_MLOPS


In [50]:
# Reload config to ensure any updates are applied
importlib.reload(config)

# Define directory for interim text-related pickle files
INTERIM_DIR = Path(config.INTERIM_DIR)

INTERIM_DIR.mkdir(parents=True, exist_ok=True)  # Ensure directory exists

# Define file paths for storing text-related datasets
TRAIN_PICKLE_PATH = INTERIM_DIR / "X_train_full_img.pkl"
TEST_PICKLE_PATH = INTERIM_DIR / "X_test_img_sub.pkl"

# Adjust the display width for columns
pd.set_option('display.max_colwidth', 500)  #You can adjust it

# Function to load a Pickle file safely
def load_pickle(file_path, dataset_name):
    if os.path.exists(file_path):
        try:
            data = pd.read_pickle(file_path)
            print(f"Successfully loaded `{dataset_name}` | Shape: {data.shape}\n")
            display(data.head())  # Display first few rows
            return data
        except Exception as e:
            print(f"Error loading `{dataset_name}`: {e}")
    else:
        print(f"File not found: {file_path}")
    return None

# Load both datasets
X_train = load_pickle(TRAIN_PICKLE_PATH, "X_train_full_img.pkl")
X_test_sub = load_pickle(TEST_PICKLE_PATH, "X_test_img_sub.pkl")


Successfully loaded `X_train_full_img.pkl` | Shape: (84916, 6)



Unnamed: 0,designation,description,productid,imageid,prdtypecode,image_name
0,Olivia: Personalisiertes Notizbuch / 150 Seiten / Punktraster / Ca Din A5 / Rosen-Design,,3804725264,1263597046,10,image_1263597046_product_3804725264.jpg
1,Journal Des Arts (Le) N° 133 Du 28/09/2001 - L'art Et Son Marche Salon D'art Asiatique A Paris - Jacques Barrere - Francois Perrier - La Reforme Des Ventes Aux Encheres Publiques - Le Sna Fete Ses Cent Ans.,,436067568,1008141237,2280,image_1008141237_product_436067568.jpg
2,Grand Stylet Ergonomique Bleu Gamepad Nintendo Wii U - Speedlink Pilot Style,PILOT STYLE Touch Pen de marque Speedlink est 1 stylet ergonomique pour GamePad Nintendo Wii U.<br> Pour un confort optimal et une précision maximale sur le GamePad de la Wii U: ce grand stylet hautement ergonomique est non seulement parfaitement adapté à votre main mais aussi très élégant.<br> Il est livré avec un support qui se fixe sans adhésif à l'arrière du GamePad<br> <br> Caractéristiques:<br> Modèle: Speedlink PILOT STYLE Touch Pen<br> Couleur: Bleu<br> Ref. Fabricant: SL-3468-BE<br>...,201115110,938777978,50,image_938777978_product_201115110.jpg
3,Peluche Donald - Europe - Disneyland 2000 (Marionnette À Doigt),,50418756,457047496,1280,image_457047496_product_50418756.jpg
4,La Guerre Des Tuques,Luc a des id&eacute;es de grandeur. Il veut organiser un jeu de guerre de boules de neige et s'arranger pour en &ecirc;tre le vainqueur incontest&eacute;. Mais Sophie s'en m&ecirc;le et chambarde tous ses plans...,278535884,1077757786,2705,image_1077757786_product_278535884.jpg


Successfully loaded `X_test_img_sub.pkl` | Shape: (13812, 5)



Unnamed: 0,designation,description,productid,imageid,image_name
84916,Folkmanis Puppets - 2732 - Marionnette Et Théâtre - Mini Turtle,,516376098,1019294171,image_1019294171_product_516376098.jpg
84917,Porte Flamme Gaxix - Flamebringer Gaxix - 136/220 - U - Twilight Of The Dragons,,133389013,1274228667,image_1274228667_product_133389013.jpg
84918,Pompe de filtration Speck Badu 95,,4128438366,1295960357,image_1295960357_product_4128438366.jpg
84919,Robot de piscine électrique,<p>Ce robot de piscine d&#39;un design innovant et élégant apportera un nettoyage efficace et rapide. Il est conçu pour tous types de revêtements tels que le vinyle le béton la fibre de verre et la céramique.Avec un dispositif de commande il est facile à régler le cycle de nettoyage et le temps. Il suffit de brancher ce robot et le mettre dans la piscine.Ses deux filtres à paniers intérieurs sont mobiles pour retenir les moindres débits et le nettoyage. Ce robot de piscine robuste d&#39;une ...,3929899732,1265224052,image_1265224052_product_3929899732.jpg
84920,Hsm Destructeur Securio C16 Coupe Crois¿E: 4 X 25 Mm,,152993898,940543690,image_940543690_product_152993898.jpg


## **2. Text Cleaning Process**
✔ **Display Raw Data Before Cleaning**  
Before applying any transformations, we inspect the raw `designation` and `description` columns  
to understand inconsistencies, missing values, and potential noise.

✔ **Cleaning the `designation` Column**  
We apply text preprocessing to standardize product titles before merging:  
- Lowercasing  
- HTML decoding  
- Removing accents  
- Normalizing special characters  
- Filtering unnecessary noise  

✔ **Creating the `text` Column**  
- Combines `designation` and `description` into a single text column for NLP processing.  
- Handles missing values by keeping only `designation` when `description` is empty.  
- Removes duplicate words to reduce redundancy.  

✔ **Cleaning the `text` Column**  
Once merged, we apply the full text cleaning pipeline, which includes:  
✔ **Lowercasing** → Ensures case consistency.  
✔ **Decoding HTML entities** → Converts `&eacute;` → `é`. </br>
✔ **Removing HTML tags** → Strips unwanted `<p>`, `<b>`, etc.  
✔ **Removing accents** → Normalizes characters (e.g., `é` → `e`).  
✔ **Normalizing special characters** → Standardizes dashes, quotes, and punctuation.  
✔ **Keeping only alphabetic characters** → Removes numbers and symbols.  
✔ **Removing punctuation** → Eliminates unnecessary characters.  
✔ **Removing stopwords and short words** → Retains only meaningful words.  

✔ **Display Cleaned Data**  
The cleaned dataset is displayed to verify the transformations.  
The `text` column is now structured, noise-free, and optimized for further analysis.  

✅ **Before Cleaning:**  
- Text contains HTML tags, special characters, and inconsistencies.  
- `designation` and `description` may be redundant or incomplete.  

✅ **After Cleaning:**  
- Text is standardized and optimized for NLP models.  
- A clean `text` column is ready for feature extraction and modeling.  


In [51]:
# ============================================================= #
#  IMPORTS & CONFIG RELOAD
# ============================================================= #
import importlib
import config
import src

importlib.reload(config)
importlib.reload(src)

from src.data_preprocessing import text_cleaning
from IPython.display import display

# ============================================================= #
#  DISPLAY RAW DATA BEFORE CLEANING
# ============================================================= #
print("\n[INFO] Displaying sample raw data before cleaning...")

print("\n[BEFORE] Sample raw data (Training Set):")
display(X_train[["designation", "description"]].head(2))

print("\n[BEFORE] Sample raw data (Test Set):")
display(X_test_sub[["designation", "description"]].head(2))

# ============================================================= #
#  CLEANING 'designation' COLUMN
# ============================================================= #
print("\n[INFO] Cleaning 'designation' column...")
X_train["designation"] = X_train["designation"].apply(text_cleaning.clean_text_pipeline)
X_test_sub["designation"] = X_test_sub["designation"].apply(text_cleaning.clean_text_pipeline)

# ============================================================= #
#  CREATING & CLEANING 'text' COLUMN
# ============================================================= #
print("\n[INFO] Creating and cleaning 'text' column...")
X_train["text"] = X_train.apply(lambda row: text_cleaning.create_clean_text(row["designation"], row["description"]), axis=1)
X_test_sub["text"] = X_test_sub.apply(lambda row: text_cleaning.create_clean_text(row["designation"], row["description"]), axis=1)

X_train["text"] = X_train["text"].apply(text_cleaning.clean_text_pipeline)
X_test_sub["text"] = X_test_sub["text"].apply(text_cleaning.clean_text_pipeline)

# ============================================================= #
#  DISPLAY CLEANED DATA
# ============================================================= #
# print("\n[INFO] Displaying sample cleaned data...")

print("\n[AFTER] Sample cleaned data (Training Set):")
display(X_train[["designation", "description", "text"]].head(2))

print("\n[AFTER] Sample cleaned data (Test Set):")
display(X_test_sub[["designation", "description", "text"]].head(2))

print("\n✔ Text cleaning completed successfully!")



[INFO] Displaying sample raw data before cleaning...

[BEFORE] Sample raw data (Training Set):


Unnamed: 0,designation,description
0,Olivia: Personalisiertes Notizbuch / 150 Seiten / Punktraster / Ca Din A5 / Rosen-Design,
1,Journal Des Arts (Le) N° 133 Du 28/09/2001 - L'art Et Son Marche Salon D'art Asiatique A Paris - Jacques Barrere - Francois Perrier - La Reforme Des Ventes Aux Encheres Publiques - Le Sna Fete Ses Cent Ans.,



[BEFORE] Sample raw data (Test Set):


Unnamed: 0,designation,description
84916,Folkmanis Puppets - 2732 - Marionnette Et Théâtre - Mini Turtle,
84917,Porte Flamme Gaxix - Flamebringer Gaxix - 136/220 - U - Twilight Of The Dragons,



[INFO] Cleaning 'designation' column...

[INFO] Creating and cleaning 'text' column...

[AFTER] Sample cleaned data (Training Set):


Unnamed: 0,designation,description,text
0,olivia personalisiertes notizbuch seiten punktraster din rosen design,,olivia personalisiertes notizbuch seiten punktraster din rosen design
1,journal arts art marche salon art asiatique paris jacques barrere francois perrier reforme ventes encheres publiques sna fete cent ans,,journal arts art marche salon asiatique paris jacques barrere francois perrier reforme ventes encheres publiques sna fete cent ans



[AFTER] Sample cleaned data (Test Set):


Unnamed: 0,designation,description,text
84916,folkmanis puppets marionnette theatre mini turtle,,folkmanis puppets marionnette theatre mini turtle
84917,porte flamme gaxix flamebringer gaxix twilight dragons,,porte flamme gaxix flamebringer twilight dragons



✔ Text cleaning completed successfully!


## **3. Saving Cleaned Datasets**  

To streamline the workflow, we save the cleaned datasets as Pickle files to avoid redundant preprocessing in future steps.  

We will store the following updated datasets for future use:  
- **`X_train_cleaned.pkl`** → Includes the cleaned training dataset with the newly created `text` column and a preprocessed `designation` column.  
- **`X_test_sub_cleaned.pkl`** → Includes the cleaned test dataset for submission with the `text` column and a preprocessed `designation` column.  



In [52]:
# Reload config to ensure any updates are applied
importlib.reload(config)

# # Define the directory and file names
# pickle_dir = "../data/interim/"
# os.makedirs(pickle_dir, exist_ok=True)

# Define the directory where Pickle files will be stored
PICKLE_DIR = Path(config.INTERIM_DIR)
os.makedirs(PICKLE_DIR, exist_ok=True)  # Ensure the directory exists

# Define file paths
# train_pickle_path = os.path.join(PICKLE_DIR, "X_train_cleaned.pkl")
# test_pickle_path = os.path.join(PICKLE_DIR, "X_test_sub_cleaned.pkl")

TRAIN_PICKLE_PATH = os.path.join(PICKLE_DIR, "X_train_cleaned.pkl")
TEST_SUB_PICKLE_PATH = os.path.join(PICKLE_DIR, "X_test_sub_cleaned.pkl")


try:
    # Save updated training dataset
    X_train.to_pickle(TRAIN_PICKLE_PATH)
    print(f"Training dataset saved: {TRAIN_PICKLE_PATH}")

    # Save updated test dataset
    X_test_sub.to_pickle(TEST_SUB_PICKLE_PATH)
    print(f"Test dataset saved: {TEST_SUB_PICKLE_PATH}")

except Exception as e:
    print(f"Error saving datasets: {e}")


Training dataset saved: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\interim\X_train_cleaned.pkl
Test dataset saved: D:\Data_Science\Append_Data_Engineer_AWS_MLOPS\Data_Scientist_Rakuten_Project-main\data\interim\X_test_sub_cleaned.pkl


## 5. 🔄 Next Steps  

Now that we have cleaned the text data, the next step is to enhance our understanding of the dataset through **visualizations** and further analysis.

We will utilize the cleaned text data to:
- Create **WordClouds** to visually represent the most frequent words in different classes.
- Identify key labels for each class, such as:
  - 50: 'video games accessories'
  - 2705: 'books'
  - ...and more.

The **WordClouds** will help us to visually explore and understand the dominant themes for each product category.

---
➡️ **Proceed to `5_Text_WordClouds_for_Product_Categories.ipynb`**  
This notebook will focus on generating WordClouds for each product category and analyzing the textual content of the dataset.
