# Text Cleaning  

## 📌 Notebook Objective  

In this notebook, we focus on **cleaning and preprocessing text data** to enhance the quality of textual features for modeling.  

Text cleaning is a crucial step in **Natural Language Processing (NLP)**, ensuring that the input data is **consistent, structured, and noise-free**. This will **improve model performance** for text-based classification.  

###  **Key Steps**  

✔ **Loading structured datasets** → Import cleaned product metadata (`X_train_img.pkl` & `X_test_img.pkl`).  
✔ **Creating a unified text column** → Merging designation (title) and description.  
✔ **Cleaning text** → Lowercasing, removing accents, stripping HTML tags, and filtering special characters.  
✔ **Removing stopwords and shor words** → Eliminating common, unimportant words in French, English, and German, as well as words with fewer than 3 characters.                
✔ **Finalizing and saving** → Ensuring cleaned text is ready for further tasks and storing it as (`df_train_cleaned.pkl` & `df_test_cleaned.pkl`).

## 1. Load Pickle Files (X_train_img.pkl & X_test_img.pkl)

###  Import Required Libraries 

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
import os
from pathlib import Path
import re
import pickle
import html
import importlib
import string
import pandas as pd
from nltk.corpus import stopwords

#### Setting Up Project Paths and Configurations

In [None]:
# Get the current notebook directory
CURRENT_DIR = Path(os.getcwd()).resolve()

# Automatically find the project root (go up 1 level)
PROJECT_ROOT = CURRENT_DIR.parents[1]

# Add project root to sys.path
sys.path.append(str(PROJECT_ROOT))

# Function to get relative paths from project root
def get_relative_path(absolute_path):
    return str(Path(absolute_path).relative_to(PROJECT_ROOT))

# Print project root directory
print(f"Project Root Directory: {PROJECT_ROOT.name}")  # Display only the root folder name

import config  # Now Python can find config.py

In [None]:
# Reload config to ensure any updates are applied
importlib.reload(config)

# Define directory for interim text-related pickle files
INTERIM_DIR = Path(config.INTERIM_DIR)

INTERIM_DIR.mkdir(parents=True, exist_ok=True)  # Ensure directory exists

# Define file paths for storing text-related datasets
TRAIN_PICKLE_PATH = INTERIM_DIR / "X_train_full_img.pkl"
TEST_PICKLE_PATH = INTERIM_DIR / "X_test_img_sub.pkl"

# Adjust the display width for columns
pd.set_option('display.max_colwidth', 500)  #You can adjust it

# Function to load a Pickle file safely
def load_pickle(file_path, dataset_name):
    if os.path.exists(file_path):
        try:
            data = pd.read_pickle(file_path)
            print(f"Successfully loaded `{dataset_name}` | Shape: {data.shape}\n")
            display(data.head())  # Display first few rows
            return data
        except Exception as e:
            print(f"Error loading `{dataset_name}`: {e}")
    else:
        print(f"File not found: {file_path}")
    return None

# Load both datasets
X_train = load_pickle(TRAIN_PICKLE_PATH, "X_train_full_img.pkl")
X_test_sub = load_pickle(TEST_PICKLE_PATH, "X_test_img_sub.pkl")


## **2. Generating a Unified Text Column**  

In the previous analysis of **CSV Exploration and Visualization**, we observed:  

- **35.1%** of `description` is missing in `df_xtrain` and **35.4%** in `df_xtest`.  
- `designation` (product title) is always present.  
- The `description` field, when available, provides additional context about the product.  

 **Action Plan:**  
To handle missing values and **enrich product information**, we will:  
✔**combine `designation` and `description`** into a single **text column**.  


In [None]:

# Convert designation and description to string type explicitly
X_train["designation"] = X_train["designation"].astype("string")
X_test_sub["designation"] = X_test_sub["designation"].astype("string")

X_train["description"] = X_train["description"].astype("string")
X_test_sub["description"] = X_test_sub["description"].astype("string")



# Function to remove duplicate words while preserving order
def remove_duplicate_words(text):
    words = text.split()
    unique_words = list(dict.fromkeys(words))  # Removes duplicates while keeping order
    return " ".join(unique_words)

# Function to create a cleaned text column
def create_clean_text(designation, description):
    """
    Combines 'designation' and 'description' into a single text column.
    - Handles missing descriptions
    - Merges text fields and removes duplicate words
    """
    if pd.isna(description) or description.strip() == "":
        text = designation
    else:
        text = f"{designation} {description}"

    return remove_duplicate_words(text)  # Apply duplicate removal

# Apply functions to create the cleaned 'text' column
X_train["text"] = X_train.apply(lambda row: create_clean_text(row["designation"], row["description"]), axis=1)
X_test_sub["text"] = X_test_sub.apply(lambda row: create_clean_text(row["designation"], row["description"]), axis=1)

# Display sample results
print(" Sample merged text column (Training Data):")
display(X_train[["designation", "description", "text"]].head())

print("\n Sample merged text column (Submission Data):")
display(X_test_sub[["designation", "description", "text"]].head())


## 3. Cleaning text

### 3.1 Converting Text to Lowercase

To ensure consistency in text processing, we convert all text to **lowercase**.  
This helps in **avoiding mismatches** between words like `"Laptop"` and `"laptop"`, which should be treated as the same word.  
We also remove extra spaces at the beginning and end of the text to make it cleaner.

In [None]:
# Function to convert text to lowercase
def lower_case(text):
    return text.lower().strip()

# Apply transformation
X_train["text"] = X_train["text"].apply(lower_case)
X_test_sub["text"] = X_test_sub["text"].apply(lower_case)

# Display sample results
print(" Sample cleaned text (Training Data):")
display(X_train[["text"]].head())

print("\n Sample cleaned text (Testing Data):")
display(X_test_sub[["text"]].head())


### 3.2  Decode HTML entities
This step will decode any HTML entities in the text (e.g., `&eacute;` becomes `é`).


In [None]:


def decode_html_entities(text):
    """
    Decodes HTML entities in the text (e.g., &eacute; to é).
    """
    return html.unescape(text)

# Apply the function to the text columns
X_train['text'] = X_train['text'].apply(decode_html_entities)
X_test_sub['text'] = X_test_sub['text'].apply(decode_html_entities)

# Display a sample to verify
print("Sample text after decoding HTML entities (Training Data):")
display(X_train[['text']].head())

print("\nSample text after decoding HTML entities (Testing Data):")
display(X_test_sub[['text']].head())


### 3.3 Remove HTML Tags
This step removes any remaining HTML tags such as `<p>`, `<b>`, `<br>`, etc.

In [None]:


def remove_html_tags(text):
    """
    Removes HTML tags from the given text using regex substitution.
    This ensures that only meaningful product information remains.
    """
    return re.sub(r"<[^<]+?>", "", text)

# Apply the function to the text columns
X_train['text'] = X_train['text'].apply(remove_html_tags)
X_test_sub['text'] = X_test_sub['text'].apply(remove_html_tags)

# Display a sample to verify
print("Sample text after removing HTML tags (Training Data):")
display(X_train[['text']].head())

print("\nSample text after removing HTML tags (Testing Data):")
display(X_test_sub[['text']].head())


### 3.4 Removing Accents from Text
This step will remove accents from characters in the text (e.g., `é` becomes `e`).

In [None]:
# Function to remove accented characters from text
def remove_accent(text):
    """
    Replaces accented characters with their non-accented counterparts.
    This ensures consistency in text processing and improves model robustness.
    """
    text = text.replace('á', 'a').replace('â', 'a')
    text = text.replace('é', 'e').replace('è', 'e').replace('ê', 'e').replace('ë', 'e')
    text = text.replace('î', 'i').replace('ï', 'i')
    text = text.replace('ö', 'o').replace('ô', 'o').replace('ò', 'o').replace('ó', 'o')
    text = text.replace('ù', 'u').replace('û', 'u').replace('ü', 'u')
    text = text.replace('ç', 'c')

    return text

# Apply transformation
X_train["text"] = X_train["text"].apply(remove_accent)
X_test_sub["text"] = X_test_sub["text"].apply(remove_accent)

# Display sample results
print(" Sample text after removing accents (Training Data):")
display(X_train[["text"]].head())

print("\n Sample text after removing accents (Testing Data):")
display(X_test_sub[["text"]].head())


### 3.5  Normalize Text by Replacing Special Characters
This step replaces certain special characters and typographic symbols with their standard equivalents. This ensures consistency and helps in preparing the text for further processing.

We will replace:
- **Smart quotes** with regular quotes
- **En dashes** and **em dashes** with a hyphen
- **Ellipses** with three dots
- Remove unwanted characters (e.g., `¿`)

This will further clean up the text and avoid unwanted characters affecting model performance.


In [None]:
def normalize_text(text):
    """
    Replace special characters with their standard equivalents.
    This step includes replacing smart quotes, dashes, ellipses, and removing unwanted characters.
    """
    replacements = {
        "’": "'",    # Smart quote → standard quote
        "‘": "'",    # Smart quote → standard quote
        "“": '"',    # Smart quote → standard quote
        "”": '"',    # Smart quote → standard quote
        "–": "-",    # En dash → hyphen
        "—": "-",    # Em dash → hyphen
        "…": "...",  # Ellipsis → three dots
        "¿": "",     # Remove unwanted character
    }
    for key, value in replacements.items():
        text = text.replace(key, value)
    return text

# Apply the function to the text columns
X_train["text"] = X_train["text"].apply(normalize_text)
X_test_sub["text"] = X_test_sub["text"].apply(normalize_text)

# Display sample results
print("Sample text after normalizing special characters (Training Data):")
display(X_train[["text"]].head())

print("\nSample text after normalizing special characters (Testing Data):")
display(X_test_sub[["text"]].head())


### 3.6. Keeping Essential Characters Only
In this step, we will remove any non-alphabetic characters (e.g., numbers, punctuation) and keep only letters. This helps in focusing on meaningful words for the model.


In [None]:
def keeping_essentiel(text):
    """
    Removes all non-alphabetic characters and keeps only letters.
    This ensures that the text contains only relevant words and characters.
    """
    text = re.sub(r"[^a-zA-Z]+", " ", text)
    return text

# Apply the function to the text columns
X_train['text'] = X_train['text'].apply(lambda text: keeping_essentiel(text))
X_test_sub['text'] = X_test_sub['text'].apply(lambda text: keeping_essentiel(text))

# Display a sample to verify
print("Sample text after keeping only essential characters (Training Data):")
display(X_train[['text']].head())

print("\nSample text after keeping only essential characters (Testing Data):")
display(X_test_sub[['text']].head())


### 3.7 Removing Stopwords and Short Words

In this step, we will remove **stopwords** (commonly used words like "the", "and", "is", etc.) and **short words** (words with fewer than 3 characters) from the text. These words do not add significant meaning to the text and can introduce noise, making it harder for the model to learn useful patterns.

The function `remove_stopwords_and_short_words` will:
- Eliminate stopwords from the text, including those from French, English, and German, as well as custom stopwords specific to our dataset.
- Remove any words that are shorter than 3 characters.

This helps ensure that only meaningful words remain in the dataset, improving model performance.


In [None]:


# Initialize the stop words variable
stop_words = (stopwords.words('french') 
              + stopwords.words('english') 
              + stopwords.words('german') 
              + ['plus', 'peut', 'tout', 'etre', 'sans', 'dont', 'aussi', 'comme', 'meme', 'bien', 
                 'leurs', 'elles', 'cette', 'celui', 'ainsi', 'encore', 'alors', 'toujours', 'toute', 
                 'deux', 'nouveau', 'peu', 'car', 'autre', 'jusqu', 'quand', 'ici', 'ceux', 'enfin', 
                 'jamais', 'autant', 'tant', 'avoir', 'moin', 'celle', 'tous', 'contre', 'pourtant', 
                 'quelque', 'toutes', 'surtout', 'cet', 'comment', 'rien', 'avant', 'doit', 'autre', 
                 'depuis', 'moins', 'tre', 'souvent', 'etait', 'pouvoir', 'apre', 'non', 'ver', 'quel', 
                 'pourquoi', 'certain', 'fait', 'faire', 'sou', 'donc', 'trop', 'quelques', 'parfois', 
                 'tres', 'donc', 'dire', 'eacute', 'egrave', 'rsquo', 'agrave', 'ecirc', 'nbsp', 'acirc', 
                 'apres', 'autres', 'ocirc', 'entre', 'sous', 'quelle'])



def remove_stopwords_and_short_words(list):
    """
    Removes stopwords from a list of words and filters out words that are too short (less than 3 characters).
    Only words that are not in the stopwords list and have more than 2 characters will be kept.
    """
    filtered_words  = []
    for mot in list:
        if (mot not in stop_words and len(mot) > 2):
            filtered_words .append(mot)
    return filtered_words 

# Split the text into individual words (tokens)
X_train['text'] = X_train['text'].str.split()
X_test_sub['text'] = X_test_sub['text'].str.split()

# Apply stopwords removal and filter short words from both datasets
X_train['text'] = X_train['text'].apply(lambda x: remove_stopwords_and_short_words(x))
X_test_sub['text'] = X_test_sub['text'].apply(lambda x: remove_stopwords_and_short_words(x))

# Join the words back into a string
X_train['text'] = X_train['text'].apply(lambda x: " ".join(x))
X_test_sub['text'] = X_test_sub['text'].apply(lambda x: " ".join(x))

# Display a sample to verify
print("Sample text after removing stopwords and short words (Training Data):")
display(X_train[['text']].head())

print("\nSample text after removing stopwords and short words (Testing Data):")
display(X_test_sub[['text']].head())


### 3.8 Removing Punctuation

Punctuation marks such as periods, commas, question marks, exclamation points, parentheses, and quotation marks do not generally contribute meaningful information for most text classification tasks. Therefore, we will remove them to clean up the text and ensure that only the relevant words are processed.

In this step:
- We use the `string.punctuation` library to identify and remove all punctuation marks from the text.
- This ensures the model focuses on the core content without being distracted by unnecessary symbols.

The function `remove_punctuation` will be applied to both the training and testing datasets.


In [None]:



def remove_punctuation(text):
    # Remove punctuation using Python string.punctuation
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply this function to both datasets
X_train['text'] = X_train['text'].apply(remove_punctuation)
X_test_sub['text'] = X_test_sub['text'].apply(remove_punctuation)


# Display a sample to verify
print("Sample text after removing punctuation(Training Data):")
display(X_train[['text']].head())

print("\nSample text after removing punctuation (Testing Data):")
display(X_test_sub[['text']].head())


# 4. Saving Updated Datasets for Future Use

To avoid reloading and recomputing the datasets in every notebook, we save the cleaned datasets as Pickle files.

In this notebook, we have performed the following:
- Combined the `designation` (title) and `description` into a single column called `text`.
- Cleaned the text data by converting it to lowercase, removing accents, stripping HTML tags, filtering special characters, and eliminating stopwords and short words.

We will store the following updated datasets for future use:
- **`X_train_cleaned.pkl`** → The training dataset with cleaned text.
- **`X_test_cleaned.pkl`** → The test dataset with cleaned text.

These files will be used in future steps, including feature engineering and model training.

In [None]:


# Define the directory and file names
pickle_dir = "../data/interim/"
os.makedirs(pickle_dir, exist_ok=True)

# Define file paths
train_pickle_path = os.path.join(pickle_dir, "X_train_cleaned.pkl")
test_pickle_path = os.path.join(pickle_dir, "X_test_sub_cleaned.pkl")

try:
    # Save updated training dataset
    X_train.to_pickle(train_pickle_path)
    print(f"Training dataset saved: {train_pickle_path}")

    # Save updated test dataset
    X_test_sub.to_pickle(test_pickle_path)
    print(f"Test dataset saved: {test_pickle_path}")

except Exception as e:
    print(f"Error saving datasets: {e}")


## 5. 🔄 Next Steps  

Now that we have cleaned the text data, the next step is to enhance our understanding of the dataset through **visualizations** and further analysis.

We will utilize the cleaned text data to:
- Create **WordClouds** to visually represent the most frequent words in different classes.
- Identify key labels for each class, such as:
  - 50: 'video games accessories'
  - 2705: 'books'
  - ...and more.

The **WordClouds** will help us to visually explore and understand the dominant themes for each product category.

---
➡️ **Proceed to `5_Text_WordClouds_for_Product_Categories.ipynb`**  
This notebook will focus on generating WordClouds for each product category and analyzing the textual content of the dataset.
