# Word Clouds for Text Product Categories

## 📌Objectives of the Notebook

In this notebook, we will explore and visualize the text data associated with product codes using Word Clouds. This will help us better understand the most frequent words for each category and identify potential patterns.

## Key Steps:
✔ **Loading pre-cleaned text data**  → Import processed datasets (`X_train_cleaned.pkl` & `X_test_cleaned.pkl`).       
✔  **Generating and visualizing Word Clouds** → Extract and display the most frequent words for each product category.   
✔  **Identifying Product Category Labels from Word Cloud Analysis**    
✔  **Mapping product codes to labels** → Assign descriptive category names to product codes for better interpretability.  
✔  **Saving the labeled dataset** → Store the final processed data (`X_train_labeled.pkl) for future use.


## 1. Import Required Libraries 

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

import sys
import os
from pathlib import Path
import pandas as pd  
import matplotlib.pyplot as plt  
import seaborn as sns  
import matplotlib  
import importlib
%matplotlib inline

### Setting Up Project Paths and Configurations

In [None]:
# Get the current notebook directory
CURRENT_DIR = Path(os.getcwd()).resolve()

# Automatically find the project root (go up 1 level)
PROJECT_ROOT = CURRENT_DIR.parents[1]

# Add project root to sys.path
sys.path.append(str(PROJECT_ROOT))

# Function to get relative paths from project root
def get_relative_path(absolute_path):
    return str(Path(absolute_path).relative_to(PROJECT_ROOT))

# Print project root directory
print(f"Project Root Directory: {PROJECT_ROOT.name}")  # Display only the root folder name

import config  # Now Python can find config.py

## 2. Load Pickle Files (X_test_cleaned.pkl & X_test_cleaned.pkl)

In [None]:
importlib.reload(config)  # Reload config to ensure any updates are applied

# Define paths for datasets (FILES, not directories)
train_pickle_path = Path(config.INTERIM_TEXT_DIR) / "X_train_cleaned.pkl"
test_pickle_path = Path(config.INTERIM_TEXT_DIR) / "X_test_cleaned.pkl"
y_train_pickle_path = Path(config.INTERIM_TEXT_DIR) / "y_train.pkl"

# Function to get relative paths from project root
def get_relative_path(absolute_path: Path):
    """Returns the relative path from the project root."""
    return str(absolute_path.relative_to(config.BASE_DIR))

# Function to load a Pickle file safely
def load_pickle(file_path: Path, dataset_name: str):
    """Loads a pickle file with error handling and basic visualization."""
    if not file_path.exists():
        print(f"Error: `{dataset_name}` file not found at {file_path}")
        return None

    try:
        data = pd.read_pickle(file_path)
        print(f"Successfully loaded `{dataset_name}` | Shape: {data.shape}")

        if not data.empty:
            display(data.head())  # Display first rows

        return data
    except Exception as e:
        print(f"Error loading `{dataset_name}`: {e}")
        return None

# List of required files with their names
required_files = {
    "Training Dataset": train_pickle_path,
    "Testing Dataset": test_pickle_path,   
    "Training Labels": y_train_pickle_path
}

# Check if files exist before loading
for name, path in required_files.items():
    if not path.exists():
        raise FileNotFoundError(f"Error: `{name}` file not found at {get_relative_path(path)}")

# Load datasets
X_train = load_pickle(train_pickle_path, "X_train_cleaned.pkl")
X_test = load_pickle(test_pickle_path, "X_test_cleaned.pkl")
y_train = load_pickle(y_train_pickle_path, "y_train.pkl")


## 3. Generating and Visualizing Word Clouds

Word Clouds help us quickly identify the most frequent words in each product category. This can provide insights into key terms associated with different categories.

### 3.1  Import Required Libraries

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

### 3.2 Get unique product code values

In [None]:
import numpy as np

# === Get unique product codes === #
unique_product_codes = np.unique(X_train["prdtypecode"])

# Display unique values
print("Unique Product Codes:")
print(unique_product_codes)


### 3.3 Define a Function to Generate Word Clouds

In [None]:
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Global dictionary to store top words per category
word_freq_dict = {}

def plot_wordcloud(category, data, column="text"):
    """
    Generate and display a Word Cloud for a given product category,
    and store the most frequent words in a global dictionary.

    Parameters:
    - category (int): The product category code to visualize.
    - data (DataFrame): The dataset containing text data.
    - column (str): The column containing the text (default: "text").
    """
    text_data = " ".join(data[data["prdtypecode"] == category][column].dropna())
    
    # Generate the Word Cloud
    wc = WordCloud(
        background_color="black",  # Set background color to black for better contrast
        max_words=100,             # Limit the number of words displayed in the Word Cloud
        max_font_size=50,          # Set the maximum font size for the largest words
        random_state=42            # Ensure reproducibility
    ).generate(text_data)

    # Extract most frequent words
    word_frequencies = wc.words_  # Dictionary {word: frequency}
    top_words = Counter(word_frequencies).most_common(10)  # Get top 10 words

    # Store in global dictionary
    word_freq_dict[category] = top_words

    # Print most frequent words for the category
    print(f"\n Most frequent words for category {category}:")
    for word, freq in top_words:
        print(f"   {word}: {freq:.4f}")

    # Display the Word Cloud
    plt.figure(figsize=(15, 15))
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.title(f"Word Cloud for {category}", fontsize=14)
    plt.show()


### 3.4 Generating Word Clouds for All Product Code Categories

In [1]:
plot_wordcloud(10, X_train)


NameError: name 'plot_wordcloud' is not defined

In [2]:
plot_wordcloud(40, X_train)


NameError: name 'plot_wordcloud' is not defined

In [None]:
plot_wordcloud(50, X_train)


In [None]:
plot_wordcloud(60, X_train)


In [None]:
plot_wordcloud(1140, X_train)


In [None]:
plot_wordcloud(1160, X_train)


In [None]:
plot_wordcloud(1180, X_train)


In [None]:
plot_wordcloud(1280, X_train)


In [None]:
plot_wordcloud(1281, X_train)


In [None]:
plot_wordcloud(1300, X_train)


In [None]:
plot_wordcloud(1301, X_train)


In [None]:
plot_wordcloud(1302, X_train)


In [None]:
plot_wordcloud(1320, X_train)


In [None]:
plot_wordcloud(1560, X_train)


In [None]:
plot_wordcloud(1920, X_train)


In [None]:
plot_wordcloud(1940, X_train)


In [None]:
plot_wordcloud(2060, X_train)


In [None]:
plot_wordcloud(2220, X_train)


In [None]:
plot_wordcloud(2280, X_train)


In [None]:
plot_wordcloud(2403, X_train)


In [None]:
plot_wordcloud(2462, X_train)


In [None]:
plot_wordcloud(2522, X_train)


In [None]:
plot_wordcloud(2582, X_train)


In [None]:
plot_wordcloud(2583, X_train)


In [None]:
plot_wordcloud(2585, X_train)


In [None]:
plot_wordcloud(2705, X_train)


In [None]:
plot_wordcloud(2905, X_train)


### 3.5 Summary: Most Frequent Words by Product Category

In [None]:
print("\nSummary: Most Frequent Words by Product Category")
print("=" * 100)

for category, words in word_freq_dict.items():
    top_words = ", ".join([word for word, freq in words])
    print("-" * 100)
    print(f"Product Code {category:<5} | {top_words}")

print("=" * 100)


In [None]:
dict_prdtypecode = {
    "prdtypecode": [10, 40, 50, 60, 1140, 1160, 1180, 1280, 1281, 1300, 
                    1301, 1302, 1320, 1560, 1920, 1940, 2060, 2220, 2280, 
                    2403, 2462, 2522, 2582, 2583, 2585, 2705, 2905],  
 
    "Label": ["adult books", "imported video games", "video games accessories", "games and consoles", 
              "figurines and Toy Pop", "playing cards", "figurines, masks and role playing games", 
              "toys for children", "board games", "remote controlled models", "accessories children", 
              "toys, outdoor playing, clothes", "early childhood", "interior furniture and bedding", 
              "interior accessories", "Food", "decoration interior", "supplies for domestic animals", 
              "magazines", "children books and magazines", "games", "stationery", 
              "furniture kitchen and garden", "piscine spa", "gardening and DIY", "books", 
              "online distribution of video games"]
}


### 3.6 📌 Identifying Product Category Labels from Word Cloud Analysis

By analyzing the **Word Cloud visualizations** and the **summary of most frequent words**, we were able to accurately identify the product categories associated with each `prdtypecode`. Below is the final mapping of **product categories to their respective product codes**:  

| **Product Code** | **Identified Category** |  
|-----------------|------------------------|  
| 10             | Adult Books |  
| 40             | Imported Video Games |  
| 50             | Video Games Accessories |  
| 60             | Games and Consoles |  
| 1140           | Figurines and Toy Pop |  
| 1160           | Playing Cards |  
| 1180           | Figurines, Masks, and Role-Playing Games |  
| 1280           | Toys for Children |  
| 1281           | Board Games |  
| 1300           | Remote Controlled Models |  
| 1301           | Accessories for Children |  
| 1302           | Toys, Outdoor Playing, and Clothes |  
| 1320           | Early Childhood |  
| 1560           | Interior Furniture and Bedding |  
| 1920           | Interior Accessories |  
| 1940           | Food |  
| 2060           | Decoration Interior |  
| 2220           | Supplies for Domestic Animals |  
| 2280           | Magazines |  
| 2403           | Children Books and Magazines |  
| 2462           | Games |  
| 2522           | Stationery |  
| 2582           | Furniture, Kitchen, and Garden |  
| 2583           | Piscine and Spa |  
| 2585           | Gardening and DIY |  
| 2705           | Books |  
| 2905           | Online Distribution of Video Games |  

 This labeling will now be used for further analysis and model training.  
 📌 **We will now add these category labels to our training dataset (`X_train`).**


### 3.7 Adding Category Labels to X_train

In [None]:
# Define the mapping of prdtypecode to category labels
dict_code_label = {
    10: "Adult Books",
    40: "Imported Video Games",
    50: "Video Games Accessories",
    60: "Games and Consoles",
    1140: "Figurines and Toy Pop",
    1160: "Playing Cards",
    1180: "Figurines, Masks, and Role-Playing Games",
    1280: "Toys for Children",
    1281: "Board Games",
    1300: "Remote Controlled Models",
    1301: "Accessories for Children",
    1302: "Toys, Outdoor Playing, and Clothes",
    1320: "Early Childhood",
    1560: "Interior Furniture and Bedding",
    1920: "Interior Accessories",
    1940: "Food",
    2060: "Decoration Interior",
    2220: "Supplies for Domestic Animals",
    2280: "Magazines",
    2403: "Children Books and Magazines",
    2462: "Games",
    2522: "Stationery",
    2582: "Furniture, Kitchen, and Garden",
    2583: "Piscine and Spa",
    2585: "Gardening and DIY",
    2705: "Books",
    2905: "Online Distribution of Video Games"
}

#  Add the category labels to X_train
X_train["Label"] = X_train["prdtypecode"].map(dict_code_label)

# Display a sample to verify
X_train.head()


# 4. Saving Updated Datasets for Future Use

To avoid reloading and recomputing the datasets in every notebook, we save the cleaned and labeled training dataset as a Pickle file. This ensures that we can efficiently reuse the data in future steps without the need for redundant preprocessing.

The datasets will be saved in the **`processed`** directory as follows:

- **Training dataset**: `X_train_final.pkl`

To ensure consistency in subsequent steps, we also save the target variable `y_train` and the test dataset `X_test`, even though they have not been modified in the current notebook.

- **Training target variable (prdtypecode)**: `y_train_final.pkl`  
  (Saved for consistency, even though it hasn't been modified.)

- **Testing dataset**: `X_test_final.pkl`  
  (Saved for future use, especially for challenge submissions, as it has no associated target variable for labeling.)


In [None]:
import os
import pickle

# Define the directory and file name
# pickle_dir = "../../data/processed/"
pickle_dir = Path(config.PROCESSED_DIR)
os.makedirs(pickle_dir, exist_ok=True)

# Define file paths for the labeled datasets
train_pickle_path = os.path.join(pickle_dir, "X_train_final.pkl")
test_pickle_path = os.path.join(pickle_dir, "X_test_final.pkl")
target_pickle_path = os.path.join(pickle_dir, "y_train_final.pkl")  # Path for y_train

try:
    # Save updated training dataset
    X_train.to_pickle(train_pickle_path)
    print(f"Training dataset saved: {train_pickle_path}")

    # Save test dataset (even if it hasn't been modified)
    X_test.to_pickle(test_pickle_path)
    print(f"Test dataset saved: {test_pickle_path}")

    # Save the training target variable (even if it hasn't been modified)
    y_train.to_pickle(target_pickle_path)
    print(f"Training target variable saved: {target_pickle_path}")

except Exception as e:
    print(f"Error saving datasets: {e}")


## 5. 🔄 Next Steps

We have utilized word clouds to visualize the most frequent terms within product categories. Assigning descriptive labels to product codes based on this analysis will enhance our understanding of these categories, thereby facilitating more insightful analysis of prediction results.

**Next, we will:**

- **Vectorize the text data**: Convert the cleaned text into numerical representations suitable for Machine Learning Models.  
➡️ This will be accomplished in the upcoming notebook  **`6_ML_Text_Vectorization_TF.ipynb`**

