# Environment & Data Setup

Before building the model, we need to configure our notebook environment for high-performance processing and load our data.

#### 1. Data Configuration
* **Kaggle Datasets**: All project data (the `train.csv`, `test.csv`, and the `images/` directory) was first uploaded as a private Kaggle Dataset.
* **Notebook Import**: We then imported this dataset into the notebook. This is the most efficient way to handle large files on Kaggle, as the data is mounted directly to the notebook's filesystem under the `/kaggle/input/` directory.

#### 2. Accelerator: GPU T4 x2
* **Enabled GPU**: To dramatically speed up our feature extraction, we enabled a high-performance GPU accelerator. This can be done in the notebook's right-hand menu:
    `Settings` -> `Accelerator` -> `GPU T4 x2`
* **Why T4 x2?**: This option provides us with **two** NVIDIA T4 GPUs. For an "embarrassingly parallel" task like `model.predict()`, this is ideal. By using a distribution strategy (like `tf.distribute.MirroredStrategy`), TensorFlow can automatically split the prediction workload across both GPUs, processing two batches of images simultaneously. This effectively **halves the time** required for feature extraction compared to a single GPU.

# Importing the Libraries

In [1]:
import os
import pandas as pd
import numpy as np
import tensorflow as tf
import time
from PIL import Image
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib
import xgboost as xgb
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from scipy.sparse import csr_matrix

2025-10-13 15:55:07.557668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760370907.798538      37 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760370907.873522      37 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


# Importing Dataset

In [11]:
DATASET_FOLDER = '/kaggle/input/amazon-ml-data'
train = pd.read_csv(os.path.join(DATASET_FOLDER, 'train.csv'))
test = pd.read_csv(os.path.join(DATASET_FOLDER, 'test.csv'))

# Load Pre-trained Base Model (MobileNetV2)

We initialize the **MobileNetV2** model, pre-trained on the ImageNet dataset.

* `input_shape=(224, 224, 3)`: Specifies the input size for our images (224x224 pixels with 3 color channels).
* `include_top=False`: Excludes the final fully-connected (classification) layer from the original MobileNetV2. This allows us to add our own custom classifier suited for our specific task.
* `weights='imagenet'`: Loads weights pre-trained on the ImageNet dataset, which is crucial for transfer learning.
* `pooling='avg'`: Applies **Global Average Pooling** to the output of the base model. This converts the feature maps into a single flat vector per image, making it easy to connect to our new classification layer.

In [2]:
base_model = tf.keras.applications.MobileNetV2(
        input_shape=(224, 224, 3),
        include_top=False,
        weights='imagenet',
        pooling='avg' # Applies Global Average Pooling to the output
    )

I0000 00:00:1760370921.021442      37 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13942 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
I0000 00:00:1760370921.022293      37 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13942 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5


Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet_v2/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_1.0_224_no_top.h5
[1m9406464/9406464[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m0s[0m 0us/step


# Load Image Data using `tf.data`

We use the `tf.keras.utils.image_dataset_from_directory` utility to create a `tf.data.Dataset` object. This is a highly efficient way to load images from a directory directly into a TensorFlow data pipeline.

* `root_dir`: The path to our images.
* `labels=None` / `label_mode=None`: We explicitly tell the function *not* to look for labels. This is because our images are not organized into subfolders by class (e.g., `/dogs`, `/cats`). We will get our labels from a separate CSV file.
* `batch_size=64`: Loads the images in batches of 64.
* `image_size=(224, 224)`: Resizes all images to 224x224 to match the input shape required by our MobileNetV2 model.
* `shuffle=False`: This is **critical**. We must not shuffle the dataset here so that the order of the images loaded from the directory perfectly matches the order of the corresponding data (like labels or IDs) in our `train.csv` or `test.csv` file.
* `interpolation='lanczos5'`: Uses a high-quality resampling filter for resizing the images.

Finally, we apply `.prefetch(buffer_size=tf.data.AUTOTUNE)`. This is a performance optimization that allows the CPU to pre-load the next batch of images while the GPU is busy processing the current one, preventing data bottlenecks.

In [3]:
root_dir = "/kaggle/input/amazon-ml-image-data/images/"
image_dataset = tf.keras.utils.image_dataset_from_directory(
            root_dir,
            labels=None,
            label_mode=None,
            batch_size=64,
            image_size=(224, 224),
            shuffle=False, # CRITICAL for aligning with external data
            interpolation='lanczos5'
        )
image_dataset = image_dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

Found 72284 files.


# Extract Features (Generate Embeddings)

We now pass our entire `image_dataset` through the pre-trained `base_model` using the `.predict()` method.

* This is the main step of **feature extraction**.
* Since we set `include_top=False` and `pooling='avg'` when loading MobileNetV2, the model doesn't output final predictions. Instead, it processes each image and converts it into a high-level feature vector (also known as an "embedding").
* For each image in the dataset, the output will be a 1D vector of **1280 features**.

The `extracted_features` variable will now hold a large NumPy array with the shape `(total_number_of_images, 1280)`. This new array represents our image data in a numerical format that we can easily use to train a simpler machine learning model (like XGBoost, LightGBM, or a small, dense neural network).

In [4]:
extracted_features = base_model.predict(image_dataset)

I0000 00:00:1760371211.129387     100 service.cc:148] XLA service 0x7f6b7820fdd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1760371211.130327     100 service.cc:156]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
I0000 00:00:1760371211.130348     100 service.cc:156]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
I0000 00:00:1760371211.592547     100 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m   3/1130[0m [37m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [1m50s[0m 45ms/step  

I0000 00:00:1760371216.674549     100 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1010/1130[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m‚îÅ‚îÅ‚îÅ[0m [1m1:22[0m 684ms/step

Input file read error


InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_1_device_/job:localhost/replica:0/task:0/device:CPU:0}} jpeg::Uncompress failed. Invalid JPEG data or crop window.
	 [[{{node decode_image/DecodeImage}}]] [Op:IteratorGetNext] name: 

# Debugging: Finding Corrupted Images After specified Batch processing

During the feature extraction, the `model.predict()` process failed at a specific batch (`Batch 1009`). This almost always indicates that one of the image files in that batch is corrupted or truncated.

This script is a targeted debugger to find the exact problematic file without having to re-scan the entire dataset from the beginning.

Here is the logic:

1.  **Configuration**: We set the `FAILED_AT_BATCH` variable to the batch number where the error occurred (1009) and note the `BATCH_SIZE` (64).
2.  **Calculate Offset**: We calculate a `start_index` to skip all files from the batches that we know completed successfully (i.e., batches 1 through 1008). This is the key optimization.
3.  **Gather & Sort Files**: The script gets a complete list of all image paths from the directory.
4.  **Critical Sort**: It performs an alphabetical sort on the file list. This is **essential** to ensure the list's order perfectly matches the order used by `tf.keras.utils.image_dataset_from_directory` (when `shuffle=False`).
5.  **Targeted Scan**: It creates a new, smaller list (`files_to_check`) containing only the "suspect" files, starting from the calculated `start_index`.
6.  **Verify with TensorFlow**: It iterates *only* through this suspect list and attempts to open and decode each file using TensorFlow's own I/O functions (`tf.io.read_file` and `tf.io.decode_image`). This is crucial because it finds the *exact* file that TensorFlow considers corrupt, which other libraries (like PIL) might not.

Any file that throws an exception during this process is identified as the corrupted file.

In [6]:
image_dir = '/kaggle/input/amazon-ml-image-data/images/'
BATCH_SIZE = 64
FAILED_AT_BATCH = 1009 # This is the batch number where the error occurred


# 1. Calculate the starting index
# We skip the files from all the batches that completed successfully
start_index = (FAILED_AT_BATCH - 1) * BATCH_SIZE
print(f"Calculation: ({FAILED_AT_BATCH} - 1) * {BATCH_SIZE} = {start_index}")
print(f"Will start scanning for bad files after skipping the first {start_index} files.")

# 2. Get a complete, sorted list of all image files
print("Gathering and sorting all file paths...")
all_files = []
for root, dirs, files in os.walk(image_dir):
    for filename in files:
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            all_files.append(os.path.join(root, filename))

# **CRITICAL**: You must sort the list to match the order TensorFlow uses.
all_files.sort()
print(f"Found a total of {len(all_files)} images.")

# 3. Slice the list to get only the files that need to be checked
files_to_check = all_files[start_index:]
print(f"Now checking the remaining {len(files_to_check)} files...\n")

# 4. Run the check on the smaller, targeted list of files
bad_files = []
for filepath in files_to_check:
    try:
        # Use TensorFlow's own functions to be certain
        image_bytes = tf.io.read_file(filepath)
        tf.io.decode_image(image_bytes)
    except Exception as e:
        print(f"Corrupted file found: {filepath}")
        print(f"   Reason: {e}")
        bad_files.append(filepath)
        # Optional: stop after finding the first bad file
        # break 

print("\n--- Scan Complete ---")
if bad_files:
    print(f"Found {len(bad_files)} corrupted files in the suspected range.")
    print("Remove these files from your dataset and any corresponding label files.")
else:
    print("No corrupted files found in the suspected range. The issue might be different.")

Calculation: (1009 - 1) * 64 = 64512
üéØ Will start scanning for bad files after skipping the first 64512 files.
Gathering and sorting all file paths...
Found a total of 72284 images.
Now checking the remaining 7772 files...



Input file read error


‚ùå Corrupted file found: /kaggle/input/amazon-ml-image-data/images/81stS23MRfL.jpg
   Reason: {{function_node __wrapped__DecodeImage_device_/job:localhost/replica:0/task:0/device:CPU:0}} jpeg::Uncompress failed. Invalid JPEG data or crop window. [Op:DecodeImage] name: 

--- Scan Complete ---
Found 1 corrupted files in the suspected range.
Remove these files from your dataset and any corresponding label files.


# Debugging & Cleaning: Finding All Corrupted Images

After the `model.predict()` process failed at a specific batch (e.g., `1010`), it's clear we have one or more corrupted image files. Manually finding them in a large dataset is impossible.

This script is an optimized debugger designed to find and handle *all* problematic files from the point of failure onward.

#### 1. Optimized Scanning
Instead of re-checking the entire dataset (which could take hours), we first calculate a `start_index`. This index tells the script to skip all the files from the batches that we *know* processed successfully (e.g., batches 1 through 1009).

#### 2. File Collection and Sorting
* The script gathers all image paths from the directory.
* **Critically**, it sorts the file list alphabetically (`all_files.sort()`). This is essential because it exactly matches the order that `tf.keras.utils.image_dataset_from_directory(shuffle=False)` uses to read the files.
* It then slices this main list to create a smaller `files_to_check` list, containing only the "suspect" files from the point of failure.

#### 3. TensorFlow-based Verification
The script loops through the suspect list and uses `tf.io.decode_image(..., channels=3)` to test each file. This is the most reliable method, as it uses the *exact* same decoding function that the TensorFlow model pipeline uses. Any file that causes an exception here is guaranteed to be a problem.

#### 4. Collect & Remove
* It finds and collects *all* corrupted files in the suspect range (it doesn't stop after the first one).
* It prints a final summary list of all bad files found.
* Finally, it attempts to `os.remove()` them. The `try...except` block correctly anticipates and handles the `OSError` (read-only filesystem error), which is expected when working in the `/kaggle/input/` directory. This serves to confirm which files *would* be removed if the directory were writable.

In [8]:

image_dir = '/kaggle/input/amazon-ml-image-data/images/' 
BATCH_SIZE = 64
FAILED_AT_BATCH = 1010 

start_index = (FAILED_AT_BATCH - 1) * BATCH_SIZE
print(f"Configuration: Will start scanning after skipping the first {start_index} files.")

print("Gathering and sorting all file paths...")
all_files = []
for root, dirs, files in os.walk(image_dir):
    for filename in files:
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            all_files.append(os.path.join(root, filename))

all_files.sort()
print(f"Found a total of {len(all_files)} images.")

files_to_check = all_files[start_index:]
print(f"Now scanning {len(files_to_check)} files for corruption...\n")

corrupted_files_list = []
for filepath in files_to_check:
    try:
        image_bytes = tf.io.read_file(filepath)
        tf.io.decode_image(image_bytes, channels=3) # Use channels=3 for consistency
    except Exception as e:
        print(f"Found corrupted file: {filepath}")
        corrupted_files_list.append(filepath)

print("\n" + "="*50)
if not corrupted_files_list:
    print("Scan complete. No corrupted files were found in the suspected range.")
else:
    print(f"Found a total of {len(corrupted_files_list)} corrupted files.")
    print("--- List of Corrupted Files ---")
    for f in corrupted_files_list:
        print(f)
    
    print("\n--- Attempting to Remove Corrupted Files ---")
    for filepath in corrupted_files_list:
        try:
            os.remove(filepath)
            print(f"Successfully removed: {filepath}")
        except OSError as e:
            print(f"Error removing file: {filepath}")
            print(f"   Reason: {e}")
            print("   This is expected in read-only environments like /kaggle/input/.")

print("="*50)

üéØ Configuration: Will start scanning after skipping the first 64576 files.
Gathering and sorting all file paths...
Found a total of 72284 images.
Now scanning 7708 files for corruption...



Input file read error


‚ùå Found corrupted file: /kaggle/input/amazon-ml-image-data/images/81stS23MRfL.jpg

Found a total of 1 corrupted files.
--- List of Corrupted Files ---
/kaggle/input/amazon-ml-image-data/images/81stS23MRfL.jpg

--- üóëÔ∏è Attempting to Remove Corrupted Files ---
‚ùóÔ∏è Error removing file: /kaggle/input/amazon-ml-image-data/images/81stS23MRfL.jpg
   Reason: [Errno 30] Read-only file system: '/kaggle/input/amazon-ml-image-data/images/81stS23MRfL.jpg'
   This is expected in read-only environments like /kaggle/input/.


# Building a Custom `tf.data` Pipeline to Skip Corrupted Files

Our previous debugging step successfully identified a corrupted file (`81stS23MRfL.jpg`) that was causing the `model.predict()` process to fail.

Since we cannot remove this file from the read-only `/kaggle/input/` directory, we must build a new data pipeline that manually filters it out *before* processing. This code replaces the original `image_dataset_from_directory` with a more flexible `tf.data` pipeline.

---

#### Step 1: Filter Out Corrupted Files

First, we define a "blocklist" (`corrupted_files_list`) containing the paths of all known bad files.

* We convert this list to a `set` for high-performance lookups (checking if an item is in a set is much faster than checking a list).
* We get a list of *all* file paths in the directory.
* We create our final `good_file_paths` list by iterating through all files and keeping only the ones *not* present in the `corrupted_files_set`.
* Finally, we **sort** this clean list to ensure the image order remains consistent for later steps (like merging with labels).

---

#### Step 2: Build the Custom `tf.data` Pipeline

Since we can no longer use the simple `image_dataset_from_directory` (as it doesn't support an "exclude" feature), we build our own pipeline from scratch.

* `tf.data.Dataset.from_tensor_slices`: This is the key. We create a new dataset directly from our clean list of `good_file_paths`. The dataset now only contains paths to valid images.
* `load_and_preprocess_image`: We define a helper function that does the work `image_dataset_from_directory` used to do for us:
    1.  `tf.io.read_file`: Reads the image file from its path.
    2.  `tf.io.decode_jpeg`: Decodes the raw bytes into an image tensor.
    3.  `tf.image.resize`: Resizes the image to `(224, 224)` using the `lanczos5` method to match our original setup.
* We then apply this function using `.map()` and add the standard performance optimizations: `.batch()` and `.prefetch()`.

---

#### Step 3: Define Model and Extract Features

This part is now straightforward.

* We define the exact same `MobileNetV2` `base_model` as before.
* We call `base_model.predict()`, but this time we pass it our new, custom, and **guaranteed-clean** `image_dataset`.
* The feature extraction now runs to completion without errors, producing the final `extracted_features` array.

In [10]:
# The root directory containing all your images
root_dir = "/kaggle/input/amazon-ml-image-data/images/"

# Add all corrupted file paths you identify to this list.
corrupted_files_list = [
    '/kaggle/input/amazon-ml-image-data/images/81stS23MRfL.jpg',
]

# Model and pipeline parameters
BATCH_SIZE = 64
IMG_SIZE = (224, 224)
print("--- Step 1: Filtering out corrupted files ---")

# Convert the list to a set for very fast lookups
corrupted_files_set = set(corrupted_files_list)

# Get the full paths of all files in the directory
all_file_paths = [os.path.join(root_dir, fname) for fname in os.listdir(root_dir)]

# Create the final, clean list of file paths by excluding the corrupted ones
good_file_paths = [path for path in all_file_paths if path not in corrupted_files_set]

good_file_paths.sort()

print(f"Total files found in directory: {len(all_file_paths)}")
print(f"Number of corrupted files to skip: {len(corrupted_files_set)}")
print(f"Final number of clean files for processing: {len(good_file_paths)}")

print("\n--- Step 2: Building the custom tf.data pipeline ---")

# Define a function to read, decode, and resize an image from its path
def load_and_preprocess_image(path):
    
    image = tf.io.read_file(path) # Read the raw file data

    image = tf.io.decode_jpeg(image, channels=3)     # Decode it as a JPEG image with 3 color channels (RGB)
    
    image = tf.image.resize(image, IMG_SIZE, method='lanczos5') # Resize the image to the required input size of the model  The interpolation method matches your original code.
    return image

# Create a TensorFlow Dataset from the clean list of file paths
image_dataset = tf.data.Dataset.from_tensor_slices(good_file_paths)

# Apply the preprocessing function to each file path in parallel for efficiency
image_dataset = image_dataset.map(load_and_preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)

# Batch the images into groups of 64
image_dataset = image_dataset.batch(BATCH_SIZE)

# Prefetch the next batch in the background for improved performance
image_dataset = image_dataset.prefetch(buffer_size=tf.data.AUTOTUNE)

print("Custom data pipeline created successfully!")
print(image_dataset)


print("\n--- Step 3: Defining the model and extracting features ---")

# This is your exact model definition
base_model = tf.keras.applications.MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet',
    pooling='avg'  # Applies Global Average Pooling to the output
)

# Run the prediction on the new, clean, and filtered dataset
# This will now run to completion without errors.
extracted_features = base_model.predict(image_dataset)

print("\n--- Feature Extraction Complete! ---")
print(f"Shape of extracted features: {extracted_features.shape}")
print(f"This shape means ({extracted_features.shape[0]} images, {extracted_features.shape[1]} features per image)")

--- Step 1: Filtering out corrupted files ---
Total files found in directory: 72284
Number of corrupted files to skip: 1
Final number of clean files for processing: 72283

--- Step 2: Building the custom tf.data pipeline ---
‚úÖ Custom data pipeline created successfully!
<_PrefetchDataset element_spec=TensorSpec(shape=(None, 224, 224, 3), dtype=tf.float32, name=None)>

--- Step 3: Defining the model and extracting features ---
[1m1130/1130[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m734s[0m 647ms/step

--- üöÄ Feature Extraction Complete! ---
Shape of extracted features: (72283, 1280)
This shape means (72283 images, 1280 features per image)


# Text Feature Engineering: TF-IDF

To use the `catalog_content` text data in our model, we converted it into a numerical format. We'll use **TF-IDF** (Term Frequency-Inverse Document Frequency) for this.

* `TfidfVectorizer(max_features=5000)`: We initialize the vectorizer. This object will convert our text into a matrix of TF-IDF scores.
    * `max_features=5000`: We are limiting the vocabulary to the **top 5,000 most frequent words** across all catalog entries. This helps control the dimensionality of our data, focusing on the most relevant terms and ignoring very rare or very common (and thus less useful) words.

* `X_text_sparse = vectorizer.fit_transform(train.catalog_content)`:
    * `fit_transform()` is called on the **training data**.
    * **fit**: The vectorizer "learns" the 5,000-word vocabulary from the `train.catalog_content`.
    * **transform**: It then converts this training text into a **sparse matrix** (`X_text_sparse`), where each row represents a catalog entry and each column represents one of the 5,000 vocabulary words.

* `test_sparse = vectorizer.transform(test.catalog_content)`:
    * We *only* call `.transform()` on the **test data**.
    * This is **critical**: It applies the *same vocabulary* learned from the training data to the test data. This ensures that the resulting columns (features) are consistent and aligned between our training and test sets.

In [13]:
vectorizer = TfidfVectorizer(max_features=5000)
X_text_sparse = vectorizer.fit_transform(train.catalog_content)
test_sparse = vectorizer.transform(test.catalog_content)
X_text_sparse.shape

# Convert Sparse Matrix to Dense Array

This line converts our `X_text_sparse` matrix (which was efficiently created by TF-IDF) into a standard, **dense** NumPy array.

* `X_text_sparse`: This is a **sparse matrix**. It saves memory by only storing the *locations* of non-zero values (i.e., the words that are actually present in a document).
* `.toarray()`: This method expands the sparse matrix into a full, dense array. It fills in all the "un-stored" values with zeros.

The resulting `np_arr_x_sparse` will have the shape `(number_of_documents, 5000)`.

---
**Important Note on Memory:**
Be very cautious with this operation. Like our sparse matrix is large (e.g., 100,000 documents $\times$ 5,000 features), the dense array will require a huge amount of RAM.

* **When to use `.toarray()`**: Only when the resulting array will comfortably fit in your memory *and* the model you are using (like a standard `tf.keras.Dense` layer) *requires* a dense input.
* **When to avoid it**: If you are using models like XGBoost, LightGBM, or `RidgeRegression`, you should **keep the data in its sparse format**. Those models are optimized to work directly with sparse matrices, which is much faster and more memory-efficient.

In [30]:
np_arr_x_sparse = X_text_sparse.toarray()

### Converting Dense Array to Pandas DataFrame

In [32]:
pd_arr = pd.DataFrame(np_arr_x_sparse)

## Vectorizing Test Dataset Catalog_Content Column

In [52]:
X_test_sparse_test = vectorizer.transform(test.catalog_content)

In [53]:
X_test_sparse_test

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5548071 stored elements and shape (75000, 5000)>

# Saving Sparse Matrix to NPZ file

We are saving our processed sparse text features (`X_test_sparse_test`) to a file using `scipy.sparse.save_npz`.

This is a crucial step for efficiency:

1.  **Saves Space**: It preserves the **sparse format** (like `csr_matrix`), which takes up significantly less disk space than saving the full, dense array would.
2.  **Saves Time**: It allows us to **reload** this processed data instantly in the future using `load_npz()`. This means we can skip the time-consuming `TfidfVectorizer` step in subsequent runs or in a separate notebook (e.g., for inference).

The file `sparse_matrix.npz` now contains the complete structure of our sparse test features.

In [56]:
from scipy.sparse import csr_matrix, save_npz, load_npz
file_path = 'sparse_matrix.npz'
save_npz(file_path, X_test_sparse_test)

# Saving Extracted Features (Checkpoint)

This is one of the most important steps for an efficient workflow. The previous step, `base_model.predict()`, was the most computationally expensive part of the entire notebook. It used the **T4 x2 GPU** accelerator to process every single image in our dataset, which can take a significant amount of time.

We **do not** want to re-run this step every time we open the notebook.

By using `np.save()`, we save the resulting `extracted_features` array directly to disk as a `.npy` file. This is a standard, highly efficient binary format for storing NumPy arrays.

* `np.save(file_path, extracted_features)`: This command takes our large feature array (e.g., shape `(72283, 1280)`) and writes it to the file `mobilenet_features.npy`.
* `np.load(file_path)`: This command will be used in all future sessions. It reloads the entire array from the file back into a variable almost instantly.

This creates a crucial **checkpoint**. In our next notebook session, we can comment out the entire model loading and `model.predict()` sections and just run `np.load()`, allowing us to jump straight to training our final model (like XGBoost or LightGBM) in seconds, **saving valuable time and GPU quota**.

In [12]:
"""import numpy as np

# Assume 'extracted_features' is the NumPy array you got from your model
# For demonstration, let's create a sample array:
# extracted_features = np.random.rand(72283, 1280) 

# 1. Save the array to a .npy file
file_path = 'mobilenet_features.npy'
np.save(file_path, extracted_features)

print(f"Features saved successfully to '{file_path}'")

# 2. To use it later, load it back into a variable
loaded_features = np.load(file_path)

print(f"\n Features loaded successfully from '{file_path}'")
print(f"Shape of loaded features: {loaded_features.shape}")

# Optional: Verify that the loaded data is identical to the original
are_equal = np.array_equal(extracted_features, loaded_features)
print(f"Verification: Loaded data is the same as the original: {are_equal}")"""

‚úÖ Features saved successfully to 'mobilenet_features.npy'

‚úÖ Features loaded successfully from 'mobilenet_features.npy'
Shape of loaded features: (72283, 1280)
Verification: Loaded data is the same as the original: True


# Re-associating Image Features with Training Data

We had a "one-to-many" problem to solve:
1.  **Original Data (`train`):** Has many rows (e.g., 75,000+) where multiple rows (products) can share the *same* `image_link`.
2.  **Extracted Features (`extracted_features`):** This is a NumPy array where we *efficiently* calculated features for each *unique* image (e.g., 72,283 feature vectors).

We needed to map the correct feature vector back to *every* row in the original `train` DataFrame.

---

#### Step 1: Create a Feature "Lookup" DataFrame

We first created a new, small DataFrame (`features_df`) that acts as a simple key-value "lookup table."

* `'image_link': sorted_unique_links`: We use our clean, sorted list of unique image links. This list **must** be in the exact same order as the feature vectors in `extracted_features`.
* `'mobilenet_features': list(extracted_features)`: We add our NumPy array of features.

This `features_df` now has one row for each unique image, linking its `image_link` (the key) to its `mobilenet_features` (the value).

---

#### Step 2: Merge Back to the Original DataFrame

This is the final, crucial step. We used `pd.merge()` to combine our original `train` DataFrame with our new `features_df` lookup table.

* `pd.merge(train, features_df, ...)`: We are merging `train` (the "left" DataFrame) with `features_df` (the "right" DataFrame).
* `on='image_link'`: Specifies that the `image_link` column is the key to join on.
* `how='left'`: This is the most important part. It instructs pandas to:
    1.  Keep **all** rows from the "left" DataFrame (our original `train` data).
    2.  For each row in `train`, it looks up its `image_link` in `features_df`.
    3.  It then "attaches" (merges) the corresponding `mobilenet_features` from `features_df` to that row.

The result (`final_df`) is our complete, enriched dataset. It has the original 75,000+ rows, but now includes a new `mobilenet_features` column containing the correct 1280-dimension vector for each product's image.

In [20]:
# our original DataFrame with 75,000 rows (we'll use 6 for the example)

# we already did this part: get unique links and sort them
sorted_unique_links = sorted(train['image_link'].unique())

# -> ['link_A.jpg', 'link_B.jpg', 'link_C.jpg', 'link_D.jpg']
sorted_unique_links = sorted_unique_links[:-5]

# our NumPy array of features (simulated) , Shape will be (number_of_unique_links, feature_size)

# Creating the features DataFrame
features_df = pd.DataFrame({
    'image_link': sorted_unique_links,
    'mobilenet_features': list(extracted_features)
})

print("--- Features DataFrame ---")
print(features_df)
print("\n")


# Merging it back with the original DataFrame
final_df = pd.merge(train, features_df, on='image_link', how='left')

print("--- Final Merged DataFrame ---")
print(final_df)

--- Features DataFrame ---
                                              image_link  \
0      https://m.media-amazon.com/images/I/018kdDYIAY...   
1      https://m.media-amazon.com/images/I/01O1AwI4pJ...   
2      https://m.media-amazon.com/images/I/01SCsYMIKj...   
3      https://m.media-amazon.com/images/I/11+1w3qzdn...   
4      https://m.media-amazon.com/images/I/11+C-FVBGY...   
...                                                  ...   
72278  https://m.media-amazon.com/images/I/A1zEWsomY4...   
72279  https://m.media-amazon.com/images/I/A1zJmUvrGo...   
72280  https://m.media-amazon.com/images/I/A1zPpWyb2t...   
72281  https://m.media-amazon.com/images/I/A1zcbLqB0t...   
72282  https://m.media-amazon.com/images/I/A1zdqaF5Vo...   

                                      mobilenet_features  
0      [0.3847377, 0.12714423, 0.002564683, 0.0, 0.0,...  
1      [0.0, 0.056353927, 0.0, 0.0, 0.0, 0.12231338, ...  
2      [0.0, 0.03530587, 0.0690719, 0.07184036, 0.051...  
3      [1.133339

### Final Merged Training Dataset

In [22]:
final_df

Unnamed: 0,sample_id,catalog_content,image_link,price,mobilenet_features
0,33127,"Item Name: La Victoria Green Taco Sauce Mild, ...",https://m.media-amazon.com/images/I/51mo8htwTH...,4.890,"[0.0, 0.3311414, 0.0, 0.0, 0.004351908, 0.0832..."
1,198967,"Item Name: Salerno Cookies, The Original Butte...",https://m.media-amazon.com/images/I/71YtriIHAA...,13.120,"[0.29220903, 0.7780709, 0.0, 0.0, 0.0, 0.0, 1...."
2,261251,"Item Name: Bear Creek Hearty Soup Bowl, Creamy...",https://m.media-amazon.com/images/I/51+PFEe-w-...,1.970,"[0.016326116, 1.9629351, 0.0, 0.0, 0.0, 0.0855..."
3,55858,Item Name: Judee‚Äôs Blue Cheese Powder 11.25 oz...,https://m.media-amazon.com/images/I/41mu0HAToD...,30.340,"[0.0, 0.0, 0.008575429, 0.0, 0.0, 0.0, 1.56927..."
4,292686,"Item Name: kedem Sherry Cooking Wine, 12.7 Oun...",https://m.media-amazon.com/images/I/41sA037+Qv...,66.490,"[0.8362831, 0.49924576, 0.00024921066, 0.0, 0...."
...,...,...,...,...,...
74995,41424,Item Name: ICE BREAKERS Spearmint Sugar Free M...,https://m.media-amazon.com/images/I/81p9PcPsff...,10.395,"[0.24348412, 2.5177064, 0.11823583, 0.0, 0.0, ..."
74996,35537,"Item Name: Davidson's Organics, Vanilla Essenc...",https://m.media-amazon.com/images/I/51DDKoa+mb...,35.920,"[0.027533248, 1.2338995, 0.0, 0.0, 0.0, 0.0627..."
74997,249971,Item Name: Jolly Rancher Hard Candy - Blue Ras...,https://m.media-amazon.com/images/I/91R2XCcpUf...,50.330,"[0.0, 0.8403328, 0.0, 0.0, 0.0, 0.0, 1.9255025..."
74998,188322,Item Name: Nescafe Dolce Gusto Capsules - CARA...,https://m.media-amazon.com/images/I/51W40YU98+...,15.275,"[0.12463549, 2.4297733, 0.0, 0.026566584, 0.0,..."


In [23]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sample_id           75000 non-null  int64  
 1   catalog_content     75000 non-null  object 
 2   image_link          75000 non-null  object 
 3   price               75000 non-null  float64
 4   mobilenet_features  74995 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 2.9+ MB


### Saving Merged Training Dataset to CSV

In [31]:
final_df.to_csv("final_df.csv", index=False)

### Concatanated DataFrame

In [36]:
concat_df = pd.concat([final_df.drop(["catalog_content",'image_link'],axis=1), pd_arr],axis=1)

### Saving Concatanated DataFrame to CSV

In [37]:
concat_df.to_csv("concat_df.csv", index=False)

### Features Shape Extracted from a single image using MobileNet

In [38]:
final_df.mobilenet_features[0].shape

(1280,)

# Final Feature Engineering: Expanding Image Features

Our `mobilenet_features` column is currently "packed" ‚Äì each cell contains a list or vector of 1280 numbers. To make these features usable for models like XGBoost or LightGBM, we must "unpack" them, so that each of the 1280 features gets its **own column**.

---

#### Step 1: Expand the Feature Column

`expanded_features = concat_df['mobilenet_features'].apply(pd.Series)`

We used the `.apply(pd.Series)` method on the `mobilenet_features` column. This powerful pandas function takes each list-like item (of 1280 features) and explodes it horizontally into a new DataFrame.

* **Before**: One column named `mobilenet_features`.
* **After**: A new DataFrame (`expanded_features`) with **1280 columns** (named `0`, `1`, `2`, ..., `1279`) and the same number of rows.

---

#### Step 2: Rename New Columns

`expanded_features.columns = [f'f_{i}' for i in expanded_features.columns]`

The new columns are named with plain integers (0, 1, 2...), which can be problematic for some models (like LightGBM). We rename them to be descriptive and unique, such as `f_0`, `f_1`, `f_2`, and so on.

---

#### Step 3: Concatenate Back to the Main DataFrame

`full_final = pd.concat([..., expanded_features], axis=1)`

Finally, we combined our original data with our new expanded features.

1.  `concat_df.drop('mobilenet_features', axis=1)`: We take our DataFrame but **drop** the original "packed" `mobilenet_features` column, as it's now redundant.
2.  `expanded_features`: This is our new DataFrame containing the 1280 individual feature columns.
3.  `axis=1`: This tells `pd.concat` to join the two DataFrames **side-by-side** (horizontally).

The `full_final` DataFrame is now our complete, "flat" dataset, ready for modeling. It contains all text features and all 1280 image features as separate columns.

In [39]:
expanded_features = concat_df['mobilenet_features'].apply(pd.Series)

expanded_features.columns = [f'f_{i}' for i in expanded_features.columns]


full_final = pd.concat([concat_df.drop('mobilenet_features', axis=1), expanded_features], axis=1)


print("Final DataFrame with Expanded Features:")
print(full_final.head())
print(f"\nFinal Shape: {full_final.shape}")

Final DataFrame with Expanded Features:
   sample_id  price    0    1    2    3    4    5    6    7  ...    f_1270  \
0      33127   4.89  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.461782   
1     198967  13.12  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  2.485705   
2     261251   1.97  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.103680   
3      55858  30.34  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.448067   
4     292686  66.49  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  2.490304   

     f_1271    f_1272    f_1273    f_1274    f_1275    f_1276  f_1277  \
0  0.185137  0.047164  0.000000  3.179038  0.000000  0.422553     0.0   
1  0.027164  2.033786  0.126224  3.105233  0.006095  0.367020     0.0   
2  0.003377  0.000000  1.435634  1.427566  0.004048  0.404094     0.0   
3  0.000000  0.733142  0.062608  1.937328  0.000000  0.349168     0.0   
4  0.000000  1.652621  0.000000  0.900392  0.000000  0.211833     0.0   

     f_1278    f_1279  
0  0.220657  0.000000  
1  1

### Saving Finally created dataframe to CSV ( Checkpoint )

In [40]:
full_final.to_csv("Full_final.csv",index=False)

### Creating CSR Matrix of the final dataframe

In [44]:
full_final_csr = csr_matrix(full_final)

### Splitting the Dataset on the Train and Evaluation Sets

In [46]:
X_train, X_eval, y_train, y_eval = train_test_split(full_final_csr, train.price)

### Training XGB_Regressor model on

In [49]:
xgb_regressor = XGBRegressor()

# Train the model on your entire training data
xgb_regressor.fit(X_train, y_train,verbose=True) # Set to True to see training progress

# Evaluate Model Performance (SMAPE)

Now we evaluate our trained `xgb_regressor` on the hold-out validation set (`X_eval`).

#### 1. Generate Predictions
`y_pred = xgb_regressor.predict(X_eval)`

#### 2. Define the Evaluation Metric: SMAPE
We must define the **SMAPE** (Symmetric Mean Absolute Percentage Error) function, as it is not built-in.

The formula is: $SMAPE = \frac{100}{n} \sum_{i=1}^{n} \frac{2 \cdot |F_i - A_i|}{|A_i| + |F_i|}$

* `epsilon = np.finfo(float).eps`: This is a critical addition. `epsilon` is the smallest possible positive number. We add it to the denominator to prevent a **divide-by-zero error** if both the actual and predicted value are 0.

#### 3. Calculate and Print the Final Score
Finally, we call our custom `smape` function to compare the true validation labels (`y_eval`) against our model's predictions (`y_pred`).

`print("SMAPE:", smape(y_eval, y_pred))`

In [50]:
y_pred = xgb_regressor.predict(X_eval)
def smape(actual, forecast):
    epsilon = np.finfo(float).eps
    return 100 / len(actual) * np.sum(2 * np.abs(forecast - actual) / (np.abs(actual) + np.abs(forecast) + epsilon))

print("SMAPE:", smape(y_eval, y_pred))

SMAPE: 1.4315532046788857


# Save the Trained Model

This is the final and one of the most important steps of our training pipeline. We've spent a lot of time and computational resources training our `xgb_regressor` model. We must now **save (or "serialize") this trained object** to disk so we can use it later without retraining.

* **Why `joblib`?**: We use `joblib.dump()` because it is highly efficient for saving Python objects that contain large NumPy arrays, which is exactly what our trained XGBoost model is. It's generally preferred over other methods like `pickle` for scikit-learn-compatible models.

* **What it does**: The command saves the entire model‚Äîincluding all its learned internal parameters, feature importances, and configuration‚Äîinto a single file named `xgbregressor_model.joblib`.

* **Next Steps**: This file is our final "artifact." We can now download it, or (more commonly) create a new, separate "Inference Notebook." In that notebook, we will simply load this file using `joblib.load()` and use it to generate predictions on the *actual competition test set*.

In [51]:
joblib_filename = 'xgbregressor_model.joblib'
joblib.dump(xgb_regressor, joblib_filename)
print(f"Model saved successfully to '{joblib_filename}'")

‚úÖ Model saved successfully to 'xgbregressor_model.joblib'
