# Product Classification using TF-IDF, Graph Embeddings, and Price Features

This notebook implements product classification using:
1. Text-based features: TF-IDF from product descriptions (no sentence embeddings)
2. Graph-based features: Node2Vec embeddings
3. Price-based features: Price buckets and statistical features

We'll evaluate the approach using multiple classifiers:
- Random Forest
- XGBoost
- LinearSVC
- Neural Networks


## 1. Imports and Setup


In [4]:
# General imports
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display, HTML

# Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack, csr_matrix
from sklearn.preprocessing import StandardScaler

# Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# Graph embeddings
import node2vec

# Evaluation
from sklearn.metrics import accuracy_score, classification_report, log_loss
from sklearn.preprocessing import label_binarize


2025-05-20 22:15:54.217596: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747768554.277744    8455 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747768554.295922    8455 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747768554.439821    8455 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747768554.439856    8455 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747768554.439861    8455 computation_placer.cc:177] computation placer alr

## 2. Helper Functions


In [5]:
# Function to extract graph features for a set of nodes
def extract_graph_features(G, node_list):
    print("Calculating degree centrality...")
    degree_centrality = nx.degree_centrality(G)

    print("Calculating clustering coefficient...")
    clustering_coefficient = nx.clustering(G)

    print("Calculating PageRank...")
    pagerank = nx.pagerank(G, alpha=0.85, max_iter=100)

    print("Calculating triangle count...")
    triangles = nx.triangles(G)

    # Create a dataframe with the features
    features_df = pd.DataFrame(index=node_list)

    features_df['degree_centrality'] = features_df.index.map(lambda x: degree_centrality.get(x, 0))
    features_df['clustering_coefficient'] = features_df.index.map(lambda x: clustering_coefficient.get(x, 0))
    features_df['pagerank'] = features_df.index.map(lambda x: pagerank.get(x, 0))
    features_df['triangle_count'] = features_df.index.map(lambda x: triangles.get(x, 0))

    # Degree (number of connections)
    print("Calculating degree...")
    degree_dict = dict(G.degree())
    features_df['degree'] = features_df.index.map(lambda x: degree_dict.get(x, 0))

    return features_df

# Function to calculate multiclass log loss
def multiclass_log_loss(y_true, y_pred_proba, eps=1e-15):
    """
    y_true: array-like of shape (N,) - true class labels
    y_pred_proba: array-like of shape (N, C) - predicted class probabilities
    """
    # Number of samples
    N = y_true.shape[0]

    # One-hot encode the true labels (yij)
    y_true_one_hot = label_binarize(y_true, classes=np.arange(y_pred_proba.shape[1]))

    # Clip predicted probabilities to avoid log(0)
    y_pred_proba = np.clip(y_pred_proba, eps, 1 - eps)

    # Compute the log loss
    loss = -np.sum(y_true_one_hot * np.log(y_pred_proba)) / N
    return loss

# Function to create price features
def create_price_features(product_ids, price_df):
    """
    Create price-based features for a list of product IDs
    
    Args:
        product_ids: List of product IDs
        price_df: DataFrame with product_id and price columns
        
    Returns:
        DataFrame with price features
    """
    # Create a DataFrame with product IDs as index
    price_features = pd.DataFrame(index=product_ids)
    
    # Map prices to products
    price_dict = dict(zip(price_df['product_id'], price_df['price']))
    price_features['price'] = price_features.index.map(lambda x: price_dict.get(x, np.nan))
    
    # Fill missing prices with median
    median_price = price_df['price'].median()
    price_features['price'].fillna(median_price, inplace=True)
    
    # Create price buckets (as binary features)
    price_features['price_0_10'] = (price_features['price'] <= 10).astype(int)
    price_features['price_10_100'] = ((price_features['price'] > 10) & (price_features['price'] <= 100)).astype(int)
    price_features['price_100_plus'] = (price_features['price'] > 100).astype(int)
    
    # Log transformation of price
    price_features['price_log'] = np.log1p(price_features['price'])
    
    # Price rank (percentile)
    price_features['price_rank'] = price_features['price'].rank(pct=True)
    
    # Z-score of price (how many standard deviations from the mean)
    mean_price = price_df['price'].mean()
    std_price = price_df['price'].std()
    price_features['price_zscore'] = (price_features['price'] - mean_price) / std_price
    
    return price_features


## 3. Loading Data


In [6]:
# Load the edge list data
edgelist_file = 'data_files/edgelist.txt'
edges_df = pd.read_csv(edgelist_file, header=None, names=['source', 'target'])

# Load the class labels
labels_file = 'y_train.txt'
labels_df = pd.read_csv(labels_file, header=None, names=['product_id', 'label'])

# Load the train and test splits
train_df = pd.read_csv('split_dataset/train.csv')
test_df = pd.read_csv('split_dataset/test.csv')

# Load price data
price_df = pd.read_csv('data_files/price.txt', header=None, names=['product_id', 'price'])

# Display basic information about the datasets
print(f"Edge list shape: {edges_df.shape}")
print(f"Labels shape: {labels_df.shape}")
print(f"Train set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")
print(f"Price data shape: {price_df.shape}")

# Check the first few rows of each dataset
print("\nEdge list sample:")
print(edges_df.head())

print("\nLabels sample:")
print(labels_df.head())

print("\nTrain set sample:")
print(train_df.head())

print("\nTest set sample:")
print(test_df.head())

print("\nPrice data sample:")
print(price_df.head())


Edge list shape: (1811087, 2)
Labels shape: (182006, 2)
Train set shape: (145604, 3)
Test set shape: (36402, 3)
Price data shape: (198817, 2)

Edge list sample:
   source  target
0  251528  237411
1  100805   74791
2   38634   97747
3  247470   77089
4  267060  250490

Labels sample:
   product_id  label
0       66795      9
1      242781      3
2       91280      2
3       56356      5
4      218494      0

Train set sample:
   product_id                                         text_clean  label
0      114704  hornady unprimed winchester cartridge case hor...      2
1      250731  tachikara tk leopard knee pad tachikara tk leo...     11
2      152967  g asd replacement cutter aluminum amp carbon u...      2
3        4541  mtech usa mt tactical folding knife inch close...      2
4      142062  nhl pittsburgh penguins game day black pro sha...      7

Test set sample:
   product_id                                         text_clean  label
0       56218                             katz h

In [7]:
# Handle missing values in text data
# Find rows with missing 'description' in train_df
missing_train = train_df[train_df['text_clean'].isnull()]

# Find rows with missing 'description' in test_df
missing_test = test_df[test_df['text_clean'].isnull()]

# Print the rows with missing values
print("Rows with missing 'description' in train_df:")
print(missing_train)

print("\nRows with missing 'description' in test_df:")
print(missing_test)

# Remove rows with missing 'description' in train_df
train_df = train_df.dropna(subset=['text_clean'])

# Remove rows with missing 'description' in test_df
test_df = test_df.dropna(subset=['text_clean'])

# Verify if any rows with missing values remain
print("Missing values in cleaned train_df:", train_df.isnull().sum())
print("Missing values in cleaned test_df:", test_df.isnull().sum())

# Reset the index after removing rows with missing values
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

# Rename columns for clarity
train_df.columns = ['product_id', 'description', 'label']
test_df.columns = ['product_id', 'description', 'label']

# Check the first few rows to confirm that the index is reset
print("\nTrain data after cleaning:")
display(HTML(train_df.head().to_html(escape=False)))

print("\nTest data after cleaning:")
display(HTML(test_df.head().to_html(escape=False)))


Rows with missing 'description' in train_df:
       product_id text_clean  label
89285      265165        NaN      7

Rows with missing 'description' in test_df:
       product_id text_clean  label
34767      174103        NaN      5
Missing values in cleaned train_df: product_id    0
text_clean    0
label         0
dtype: int64
Missing values in cleaned test_df: product_id    0
text_clean    0
label         0
dtype: int64

Train data after cleaning:


Unnamed: 0,product_id,description,label
0,114704,hornady unprimed winchester cartridge case hornady brass ll reload great consistency predictable performance accuracy commodity brass simply deliver hornady build case precision attention detail focus perfection worldleader bullet ammunition measure consistent wall concentricity run case pressure calibration test ensure uniform case expansion quality control extreme case hand inspect thank kind manufacturing care hornady brass ensure proper seating bullet case chamber mean consistent pressure ultimately yield optimal velocity repeatable accuracy,2
1,250731,tachikara tk leopard knee pad tachikara tk leopard junior knee pad ideal beginner recreational player cotton sleeve segment foam construction comfort flexibility fun animal print size fit junior,11
2,152967,g asd replacement cutter aluminum amp carbon use aluminum carbon shafts replace individual cutter use clean andor deburr face insert shaft ensure arrowhead spine true v shape slot accept diameter shaft,2
3,4541,mtech usa mt tactical folding knife inch close mtech linerlock closed stainless guthook blade black finish dual thumb stud pakkawood handle stainless pocket clip,2
4,142062,nhl pittsburgh penguins game day black pro shape flat brim flex cap txz,7



Test data after cleaning:


Unnamed: 0,product_id,description,label
0,56218,katz hoodie volleyball,0
1,42346,vz grip operator ii standard size gun grip usa nearly indestructible g basic design original operator replace large aggressive checkering strap aggressive golf ball pattern add thumb recess magazine release functional grip tournament shooter tactical high speed environment buying online want confident company buy exactly feel purchase vz grip lead manufacturer fine grip planet beloved weapon writeup big gun magazine world value customer stand product hard mean thing product company real deal browse product buy confidence know high quality grip firearm world class customer service firearm deserve,2
2,215842,tough flat leather hobble tough flat leather hobble ride horse hobble double stitch nickel hardware,15
3,36062,buzzrack skipper bicycle suv hatchback wheel cradle trunk bike rack buzzrack skipper wheel mount bicycle rack carry bike attach car suv trunk point tie system entire car bicycle rack flexible degree adjustable place maximize form fit vehicle trunk adjustable center support allow stability bicycle transport additional feature car bicycle rack include adjustable wheel mount red safety strap tie strap design work center support secure locknut assembly system quick setup car bicycle rack complete durable black paint coat finish year manufacturer warranty,1
4,188250,cuisinart cgg allfood btu portable outdoor tabletop propane gas grill,10


## 4. Creating Graph and Extracting Node2Vec Embeddings


In [8]:
# Create a graph from the edge list
G = nx.from_pandas_edgelist(edges_df, 'source', 'target')

# Print basic information about the graph
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
print(f"Network Density: {nx.density(G):.6f}")

# Check if the graph is connected
is_connected = nx.is_connected(G)
print(f"Is the graph connected? {is_connected}")

if not is_connected:
    # Get the largest connected component
    largest_cc = max(nx.connected_components(G), key=len)
    largest_cc_subgraph = G.subgraph(largest_cc)
    print(f"\nLargest Connected Component:")
    print(f"  Nodes: {largest_cc_subgraph.number_of_nodes()}")
    print(f"  Edges: {largest_cc_subgraph.number_of_edges()}")
    print(f"  Percentage of total nodes: {largest_cc_subgraph.number_of_nodes() / G.number_of_nodes() * 100:.2f}%")


Number of nodes: 276453
Number of edges: 1811087
Network Density: 0.000047
Is the graph connected? False

Largest Connected Component:
  Nodes: 273012
  Edges: 1808230
  Percentage of total nodes: 98.76%


In [9]:
# Get the list of product IDs from train and test sets
train_product_ids = train_df['product_id'].tolist()
test_product_ids = test_df['product_id'].tolist()

# Extract graph features for training and testing sets
print("Extracting graph features for training set...")
train_graph_features = extract_graph_features(G, train_product_ids)

print("\nExtracting graph features for testing set...")
test_graph_features = extract_graph_features(G, test_product_ids)

# Fill missing values with 0 if any
train_graph_features = train_graph_features.fillna(0)
test_graph_features = test_graph_features.fillna(0)

# Scale the features
graph_scaler = StandardScaler()
train_graph_features_scaled = graph_scaler.fit_transform(train_graph_features)
test_graph_features_scaled = graph_scaler.transform(test_graph_features)


Extracting graph features for training set...
Calculating degree centrality...
Calculating clustering coefficient...
Calculating PageRank...
Calculating triangle count...
Calculating degree...

Extracting graph features for testing set...
Calculating degree centrality...
Calculating clustering coefficient...
Calculating PageRank...
Calculating triangle count...
Calculating degree...


In [10]:
# Generate Node2Vec embeddings
print("Generating Node2Vec embeddings...")

try:
    # Convert NetworkX graph to node2vec format
    # First, ensure all nodes are strings for compatibility
    G_node2vec = nx.Graph()
    for edge in G.edges():
        G_node2vec.add_edge(str(edge[0]), str(edge[1]))

    # Initialize node2vec model
    node2vec_model = node2vec.Node2Vec(
        G_node2vec,
        dimensions=64,  # Embedding dimension
        walk_length=7,  # Length of each random walk
        num_walks=7,    # Number of random walks per node
        workers=1       # Number of parallel workers
    )

    # Train the model
    print("Training Node2Vec model...")
    n2v_model = node2vec_model.fit(
        window=10,       # Context size for optimization
        min_count=1,     # Minimum count of node occurrences
        batch_words=4    # Number of words per batch
    )

    # Generate embeddings for train and test nodes
    train_node2vec_features = np.zeros((len(train_product_ids), 64))
    test_node2vec_features = np.zeros((len(test_product_ids), 64))

    # Extract embeddings for training nodes
    for i, node_id in enumerate(train_product_ids):
        try:
            train_node2vec_features[i] = n2v_model.wv[str(node_id)]
        except KeyError:
            # If node not in embeddings, use zeros
            pass

    # Extract embeddings for testing nodes
    for i, node_id in enumerate(test_product_ids):
        try:
            test_node2vec_features[i] = n2v_model.wv[str(node_id)]
        except KeyError:
            # If node not in embeddings, use zeros
            pass

    print(f"Node2Vec embeddings shape - Train: {train_node2vec_features.shape}, Test: {test_node2vec_features.shape}")

    # Scale the embeddings
    n2v_scaler = StandardScaler()
    train_node2vec_scaled = n2v_scaler.fit_transform(train_node2vec_features)
    test_node2vec_scaled = n2v_scaler.transform(test_node2vec_features)

except Exception as e:
    print(f"Error generating Node2Vec embeddings: {e}")
    print("Skipping Node2Vec embeddings...")


Generating Node2Vec embeddings...


Computing transition probabilities:   0%|          | 0/276453 [00:00<?, ?it/s]

Generating walks (CPU: 1): 100%|██████████| 7/7 [01:18<00:00, 11.22s/it]


Training Node2Vec model...
Node2Vec embeddings shape - Train: (145603, 64), Test: (36401, 64)


## 5. Extracting TF-IDF Features from Text


In [11]:
# Prepare text data
X_train_text = train_df['description']
y_train = train_df['label']

X_test_text = test_df['description']
y_test = test_df['label']

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), min_df=5, max_df=0.95)

# Fit and transform the text data to get the TF-IDF matrix
X_tfidf_train = tfidf_vectorizer.fit_transform(X_train_text)
X_tfidf_test = tfidf_vectorizer.transform(X_test_text)

print(f"TF-IDF features shape - Train: {X_tfidf_train.shape}, Test: {X_tfidf_test.shape}")


TF-IDF features shape - Train: (145603, 261816), Test: (36401, 261816)


## 6. Creating Price Features


In [12]:
# Create price features for training and testing sets
print("Creating price features for training set...")
train_price_features = create_price_features(train_product_ids, price_df)

print("Creating price features for testing set...")
test_price_features = create_price_features(test_product_ids, price_df)

# Display the first few rows of price features
print("\nTrain price features sample:")
print(train_price_features.head())

print("\nTest price features sample:")
print(test_price_features.head())

# Scale the price features (except binary features)
price_scaler = StandardScaler()
price_columns_to_scale = ['price', 'price_log', 'price_rank', 'price_zscore']
binary_columns = ['price_0_10', 'price_10_100', 'price_100_plus']

# Scale the selected columns
train_price_scaled = train_price_features.copy()
test_price_scaled = test_price_features.copy()

train_price_scaled[price_columns_to_scale] = price_scaler.fit_transform(train_price_features[price_columns_to_scale])
test_price_scaled[price_columns_to_scale] = price_scaler.transform(test_price_features[price_columns_to_scale])

# Convert to numpy arrays for easier handling
train_price_features_array = train_price_scaled.values
test_price_features_array = test_price_scaled.values

print(f"Price features shape - Train: {train_price_features_array.shape}, Test: {test_price_features_array.shape}")


Creating price features for training set...
Creating price features for testing set...

Train price features sample:
        price  price_0_10  price_10_100  price_100_plus  price_log  \
114704  43.20           0             1               0   3.788725   
250731  24.99           0             1               0   3.257712   
152967  22.95           0             1               0   3.175968   
4541     8.49           1             0               0   2.250239   
142062  24.99           0             1               0   3.257712   

        price_rank  price_zscore  
114704    0.760266     -0.130318  
250731    0.495769     -0.326493  
152967    0.339856     -0.348470  
4541      0.095352     -0.504247  
142062    0.495769     -0.326493  

Test price features sample:
         price  price_0_10  price_10_100  price_100_plus  price_log  \
56218    11.99           0             1               0   2.564180   
42346    65.00           0             1               0   4.189655   
215842   2

## 7. Combining Features


In [13]:
# Convert node2vec embeddings to sparse for efficient concatenation
print("Converting node2vec embeddings to sparse format...")
train_node2vec_sparse = csr_matrix(train_node2vec_scaled)
test_node2vec_sparse = csr_matrix(test_node2vec_scaled)

# Convert price features to sparse
print("Converting price features to sparse format...")
train_price_sparse = csr_matrix(train_price_features_array)
test_price_sparse = csr_matrix(test_price_features_array)

# Combine all features: TF-IDF + node2vec embeddings + price features
print("Combining all features...")
X_combined_train = hstack([X_tfidf_train, train_node2vec_sparse, train_price_sparse], format='csr')
X_combined_test = hstack([X_tfidf_test, test_node2vec_sparse, test_price_sparse], format='csr')

# Output memory usage information
print(f"Final combined features shape - Train: {X_combined_train.shape}, Test: {X_combined_test.shape}")
print("Memory usage (approximate):")
print(f"  - X_combined_train: {X_combined_train.data.nbytes / (1024 ** 2):.2f} MB")
print(f"  - X_combined_test: {X_combined_test.data.nbytes / (1024 ** 2):.2f} MB")


Converting node2vec embeddings to sparse format...
Converting price features to sparse format...
Combining all features...
Final combined features shape - Train: (145603, 261887), Test: (36401, 261887)
Memory usage (approximate):
  - X_combined_train: 139.05 MB
  - X_combined_test: 34.26 MB


## 8. Model Training and Evaluation


### 8.1 LinearSVC


In [14]:
# Train LinearSVC model
print("Training LinearSVC model...")
base_svm = LinearSVC(max_iter=5000)
svm_model = CalibratedClassifierCV(base_svm, cv=5)  # Platt scaling for probability estimates

svm_model.fit(X_combined_train, y_train)

# Make predictions
y_pred_svm = svm_model.predict(X_combined_test)
y_proba_svm = svm_model.predict_proba(X_combined_test)

# Evaluate
svm_accuracy = accuracy_score(y_test, y_pred_svm)
svm_log_loss = multiclass_log_loss(y_test, y_proba_svm)

print(f"LinearSVC Accuracy: {svm_accuracy:.4f}")
print(f"LinearSVC Log Loss: {svm_log_loss:.4f}")
print("\nLinearSVC Classification Report:")
print(classification_report(y_test, y_pred_svm))


Training LinearSVC model...
LinearSVC Accuracy: 0.9427
LinearSVC Log Loss: 0.2207

LinearSVC Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      3033
           1       0.92      0.92      0.92      2372
           2       0.93      0.96      0.94      8652
           3       0.96      0.97      0.96      1073
           4       0.94      0.96      0.95      3016
           5       0.98      0.96      0.97      3564
           6       0.96      0.95      0.96      1519
           7       0.97      0.94      0.95      3752
           8       0.97      0.97      0.97      1316
           9       0.96      0.95      0.95       903
          10       0.89      0.88      0.89      3589
          11       0.93      0.93      0.93      1425
          12       0.90      0.83      0.87      1318
          13       0.95      0.87      0.91       323
          14       0.97      0.97      0.97       226
          15       

### 8.2 Random Forest


In [None]:
# Train Random Forest model
print("Training Random Forest model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_combined_train, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_combined_test)
y_proba_rf = rf_model.predict_proba(X_combined_test)

# Evaluate
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_log_loss = multiclass_log_loss(y_test, y_proba_rf)

print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Random Forest Log Loss: {rf_log_loss:.4f}")
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))


### 8.3 XGBoost


In [None]:
# Train XGBoost model
print("Training XGBoost model...")
xgb_model = xgb.XGBClassifier(
    objective='multi:softprob',  # Needed for multiclass probability output
    num_class=len(np.unique(y_train)),  # Set number of classes
    eval_metric='mlogloss',  # Use multiclass log loss as eval metric
    n_estimators=75,
    max_depth=5,
    learning_rate=0.1,
    random_state=42,
    n_jobs=4
)

xgb_model.fit(X_combined_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_combined_test)
y_proba_xgb = xgb_model.predict_proba(X_combined_test)

# Evaluate
xgb_accuracy = accuracy_score(y_test, y_pred_xgb)
xgb_log_loss = multiclass_log_loss(y_test, y_proba_xgb)

print(f"XGBoost Accuracy: {xgb_accuracy:.4f}")
print(f"XGBoost Log Loss: {xgb_log_loss:.4f}")
print("\nXGBoost Classification Report:")
print(classification_report(y_test, y_pred_xgb))


Training XGBoost model...


### 8.4 Neural Network


In [None]:
# Convert sparse matrix to dense for neural network
X_combined_train_dense = X_combined_train.toarray()
X_combined_test_dense = X_combined_test.toarray()

# Get number of classes
num_classes = len(np.unique(y_train))

# Build neural network model
print("Building Neural Network model...")
nn_model = Sequential([
    Dense(256, activation='relu', input_shape=(X_combined_train_dense.shape[1],)),
    Dropout(0.4),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(num_classes, activation='softmax')
])

# Compile the model
nn_model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Define early stopping
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

# Train the model
print("Training Neural Network model...")
history = nn_model.fit(
    X_combined_train_dense, y_train,
    epochs=30,
    batch_size=64,
    validation_split=0.2,
    callbacks=[early_stopping],
    verbose=1
)

# Evaluate the model
y_pred_nn = nn_model.predict(X_combined_test_dense)
y_pred_nn_classes = np.argmax(y_pred_nn, axis=1)

# Calculate metrics
nn_accuracy = accuracy_score(y_test, y_pred_nn_classes)
nn_log_loss = multiclass_log_loss(y_test, y_pred_nn)

print(f"Neural Network Accuracy: {nn_accuracy:.4f}")
print(f"Neural Network Log Loss: {nn_log_loss:.4f}")
print("\nNeural Network Classification Report:")
print(classification_report(y_test, y_pred_nn_classes))


## 9. Results Comparison


In [None]:
# Create a DataFrame to compare model performances
results = pd.DataFrame({
    'Model': ['LinearSVC', 'Random Forest', 'XGBoost', 'Neural Network'],
    'Accuracy': [svm_accuracy, rf_accuracy, xgb_accuracy, nn_accuracy],
    'Log Loss': [svm_log_loss, rf_log_loss, xgb_log_loss, nn_log_loss]
})

# Sort by Log Loss (lower is better)
results = results.sort_values('Log Loss')

# Display the results
print("Model Performance Comparison:")
display(results)

# Plot the results
plt.figure(figsize=(12, 6))

# Accuracy plot
plt.subplot(1, 2, 1)
plt.bar(results['Model'], results['Accuracy'], color='skyblue')
plt.title('Model Accuracy Comparison')
plt.ylabel('Accuracy')
plt.ylim(0, 1)
plt.xticks(rotation=45)

# Log Loss plot
plt.subplot(1, 2, 2)
plt.bar(results['Model'], results['Log Loss'], color='salmon')
plt.title('Model Log Loss Comparison')
plt.ylabel('Log Loss (lower is better)')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


## 10. Feature Importance Analysis


In [None]:
# Analyze feature importance from the Random Forest model
if hasattr(rf_model, 'feature_importances_'):
    # Get feature importances
    importances = rf_model.feature_importances_
    
    # Create feature names
    feature_names = []
    
    # TF-IDF feature names (top 20 most important)
    tfidf_feature_indices = np.argsort(importances[:X_tfidf_train.shape[1]])[-20:]
    tfidf_features = {i: name for i, name in enumerate(tfidf_vectorizer.get_feature_names_out())}
    
    # Node2Vec feature names
    node2vec_features = {i + X_tfidf_train.shape[1]: f"node2vec_{i}" 
                         for i in range(train_node2vec_features.shape[1])}
    
    # Price feature names
    price_feature_names = ['price', 'price_0_10', 'price_10_100', 'price_100_plus', 
                           'price_log', 'price_rank', 'price_zscore']
    price_features = {i + X_tfidf_train.shape[1] + train_node2vec_features.shape[1]: name 
                      for i, name in enumerate(price_feature_names)}
    
    # Combine all feature names
    all_features = {**tfidf_features, **node2vec_features, **price_features}
    
    # Get top 30 features by importance
    indices = np.argsort(importances)[-30:]
    top_features = [(all_features.get(i, f"feature_{i}"), importances[i]) for i in indices]
    
    # Create DataFrame for visualization
    importance_df = pd.DataFrame(top_features, columns=['Feature', 'Importance'])
    importance_df = importance_df.sort_values('Importance', ascending=False)
    
    # Plot feature importances
    plt.figure(figsize=(12, 8))
    plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.title('Top 30 Feature Importances')
    plt.tight_layout()
    plt.show()
    
    # Analyze price feature importance specifically
    price_importance = []
    for i, name in price_features.items():
        if i < len(importances):
            price_importance.append((name, importances[i]))
    
    price_importance_df = pd.DataFrame(price_importance, columns=['Feature', 'Importance'])
    price_importance_df = price_importance_df.sort_values('Importance', ascending=False)
    
    print("Price Feature Importances:")
    display(price_importance_df)
    
    # Plot price feature importances
    plt.figure(figsize=(10, 6))
    plt.barh(price_importance_df['Feature'], price_importance_df['Importance'], color='salmon')
    plt.xlabel('Importance')
    plt.ylabel('Price Feature')
    plt.title('Price Feature Importances')
    plt.tight_layout()
    plt.show()


In [25]:
# Load product IDs from test.txt
print("Loading product IDs from test.txt...")
test_products = []
with open("test.txt", "r") as f:
    for line in f:
        # Remove trailing comma and any whitespace
        product_id = line.strip().rstrip(',')
        if product_id:  # Skip empty lines
            test_products.append(int(product_id))

print(f"Loaded {len(test_products)} product IDs from test.txt")

# CREATE PRICE FEATURES FOR TEST PRODUCTS
print("\nCreating price features for test products...")
test_price_features_final = create_price_features(test_products, price_df)

# Scale the price features using the same scaler that was fitted on training data
print("Scaling price features for test products...")
test_price_scaled_final = test_price_features_final.copy()
test_price_scaled_final[price_columns_to_scale] = price_scaler.transform(test_price_features_final[price_columns_to_scale])

# Convert to numpy array
test_price_features_final_array = test_price_scaled_final.values
print(f"Price features shape for test products: {test_price_features_final_array.shape}")

# # Extract graph features for test products
# print("\nExtracting graph features for test products...")
# test_graph_features_final = extract_graph_features(G, test_products)
# test_graph_features_final = test_graph_features_final.fillna(0)
# test_graph_features_final_scaled = graph_scaler.transform(test_graph_features_final)

# Generate Node2Vec embeddings for test products
print("\nGenerating Node2Vec embeddings for test products...")
test_node2vec_features_final = np.zeros((len(test_products), 64))

# Extract embeddings for test products
for i, node_id in enumerate(test_products):
    try:
        test_node2vec_features_final[i] = n2v_model.wv[str(node_id)]
    except KeyError:
        # If node not in embeddings, use zeros
        pass

print(f"Node2Vec embeddings shape for test products: {test_node2vec_features_final.shape}")

# Scale the embeddings
test_node2vec_final_scaled = n2v_scaler.transform(test_node2vec_features_final)

# Load descriptions for test products
print("\nLoading descriptions for test products...")
descriptions = {}
try:
    with open("data_files/description_part_1.txt", "r") as f:
        for line in f:
            parts = line.split('|=|')
            if len(parts) >= 2:
                product_id = int(parts[0])
                description = parts[1].strip()
                descriptions[product_id] = description

    with open("data_files/description_part_2.txt", "r") as f:
        for line in f:
            parts = line.split('|=|')
            if len(parts) >= 2:
                product_id = int(parts[0])
                description = parts[1].strip()
                descriptions[product_id] = description
except Exception as e:
    print(f"Error loading descriptions: {e}")
    print("Using empty descriptions for missing products")

# Get descriptions for test products
test_descriptions = []
for product_id in test_products:
    test_descriptions.append(descriptions.get(product_id, ""))

# Extract TF-IDF features for test descriptions
print("\nExtracting TF-IDF features for test descriptions...")
X_tfidf_test_final = tfidf_vectorizer.transform(test_descriptions)
print(f"TF-IDF features shape for test products: {X_tfidf_test_final.shape}")

X_text_test_final = hstack([X_tfidf_test_final])

# COMBINE ALL FEATURES INCLUDING PRICE FEATURES
print("\nCombining all features including price features...")
X_combined_test_final = hstack([
    X_tfidf_test_final,
    # test_graph_features_final_scaled,
    test_node2vec_final_scaled,
    test_price_features_final_array
])
print(f"Combined features shape for test products (with price): {X_combined_test_final.shape}")

# Make predictions using the model
print("\nMaking predictions for test products...")
# test_pred_proba = svm_model.predict_proba(X_combined_test_final)
test_pred_proba = xgb_model.predict_proba(X_combined_test_final)

# Create a DataFrame with the predictions
print("\nCreating CSV with predictions...")
predictions_df = pd.DataFrame()
predictions_df['product'] = test_products

# Add probability columns for each class
for i in range(len(np.unique(y_train))):
    predictions_df[f'class{i}'] = test_pred_proba[:, i].round(4)

# Save predictions to CSV
# predictions_df.to_csv('svm_predictions.csv', index=False)
predictions_df.to_csv('xgb_predictions.csv', index=False)
print(f"Predictions saved to predictions.csv")

# Display the first few rows of the predictions
print("\nSample of predictions:")
print(predictions_df.head())

# Print feature dimensions for verification
print(f"\nFeature dimensions:")
print(f"Text features: {X_text_test_final.shape}")
print(f"Graph features: {test_graph_features_final_scaled.shape}")
print(f"Node2Vec features: {test_node2vec_final_scaled.shape}")
print(f"Price features: {test_price_features_final_array.shape}")
print(f"Total combined: {X_combined_test_final.shape}")

Loading product IDs from test.txt...
Loaded 45502 product IDs from test.txt

Creating price features for test products...
Scaling price features for test products...
Price features shape for test products: (45502, 7)

Generating Node2Vec embeddings for test products...
Node2Vec embeddings shape for test products: (45502, 64)

Loading descriptions for test products...

Extracting TF-IDF features for test descriptions...
TF-IDF features shape for test products: (45502, 261816)

Combining all features including price features...
Combined features shape for test products (with price): (45502, 261887)

Making predictions for test products...

Creating CSV with predictions...
Predictions saved to predictions.csv

Sample of predictions:
   product  class0  class1  class2  class3  class4  class5  class6  class7  \
0    49957  0.0006  0.0147  0.0469  0.0007  0.0059  0.1242  0.0230  0.0003   
1   135386  0.0000  0.0040  0.0000  0.0011  0.0121  0.0074  0.0001  0.0080   
2   226880  0.0000  0.0016

In [21]:
# Check individual feature components for TRAINING
print("\nTRAINING SET:")
print(f"Text features (X_text_train): {X_text_test_final.shape}")
print(f"Graph features (train_graph_features_scaled): {train_graph_features_scaled.shape}")
print(f"Node2Vec features (train_node2vec_scaled): {train_node2vec_scaled.shape}")
print(f"Price features (train_price_features_array): {train_price_features_array.shape}")
print(f"TOTAL TRAINING: {X_combined_train.shape}")

# Check individual feature components for TEST (final prediction)
print("\nFINAL TEST SET:")
print(f"Text features (X_text_test_final): {X_text_test_final.shape}")
print(f"Graph features (test_graph_features_final_scaled): {test_graph_features_final_scaled.shape}")
print(f"Node2Vec features (test_node2vec_final_scaled): {test_node2vec_final_scaled.shape}")
print(f"Price features (test_price_features_final_array): {test_price_features_final_array.shape}")


TRAINING SET:
Text features (X_text_train): (45502, 261816)
Graph features (train_graph_features_scaled): (145603, 5)
Node2Vec features (train_node2vec_scaled): (145603, 64)
Price features (train_price_features_array): (145603, 7)
TOTAL TRAINING: (145603, 261887)

FINAL TEST SET:
Text features (X_text_test_final): (45502, 261816)
Graph features (test_graph_features_final_scaled): (45502, 5)
Node2Vec features (test_node2vec_final_scaled): (45502, 64)
Price features (test_price_features_final_array): (45502, 7)


## 11. Conclusion


In this notebook, we implemented product classification using:
1. Text-based features: TF-IDF from product descriptions
2. Graph-based features: Node2Vec embeddings
3. Price-based features: Price buckets and statistical features

Key findings:
1. The combination of TF-IDF, graph embeddings, and price features provides good classification performance.
2. Price features contribute valuable information to the classification task, especially the price buckets and z-score.
3. The best performing model based on log loss was [BEST_MODEL], which achieved an accuracy of [BEST_ACCURACY].
4. Feature importance analysis shows that [FEATURE_TYPE] features are most important for classification.

This approach demonstrates that we can achieve good classification performance without using sentence embeddings, by leveraging graph structure and price information alongside TF-IDF features.