<a href="https://colab.research.google.com/github/nancy-kataria/NexTrade/blob/main/product_matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

=== Imports ===

In [1]:
import kagglehub
import pandas as pd
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

=== 1. Dataset Download ===

In [2]:
# Download latest version
print("Dowlaod Dataset...")
path = kagglehub.dataset_download("vivek468/superstore-dataset-final")
print(f"Dataset downloaded to: {path}")
csv_file_path = os.path.join(path, "Sample - Superstore.csv")
print(f"Reading data from: {csv_file_path}")

Dowlaod Dataset...
Dataset downloaded to: /kaggle/input/superstore-dataset-final
Reading data from: /kaggle/input/superstore-dataset-final/Sample - Superstore.csv


=== 2. Load & Clean Data ===

In [3]:
try:
    superstore_data = pd.read_csv(csv_file_path, encoding='ISO-8859-1')
    print("Data loaded successfully.")
except FileNotFoundError:
    print(f"ERROR: File not found at {csv_file_path}.")
    exit()

Data loaded successfully.


In [4]:
# Keep necessary columns
columns_to_keep = ['Order ID', 'Order Date', 'Ship Date', 'Customer ID', 'Country', 'City', 'State', 'Postal Code', 'Product ID', 'Product Name', 'Sales', 'Quantity', 'Category', 'Sub-Category']
superstore_data = superstore_data[columns_to_keep]

In [None]:
# Display the first 5 rows to check the data
print("First 5 rows of data:")
print(superstore_data.head())

First 5 rows of data:
         Order ID  Order Date   Ship Date Customer ID        Country  \
0  CA-2016-152156   11/8/2016  11/11/2016    CG-12520  United States   
1  CA-2016-152156   11/8/2016  11/11/2016    CG-12520  United States   
2  CA-2016-138688   6/12/2016   6/16/2016    DV-13045  United States   
3  US-2015-108966  10/11/2015  10/18/2015    SO-20335  United States   
4  US-2015-108966  10/11/2015  10/18/2015    SO-20335  United States   

              City       State  Postal Code       Product ID  \
0        Henderson    Kentucky        42420  FUR-BO-10001798   
1        Henderson    Kentucky        42420  FUR-CH-10000454   
2      Los Angeles  California        90036  OFF-LA-10000240   
3  Fort Lauderdale     Florida        33311  FUR-TA-10000577   
4  Fort Lauderdale     Florida        33311  OFF-ST-10000760   

                                        Product Name     Sales  Quantity  \
0                  Bush Somerset Collection Bookcase  261.9600         2   
1  Hon D

In [5]:
# Convert dates
superstore_data['Order Date'] = pd.to_datetime(superstore_data['Order Date'])
superstore_data['Ship Date'] = pd.to_datetime(superstore_data['Ship Date'])

In [6]:
# drop rows with missing any necessary columns
superstore_data.dropna(subset=columns_to_keep, inplace=True)

=== 3. Precomputation  ===

In [7]:
# 1. Product Popularity
product_popularity = superstore_data.groupby('Product ID').agg({
    'Product Name': 'first',
    'Category': 'first',
    'Sub-Category': 'first',
    'Quantity': 'sum',
    'Sales': 'sum'
}).reset_index()

# Normalize popularity score
product_popularity['popularity_score'] = product_popularity['Quantity'] / product_popularity['Quantity'].max()

# 2. Content-Based Info Preparation
superstore_data['product_info'] = (
    superstore_data['Product Name'].astype(str) + ' ' +
    superstore_data['Category'].astype(str) + ' ' +
    superstore_data['Sub-Category'].astype(str)
)

# One row per product
products = superstore_data.drop_duplicates(subset='Product ID')[
    ['Product ID', 'Product Name', 'Category', 'Sub-Category', 'product_info']
]

# 3. TF-IDF Matrix and Cosine Similarity
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(products['product_info'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 4. Product Index Mapping
product_indices = pd.Series(products.index, index=products['Product ID']).drop_duplicates()

# 5. User-Product Matrix and Product Similarity for Collaborative Filtering
user_product_matrix = superstore_data.pivot_table(
    index='Customer ID',
    columns='Product ID',
    values='Quantity',
    aggfunc='sum'
).fillna(0)

product_similarity = cosine_similarity(user_product_matrix.T)
product_similarity_df = pd.DataFrame(
    product_similarity,
    index=user_product_matrix.columns,
    columns=user_product_matrix.columns
)

=== 4. Recommendation Functions ===

In [14]:
# === Helper Functions ===
def get_customer_data(customer_id, df):
    """Fetches purchase data and unique purchased product IDs for a customer."""
    customer_data = df[df['Customer ID'] == customer_id].copy()
    purchased_ids = customer_data['Product ID'].unique()
    return customer_data, purchased_ids

"""
Get a list of products the customer hasn't purchased yet

Args:
  customer_id (str): ID of the target customer
  df (?): full transaction data (e.g., superstore_data)
  product_df (?): a DataFrame of all products to recommend from (e.g., product_popularity)

Returns:
  pd.DataFrame: filtered product_df with only unseen products
  pd.DataFrame: list of purchased Product IDs for fallback logic
"""
def get_unseen_products(customer_id, df, product_df):
    purchased = df[df['Customer ID'] == customer_id]['Product ID'].unique()
    return product_df[~product_df['Product ID'].isin(purchased)], purchased

"""
Add fallback recommendations if there aren't enough unseen products to recommend
This uses globally popular products (based on 'Quantity' or 'Sales') to fill the gap

Args:
  recommendations: filtered list of unseen, ranked products
  purchased: list of already purchased product IDs
  product_df: global product list (e.g., product_popularity)
  n: number of products we want to recommend
  by: popularity metric ('Quantity' or 'Sales')

Returns:
- final DataFrame of n recommendations
"""
def add_fallback_if_needed(recommendations, purchased, product_df, n, by):
    if len(recommendations) < n:
        print(f"Customer has only {len(recommendations)} new products available. Showing global popular items instead.")
        fallback = recommend_popular_products(n=n, by=by)
        fallback = fallback[~fallback['Product ID'].isin(purchased)]
        recommendations = pd.concat([recommendations, fallback]).drop_duplicates('Product ID')
    return recommendations

# Get the most common categories and sub-categories for a customer
# This is used in personalized popularity-based recommendation
# Returns two lists: top categories and top sub-categories based on purchase frequency
def get_customer_preferences(customer_id, df):
    customer_data = df[df['Customer ID'] == customer_id]
    if customer_data.empty:
        return [], []
    top_categories = customer_data['Category'].value_counts().index.tolist()
    top_subcategories = customer_data['Sub-Category'].value_counts().index.tolist()
    return top_categories, top_subcategories




# === Recommendation Functions ===
def recommend_popular_products(n=10, by='Quantity'):
    """
    Recommends top-N globally popular products. Sorts all products based on a specified metric ('Quantity' or 'Sales') and returns the top N. Does not consider customer history.

    Args:
        n (int, optional): The number of products to recommend. Defaults to 10.
        by (strm, optional): The metric to sort popularity by ('Quantity' or 'Sales'). Defaults to 'Quantity'.

    Returns:
        pd.DataFrame: A DataFrame containing the top N popular products with columns ['Product ID', 'Product Name', 'Category', 'Sub-Category', <by>]. Returns an empty DataFrame if an invalid 'by' parameter is provided (though it currently raises ValueError).

    Raises:
        ValueError: If 'by' is not 'Quantity' or 'Sales'.
    """
    if by not in ['Quantity', 'Sales']:
        raise ValueError("Parameter 'by' must be either 'Quantity' or 'Sales'")

    return product_popularity.sort_values(by=by, ascending=False).head(n)[['Product ID', 'Product Name', 'Category', 'Sub-Category', by]]

def recommend_popular(customer_id=None, personalized=False, n=10, by='Quantity'):
    """Recommends popular products, optionally personalized for a customer.

    Modes:
    1. Global: If customer_id is None, returns globally popular products.
    2. Unseen for Customer: If customer_id is provided and personalized=False,
       returns globally popular products not yet purchased by the customer.
    3. Personalized Popular: If customer_id is provided and personalized=True,
       returns popular products from the customer's preferred categories/sub-categories
       that they haven't purchased yet.

    Includes fallback to global popular items if not enough personalized/unseen
    items are found.

    Args:
        customer_id (str, optional): The ID of the customer. Defaults to None.
        personalized (bool, optional): Whether to filter by customer preferences.
                                       Defaults to False. Ignored if customer_id is None.
        n (int, optional): The number of products to recommend. Defaults to 10.
        by (str, optional): The metric for popularity ('Quantity' or 'Sales').
                            Defaults to 'Quantity'.

    Returns:
        pd.DataFrame: A DataFrame containing the recommended products with columns
                      ['Product ID', 'Product Name', 'Category', 'Sub-Category', <metric_used>].
                      Returns global recommendations if customer preferences cannot be determined.
                      May return fewer than n items if insufficient products are available
                      even after fallback.

    Raises:
        ValueError: If 'by' is not 'Quantity' or 'Sales'.
    """
    if by not in ['Quantity', 'Sales']:
        raise ValueError("Parameter 'by' must be either 'Quantity' or 'Sales'")

    if customer_id is None:
        return recommend_popular_products(n, by)

    df = product_popularity

    if personalized:
        top_cats, top_subcats = get_customer_preferences(customer_id, superstore_data)
        if not top_cats or not top_subcats:
            return recommend_popular_products(n, by)
        df = df[(df['Category'].isin(top_cats)) | (df['Sub-Category'].isin(top_subcats))]

    unseen, purchased = get_unseen_products(customer_id, superstore_data, df)
    unseen = unseen.sort_values(by=by, ascending=False)
    final = add_fallback_if_needed(unseen, purchased, product_popularity, n, by)

    return final.head(n)[['Product ID', 'Product Name', 'Category', 'Sub-Category', by]]

def recommend_similar_products(product_id, top_n=5):
  """Recommends products similar to a given product based on content.

    Uses precomputed TF-IDF vectors and cosine similarity based on product
    name, category, and sub-category.

    Args:
        product_id (str): The ID of the product to find similar items for.
        top_n (int, optional): The number of similar products to return.
                               Defaults to 5.

    Returns:
        pd.DataFrame: A DataFrame containing the top_n similar products with
                      columns ['Product Name', 'Category', 'Sub-Category'].
                      Returns an empty DataFrame if the product_id is not found.
    """
    if product_id not in product_indices.index:
        print(f"Product ID '{product_id}' not found in product indices.")
        return pd.DataFrame() # Or an empty list

    idx = product_indices[product_id]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    product_idxs = [i[0] for i in sim_scores]

    return products.iloc[product_idxs][['Product Name', 'Category', 'Sub-Category']]

def recommend_to_customer_content_based(customer_id, top_n=5):
  """Recommends products similar to the last item purchased by a customer.

    Finds the customer's most recent purchase and then uses content-based
    similarity (recommend_similar_products) to find similar items.

    Args:
        customer_id (str): The ID of the customer.
        top_n (int, optional): The number of similar products to recommend.
                               Defaults to 5.

    Returns:
        pd.DataFrame or str: A DataFrame containing the recommended products
                             (from recommend_similar_products) or a string message
                             if the customer has no purchase history.
                             (Consider changing the string return to an empty DataFrame
                             for consistency).
    """
    customer_purchases = superstore_data[superstore_data['Customer ID'] == customer_id]
    if customer_purchases.empty:
        print(f"No purchase history for customer '{customer_id}'.")
        return pd.DataFrame() # Or an empty list

    # Get last product bought
    last_purchase = customer_purchases.sort_values('Order Date', ascending=False).iloc[0]
    last_product_id = last_purchase['Product ID']
    last_product_name = last_purchase['Product Name']

    print(f"Based on last product purchased (ID: {last_product_id}): {last_product_name}")
    return recommend_similar_products(last_product_id, top_n)

def recommend_hybrid(customer_id, top_n=5, similarity_weight=0.5, popularity_weight=0.5):
    """Recommends products using a hybrid approach (content similarity + popularity).

    Calculates a weighted score based on the average content similarity to items
    the customer has purchased and the global popularity score of candidate products.
    Excludes items already purchased by the customer.

    Args:
        customer_id (str): The ID of the customer.
        top_n (int, optional): The number of products to recommend. Defaults to 5.
        similarity_weight (float, optional): The weight for the content similarity score.
                                             Defaults to 0.5.
        popularity_weight (float, optional): The weight for the popularity score.
                                             Defaults to 0.5.

    Returns:
        pd.DataFrame or str: A DataFrame containing the top_n hybrid recommendations
                             with columns ['Product Name', 'Category', 'Sub-Category',
                             'hybrid_score'], sorted by hybrid_score. Returns a string
                             message if the customer has no history or if no relevant
                             product data is found. (Consider changing string returns
                             to an empty DataFrame).
    """
    customer_data = superstore_data[superstore_data['Customer ID'] == customer_id]
    if customer_data.empty:
        print(f"No purchase history found.")
        return pd.DataFrame() # Or an empty list

    purchased_ids = customer_data['Product ID'].unique()
    purchased_idxs = [product_indices[pid] for pid in purchased_ids if pid in product_indices]
    if not purchased_idxs:
        return "No purchased products found in product index."

    sim_scores = sum(cosine_sim[idx] for idx in purchased_idxs)
    sim_scores = sim_scores / len(purchased_idxs)

    sim_df = pd.DataFrame({
        'product_index': range(len(sim_scores)),
        'similarity_score': sim_scores
    })

    sim_df = sim_df.merge(products.reset_index(), left_on='product_index', right_index=True)
    sim_df = sim_df[~sim_df['Product ID'].isin(purchased_ids)]
    sim_df = sim_df.merge(product_popularity[['Product ID', 'popularity_score']], on='Product ID', how='left')
    sim_df['popularity_score'] = sim_df['popularity_score'].fillna(0)
    sim_df['hybrid_score'] = (
        similarity_weight * sim_df['similarity_score'] +
        popularity_weight * sim_df['popularity_score']
    )

    sim_df = sim_df.sort_values(by='hybrid_score', ascending=False).head(top_n)
    return sim_df[['Product Name', 'Category', 'Sub-Category', 'hybrid_score']]

def recommend_collaborative(product_id, top_n=5):
    """Recommends products similar to a given product using item-item collaborative filtering.

    Uses a precomputed product similarity matrix based on user co-purchase patterns.

    Args:
        product_id (str): The ID of the product to find collaboratively similar items for.
        top_n (int, optional): The number of similar products to return. Defaults to 5.

    Returns:
        pd.DataFrame or str: A DataFrame containing the top_n similar products
                             with columns ['Product ID', 'Similarity Score', 'Product Name',
                             'Category', 'Sub-Category']. Returns a string message if the
                             product_id is not found in the similarity matrix. (Consider
                             changing string returns to an empty DataFrame).
    """

    if product_id not in product_similarity_df.columns:
        print(f"Product {product_id} not found in dataset.")
        return pd.DataFrame() # Or an empty list
    similar_scores = product_similarity_df[product_id].sort_values(ascending=False)
    # return similar_scores[1:top_n+1]

    recommended = similar_scores[1:top_n+1].reset_index()
    recommended.columns = ['Product ID', 'Similarity Score']
    return recommended.merge(
        product_popularity[['Product ID', 'Product Name', 'Category', 'Sub-Category']],
        on='Product ID', how='left'
    )

def recommend_collaborative_for_customer(customer_id, top_n=5):
    """Recommends products to a customer based on collaborative filtering.

    Aggregates similarity scores from items the customer has purchased to find
    new items that are similar based on co-purchase patterns across all users.
    Excludes items already purchased by the customer.

    Args:
        customer_id (str): The ID of the customer.
        top_n (int, optional): The number of products to recommend. Defaults to 5.

    Returns:
        pd.DataFrame or str: A DataFrame containing the top_n recommended products
                             with columns ['Product ID', 'Product Name', 'Category',
                             'Sub-Category']. Returns a string message if the customer
                             has no history or suitable product data isn't found.
                             (Consider changing string returns to an empty DataFrame).
    """

    customer_data = superstore_data[superstore_data['Customer ID'] == customer_id]
    if customer_data.empty:
        print(f"No purchase history for customer '{customer_id}'.")
        return pd.DataFrame() # Or an empty list

    purchased_ids = customer_data['Product ID'].unique()

    # If user has multiple purchases, accumulate similarity
    sim_scores = None
    for pid in purchased_ids:
        if pid not in product_similarity_df.columns:
            continue
        product_scores = product_similarity_df[pid]
        sim_scores = product_scores if sim_scores is None else sim_scores + product_scores

    if sim_scores is None:
        return f"No valid products found for similarity for customer '{customer_id}'."

    # Normalize if multiple products
    sim_scores = sim_scores / len(purchased_ids)

    # Remove already purchased products
    sim_scores = sim_scores.drop(labels=purchased_ids, errors='ignore')

    # Top N similar products
    top_ids = sim_scores.sort_values(ascending=False).head(top_n).index

    return product_popularity[product_popularity['Product ID'].isin(top_ids)][[
        'Product ID', 'Product Name', 'Category', 'Sub-Category'
    ]]

=== 5. Example Usage ===

In [16]:
print("Top 5 Global Products:")
print(recommend_popular())  # global

print("\n Top 5 Personalized Products (Unseen):")
print(recommend_popular(customer_id='CG-12520'))  # exclude purchased

print("\n Top 5 Personalized Products by Preference:")
print(recommend_popular(customer_id='CG-12520', personalized=True))  # personalized

print("\n Top 5 Content-Based Recommendations:")
print(recommend_to_customer_content_based('CG-12520', 5))

print("\nTop 5 Hybrid Recommendations (Popularity + Content-Based):")
print(recommend_hybrid('CG-12520', top_n=5))  # Combines popularity and content similarity

print("\nTop 5 Collaborative Recommendations (based on customer history):")
print(recommend_collaborative_for_customer('CG-12520', top_n=5))

print("\nTop 5 Collaborative Recommendations:")
print(recommend_collaborative('FUR-CH-10000454', top_n=5))

Top 5 Global Products:
           Product ID                                       Product Name  \
1569  TEC-AC-10003832                 Logitech P710e Mobile Speakerphone   
1144  OFF-PA-10001970                                         Xerox 1881   
694   OFF-BI-10001524  GBC Premium Transparent Covers with Diagonal L...   
721   OFF-BI-10002026                            Avery Arch Ring Binders   
93    FUR-CH-10002647         Situations Contoured Folding Chairs, 4/Set   
325   FUR-TA-10001095                 Chromcraft Round Conference Tables   
1517  TEC-AC-10002049          Logitech G19 Programmable Gaming Keyboard   
835   OFF-BI-10004728  Wilson Jones Turn Tabs Binder Tool for Ring Bi...   
110   FUR-CH-10003774    Global Wood Trimmed Manager's Task Chair, Khaki   
183   FUR-FU-10001473                            DAX Wood Document Frame   

             Category Sub-Category  Quantity  
1569       Technology  Accessories        75  
1144  Office Supplies        Paper        70  

In [None]:
# visualize top products
top10 = product_popularity.head(10)
plt.figure(figsize=(10,5))
plt.barh(top10['Product Name'], top10['Quantity'], color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel('Total Quantity Sold')
plt.title('Top 10 Most Popular Products')
plt.show()

--- Pre-computation for Recommendations ---

1. Create a unique list of products

2. Create 'product_info' text feature for Content-Based Filtering

3. Calculate TF-IDF Matrix and Cosine Similarity

4. Create a mapping from Product ID to its index in our matrices/products_df

5. Calculate Product Popularity

--- Recommendation Functions ---

1. Popularity-Based Recommendation (Simple)

2. Popularity-Based Recommendation (Personalized with Category)

3. Content-Based Recommendation

4. Hybrid Recommendation (Content Similarity + Popularity)

5. Collaborative Recommendation

6. Hybrid Recommendation 2 (Content Similarity + Popularity + Collaborative)

--- Example Usage ---

--- Visualization ---