# Home Depot Product Search Relevance

Search relevancy is an implicit measure Home Depot uses to gauge how quickly they can get customers to the right products. This script focuses on predicting accurate Search relevancy of every search query in homedepot's search relelvance dataset on Kaggle(https://www.kaggle.com/c/home-depot-product-search-relevance).

###### Environment Setup

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import gensim
from porter2stemmer import Porter2Stemmer
from sklearn.metrics import r2_score
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from fuzzywuzzy import fuzz

###### Helper Function(s)

Fucntion to compute number of common terms:

In [2]:
def common_count(str):
    stra, strb = str.split('\t')
    count = 0
    for word in stra.strip().split(' '):
        if strb.find(word) >= 0:
            count += 1
    return count

Float function to compute similarity in embedded meaning of words (Using Word2Vec network embeddings):

In [3]:
def wv_sim(str, model):
    stra, strb = str.split('\t')
    count = 0.0
    wc = 0
    for word in stra.strip().split(' '):
        if word in model.wv.vocab:
            agg = 0
            for term in strb.strip().split(' '):
                if term in model.wv.vocab:
                    tx = (model.wv.similarity(word, term))
                    if(tx > agg):
                        agg = tx
            count += agg
            wc += 1
    return count / (wc if wc > 0 else 1)

Levenshtein Distance functions:

In [4]:
def token_set(x):
    stra, strb = x.split('\t')
    return fuzz.token_set_ratio(stra, strb)

def token_sort(x):
    stra, strb = x.split('\t')
    return fuzz.token_sort_ratio(stra, strb)

###### Dataset

The dataset contains a number of products and real customer search terms from Home Depot's website. Most important files are: train.csv, test.csv, product_descriptions.csv and attributes.csv.

Training data consists of 74067 instances and Test data contains 166693 instances. 

Load dataset:

In [5]:
dfs = dict()
dfs['train'] = pd.read_csv('train.csv', encoding = "ISO-8859-1")
dfs['test'] = pd.read_csv('test.csv', encoding = "ISO-8859-1")
dsc = pd.read_csv('product_descriptions.csv', encoding = "utf-8")
att = pd.read_csv('attributes.csv', encoding = "utf-8")

Extract Brand Information and additional description:

In [6]:
stemmer = Porter2Stemmer()
aux_dsc = att[att['name'] == 'Bullet01'][["product_uid", "value"]].rename(columns = {"value" : "auxilary_description"})
brands_list = att[att['name'] == 'MFG Brand Name'][["product_uid", "value"]].rename(columns = {"value" : "brand"})

Extract unique sentences from dataframe:

In [7]:
sentences = np.array([])
for i, k in enumerate(dfs):
    for field in ['product_title', 'search_term']:
        sentences = np.append(sentences, dfs[k][field].map(lambda x: [stemmer.stem(word) for word in str(x).lower().split(' ')]).values)
sentences = np.append(sentences, dsc['product_description'].map(lambda x: [stemmer.stem(word) for word in str(x).lower().split(' ')]).values)
sentences = np.append(sentences, brands_list['brand'].map(lambda x: [stemmer.stem(word) for word in str(x).lower().split(' ')]).values)
sentences = np.append(sentences, aux_dsc['auxilary_description'].map(lambda x: [stemmer.stem(word) for word in str(x).lower().split(' ')]).values)

Initialize Word2Vec Model:

In [8]:
wv_model = Word2Vec(sentences, size=16, window=5, min_count=5, workers=4)

Stem dataset, Calculate query length and similarity coefficients between search query & different product fields:

In [9]:
for i, k in enumerate(dfs):
    dfs[k] = pd.merge(dfs[k], dsc, how='left', on='product_uid')
    dfs[k] = pd.merge(dfs[k], brands_list, how='left', on='product_uid')
    dfs[k] = pd.merge(dfs[k], aux_dsc, how='left', on='product_uid')
    
    for field in ['product_title', 'brand', 'product_description', 'auxilary_description', 'search_term']:
        dfs[k][field] = dfs[k][field].map(lambda x: " ".join([stemmer.stem(word) for word in str(x).lower().split(' ')]))
    dfs[k]['search_len'] = dfs[k]['search_term'].map(lambda x: len(x.split(' '))).astype(np.int32)
    
    for field in ['product_title', 'brand', 'product_description', 'auxilary_description']:
        dfs[k][field + '_common_count'] = (dfs[k]['search_term'] + '\t' + dfs[k][field]).map(lambda x: common_count(x)).astype(np.float32)
        dfs[k][field + '_wv_sim'] = (dfs[k]['search_term'] + '\t' + dfs[k][field]).map(lambda x: wv_sim(x, wv_model)).astype(np.float32)
        dfs[k][field + '_token_set'] = (dfs[k]['search_term'] + '\t' + dfs[k][field]).map(lambda x: token_set(x)).astype(np.float32)
        dfs[k][field + '_token_sort'] = (dfs[k]['search_term'] + '\t' + dfs[k][field]).map(lambda x: token_sort(x)).astype(np.float32)

Feed arrays:

In [10]:
y_train = dfs['train']['relevance'].values
x_test = dfs['test'][['search_len', 'product_title_common_count', 'product_title_wv_sim', 'product_title_token_set', 'product_title_token_sort', 'brand_common_count', 'brand_wv_sim', 'brand_token_set', 'brand_token_sort', 'product_description_common_count','product_description_wv_sim', 'product_description_token_set', 'product_description_token_sort', 'auxilary_description_common_count', 'auxilary_description_wv_sim', 'auxilary_description_token_set', 'auxilary_description_token_sort']].values
x_train = dfs['train'][['search_len', 'product_title_common_count', 'product_title_wv_sim', 'product_title_token_set', 'product_title_token_sort', 'brand_common_count', 'brand_wv_sim', 'brand_token_set', 'brand_token_sort', 'product_description_common_count','product_description_wv_sim', 'product_description_token_set', 'product_description_token_sort', 'auxilary_description_common_count', 'auxilary_description_wv_sim', 'auxilary_description_token_set', 'auxilary_description_token_sort']].values


###### Model Analysis

In [11]:
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.08, gamma=0, subsample=0.75, colsample_bytree=1, max_depth=7)
model.fit(x_train, y_train)

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.08, max_delta_step=0, max_depth=7,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.75)

R2 Score:

In [12]:
y = model.predict(x_train)
print(r2_score(y_train, y))

0.308294417585


RMS Error:

In [13]:
np.mean((y_train - y) ** 2) ** 0.5

0.4441052537155975

In [14]:
y = model.predict(x_test)
ans = pd.DataFrame({"id": dfs['test']['id'], "relevance": y})
ans.to_csv('answers.csv',index=False)

XG-Boost Regressor yields average mean squared error of 0.47179 on Kaggle (vs sklearn's Random Forest score of 0.48221 and Keras' ANN score of 0.51127).