## HomeDepot

### Data Exploration

- Shoppers rely on Home Depot’s product authority to find and buy the latest products and to get timely solutions to their home improvement needs. From installing a new ceiling fan to remodeling an entire kitchen, with the click of a mouse or tap of the screen, customers expect the correct results to their queries – quickly. Speed, accuracy and delivering a frictionless customer experience are essential.



- In this competition, Home Depot is asking Kagglers to help them improve their customers' shopping experience by developing a model that can accurately predict the relevance of search results.



- Search relevancy is an implicit measure Home Depot uses to gauge how quickly they can get customers to the right products. Currently, human raters evaluate the impact of potential changes to their search algorithms, which is a slow and subjective process. By removing or minimizing human input in search relevance evaluation, Home Depot hopes to increase the number of iterations their team can perform on the current search algorithms.




In [1]:
import pandas as pd
import numpy as np

import gensim 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity  

from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from nltk.stem.snowball import SnowballStemmer

  from numpy.core.umath_tests import inner1d


In [2]:
# read encoding document from python
# https://pythonhosted.org/kitchen/unicode-frustrations.html
train = pd.read_csv('train.csv',encoding="ISO-8859-1")
test = pd.read_csv('test.csv',encoding="ISO-8859-1")

In [3]:
print(train.shape)
train.head()


(74067, 5)


Unnamed: 0,id,product_uid,product_title,search_term,relevance
0,2,100001,Simpson Strong-Tie 12-Gauge Angle,angle bracket,3.0
1,3,100001,Simpson Strong-Tie 12-Gauge Angle,l bracket,2.5
2,9,100002,BEHR Premium Textured DeckOver 1-gal. #SC-141 ...,deck over,3.0
3,16,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,rain shower head,2.33
4,17,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,shower only faucet,2.67


In [4]:
print("Number of unique products: {0}".format(train['product_title'].nunique()))
print("Number of unique IDs: {0}".format(train['id'].nunique()))
print("Number of unique search terms: {0}".format(train['search_term'].nunique()))

Number of unique products: 53489
Number of unique IDs: 74067
Number of unique search terms: 11795


In [5]:
print(test.shape)
test.head()

(166693, 4)


Unnamed: 0,id,product_uid,product_title,search_term
0,1,100001,Simpson Strong-Tie 12-Gauge Angle,90 degree bracket
1,4,100001,Simpson Strong-Tie 12-Gauge Angle,metal l brackets
2,5,100001,Simpson Strong-Tie 12-Gauge Angle,simpson sku able
3,6,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong ties
4,7,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong tie hcc668


### Concatenate product_description to test dataset
#### Possible Features
- Cosine Sim: vectorize using W2V and calculate cosine similarity between them?
- Number of overlapping words between product description and search terms
- length of query

In [6]:
test['product_title'].nunique()

94731

In [7]:
attributes = pd.read_csv('attributes.csv')
print(attributes.shape)
attributes.head()

(2044803, 3)


Unnamed: 0,product_uid,name,value
0,100001.0,Bullet01,Versatile connector for various 90° connection...
1,100001.0,Bullet02,Stronger than angled nailing or screw fastenin...
2,100001.0,Bullet03,Help ensure joints are consistently straight a...
3,100001.0,Bullet04,Dimensions: 3 in. x 3 in. x 1-1/2 in.
4,100001.0,Bullet05,Made from 12-Gauge steel


In [8]:
prod_desc = pd.read_csv('product_descriptions.csv')
print(prod_desc.shape)
prod_desc.head()

(124428, 2)


Unnamed: 0,product_uid,product_description
0,100001,"Not only do angles make joints stronger, they ..."
1,100002,BEHR Premium Textured DECKOVER is an innovativ...
2,100003,Classic architecture meets contemporary design...
3,100004,The Grape Solar 265-Watt Polycrystalline PV So...
4,100005,Update your bathroom with the Delta Vero Singl...


In [9]:
len(prod_desc['product_uid'].unique())

124428

In [10]:
prod_desc['product_uid'].nunique()

124428

In [11]:
# prod_desc.describe()

In [12]:
# merge descriptions
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html

train_data = pd.merge(train, prod_desc, on="product_uid", how="right")

In [13]:
train_data = train_data[['id','product_uid','product_title','product_description','search_term','relevance']]
train_data.head(10)

Unnamed: 0,id,product_uid,product_title,product_description,search_term,relevance
0,2.0,100001,Simpson Strong-Tie 12-Gauge Angle,"Not only do angles make joints stronger, they ...",angle bracket,3.0
1,3.0,100001,Simpson Strong-Tie 12-Gauge Angle,"Not only do angles make joints stronger, they ...",l bracket,2.5
2,9.0,100002,BEHR Premium Textured DeckOver 1-gal. #SC-141 ...,BEHR Premium Textured DECKOVER is an innovativ...,deck over,3.0
3,16.0,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,Update your bathroom with the Delta Vero Singl...,rain shower head,2.33
4,17.0,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,Update your bathroom with the Delta Vero Singl...,shower only faucet,2.67
5,18.0,100006,Whirlpool 1.9 cu. ft. Over the Range Convectio...,Achieving delicious results is almost effortle...,convection otr,3.0
6,20.0,100006,Whirlpool 1.9 cu. ft. Over the Range Convectio...,Achieving delicious results is almost effortle...,microwave over stove,2.67
7,21.0,100006,Whirlpool 1.9 cu. ft. Over the Range Convectio...,Achieving delicious results is almost effortle...,microwaves,3.0
8,23.0,100007,Lithonia Lighting Quantum 2-Light Black LED Em...,The Quantum Adjustable 2-Light LED Black Emerg...,emergency light,2.67
9,27.0,100009,House of Fara 3/4 in. x 3 in. x 8 ft. MDF Flut...,Get the House of Fara 3/4 in. x 3 in. x 8 ft. ...,mdf 3/4,3.0


In [14]:
# # merge product counts
# product_counts = pd.DataFrame(pd.Series(training_data.groupby(["product_uid"]).size(), name="product_count"))
# training_data = pd.merge(train_data, product_counts, left_on="product_uid", right_index=True, how="left")

# # merge brand names
# brand_names = attribute_data[attribute_data.name == "MFG Brand Name"][["product_uid", "value"]].rename(columns={"value": "brand_name"})
# train_data = pd.merge(training_data, brand_names, on="product_uid", how="left")
# train_data.brand_name.fillna("Unknown", inplace=True)

In [15]:
# pd.Series
# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html 

train_data.groupby(['product_uid']).size().head()

product_uid
100001    2
100002    1
100003    1
100004    1
100005    2
dtype: int64

In [16]:
pd.DataFrame(train_data.groupby(['product_uid']).size()).head()

Unnamed: 0_level_0,0
product_uid,Unnamed: 1_level_1
100001,2
100002,1
100003,1
100004,1
100005,2


In [17]:
pd.DataFrame(pd.Series(train_data.groupby(['product_uid']).size())).head()

Unnamed: 0_level_0,0
product_uid,Unnamed: 1_level_1
100001,2
100002,1
100003,1
100004,1
100005,2


In [18]:
train_data.groupby(['product_uid'])

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x121922e10>

In [19]:
# to make '.groupby()' visible 
pd.Series(train_data.groupby(['product_uid']).size()).head()

product_uid
100001    2
100002    1
100003    1
100004    1
100005    2
dtype: int64

In [20]:
# Diff. between 'value_counts' and 'groupby' 
train_data['product_uid'].value_counts().head()

102893    21
101959    21
101892    18
102456    17
104691    17
Name: product_uid, dtype: int64

In [21]:
# merge product counts
product_counts = pd.DataFrame(pd.Series(train_data.groupby(["product_uid"]).size(), name="product_count"))
train_data = pd.merge(train_data, product_counts, left_on="product_uid", right_index=True, how="left")

In [22]:
train_data.head()

Unnamed: 0,id,product_uid,product_title,product_description,search_term,relevance,product_count
0,2.0,100001,Simpson Strong-Tie 12-Gauge Angle,"Not only do angles make joints stronger, they ...",angle bracket,3.0,2
1,3.0,100001,Simpson Strong-Tie 12-Gauge Angle,"Not only do angles make joints stronger, they ...",l bracket,2.5,2
2,9.0,100002,BEHR Premium Textured DeckOver 1-gal. #SC-141 ...,BEHR Premium Textured DECKOVER is an innovativ...,deck over,3.0,1
3,16.0,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,Update your bathroom with the Delta Vero Singl...,rain shower head,2.33,2
4,17.0,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,Update your bathroom with the Delta Vero Singl...,shower only faucet,2.67,2


In [23]:
# merge brand names

brand_names = attributes[attributes.name == "MFG Brand Name"][["product_uid", "value"]].rename(columns={"value": "brand_name"})
# training_data = pd.merge(training_data, brand_names, on="product_uid", how="left")
# training_data.brand_name.fillna("Unknown", inplace=True)


#  Data Preprocessing

In [15]:
import pandas as pd
train  = pd.read_csv('train.csv',encoding="ISO-8859-1")
test = pd.read_csv('test.csv',encoding="ISO-8859-1")
prod_desc = pd.read_csv('product_descriptions.csv')
attributes = pd.read_csv('attributes.csv')

In [16]:
train.head()

Unnamed: 0,id,product_uid,product_title,search_term,relevance
0,2,100001,Simpson Strong-Tie 12-Gauge Angle,angle bracket,3.0
1,3,100001,Simpson Strong-Tie 12-Gauge Angle,l bracket,2.5
2,9,100002,BEHR Premium Textured DeckOver 1-gal. #SC-141 ...,deck over,3.0
3,16,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,rain shower head,2.33
4,17,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,shower only faucet,2.67


In [17]:
test.head()

Unnamed: 0,id,product_uid,product_title,search_term
0,1,100001,Simpson Strong-Tie 12-Gauge Angle,90 degree bracket
1,4,100001,Simpson Strong-Tie 12-Gauge Angle,metal l brackets
2,5,100001,Simpson Strong-Tie 12-Gauge Angle,simpson sku able
3,6,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong ties
4,7,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong tie hcc668


In [18]:
prod_desc.head()

Unnamed: 0,product_uid,product_description
0,100001,"Not only do angles make joints stronger, they ..."
1,100002,BEHR Premium Textured DECKOVER is an innovativ...
2,100003,Classic architecture meets contemporary design...
3,100004,The Grape Solar 265-Watt Polycrystalline PV So...
4,100005,Update your bathroom with the Delta Vero Singl...


In [19]:
attributes.head()

Unnamed: 0,product_uid,name,value
0,100001.0,Bullet01,Versatile connector for various 90° connection...
1,100001.0,Bullet02,Stronger than angled nailing or screw fastenin...
2,100001.0,Bullet03,Help ensure joints are consistently straight a...
3,100001.0,Bullet04,Dimensions: 3 in. x 3 in. x 1-1/2 in.
4,100001.0,Bullet05,Made from 12-Gauge steel


In [20]:
# Examples

a1 = 'i want this'
a2 = 'you want this'

# returns Boolean type first
print([(a1.find(word)>=0) for word in a2.split()])

# put int() around you'll get 0 or 1 for Boolean values
print([int(a1.find(word)>=0) for word in a2.split()])

# sum() -> add all 1 values -> counting common words
print(sum(int(a1.find(word)>=0)) for word in a2.split())

[False, True, True]
[0, 1, 1]
<generator object <genexpr> at 0x1230c7a40>


In [21]:
# Define functions

stemmer = SnowballStemmer('english')

# Define functions

def str_stemmer(s):
    return " ".join([stemmer.stem(word) for word in s.lower().split()])

# Function that counts the number of common words
def str_common_word(str1, str2):
    return sum(int(str2.find(word)>=0) for word in str1.split())

In [22]:
# Data Concatenation with Product Description

# df_all = pd.merge(df_all, df_pro_desc, how='left', on='product_uid')
train_c = pd.merge(train, prod_desc, how = 'left', on = 'product_uid')
test_c = pd.merge(test, prod_desc, how = 'left', on = 'product_uid')

In [23]:
train_c.head()

Unnamed: 0,id,product_uid,product_title,search_term,relevance,product_description
0,2,100001,Simpson Strong-Tie 12-Gauge Angle,angle bracket,3.0,"Not only do angles make joints stronger, they ..."
1,3,100001,Simpson Strong-Tie 12-Gauge Angle,l bracket,2.5,"Not only do angles make joints stronger, they ..."
2,9,100002,BEHR Premium Textured DeckOver 1-gal. #SC-141 ...,deck over,3.0,BEHR Premium Textured DECKOVER is an innovativ...
3,16,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,rain shower head,2.33,Update your bathroom with the Delta Vero Singl...
4,17,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,shower only faucet,2.67,Update your bathroom with the Delta Vero Singl...


In [24]:
test_c.head()

Unnamed: 0,id,product_uid,product_title,search_term,product_description
0,1,100001,Simpson Strong-Tie 12-Gauge Angle,90 degree bracket,"Not only do angles make joints stronger, they ..."
1,4,100001,Simpson Strong-Tie 12-Gauge Angle,metal l brackets,"Not only do angles make joints stronger, they ..."
2,5,100001,Simpson Strong-Tie 12-Gauge Angle,simpson sku able,"Not only do angles make joints stronger, they ..."
3,6,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong ties,"Not only do angles make joints stronger, they ..."
4,7,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong tie hcc668,"Not only do angles make joints stronger, they ..."


In [25]:
# stemming text data

train_c['product_title'] = train_c['product_title'].apply(str_stemmer)
train_c['search_term'] = train_c['search_term'].apply(str_stemmer)
train_c['product_description'] = train_c['product_description'].apply(str_stemmer)

test_c['product_title'] = test_c['product_title'].apply(str_stemmer)
test_c['search_term'] = test_c['search_term'].apply(str_stemmer)
test_c['product_description'] = test_c['product_description'].apply(str_stemmer)

In [26]:
# Create a new column for the length of search term

train_c['len_of_query'] = train_c['search_term'].apply(lambda x:len(x.split())).astype(np.int64)
test_c['len_of_query'] = test_c['search_term'].apply(lambda x:len(x.split())).astype(np.int64)

In [27]:
# Create new column 'product_info' by concatenating other columns

train_c['product_info'] = train_c['search_term']+"\t"+train_c['product_title']+"\t"+train_c['product_description']
#train_c['product_info'].head()

test_c['product_info'] = test_c['search_term']+"\t"+test_c['product_title']+"\t"+test_c['product_description']
#test_c['product_info'].head()

In [28]:
# Num. of COMMON words between Search term & Product_title;  [0] => search term    [1] => product_title

train_c['word_in_title'] = train_c['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[1]))
#train_c['word_in_title'].head()

test_c['word_in_title'] = test_c['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[1]))
#test_c['word_in_title'].head()

In [29]:
# Num. of COMMON words between Search term & Product_description;  [0] => search term    [2] => product_description

train_c['word_in_description'] = train_c['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[2]))
#train_c['word_in_description'].head()

test_c['word_in_description'] = test_c['product_info'].map(lambda x:str_common_word(x.split('\t')[0],x.split('\t')[2]))
#test_c['word_in_description'].head()

In [30]:
## TFIDF Analysis of Product description & search_term

train_c['product_info_tfidf'] = train_c['product_title']+" "+train_c['product_description']
#train_c['product_info'].head()

test_c['product_info_tfidf'] = test_c['product_title']+" "+test_c['product_description']
#test_c['product_info'].head()




TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [33]:
tfidf_model = TfidfVectorizer(analyzer = 'word', ngram_range = (1,1), max_features = 50000, binary = True)
# tfidf_model = TfidfVectorizer(analyzer = 'word', ngram_range = (1,1), binary = True)
tfidf_model

TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [36]:
tfidf_matrix = tfidf_model.fit_transform(train_c['product_info_tfidf'])
tfidf_matrix

<74067x50000 sparse matrix of type '<class 'numpy.float64'>'
	with 6544807 stored elements in Compressed Sparse Row format>

In [37]:
search_vec = tfidf_model.transform(train_c['search_term'])
search_vec

<74067x50000 sparse matrix of type '<class 'numpy.float64'>'
	with 207422 stored elements in Compressed Sparse Row format>

In [None]:
def cos_sim(p_desc, query):
    cs = cosine_similarity(p_desc, query)
    return cs

train_c['cos_sim'] = cos_sim(tfidf_matrix, search_vec)
train_c.head()

In [None]:
NOW .transform  search_term and compare cos. sim. between search_term & product_info_tfidf

In [12]:
# Load Word2Vec Model for vectorization for similarity calculation
# import gensim 

# model = gensim.models.KeyedVectors.load_word2vec_format('/Users/keonpark/Downloads/TD_Internship/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [11]:
# # Put cleaned texts to a list for vectorization
# train_c['product_info_vect'] = train_c['product_info'].values.tolist()

# # Finding words that does not exist in pretrained Word2Vec embedding
# unique_words = list(set([word for sublist in [doc.split() for doc in train_c['product_info_vect']] for word in sublist]))
        
# # Get all vocabularies & corresponding vectors(300-dimension for each vector) of pretrained Word2Vec embedding
# w2v_words = list(model.wv.index2word)
# w2v_vectors = model.vectors
        
# # Finding out unique words from test set that does not exist in Word2vec embedding
# OOV = list(set(unique_words)-set(w2v_words))
        
# # Assigning randomly generated vectors or zeros to vocabs that does not exist in Glove embedding
# np.random.seed(seed=42)
# OOV_dict = {oov:np.random.uniform(low = -0.0001, high = 0.0001, size = (300,)) for oov in OOV}
        
# w2v_dict = {word:w2v_vectors[i] for i, word in enumerate(w2v_words)}
# total_dict = {**w2v_dict, **OOV_dict}

In [3]:
# # Define vectorizer function for vector averaging
# def vectorizer(doc, model):
#     word_list = doc.split()
#     doc_vector = np.mean([model[word] for word in word_list], axis=0)
#     return doc_vector

In [4]:
# len(total_dict['nail'])

In [5]:
 # Line below is a code for averaging word vectors to get a sentence vector (In this case, a short description vector)
# query_vectors = np.array([vectorizer(q, total_dict) for q in train_c['product_info_vect'].values.tolist()])

In [6]:
# query_vectors.shape

In [7]:
# st_vectors = np.array([vectorizer(q, total_dict) for q in train_c['search_term'].values.tolist()])
# st_vectors.shape

In [8]:
def cos_sim(p_desc, query):
    cs = cosine_similarity(p_desc, query)
    return cs

In [9]:
# train_c['cos_sim'] = cos_sim(query_vectors,st_vectors )
# train_c['cos_sim']

In [10]:
# train_c.head()