# Home Depot Product Search Relevance
The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

## LabGraph Create
This notebook uses the LabGraph create machine learning iPython module. You need a personal licence to run this code.

In [1]:
import graphlab as gl

### Load data from CSV files

In [2]:
train = gl.SFrame.read_csv("../data/train.csv")

[INFO] This non-commercial license of GraphLab Create is assigned to thomasv1000@hotmail.fr and will expire on October 12, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-16476 - Server binary: C:\Users\T.Jaskula\AppData\Local\Continuum\Anaconda2\lib\site-packages\graphlab\unity_server.exe - Server log: C:\Users\T9773~1.JAS\AppData\Local\Temp\graphlab_server_1454677328.log.0
[INFO] GraphLab Server Version: 1.8.1


PROGRESS: Finished parsing file C:\Users\T.Jaskula\Python\HomeDepot\data\train.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.155015 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file C:\Users\T.Jaskula\Python\HomeDepot\data\train.csv
PROGRESS: Parsing completed. Parsed 74067 lines in 0.10101 secs.


In [3]:
test = gl.SFrame.read_csv("../data/test.csv")

PROGRESS: Finished parsing file C:\Users\T.Jaskula\Python\HomeDepot\data\test.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.539053 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,long,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file C:\Users\T.Jaskula\Python\HomeDepot\data\test.csv
PROGRESS: Parsing completed. Parsed 166693 lines in 0.218021 secs.


In [4]:
desc = gl.SFrame.read_csv("../data/product_descriptions.csv")

PROGRESS: Finished parsing file C:\Users\T.Jaskula\Python\HomeDepot\data\product_descriptions.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.854086 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[long,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 61134 lines. Lines per second: 69701.1
PROGRESS: Finished parsing file C:\Users\T.Jaskula\Python\HomeDepot\data\product_descriptions.csv
PROGRESS: Parsing completed. Parsed 124428 lines in 1.12511 secs.


### Data merging

In [6]:
# merge train nwith description
train = train.join(desc, on = 'product_uid', how = 'left')

In [7]:
# merge test nwith description
test = test.join(desc, on = 'product_uid', how = 'left')

In [None]:
train['search_term_word_count'] = gl.text_analytics.count_words(train['search_term'])
train_search_tfidf = gl.text_analytics.tf_idf(train['search_term_word_count'])

In [None]:
train['search_tfidf'] = train_search_tfidf

In [None]:
train.head()

In [None]:
train['product_desc_word_count'] = gl.text_analytics.count_words(train['product_description'])
train_desc_tfidf = gl.text_analytics.tf_idf(train['product_desc_word_count'])

In [None]:
train['desc_tfidf'] = train_desc_tfidf

In [None]:
train.head()

In [None]:
train['product_title_word_count'] = gl.text_analytics.count_words(train['product_title'])
train_title_tfidf = gl.text_analytics.tf_idf(train['product_title_word_count'])
train['title_tfidf'] = train_title_tfidf
train.head()

In [None]:
train['distance'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
train['distance2'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))

In [None]:
train.head()

In [None]:
model1 = gl.linear_regression.create(train, target = 'relevance', features = ['distance', 'distance2'], validation_set = None)

In [None]:
#let's take a look at the weights before we plot
model1.get("coefficients")

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
test['search_term_word_count'] = gl.text_analytics.count_words(test['search_term'])
test_search_tfidf = gl.text_analytics.tf_idf(test['search_term_word_count'])
test['search_tfidf'] = test_search_tfidf
test['product_desc_word_count'] = gl.text_analytics.count_words(test['product_description'])
test_desc_tfidf = gl.text_analytics.tf_idf(test['product_desc_word_count'])
test['desc_tfidf'] = test_desc_tfidf
test['product_title_word_count'] = gl.text_analytics.count_words(test['product_title'])
test_title_tfidf = gl.text_analytics.tf_idf(test['product_title_word_count'])
test['title_tfidf'] = test_title_tfidf
test['distance'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
test['distance2'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))

In [None]:
output = model1.predict(test)

In [None]:
output

In [None]:
submission = gl.SFrame(test['id'])

In [None]:
submission.add_column(output)
submission.rename({'X1': 'id', 'X2':'relevance'})

In [None]:
submission['relevance'] = submission.apply(lambda x: 3.0 if x['relevance'] > 3.0 else x['relevance'])
submission['relevance'] = submission.apply(lambda x: 1.0 if x['relevance'] < 1.0 else x['relevance'])

In [None]:
submission['relevance'] = submission.apply(lambda x: str(x['relevance']))

In [None]:
submission.export_csv('../data/submission.csv', quote_level = 3)

In [None]:
#gl.canvas.set_target('ipynb')