# Using Random Forests to Find the Relevance of Search Queries

We are given a set of data which was generated by a group of people who looked through queries and decided how relevant they were to the product that some user eventually ended up on. What we know are the product UID, the query made, and the title of the product. We also do know more about the product than the title and UID because we also have access to the attributes of the product. This will be of great help because it gives us several more opportunities to create some features.

## Initial procedures

In [1]:
import pandas as pd
import matplotlib as mpl
import numpy as np
import math
import os

from sklearn.ensemble import RandomForestRegressor

In [2]:
folder = '/home/pbnjeff/Dropbox/KaggleHomeDepotData/'
cleaned_data_path = '/home/pbnjeff/Dropbox/KaggleHomeDepotData/combined_cleaned.csv'

train_path = folder + 'train.csv'
test_path = folder + 'test.csv'

train = pd.read_csv(train_path)
test = pd.read_csv(test_path)

if os.path.exists(cleaned_data_path):
    combined = pd.read_csv(cleaned_data_path)
else:
    combined = train.append(test)
    combined = combined.reset_index(drop=True)

In [3]:
attributes = pd.read_csv('/home/pbnjeff/Dropbox/KaggleHomeDepotData/attributes.csv')

## Feature engineering

I create some features, as I didn't have enough time to go through everything that I wanted to. To sumarize:

* Percent query in product title: An obvious feature. If there are more words from the query in the title, then it should be more relevant.

* There is at least one match between query and title: This might be a way to "normalize" the queries. If 1/10 terms match for one query and 1/9 terms match for another, the previous feature alone would say that the query with 1/9 matches is more relevant, but it's not exactly a fair comparison given that each percent match is nearly the same anyway.

* Numerical string in both: People might not know the exact number, but may have gotten something close to it. Basically, it's a feature that gives an A for effort, but combined with the previous two features, gives a bonus (possibly too much) for matching numbers.

* Material matches: An obvious feature if you're given the attributes.

In [5]:
def breakDownQueryNames(df):
    
    # Lowercase for everything to normalize
    df['query_terms'] = df['search_term'].str.lower()
    df['query_terms'] = df['search_term'].str.replace('-',' ')
    df['productname_terms'] = df['product_title'].str.lower()
    df['productname_terms'] = df['productname_terms'].replace('-',' ')
    df['query_terms'] = df['query_terms'].str.split(' ')
    df['productname_terms'] = df['productname_terms'].str.split(' ')
    
    return df

def removeLists(df):
    """
    Helper function to remove unnecessary columns for training models
    """
    
    return df.drop(['query_terms','productname_terms','material_terms'],axis=1)

def percentQueryInProductName(df):
    
    
    df['percentQueryInName'] = pd.Series()
    
    for i in range(len(df['query_terms'])):

        numQueryTerms = len(df['query_terms'][i])
        numNameTerms = len(df['productname_terms'][i])
        queryTermsInName = 0

        for j in range(numQueryTerms):

            if df['query_terms'][i][j] in df['productname_terms'][i]:

                queryTermsInName += 1

        df.loc[i,'percentQueryInName'] = float(queryTermsInName) / numNameTerms
        
        printCompleted(i, len(df['query_terms']))
    
    return df

def printCompleted(i, total):
    
    if i % 10000 == 0:
        
        print('{0}'.format(str(i) + '/' + str(total) + ' completed!'))

def hasOneMatch(df):
    
    df['hasMatch'] = pd.Series()
    
    for i in range(len(df['query_terms'])):
        
        numQueryTerms = len(df['query_terms'][i])
        queryTermInName = False

        for j in range(numQueryTerms):

            if df['query_terms'][i][j] in df['productname_terms'][i]:

                queryTermInName = True

        df.loc[i,'hasMatch'] = queryTermInName
        
        printCompleted(i, len(df['query_terms']))
    
    return df

def isNumber(string):
    
    try:
        float(string)
        return True
    except ValueError:
        return False

def queryProductHaveNumeric(df):
    """
    Humans might input the wrong number, but the intention
    was to specify a number. This is a measure of whether
    a human had the intention.
    """
    
    df['BothHaveNumbers'] = pd.Series()
    
    for i in range(len(df['query_terms'])):
        
        queryNumber = False
        productNumber = False

        for word in df['query_terms'][i]:

            if isNumber(str(word)):

                queryNumber = True

        for word in df['productname_terms'][i]:

            if isNumber(str(word)):

                productNumber = True

        if (queryNumber and productNumber):

            df['BothHaveNumbers'] = True

        else:

            df['BothHaveNumbers'] = False
            
        printCompleted(i, len(df['query_terms']))

    return df

def materialHasMatch(df):
    
    df['MaterialMatch'] = pd.Series()
    
    for i in range(len(df['query_terms'])):
        
        hasMatch = False
        
        for term in df['query_terms'][i]:
            
            if term in df['material_terms'][i]:

                hasMatch = True
        
        df['MaterialMatch'][i] = hasMatch
        
        printCompleted(i, len(df['query_terms']))
    
    return df

def percentMaterialMatched(df):
    
    df['percentMaterialMatched'] = pd.Series()

    for i in range(len(df['query_terms'])):
        
        numMatches = 0
        numMaterialTerms = len(df['material_terms'][i])

        for term in df['query_terms'][i]:

            if term in df['material_terms'][i]:

                numMatches += 1

        if numMatches > numMaterialTerms:
            numMatches = numMaterialTerms

        df['percentMaterialMatched'][i] = float(numMatches) / numMaterialTerms

        printCompleted(i, len(df['query_terms']))
    
    return df

def getMaterials(attributes, df):
    
    uid_materials = attributes.loc[attributes['name'] == 'Material'].drop('name', axis=1).reset_index(drop=True)
    
    df['material_terms'] = pd.Series()
    uid_materials['material_terms'] = uid_materials['value'].str.lower()
    uid_materials['materials'] = uid_materials['material_terms'].str.split(' ')
    uid_materials = uid_materials.drop('material_terms', axis=1)
    
    for i in range(len(df['product_uid'])):
        
        uid = df['product_uid'][i]
        mat_df = uid_materials[uid_materials['product_uid']==uid]['materials'].to_frame()
        
        try:
            material_terms = mat_df.iloc[0]['materials']
        except IndexError:
            material_terms = ['']
            
        df['material_terms'][i] = material_terms
        
        try:
            if math.isnan(df['material_terms'][i]):
                df['material_terms'][i] = ['']
        except:
            pass
        
        # TODO: Eliminate the parenthesis surrounding things like '(mdf)'
        
    return df

In [72]:
combined = breakDownQueryNames(combined)

In [73]:
combined = percentQueryInProductName(combined)

0/240760 completed!
10000/240760 completed!
20000/240760 completed!
30000/240760 completed!
40000/240760 completed!
50000/240760 completed!
60000/240760 completed!
70000/240760 completed!
80000/240760 completed!
90000/240760 completed!
100000/240760 completed!
110000/240760 completed!
120000/240760 completed!
130000/240760 completed!
140000/240760 completed!
150000/240760 completed!
160000/240760 completed!
170000/240760 completed!
180000/240760 completed!
190000/240760 completed!
200000/240760 completed!
210000/240760 completed!
220000/240760 completed!
230000/240760 completed!
240000/240760 completed!


In [74]:
combined = hasOneMatch(combined)

0/240760 completed!
10000/240760 completed!
20000/240760 completed!
30000/240760 completed!
40000/240760 completed!
50000/240760 completed!
60000/240760 completed!
70000/240760 completed!
80000/240760 completed!
90000/240760 completed!
100000/240760 completed!
110000/240760 completed!
120000/240760 completed!
130000/240760 completed!
140000/240760 completed!
150000/240760 completed!
160000/240760 completed!
170000/240760 completed!
180000/240760 completed!
190000/240760 completed!
200000/240760 completed!
210000/240760 completed!
220000/240760 completed!
230000/240760 completed!
240000/240760 completed!


In [75]:
combined = queryProductHaveNumeric(combined)

0/240760 completed!
10000/240760 completed!
20000/240760 completed!
30000/240760 completed!
40000/240760 completed!
50000/240760 completed!
60000/240760 completed!
70000/240760 completed!
80000/240760 completed!
90000/240760 completed!
100000/240760 completed!
110000/240760 completed!
120000/240760 completed!
130000/240760 completed!
140000/240760 completed!
150000/240760 completed!
160000/240760 completed!
170000/240760 completed!
180000/240760 completed!
190000/240760 completed!
200000/240760 completed!
210000/240760 completed!
220000/240760 completed!
230000/240760 completed!
240000/240760 completed!


In [76]:
combined = getMaterials(attributes, combined)

In [77]:
combined = materialHasMatch(combined)

0/240760 completed!
10000/240760 completed!
20000/240760 completed!
30000/240760 completed!
40000/240760 completed!
50000/240760 completed!
60000/240760 completed!
70000/240760 completed!
80000/240760 completed!
90000/240760 completed!
100000/240760 completed!
110000/240760 completed!
120000/240760 completed!
130000/240760 completed!
140000/240760 completed!
150000/240760 completed!
160000/240760 completed!
170000/240760 completed!
180000/240760 completed!
190000/240760 completed!
200000/240760 completed!
210000/240760 completed!
220000/240760 completed!
230000/240760 completed!
240000/240760 completed!


In [78]:
combined = percentMaterialMatched(combined)

0/240760 completed!
10000/240760 completed!
20000/240760 completed!
30000/240760 completed!
40000/240760 completed!
50000/240760 completed!
60000/240760 completed!
70000/240760 completed!
80000/240760 completed!
90000/240760 completed!
100000/240760 completed!
110000/240760 completed!
120000/240760 completed!
130000/240760 completed!
140000/240760 completed!
150000/240760 completed!
160000/240760 completed!
170000/240760 completed!
180000/240760 completed!
190000/240760 completed!
200000/240760 completed!
210000/240760 completed!
220000/240760 completed!
230000/240760 completed!
240000/240760 completed!


In [79]:
combined.to_csv(cleaned_data_path, index=False)

In [6]:
combined = removeLists(combined)

In [8]:
combined = combined.drop(['product_title','product_uid','search_term'], axis=1)

In [10]:
train_cleaned = combined[~combined['relevance'].isnull()]
test_cleaned = combined[combined['relevance'].isnull()]

In [13]:
train_X = train_cleaned.drop(['id','relevance'], axis=1)
train_y = train_cleaned['relevance']

In [14]:
test_X = test_cleaned.drop(['id','relevance'], axis=1)

# Predicting the relevance

I don't go terribly wild here. I could have used more estimators, but time and computational power limits me right now.

In [16]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators = 10, min_samples_split = 1)

In [17]:
rf.fit(train_X, train_y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=1, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [18]:
predictions = rf.predict(test_X)

submission = pd.DataFrame()

submission['id'] = test_cleaned['id']

submission['relevance'] = predictions

out_path = '/home/pbnjeff/Dropbox/KaggleHomeDepotData/submission.csv'
submission.to_csv(out_path, index = False)