### Cosine Similarity features (Version 1)
A simple baseline that generates features based on the cosine similarities of the search query with the product description, product title and attribute name vectors. This model first builds the vocab of words under a column and then finds the representations of the words in each training example. After that, it computes the tfidf vectors of the columns in question. With these tfidf values in hand, the tfidf values of the search query are computed; the resultant being 2 vectors - document and search vectors. Each feature is the cosine similarity of a document vector with the corresponding search vector.

In [2]:
import pandas as pd
import csv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

The following method takes a couple of vectors - the document and search vectors and returns the cosine similarity between them. The first step is to fill in the values of the search vector. The vectors are both sparse representations of the tfidf values of the words in the vocab.

In [None]:
# Takes the document and search vectors and returns the cosine similarity of the 2 vectors.
def computeCosineSim(words, doc_vector, search_vector, vectorizer):
    # First store the term frequencies of the words
    dict = {}
    for word in words:
        if word in dict:
            count = dict.get(word)
            count = count + 1
            dict[word] = count
        else:
            dict[word] = 1

    doc_array = doc_vector.toarray()
    search_array = search_vector.toarray()
    # Now compute the tfidf of the words wrt to the document to get the search vector.
    for word in dict:
        index = vectorizer.vocabulary_.get(word, -1)
        # print(word + " " + str(index))
        if index != -1:
            search_array[0][index] = (doc_array[0][index] * dict[word]) / len(words)  # Has to be reconsidered.
    return cosine_similarity(doc_array, search_array)


The first step in extracting the features is merging the files. There are 3 files to work on in the preprocessing step - attributes.csv, train.csv, product_descriptions.csv. Motivation - We need a combined vocab of descriptions, titles and attributes in one place.

In [None]:
# This method takes the different product files and combines them into one on product_id.
def preprocess():
    # Combining the rows of the attributes file on product_uid.
    myfile_att = open("att_mod.csv", "w")
    df_attributes = pd.read_csv("attributes.csv", encoding='latin-1')
    df_attributes['name'] = df_attributes['name'].astype(str) + " "
    df_attributes['value'] = df_attributes['value'].astype(str) + " "
    df_attributes = df_attributes.groupby('product_uid').apply(lambda x: x.sum())
    df_attributes = df_attributes.drop(df_attributes.columns[[0]], axis=1)
    df_attributes.to_csv(myfile_att, sep=',', quoting=csv.QUOTE_NONNUMERIC)
    myfile_att.close()

    df_attributes = pd.read_csv("att_mod.csv", encoding='latin-1')
    df_prodDesc = pd.read_csv("product_descriptions.csv", encoding='latin-1')
    result = pd.merge(pd.DataFrame(df_prodDesc), pd.DataFrame(df_attributes), on='product_uid', how='left')
    merged_file = open("merged.csv", "w")
    # result.rename(columns={result.columns[-1]: "id"}, inplace=True)
    result.to_csv(merged_file, sep=',', quoting=csv.QUOTE_NONNUMERIC)
    merged_file.close()

    # Merge the prod_titles in the train file with the product details.
    df_train = pd.read_csv("train.csv", encoding='latin-1')
    df_merged = pd.read_csv("merged.csv", encoding='latin-1')
    result = pd.merge(pd.DataFrame(df_merged),
        pd.DataFrame(df_train)[['product_uid', 'product_title','search_term','relevance']], on='product_uid', how='inner')
    merged_file = open("training.csv", "w")
    result.to_csv(merged_file, sep=',', quoting=csv.QUOTE_NONNUMERIC)
    merged_file.close()


#### Vectorize the corpus:
This method takes care of vectorizing the text under a certain column in a csv file. The tfidfVectorizer() builds and fits the vocab. Each training example can then be transformed before finding the cosine similarity value.

In [None]:
# Given the column to vectorize, this function builds the vocab from words under
# the column, gets the tfidf vector representation, computes and returns the cosine similarity matrix.
def vectorize(column):
    # Get the product titles.
    df_merged.fillna(' ', inplace=True)
    text = df_merged[column].values.astype('U')

    # create the transform
    vectorizer = TfidfVectorizer()
    # tokenize and build vocab
    vectorizer.fit(text)

    cosine = []
    # Encode each document based on the transform.
    for t in range(0, len(df_merged)):
        doc_vector = vectorizer.transform([df_merged[column][t]])
        # print(doc_vector.toarray())

        # Now get the TFIDF of the search query.
        search_vector = vectorizer.transform([df_merged['search_term'][t]])
        # print(search_vector.toarray())
        # print("---------------")
        words = str(df_merged['search_term'][t]).split(" ")
        cos = computeCosineSim(words, doc_vector, search_vector, vectorizer)
        # print(cos)
        cosine.append(cos)
    return cosine


The driver of the program.
Flow of execution:

1) Preprocess the data
2) Get the cosine similarity values of the search vector with each of product description, product title and attributes vectors
3) Put the vectors in a matrix and prints it

In [None]:
preprocess()
df_merged = pd.read_csv("training.csv", encoding='latin-1')
features = []
cosine = vectorize('product_description')

for c in cosine:
    temp = []
    temp.append(c[0][0])
    features.append(temp)

cosine = vectorize('product_title')
for i in range(0, len(cosine)):
    features[i].append(cosine[i][0][0])

cosine = vectorize('name')
for i in range(0, len(cosine)):
    features[i].append(cosine[i][0][0])

# This matrix can be fed into a Linear Regression model.
print(features)