# Doc2vec

We tried doc2vec to find similar products. However, the results are poor.

We infer that the poor results may due to :

1. The data set is small. Most published work trains on tens-of-thousands to millions of documents, of dozens to thousands of words each. 

2. Paramenters needed beter tuning process.

In conclusion, we would not use this function as our final recommendation function. 

In [1]:
def search_doc2vec(query):
    '''
    This function uses Doc2Vec algorithm to vectorize and score documents,
    and print the best outfit we can find in product and outfit files.
    '''
    # Input packages and dataset
    from gensim.test.utils import common_texts
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    from sklearn.metrics.pairwise import cosine_similarity
    import pandas as pd
    import numpy as np
    df = pd.read_csv('processed_product.csv')
    out_fit = pd.read_csv('outfit_combinations.csv')
    
    # Output is expected to be a list of dictionaries.
    output = []
    
    # check if the query is a product ID
    # If not a product ID, we would do doc2vec after.
    # If it is a product ID, print the product name.
    # Moreover, if the product ID is in human domain experts combos, print the combinations.
    out_fit_products = list(out_fit['product_id'].unique())
    if query in list(df['product_id']):
        matched_product_name = df[df['product_id']==query]['name'].values[0]
        print(f'Matched Product: {matched_product_name} ({query})\n')
        if query in out_fit_products:
            print('WOW! The product is in a great combination(s):\n')
            matched_outfits = out_fit[out_fit['product_id']==query]['outfit_id'].unique()
            for outfit in matched_outfits:
                outfit_details = out_fit[out_fit['outfit_id']==outfit].reset_index()
                output.append(dict(zip(outfit_details['outfit_item_type'],outfit_details['product_full_name']+'('+outfit_details['product_id']+')')))
            combo_idx=1
            for c in output:
                print(f'Combo {combo_idx}:\n')
                combo_idx+=1
                for i in c:
                    print(f'{i}: {c[i]}\n')
        return None  
    
    
    # Build 2 doc2vec models by name, description columns respectively
    names = [TaggedDocument(doc, [i]) for i, doc in enumerate(df['name'])]
    model_name = Doc2Vec(names, vector_size=1000, min_count=8, workers=4)
    
    descriptions = [TaggedDocument(doc, [i]) for i, doc in enumerate(df['description'])]
    model_desc = Doc2Vec(descriptions, vector_size=1000, min_count=8, workers=4)
    
    # Fit the query with 2 models 
    query = [query]
    query_vector_name = model_name.infer_vector(query)
    query_vector_desc = model_desc.infer_vector(query)
    
    
    # Combine vectors from 2 models together.
    # Since there are too many noisy words in description, we weighted name and desction vector.
    # The final vector would be 0.8 weight for name-model vectors, 0.2 weight for desc-model vector.
    similarities_lst = []
    for i in range(len(df)):
        name_vector = model_name.infer_vector([df['name'][i]])
        desc_vector = model_name.infer_vector([df['description'][i]])
        doc_vector = np.append(0.8*name_vector,0.2*desc_vector).reshape(1,-1)
        query_vector = np.append(0.8*query_vector_name,0.2*query_vector_desc).reshape(1,-1)
        similarities_lst.append(cosine_similarity(doc_vector, query_vector)[0][0])
    similarities = pd.DataFrame({'similarity':similarities_lst},index=df['product_id']).sort_values(by='similarity',ascending=False).reset_index()
    similarities = pd.merge(similarities, df[['product_id','name','product_category']],on='product_id',how='left')
    
    # Find the most similar product
    most_matched_product = similarities.loc[0,'product_id']
    most_matched_product_name = similarities.loc[0,'name']
    
    # Threshold of "good" similarity
    # Check if the most similar product is clothing
    # We found there are some products in the dataset are not clothing, such as gift card, candles, etc.
    if similarities.loc[0,'product_category']=='UNKNOWN_TOKEN':
        print('Sorry, we do not find matched product. Please check if you are searching for clothing.')
        return None    
    
    # If the most similar product in outfit dataset, print out experts recommended combos
    if most_matched_product in out_fit_products:
        print('WOW! The product you are searching for is in human domain experts recommended combos.\n')
        matched_outfits = out_fit[out_fit['product_id']==most_matched_product]['outfit_id'].unique()
        for outfit in matched_outfits:
            outfit_details = out_fit[out_fit['outfit_id']==outfit].reset_index()
            output.append(dict(zip(outfit_details['outfit_item_type'],outfit_details['product_full_name']+'('+outfit_details['product_id']+')')))
        print(f'The most recommended product is {most_matched_product_name} ({most_matched_product}).\n')
        print('Following are our recommended outfit combination:\n')
        idx = 1
        for c in output:
            print(f'Combo {idx}:\n')
            idx += 1
            for i in c:
                print(f'{i}: {c[i]}\n')
    # Otherwise, we give a recommendation by ourselves
    # We have two format of combo: [top, bottom, shoes, accessory] or [one-piece, shoes, accessory]     
    else:           
        top = similarities[similarities['product_category']=='top'].iloc[0]['name']+'('+similarities[similarities['product_category']=='top'].iloc[0]['product_id']+')'        
        bottom = similarities[similarities['product_category']=='bottom'].iloc[0]['name']+'('+similarities[similarities['product_category']=='bottom'].iloc[0]['product_id']+')'
        onepiece = similarities[similarities['product_category']=='onepiece'].iloc[0]['name']+'('+similarities[similarities['product_category']=='onepiece'].iloc[0]['product_id']+')'
        shoes = similarities[similarities['product_category']=='shoe'].iloc[0]['name']+'('+similarities[similarities['product_category']=='shoe'].iloc[0]['product_id']+')'
        accessory = similarities[similarities['product_category']=='accessory'].iloc[0]['name']+'('+similarities[similarities['product_category']=='accessory'].iloc[0]['product_id']+')'
        # if matched product is top/bottom, we recommend a combo with top, bottom, shoe, accessory
        if similarities[similarities['product_id']==most_matched_product]['product_category'].values[0] in ['top','bottom']:
            output.append({'top':top,'bottom':bottom,'shoe':shoes,'accessory':accessory})
        # if matched product is onepiece, we recommend a combo with onepiece, shoe, accessory
        elif similarities[similarities['product_id']==most_matched_product]['product_category'].values[0]=='onepiece':
            output.append({'onepiece':onepiece,'shoe':shoes,'accessory':accessory})
        # if matched product is shoe/accessory, we recommend 2 kinds of combos
        else:
            output.append({'top':top,'bottom':bottom,'shoe':shoes,'accessory':accessory})
            output.append({'onepiece':onepiece,'shoe':shoes,'accessory':accessory})
    
        print(f'The most recommended product is {most_matched_product_name} ({most_matched_product}).\n')
        print('Following are our recommended outfit combination:\n')
        idx = 1
        for c in output:
            print(f'Combo {idx}:\n')
            idx += 1
            for i in c:
                print(f'{i}: {c[i]}\n')

In [20]:
search_doc2vec('computer')

Sorry, we do not find matched product. Please check if you are searching for clothing.


In [22]:
search_doc2vec('01EWTHFH4H3GP0Q34E6JBYJZNZ')

Matched Product: clara (01EWTHFH4H3GP0Q34E6JBYJZNZ)



In [23]:
search_doc2vec('01DVA59VHYAPT4PVX32NXW91G5')

Matched Product: juan embossed mules (01DVA59VHYAPT4PVX32NXW91G5)

WOW! The product is in a great combination(s):

Combo 1:

top: Knightley Striped Cotton-Voile Shirt(01DTATDR81EZ9S7DTYW3NE1QH0)

bottom: Vanessa High-Rise Straight-Leg Jeans(01DTATGN3YQGYEPCXAD0E207TP)

shoe: Juan Embossed Mules(01DVA59VHYAPT4PVX32NXW91G5)



In [19]:
search_doc2vec('slim fitting, straight leg pant')

The most recommended product is lost coast moleskin shirt  final sale (01F22MVY1H6FRN2JDSV0TR7SME).

Following are our recommended outfit combination:

Combo 1:

top: lost coast moleskin shirt  final sale(01F22MVY1H6FRN2JDSV0TR7SME)

bottom:    straight leg jeans(01E1JKRT9RKDNHAWNK381JGTBY)

shoe: gwen flats in croc embossed leather(01DPGTHPSYZCNW17PFDX20C2B6)

accessory: soko  black capped quill dangle earrings(01EC8M5W6SMAPCW5X2RDT53G9K)



In [21]:
search_doc2vec('juan embossed mules')

WOW! The product you are searching for is in human domain experts recommended combos.
The most recommended product is juan embossed mules (01DVA59VHYAPT4PVX32NXW91G5).

Following are our recommended outfit combination:

Combo 1:

top: Knightley Striped Cotton-Voile Shirt(01DTATDR81EZ9S7DTYW3NE1QH0)

bottom: Vanessa High-Rise Straight-Leg Jeans(01DTATGN3YQGYEPCXAD0E207TP)

shoe: Juan Embossed Mules(01DVA59VHYAPT4PVX32NXW91G5)



In [24]:
search_doc2vec('high rise straight leg jeans')

WOW! The product you are searching for is in human domain experts recommended combos.
The most recommended product is high rise straight leg jeans (01DVA4XSMTZ334M7SPPW0M1EDV).

Following are our recommended outfit combination:

Combo 1:

bottom: High-Rise Straight-Leg Jeans(01DVA4XSMTZ334M7SPPW0M1EDV)

shoe: Clarita Bow-Embellished Suede Sandals(01DVA4XY7A0QMMSK3V3SBR52J9)

top: Harper Cotton Eyelet Blouse(01DVA4Y85Y5VZTKZNVEKCTDJXQ)

Combo 2:

shoe: Doey Suede Ankle Boots(01DTATDENPZ2G048Q6YTM51C91)

bottom: High-Rise Straight-Leg Jeans(01DVA4XSMTZ334M7SPPW0M1EDV)

top: Drama cropped satin blouse(01DVVFWANCGCFK1E1WXKQR5ER5)

