# Project Phase 2
# Team Purple

## Notebook Description

**Executive Summary**
- This notebook builds a recommendation system for the outfit combinations file using Product ID and/or free text.
- The notebook allows a user to enter inputs – either product IDs or product descriptions and details - and returns recommended outfits.
- We tried different vectorization techniques, using Word2Vec, Word2Vec weighted average by TF-IDF, 1-Hot encoding
- We then calculated a similarity score between the user's input and the existing Product ID, descriptions and Full names in the database, in order to find the best match for the given input

> We finally choose Word2vec embeddings (Skipgram) as our final model. It uses the 'Spacy' library to generate document vectors by averaging individual word vectors 

In [1]:
# Importing relevant libraries

import pandas as pd
import re
import nltk
import spacy
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
import numpy as np
from numpy import array, argmax, asarray, zeros
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from fuzzywuzzy import fuzz
from tabulate import tabulate
from scipy.spatial.distance import cosine
from keras.preprocessing.text import Tokenizer
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import warnings
warnings.filterwarnings("ignore")
nlp = spacy.load("en_core_web_md")

Using TensorFlow backend.


In [2]:
# Reading Outfit Combinations Provided by Experts
## We have merged Product Description from Part 1 on Product ID as unique key
data = pd.read_csv("outfit_combinations_description.csv")

# Replacing Null values in product description with blank space
data['product description'] = data['product description'].fillna('')
data.head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,product description
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt,A nice skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,Eileen Fisher,Rib Mock Neck Tank,A nice tank
2,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory1,kate spade new york,medium margaux leather satchel,A nice bag
3,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,Tory Burch,Penelope Mid Cap Toe Pump,A nice shoe
4,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt,A nice skirt


In [3]:
# Defining Recommendation Function
## This function returns a randomly chosen set of outfit combinations for a given product ID

def recommendation(prod_id):
    df_product = data[data['product_id'] == prod_id].reset_index(drop = True)
    outfit_type = df_product.loc[0,"outfit_item_type"]
    print(f"Outfit type for product id {prod_id} is :",outfit_type,"\n")
    
    outfit_id_show = list(np.random.choice(a = list(df_product['outfit_id']), size = 1))
    df_outfit = data[data['outfit_id'] == outfit_id_show[0]]
    print(f"Matching Outfit ID is :",outfit_id_show[0],"\n")
    output = df_outfit['outfit_item_type']+ ": "+ df_outfit['product_full_name']+ " (" + df_outfit['product_id'] + ")"
    return output

# Product ID Input
*if user has the Product ID, they can enter it in the cell below, and get the corresponding outfit recommendations*

In [4]:
  # User Input
prod_id = "01DMBRYVA2ZFDYRYY5TRQZJTBD"            # <----- ENTER INPUT HERE


# Removes any white spaces to give a contiguous string for exact match
prod_id = ''.join(prod_id.split())

## Assigns productID similarity score to each row using Fuzzywuzzy library
## Chooses list of top 3 unique scores as suggested product IDs, in case of no exact match
## User may choose one of the suggested 3 IDs, and re-enter in the user input space (in Line 2 above)

df = data.copy()
df['fuzz_score'] = data["product_id"].apply(lambda x: fuzz.ratio(x,prod_id.upper()))
df = df.sort_values(by = 'fuzz_score', ascending = False)
matches = list(pd.Series(df['product_id'].unique())[:3])

# If a perfect match is found then recommendation function is called
if (df['fuzz_score'] == 100).any():
    output = pd.DataFrame(recommendation(prod_id),columns = ["Recommended Outfit Combination:"]).reset_index(drop = True)
    print(tabulate(output, showindex=False, headers=df.columns))
   
# If a perfect match is not found then similar product IDs are suggested
else:
   print(f'{prod_id} not found\n\nSuggested Product IDs {matches}')
    

Outfit type for product id 01DMBRYVA2ZFDYRYY5TRQZJTBD is : shoe 

Matching Outfit ID is : 01DMHRX35M2DPVYVQ1PNER4S4B 

outfit_id
------------------------------------------------------------
onepiece: Chemelle Midi Dress (01DMBRYVA2Q2ST7MNYR6EEY4TK)
shoe: Penelope Mid Cap Toe Pump (01DMBRYVA2ZFDYRYY5TRQZJTBD)
accessory1: Crystal Clutch (01DMHCNT41E14QWP503V7CT9G6)


# Product Description Input
*if user does not have a Prouct ID, they can enter the prodcut's Brand and/or description in the cell below, and get the corresponding outfit recommendations*

In [25]:
# Brand and Description input
## ENTER Brand and product description information below
## In case any information is missing, jut enter blank string, i.e., ''

brand = "Reformation"
description = "Sexy silky, a-line mini skirt zipper Benson skirt" 

### Cleaning the input text

In [26]:
# Stores the descriotion in a temporary test variable
test_desc = description

# Remove Punctuations from the input text
punctuation = "!@#$%^&*()_+<>?:.,;"  
    
for c in test_desc:
    if c in punctuation:
        test_desc = test_desc.replace(c, "")

# Remove Stopwords from input text
stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(test_desc) 
test_desc = [w for w in word_tokens if not w in stop_words] 
test_desc = [] 
for w in word_tokens: 
    if w not in stop_words: 
        test_desc.append(w) 
test_desc = ' '.join(test_desc)
test_desc

'Sexy silky a-line mini skirt zipper Benson skirt'

### Determine Outfit Type
- Find the most relevant words (eg: common nouns) associated with each outfit item type
- When a test query/document is submitted on user interface, this query is parsed to check with what outfit item type(s) it matches using regular expression
- Once we know the possible outfit item types,
we find the most similar product by filtering dataset on these outfit item types only.

In [27]:
# Regex to identify right category for filtering
shoe=r'(boot|sandal|pump|mule|sneaker|loafer|slingback|flat|slide|croc)'
top=r'(shirt|sweater|top|blouse|turtleneck|jersey|tee|bodysuit|neck|sleeve|jacket|coat|cardigan|blazer|sweater|hoodie|pullover|bomber|vest|camisole|dickey|puffer)'
bottom=r'(leg|pant|skirt|jean|rise|midi|short|trouser)'
onepiece = r'(dress|jumpsuit|wrap|stretch|maxi|midi|larina|francoise|polka|shirt|sweater|top|blouse|turtleneck|jersey|tee|bodysuit|neck|sleeveleg|pant|skirt|jean|rise|short|trouser|jacket|coat|cardigan|blazer|sweater|hoodie|pullover|bomber|mirella|vest|camisole|dickey|charmeuse|puffer)'
accessory1=r'(bag|tote|croc|tori|clutch|mini|scarf|cabinet|top|bucket|backpack|hammock|belt|lazo|handle|box|saddle|amal|protea|drawstring|saffiano|camera|wallet|chain|charmeuse|pouch|puffer|margaux|jacket|coat|cardigan|wrap|belt|blazer|sweater|shirt|hoodie|dickey|camisole|sunglasses|vest|shawl|mirella|pullover|bomber|aviator)'
accessory2=r'(bag|tote|croc|tori|clutch|mini|scarf|cabinet|top|bucket|backpack|hammock|belt|lazo|handle|box|saddle|amal|protea|drawstring|saffiano|camera|wallet|chain|charmeuse|pouch|puffer|margaux|jacket|coat|cardigan|wrap|belt|blazer|sweater|shirt|hoodie|dickey|camisole|sunglasses|vest|shawl|mirella|pullover|bomber|aviator)'
accessory3= r'(coat)'

In [28]:
# Determining Potential outfit categories using the above Regex

# outfitTypes is a dictionary to map 'outfit item type' with it's regular expression created above
outfitTypes={'top':top,'bottom':bottom,'shoe':shoe,'onepiece':onepiece,'accessory1':accessory1,'accessory2':accessory2,'accessory3':accessory3}
outfits = [outfit for outfit in outfitTypes if re.search(outfitTypes[outfit],test_desc,flags=re.IGNORECASE)]
outfits

['bottom', 'onepiece', 'accessory1', 'accessory2']

In [29]:
# Filtering data to the outfit categories found above
## We use this filtering to search in a subset of the dataframe (only for the categories identified)
## This will make search faster and give more accurate results

outfit_data = data.copy()

if outfits != []:
    outfit_data = data[data['outfit_item_type'].isin(outfits)].reset_index(drop = True)
outfit_data.head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,product description
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt,A nice skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory1,kate spade new york,medium margaux leather satchel,A nice bag
2,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt,A nice skirt
3,01DMHCX50CFX5YNG99F3Y65GQW,01DMHCNT41E14QWP503V7CT9G6,accessory1,Nina,Crystal Clutch,A nice clutch
4,01DMHRX35M2DPVYVQ1PNER4S4B,01DMBRYVA2Q2ST7MNYR6EEY4TK,onepiece,Equipment,Chemelle Midi Dress,A nice dress


### Determine Brand
- Find close matches to the brand entered by the user, based on FuzzyWuzzy library, with cutoff of 85
- If a close match is found, we filter the outfit data further for the given brand

In [30]:
# Filtering data with specific brand if a brand match is found in the data

brand_data = outfit_data.copy()

if brand != "":
    brand_data['fuzz_score'] = brand_data["brand"].str.lower().apply(lambda x: fuzz.ratio(x,brand.lower()))
    brand_data = brand_data[brand_data['fuzz_score'] > 85].drop('fuzz_score',axis=1)

brand_data.head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,product description
8,01DQ63P636Q4BQVCKT6Z4S41G5,01DPKMGJ33SDFXM7XHGPQJWQ12,bottom,Reformation,Benson Skirt,Sexy silky. This is an a-line mini skirt with ...
10,01DQ86EH3GMXAVKNECH2Z6FCSV,01DPKNJ6J1NQPQ1D3DBKWK5ARS,onepiece,Reformation,Rosamund Dress,Let your dress do the work. This is a midi len...
12,01DQ8KWVX1GBJTPTVDAC6NQ9B4,01DPKNJA2K022V3KP077611MKC,bottom,Reformation,Everett Pant,It's cold. Put some pants on. This is a high r...
14,01DQ8ME3M3QS9MQGZCQHXDHE1R,01DPKMH0D252JKMAA27MFCT5GM,bottom,Reformation,Marlon Pant,Let your pants do the talking. This is a slim ...
15,01DQ8MQAVBFSGHJXCF5JCYJ7A6,01DPKMKG14KT68YQY0MWA1CAA8,bottom,Reformation,Julia Crop High Cigarette Jean,"Better butts. This is a high rise, rigid jean ..."


### Declaring Relevant functions

In [31]:
## Finding most similar document's product ID and relevant outfit combination for the dataset filtered  from above
## We calculate 2 separate cosine similarity scores: one for Description and 1 for Product_Full_Name

def subset_prodid(test_doc): 
    df2 = brand_data.copy()

    test=[]
    score_desc=[]
    score_full_name = []
    for idx,row in df2.iterrows():
        descr=row['product description']
        full_name = row['product_full_name']
        org=nlp(descr)
        org2 = nlp(full_name)
        score_desc.append(test_doc.similarity(org))
        score_full_name.append(test_doc.similarity(org2))
        test.append(test_doc)
    df2['test_doc']=test
    df2['score_sim_full_name'] = score_full_name
    df2['score_sim_desc']=score_desc

    # If the user input is longer than 20 characters, we give higher weightage to Description
    if (len_test_desc>20):
        df2['score_sim'] = 0.7*df2['score_sim_desc'] + 0.3* df2['score_sim_full_name']
    
    # If the user input is less than 20 characters, we take the maximum of Similarity scores received from Description and Full_name
    # This gives higher weightage to Full_Name as Full Name column is generally 15-20 characters
    else:
        df2['score_sim'] = df2[['score_sim_desc','score_sim_full_name']].max(axis = 1) 
    df2 = df2.sort_values(by='score_sim',ascending=False).reset_index()
    df2.head()
    tar_prodid=df2.loc[0,"product_id"] 
    tar_prodid  
    return tar_prodid   

# This function generates a word2vec vector, does a weighted average TF-IDF score, to give higher weightage to relevant words

def word2vec_TFIDF(data,X):
    X = X.transform(data)

    tf_idf_lookup_table = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
    

    DOCUMENT_SUM_COLUMN = "DOCUMENT_TF_IDF_SUM"

    # sum the tf idf scores for each document
    tf_idf_lookup_table[DOCUMENT_SUM_COLUMN] = tf_idf_lookup_table.sum(axis=1)
    available_tf_idf_scores = tf_idf_lookup_table.columns # a list of all the columns we have
    available_tf_idf_scores = list(map( lambda x: x.lower(), available_tf_idf_scores)) # lowercase everything

    row_vectors = []
    for idx, row in enumerate(data): # iterate through each review
        tokens = nlp(row) # have spacy tokenize the review text

        # initially start a running total of tf-idf scores for a document
        total_tf_idf_score_per_document = 0

        # start a running total of initially all zeroes (300 is picked since that is the word embedding size used by word2vec)
        running_total_word_embedding = np.zeros(300) 
        for token in tokens: # iterate through each token

        # if the token has a pretrained word embedding it also has a tf-idf score
            if token.has_vector and token.text.lower() in available_tf_idf_scores:

                tf_idf_score = tf_idf_lookup_table.loc[idx, token.text.lower()]
                #print(f"{token} has tf-idf score of {tf_idf_lookup_table.loc[idx, token.text.lower()]}")
                running_total_word_embedding += tf_idf_score * token.vector

                total_tf_idf_score_per_document += tf_idf_score

        # divide the total embedding by the total tf-idf score for each document
        document_embedding = running_total_word_embedding / total_tf_idf_score_per_document
        row_vectors.append(document_embedding)
    return row_vectors

# Method 1 - Using Word Embeddings from Spacy
# (Final Output)

In [32]:
# Calculating the total # of characters of the input query
len_test_desc = sum(len(word) for word in test_desc)

# Calculating scores on filtered subset of the original dataframe and returns outfit recommendations
prod_id = subset_prodid(nlp(test_desc))
output = pd.DataFrame(recommendation(prod_id),columns = ["Recommended Outfit Combination:"]).reset_index(drop = True)
print(tabulate(output, showindex=False, headers=df.columns))

Outfit type for product id 01DPKMGJ33SDFXM7XHGPQJWQ12 is : bottom 

Matching Outfit ID is : 01DQ63P636Q4BQVCKT6Z4S41G5 

outfit_id
-------------------------------------------------------------
shoe: Pointed-toe flats in suede (01DPCRZWX4S2Z8Q5HYDFM4HNEG)
top: Ashlynn Blouse (01DPET2NWSA221STZF740BZ9SW)
bottom: Benson Skirt (01DPKMGJ33SDFXM7XHGPQJWQ12)


# Other Methods Tried

# Method 2 - Using Weighted Average Word Embeddings 

In [13]:
brand_data.head(1)

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name,product description
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,Tory Burch,Penelope Mid Cap Toe Pump,A nice shoe


In [19]:
# pd.set_option('display.max_colwidth', None)

data_list = list(brand_data['product_full_name'] + ' ' + brand_data['product description'])

vectorizer = TfidfVectorizer()
X = vectorizer.fit(data_list)
train_vec = word2vec_TFIDF(data_list,X)

test_vec = word2vec_TFIDF([test_desc],X)
test_vec = [list(test_vec[0])]

sim_score = []

for i in range(brand_data.shape[0]):
     train_vec[i] = list(train_vec[i])
     score =  float(cosine_similarity([train_vec[i]],test_vec))
     sim_score.append(score)

max_row = sim_score.index(max(sim_score))
recommendation(data.loc[max_row,"product_id"])

Outfit type for product id 01DMHCNT41E14QWP503V7CT9G6 is : accessory1 

Matching Outfit ID is : 01DMHRX35M2DPVYVQ1PNER4S4B 



8     onepiece: Chemelle Midi Dress (01DMBRYVA2Q2ST7...
9     shoe: Penelope Mid Cap Toe Pump (01DMBRYVA2ZFD...
10    accessory1: Crystal Clutch (01DMHCNT41E14QWP50...
dtype: object

# Method 3 - One Hot Encoding + Cosine Similarity

In [20]:
vectorizer = CountVectorizer(stop_words="english", binary=True)
H = vectorizer.fit(data_list)
train_vec = H.transform(data_list)
train_vec_df = pd.DataFrame(train_vec.toarray(), columns=vectorizer.get_feature_names())
train_vec_df.head()

Unnamed: 0,01,06,100,100mm,105,105mm,12,15mm,1774,19,...,wrapped,wraps,www,years,young,zebra,zip,zipper,zippers,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
test_vec = H.transform(test_desc).toarray()
test_vec = [list(test_vec[0])]

In [24]:
sim_score = []

for i in range(brand_data.shape[0]):
    train_vec = list(train_vec_df.iloc[i,:])
    score =  float(cosine_similarity([train_vec],test_vec))
    sim_score.append(score)

max_row = sim_score.index(max(sim_score))
recommendation(brand_data.loc[max_row,"product_id"])

Outfit type for product id 01DMBRYVA2ZFDYRYY5TRQZJTBD is : shoe 

Matching Outfit ID is : 01DDBHC62ES5K80P0KYJ56AM2T 



0    bottom: Slim Knit Skirt (01DMBRYVA2P5H24WK0HTK...
1    top: Rib Mock Neck Tank (01DMBRYVA2PEPWFTT7RMP...
2    accessory1: medium margaux leather satchel (01...
3    shoe: Penelope Mid Cap Toe Pump (01DMBRYVA2ZFD...
dtype: object