# TEAM PURPLE
## APPENDIX to 'NLP Part2 Team Purple Code.ipynb'

**Notebook Description**

* In this notebook we will find the most relevant words (eg: common nouns) associated with each outfit item type. When a test query/document is submitted on user interface, this query is parsed to check with what outfit item type(s) it matches using regular expression.

* Once we know the possible outfit item types through this notebook, we find the most similar product by filtering dataset on these outfit item types only.

* This rationale has reduced false positives to a mimimum since without this logic finding exact/similar products using description was leading to irrevalant matches at times.

In [1]:
##Importing required libraries

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from functools import reduce
import re

In [3]:
# Reading input file

data = pd.read_csv("outfit_combinations.csv")
data.head()

Unnamed: 0,outfit_id,product_id,outfit_item_type,brand,product_full_name
0,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt
1,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2PEPWFTT7RMP5AA1T,top,Eileen Fisher,Rib Mock Neck Tank
2,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2S5T9W793F4CY41HE,accessory1,kate spade new york,medium margaux leather satchel
3,01DDBHC62ES5K80P0KYJ56AM2T,01DMBRYVA2ZFDYRYY5TRQZJTBD,shoe,Tory Burch,Penelope Mid Cap Toe Pump
4,01DMHCX50CFX5YNG99F3Y65GQW,01DMBRYVA2P5H24WK0HTK4R0A1,bottom,Eileen Fisher,Slim Knit Skirt


### TFIDF Vectorized Scores

* This function will take product full name for a particular outfit item type at a time
* It will then find the TF-IDF vectorized score of words in descending order found in the product full name corresponding to outfit item type.

In [4]:
# Function to generate TF-IDF Vector

vectorizer = TfidfVectorizer(token_pattern=r'\b[a-zA-Z0-9]{3,}\b',
                             min_df=0.001,
                            stop_words=stopwords.words('english'))

def vectorizeProductFullName(df,outfitType):
    X = vectorizer.fit_transform(df)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)
    tf_idf = tf_idf.sum(axis=1)
    outfit = pd.DataFrame(tf_idf, columns=[outfitType+'_score'])
    outfit["term"] = terms
    outfit["category"] = outfitType
    outfit.sort_values(by=outfitType+'_score', ascending=False, inplace=True)
    return outfit

#### Category: Shoe

In [5]:
## filter dataset on shoe outfit and get the product_full_name
shoe_df = data[data['outfit_item_type']=='shoe']['product_full_name'].tolist()

In [6]:
## Get the TF-IDF score for the words in shoe outfit
shoe = vectorizeProductFullName(shoe_df,'shoe')
shoe.reset_index(drop=True,inplace=True)

In [7]:
## Here we will find most relevant words associated with shoes outfit
shoe.head(30)

Unnamed: 0,shoe_score,term,category
0,118.846684,leather,shoe
1,96.593574,boots,shoe
2,81.333578,ankle,shoe
3,71.811042,suede,shoe
4,61.301217,sandals,shoe
5,61.222516,pumps,shoe
6,44.062204,mules,shoe
7,43.332791,effect,shoe
8,42.119832,snake,shoe
9,40.608446,sneakers,shoe


#### Category: Top

In [8]:
## filter dataset on top outfit and get the product_full_name
top_df = data[data['outfit_item_type']=='top']['product_full_name'].tolist()

In [11]:
## Get the TF-IDF score for the words in top outfit
top = vectorizeProductFullName(top_df,'top')
top.reset_index(drop=True,inplace=True)

In [12]:
## Here we will find most relevant words associated with top outfit
top.head(30)

Unnamed: 0,top_score,term,category
0,66.810834,shirt,top
1,58.766229,sweater,top
2,56.77674,silk,top
3,52.190902,cotton,top
4,51.345017,top,top
5,43.847669,wool,top
6,42.782341,blend,top
7,41.965295,blouse,top
8,41.783438,turtleneck,top
9,36.613681,satin,top


#### Category: Bottom

In [13]:
## filter dataset on bottom outfit and get the product_full_name
bottom_df = data[data['outfit_item_type']=='bottom']['product_full_name'].tolist()

In [14]:
## Get the TF-IDF score for the words in bottom outfit
bottom = vectorizeProductFullName(bottom_df,'bottom')
bottom.reset_index(drop=True,inplace=True)

In [15]:
## Here we will find most relevant words associated with bottom outfit
bottom.head(30)

Unnamed: 0,bottom_score,term,category
0,80.074348,leg,bottom
1,75.137667,pants,bottom
2,71.064973,skirt,bottom
3,64.417121,wide,bottom
4,62.404956,rise,bottom
5,61.726302,jeans,bottom
6,61.051533,high,bottom
7,50.211487,midi,bottom
8,41.212341,cotton,bottom
9,38.242021,cropped,bottom


#### Category: Onepiece

In [16]:
## filter dataset on onepiece and get the product_full_name
onepiece_df = data[data['outfit_item_type']=='onepiece']['product_full_name'].tolist()

In [17]:
## Get the TF-IDF score for the words in onepiece
onepiece = vectorizeProductFullName(onepiece_df,'onepiece')
onepiece.reset_index(drop=True,inplace=True)

In [18]:
## Here we will find most relevant words associated with accessory1
onepiece.head(30)

Unnamed: 0,onepiece_score,term,category
0,28.83809,dress,onepiece
1,16.207471,mini,onepiece
2,13.048706,cotton,onepiece
3,12.389663,linen,onepiece
4,12.140031,jumpsuit,onepiece
5,11.556358,wrap,onepiece
6,11.330612,crepe,onepiece
7,10.406386,silk,onepiece
8,9.981422,midi,onepiece
9,9.513179,floral,onepiece


#### Category: Accessory1

In [19]:
## filter dataset on accessory1 and get the product_full_name
accessory1_df = data[data['outfit_item_type']=='accessory1']['product_full_name'].tolist()

In [20]:
## Get the TF-IDF score for the words in accessory1
accessory1 = vectorizeProductFullName(accessory1_df,'accessory1')
accessory1.reset_index(drop=True,inplace=True)

In [21]:
## Here we will find most common and relevant words associated with accessory1
accessory1.head(30)

Unnamed: 0,accessory1_score,term,category
0,131.773751,leather,accessory1
1,111.024354,bag,accessory1
2,77.260215,shoulder,accessory1
3,60.858545,tote,accessory1
4,46.544087,small,accessory1
5,38.989623,croc,accessory1
6,38.046061,clutch,accessory1
7,36.532774,mini,accessory1
8,35.516223,textured,accessory1
9,34.775623,large,accessory1


#### Category: Accessory 2

In [22]:
## filter dataset on accessory2 and get the product_full_name
accessory2_df = data[data['outfit_item_type']=='accessory2']['product_full_name'].tolist()

In [23]:
## Get the TF-IDF score for the words in accessory1
accessory2 = vectorizeProductFullName(accessory2_df,'accessory2')
accessory2.reset_index(drop=True,inplace=True)

In [24]:
## Here we will find most relevant words associated with accessory2
accessory2.head(30)

Unnamed: 0,accessory2_score,term,category
0,69.166653,wool,accessory2
1,65.922846,jacket,accessory2
2,63.089966,coat,accessory2
3,58.555551,cardigan,accessory2
4,49.313984,wrap,accessory2
5,48.041975,blend,accessory2
6,42.934633,cashmere,accessory2
7,40.659035,leather,accessory2
8,36.495422,cotton,accessory2
9,35.404966,bag,accessory2


#### Category: Accessory3

In [9]:
## filter dataset on accessory2 and get the product_full_name
accessory3_df = data[data['outfit_item_type']=='accessory3']['product_full_name'].tolist()

In [10]:
## Get the TF-IDF score for the words in accessory1
accessory3 = vectorizeProductFullName(accessory3_df,'accessory3')
accessory3.reset_index(drop=True,inplace=True)

In [11]:
## Here we will find most relevant words associated with accessory2
accessory3.head(30)

Unnamed: 0,accessory3_score,term,category
0,0.5,asymmetric,accessory3
1,0.5,coat,accessory3
2,0.5,cotton,accessory3
3,0.5,trench,accessory3


### Regular Expressions
* In below cell we have prepared regular expression for each of the outfit item types using the relevant words (preferrably proper nouns) found in above cells
* If a relevant word appears in more than one outfit type we have included it in regular expressions of all the outfit item types. 
* Similarly, we have included unique words corresponding to each outfit item type (from above cells). So, if a user enters an product description unique to an outfit item type we narrow down our search to that specific outfit type.

In [12]:
#Regular expressions for each of the outfit item types

shoe=r'(boot|sandal|pump|mule|sneaker|loafer|slingback|flat|slide|croc)'
top=r'(shirt|sweater|top|blouse|turtleneck|jersey|tee|bodysuit|neck|sleeve|jacket|coat|cardigan|blazer|sweater|hoodie|pullover|bomber|vest|camisole|dickey|puffer)'
bottom=r'(leg|pant|skirt|jean|rise|midi|short|trouser)'
onepiece = r'(dress|jumpsuit|wrap|stretch|maxi|midi|larina|francoise|polka|shirt|sweater|top|blouse|turtleneck|jersey|tee|bodysuit|neck|sleeveleg|pant|skirt|jean|rise|short|trouser|jacket|coat|cardigan|blazer|sweater|hoodie|pullover|bomber|mirella|vest|camisole|dickey|charmeuse|puffer)'
accessory1=r'(bag|tote|croc|tori|clutch|mini|scarf|cabinet|top|bucket|backpack|hammock|belt|lazo|handle|box|saddle|amal|protea|drawstring|saffiano|camera|wallet|chain|charmeuse|pouch|puffer|margaux|jacket|coat|cardigan|wrap|belt|blazer|sweater|shirt|hoodie|dickey|camisole|sunglasses|vest|shawl|mirella|pullover|bomber|aviator)'
accessory2=r'(bag|tote|croc|tori|clutch|mini|scarf|cabinet|top|bucket|backpack|hammock|belt|lazo|handle|box|saddle|amal|protea|drawstring|saffiano|camera|wallet|chain|charmeuse|pouch|puffer|margaux|jacket|coat|cardigan|wrap|belt|blazer|sweater|shirt|hoodie|dickey|camisole|sunglasses|vest|shawl|mirella|pullover|bomber|aviator)'
accessory3= r'(coat)'

In [13]:
# Test description
# 'bucket' belongs to Accessory1 and Accessory2 only
# Thus, our output should recognize both these categories

description = 'bucket'

In [14]:
# outfitTypes is a dictionary to map 'outfit item type' with it's regular expression created above
outfitTypes={'top':top,'bottom':bottom,'shoe':shoe,'onepiece':onepiece,'accessory1':accessory1,'accessory2':accessory2,'accessory3':accessory3}

# Parse test description and return its corresponding outfit item types in a list called outfits
outfits = [outfit for outfit in outfitTypes if re.search(outfitTypes[outfit],description,flags=re.IGNORECASE)]

In [15]:
# Since bucket is common to accessory1 and accessory2. The outfits list below is used in main notebook to narrow down 
# the dataset
outfits

['accessory1', 'accessory2']