# Instructions


a.) If we want to understand which catalog (such as clothing, shoes, accessories, beauty, jewelry etc.) each item is, how will you make that happen?


b.) How can you extract the additional information from the item names, such as the color, style, size, material, gender etc. if there is any?


c.) A plus. If you write the queries/codes, or build a machine learning model to achieve the goal of a.) or b.) above on the attached dataset. Python is preferred.

## Part A

To understand the catalog that each item belongs to, I built topic models to cluster items. Then, using the output, I manually labeled the output to map a topic to a physical catalog label.




#### Processing included:
1. removing punctuation
2. lowercasing
3. lemmatizing
4. removing stop words
5. keyword extraction using YAKE

The processed dataset can be found in `data/processed-data-with-keywords.csv`

#### BERTopic

To cluster the items, I used the keywords found by the YAKE library, and used the library BERTopic for topic modeling.

Looking at some clusters based on common words:

![topics](output/topwords.png)


Note - this is an interactive html that allows the user to click on topics and sort based on keywords. However, because I ran the models in google colab (to use the GPU), I was unable to extract the html. I took a few screen shots and placed them here.

![topic1](output/topics1.png)

![topic2](output/topics2.png)

![topic3](output/topics3.png)

Given these common words per topic, a human can then map them to a category name.

**Topic 0:** pencil, box, canvas --> Art supplies

**Topic 1:** ceramic, ceramic mug, ceramic printland --> Home goods

**Topic 2:** necklace, alloy necklace, necklace indian --> Jewelry 


The full list of product names and their corresponding topics can be found in `output/new_topics_labels.csv`

You can also add new items and discover the most closely related topic(s)

![newterm](output/newterm.png)

In [3]:
# some sample labels
import pandas as pd

sample_labels = pd.read_csv('output/sample_labels.csv')
sample_labels.head(20)

Unnamed: 0,raw,topic_label
0,Steppings Trendy Boots,footwear
1,Rochees RW38 Analog Watch - For Boys,watch
2,Rorlig RR-028 Expedition Analog Watch - For M...,watches
3,Catwalk Boots,footwear
4,Magnum Footwear Lifestyle,footwear
5,Rialto Boots,footwear
6,"Alfajr WY16B Youth Digital Watch - For Men, Boys",watches
7,La Briza Ashley Boots,womens clothing
8,TAG Heuer CAU1116.BA0858 Formula 1 Analog Watc...,watch
9,Salt N Pepper 13-019 Femme Black Boots Boots,footwear


## Pros and cons

Pros - 
1. Quick, easily interpretable output
2. Can use multi-lingual BERT models



Cons - 
1. Requires human labeling - not ideal
2. Difficult to detect small topics or overlapping topics without humans

## Potential Improvements
1. Can add labels to incorporate semi-supervised learning using a pre-defined partial topic list
2. Can use other methods for dimensionality reduction (currently uses UMAP with HDBScan, but perhaps autoencoder?)

---

#### LDA Topic Model
I initially tried gensim's implementation of Latent Dirichlet Allocation (LDA), but got pretty poor results. The clusters did not accurately (as judged by my spot-checking) segment item classes.

This is a known problem with LDA (and topic models in general) that small text samples like in our dataset are difficult to segment. There are possible alternatives such as adding ngrams.

For a project I worked on doing topic modeling of scientific articles, we found that adding metadata such as author of article, name of journal published in, bibliography page, etc significantly helped model performance. I would expect similar results here. If we had other information such as merchant name, url from purchase, etc, we should look to add those to the text.


# Part B

One simple option to extract information such as color, gender, etc is to create lookup tables. 

Using lists such as (boy, girl, man, woman) for gender and (blue, black, red ...) for color, allow for extraction.

This is quick and easy, but may not be sufficient for all attributes and also has the 'unknown unknown' problem -- if my list of colors is missing green, it will be very hard for me to identify that.



I created simple functions with lookup tables to append information about color and gender to the dataset

In [4]:
# some sample labels

sample_labels = pd.read_csv('output/sample_color_data.csv')
sample_labels.head(20)

Unnamed: 0,raw,color,gender
0,Alisha Solid Women's Cycling Shorts,[],['woman']
1,FabHomeDecor Fabric Double Sofa Bed,[],[]
2,AW Bellies,[],[]
3,Sicons All Purpose Arnica Dog Shampoo,[],['dog']
4,Eternal Gandhi Super Series Crystal Paper Weig...,['silver'],[]
5,"dilli bazaaar Bellies, Corporate Casuals, Casuals",[],[]
6,Ladela Bellies,[],[]
7,Carrel Printed Women's,[],['woman']
8,Sicons All Purpose Tea Tree Dog Shampoo,[],['dog']
9,Freelance Vacuum Bottles 350 ml Bottle,[],[]


From the lists, one can easily one-hot-encode the data to allow for quick search features or modeling. 


This misses some implicit labels like the wedding lingerie set likely being for women. Embedding models and the topic model approach above may help this, as lingerie should get mapped near bras or other women's clothing and could be labeled without explicit use of the word woman.



---

If this was a larger project and I had more time / resources, I would look into a few other ways to solve this.

Firstly, I would use common libraries to do named entity recognition (NER) to extract company and product names. 

I would also use part of speech (PoS) tagging to potentially identify patterns (adjectives are more likely to be attributes where nouns are more likely to be the item, for example).



One reason to do things like this is that learning information about some attributes necessarily informs predictions about other products. For example, consider the product I found online: **"adidas ultraboost 40 black size 10"**

If we know the color is black, then we can reasonably assume the size is **not** black. And if we learn the brand is adidas, we know the color is **not** adidas, etc. 

Therefore, it's possible to use predictions/labels from one attribute and pass that into a model to predict another type of attribute. I used this stacking technique (based on [some work here](https://link.springer.com/content/pdf/10.1007/s10994-016-5546-z.pdf)) to successfully model personality traits based on social media data. This is especially useful when you have multiple targets (as we do with many product attributes) that are either correlated or on some level related.



---

I also found this [paper from Rakuten](http://sigmo.id/__paper__/nio.icde.2019.pdf) that processes the data with similar methods I outlined, and then uses a semi-supervised approach with an LSTM model to extract product attributes.

---

This would require a sufficiently large workload, but if multiple teams or groups were working on this project, there is framework developed by [facebook research called CLARA](https://research.fb.com/wp-content/uploads/2020/08/CLARA-Confidence-of-Labels-and-Raters.pdf) that is a methodology for using multiple, discrete models to discern accuracy with confidence intervals of predictions when there is little ground truth data, provide assessment of each individual model's performance, and minimize human annotation time in data collection. 

I am currently using this methodology for a project on social media analysis and think it could be valuable here also.