# Keyword extraction for Interactive Image Retrieval

## KeyBert:

[Why keyBert?](https://towardsdatascience.com/keyword-extraction-a-benchmark-of-7-algorithms-in-python-8a905326d93f) \
The previous described method is implement in an API called [KeyBert](https://maartengr.github.io/KeyBERT/). \
A minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

So using the KeyBert model:

In [1]:
import keybert

In [2]:
from keybert import KeyBERT

doc = """
         Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).
      """
kw_model = KeyBERT()
keywords = kw_model.extract_keywords(doc)

In [3]:
keywords

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947)]

In [4]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 3), stop_words='english')

[('supervised learning algorithm', 0.6992),
 ('supervised learning example', 0.6807),
 ('supervised learning', 0.6779),
 ('supervised learning machine', 0.6706),
 ('supervised', 0.6676)]

In [5]:
kw_model.extract_keywords(doc, highlight=True)

[('supervised', 0.6676),
 ('labeled', 0.4896),
 ('learning', 0.4813),
 ('training', 0.4134),
 ('labels', 0.3947)]

In [6]:
kw_model.extract_keywords(doc, keyphrase_ngram_range=(3, 3), stop_words='english', 
                              use_mmr=True, diversity=0.7)

[('supervised learning algorithm', 0.6992),
 ('pairs infers function', 0.1981),
 ('unseen situations reasonable', 0.2142),
 ('value called supervisory', 0.2895),
 ('class labels unseen', 0.3469)]

## More relevant data:

In [7]:
def readArticleText(path):
    with open(path, 'r', encoding="utf-8") as file:
        doc = file.read()
    return doc

nftArticle = readArticleText("./Data/Fashion/KarlNFT.txt")
levis = readArticleText("./Data/Fashion/Levis.txt")
winterOlympics = readArticleText("./Data/Fashion/WinterOlympics.txt")

### Article about NFT by Karl Lagerfeld:

In [8]:
kw_model.extract_keywords(nftArticle, keyphrase_ngram_range=(1, 3), stop_words='english')

[('nft collectibles luxury', 0.6237),
 ('lagerfeld newest nfts', 0.616),
 ('nft figurines', 0.6009),
 ('nft collectibles', 0.5922),
 ('nft figurines sold', 0.5918)]

In [9]:
kw_model.extract_keywords(nftArticle, highlight=True)

[('klxendless', 0.4777),
 ('nfts', 0.4736),
 ('kl7xendless', 0.4733),
 ('nft', 0.4663),
 ('collectibles', 0.3812)]

In [10]:
kw_model.extract_keywords(nftArticle, keyphrase_ngram_range=(1, 3), stop_words='english', 
                              use_mmr=True, diversity=0.6)

[('nft collectibles luxury', 0.6237),
 ('karl lagerfeld announced', 0.3616),
 ('klxendless exclusive version', 0.4586),
 ('prints endless figurines', 0.3691),
 ('digital karl augmented', 0.364)]

In [11]:
kw_model.extract_keywords(nftArticle, keyphrase_ngram_range=(1, 3), stop_words='english', 
                              use_mmr=True)

[('nft collectibles luxury', 0.6237),
 ('karl lagerfeld announced', 0.3616),
 ('klxendless figures', 0.5851),
 ('digital karl augmented', 0.364),
 ('777 euros editions', 0.3403)]

### Winter Olympics:

In [12]:
kw_model.extract_keywords(winterOlympics, highlight=True)

[('attire', 0.4226),
 ('olympics', 0.3928),
 ('worn', 0.3795),
 ('outfit', 0.3791),
 ('medal', 0.3781)]

In [13]:
kw_model.extract_keywords(winterOlympics, keyphrase_ngram_range=(1, 3), stop_words='english')

[('gb athletes wear', 0.6547),
 ('gb badge olympic', 0.581),
 ('wear winter olympics', 0.5754),
 ('winter olympics ben', 0.5704),
 ('olympics ben sherman', 0.5677)]

In [14]:
kw_model.extract_keywords(winterOlympics, keyphrase_ngram_range=(1, 3), stop_words='english', 
                              use_mmr=True)

[('gb athletes wear', 0.6547),
 ('ben sherman unveiled', 0.482),
 ('closing ceremony beijing', 0.3306),
 ('winter olympics ben', 0.5704),
 ('classic quilted peacoat', 0.3691)]

### Levis Article:

In [15]:
kw_model.extract_keywords(levis, highlight=True)

[('levi', 0.4894),
 ('brand', 0.3749),
 ('jean', 0.3508),
 ('laser', 0.3446),
 ('smith', 0.2871)]

In [16]:
kw_model.extract_keywords(levis, keyphrase_ngram_range=(1, 3), stop_words='english')

[('partner collection levi', 0.5859),
 ('levi partnered jaden', 0.5796),
 ('collection levi partnered', 0.5738),
 ('jean levi type', 0.5489),
 ('levi jaden smith', 0.5406)]

In [17]:
kw_model.extract_keywords(levis, keyphrase_ngram_range=(1, 3), stop_words='english', use_mmr=True)

[('partner collection levi', 0.5859),
 ('limited capsule inspired', 0.4734),
 ('laser focus group', 0.4098),
 ('jaden smith new', 0.3773),
 ('trucker jacket embodies', 0.2581)]

### Can we get some examples relevant to FashionIQ dataset?

### Vogue: 2022’s Fashion Forecast?

In [18]:
article = readArticleText("./Data/Fashion/VogueForecast.txt")

In [19]:
kw_model.extract_keywords(article, highlight=True)

[('fashion', 0.5683),
 ('vogue', 0.5457),
 ('glamour', 0.4726),
 ('dressing', 0.4095),
 ('2021', 0.4071)]

In [20]:
kw_model.extract_keywords(article, keyphrase_ngram_range=(1, 2), stop_words='english')

[('2022 fashion', 0.7389),
 ('vogue 2022', 0.6993),
 ('fashion trends', 0.6406),
 ('fashion demands', 0.6158),
 ('fashion forecast', 0.5769)]

In [21]:
kw_model.extract_keywords(article, keyphrase_ngram_range=(1, 3), stop_words='english')

[('2022 fashion trends', 0.8065),
 ('vogue 2022 fashion', 0.77),
 ('2022 fashion', 0.7389),
 ('2022 fashion forecast', 0.7283),
 ('summer 2022 fashion', 0.7246)]

In [22]:
kw_model.extract_keywords(article, keyphrase_ngram_range=(1, 3), stop_words='english', 
                              use_mmr=True)

[('2022 fashion trends', 0.8065),
 ('gaultier expecting brand', 0.4445),
 ('dressing juggernaut free', 0.3774),
 ('versace tailoring going', 0.4753),
 ('womenswear matches brands', 0.5389)]

### EuronewsCulture:

In [23]:
article = readArticleText("./Data/Fashion/euronews.txt")

In [24]:
kw_model.extract_keywords(article, highlight=True)

[('fashion', 0.5095),
 ('textile', 0.5074),
 ('textiles', 0.4845),
 ('fabrics', 0.4527),
 ('wardrobe', 0.4086)]

In [25]:
kw_model.extract_keywords(article, keyphrase_ngram_range=(1, 2), stop_words='english')

[('future fashion', 0.668),
 ('fashion trends', 0.5596),
 ('fashion forward', 0.555),
 ('fashion rehabilitation', 0.5478),
 ('sustainable fabrics', 0.5142)]

In [26]:
kw_model.extract_keywords(article, keyphrase_ngram_range=(1, 3), stop_words='english')

[('fashion trends 2022', 0.6917),
 ('new future fashion', 0.6835),
 ('future fashion forward', 0.6745),
 ('2022 years fashion', 0.6723),
 ('future fashion', 0.668)]

In [27]:
kw_model.extract_keywords(article, keyphrase_ngram_range=(1, 3), stop_words='english', 
                              use_mmr=True)

[('fashion trends 2022', 0.6917),
 ('comfort healing restoration', 0.3324),
 ('tactile textiles need', 0.4879),
 ('solace feel clothes', 0.4985),
 ('envelope comfort trend', 0.4393)]

In [28]:
import json

def extract_keywords(path):
    article = ""
    with open(path, 'r', encoding="utf-8") as file:
        article = file.read()
    keywds_l = kw_model.extract_keywords(article,
                                         keyphrase_ngram_range=(1, 5),
                                         stop_words='english',
                                         use_mmr=True)
    keywds = [word for (word,_) in keywds_l]
    return keywds
    
extract_keywords("./Data/Fashion/euronews.txt")

['exploring new future fashion forward',
 '2022 trends know tactile textiles',
 'seek comfort healing restoration',
 'habits year dictated bittersweet emotion',
 'solace feel clothes']