In [74]:
import json
import math
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import requests as r
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob as tb
from time import time

To identify the most meaninful words of each of 10 Etsy stores, I chose the TF-IDF algorithm to score the importance of each word.
TF-IDF (term frequence - inverse document frequency) scores words based on their frequency within a document, and their uniqueness to that document. Words that appear a lot in a document, but are rare in other documents, is very important to that document. While words that appear in a lot of other documents (such as "the", "a", "what") are less important.

TF-IDF is widely used in search engines.


The steps to identify the most meaningful words are:
 - obtain the data (text and description of each listing of each store)
 - concatenate all listings of each store
 - tokenize the text, remove stopwords, filter word types (such as removing adverbs), etc
 - perform tf-idf (I provide my own implementation, and a calculation using scikit-learn's implementation
 - extract the top n words by tf-idf score


# Constants

In [22]:
api_key = 'ggfgyosfeez1vnzek7h2ec68'
stores_to_analyze = ['Plumailes','Leinloune','HalkaBOrganics','ArtsyBottleCapsUS','DesertSoulBabe','VintagePlanePrints','beachbohojewelryshop','SurplusHandsShop','Homewarebyleahmarie','AtHooksEnd']
top_n_words = 5

In [23]:
req = r.get("https://s3.amazonaws.com/etsy-data/listings_text.json")
stores_listings = req.json()

## Concatenate all the listings of each store (both title and description), and store in a list of documents

In [56]:
def merge_listings_and_preprocess_tokens():
    """ Merge all store listings' text (title and description) and preprocess tokens by:
        - lowercase
        - remove newline character
        - remove english stop words
        - remove non-words (punctuation, digits)
        - lemmatize to remove plurals
        - remove words with one or two characters, to avoic abbreviations such as cm (centimeter)
        - filter by word type (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
            - remove adverbs, pronouns, interjections, etc.
            - keep only Nouns, verbs, and adjectives, since these provide more meaningful descriptions of a store

    Returns:
        list(string): a list of strings. Each string is a filtered concatenation of all of the store's listings text
    """


    wnl = WordNetLemmatizer()
    stopwords_dict = {k: None for v, k in enumerate(stopwords.words('english'))}
    _bloblist = []  # list of text blob of all listings for each store
    for store in stores_to_analyze:
        store_text = ""
        for listing in stores_listings[store]:
            
            store_text = store_text + " " + listing["title"].replace("\n", " ") + " " + \
                         listing["description"].replace("\n", " ")
        store_text = store_text.split()  # tokenize, splitting on white space
        filtered_words = [wnl.lemmatize(word.lower()) for word in store_text if
                          len(word) > 2 and word.isalpha() and word not in stopwords_dict]
        tagged = nltk.pos_tag(filtered_words)  # classify the type of the word
        # word_tags = [x[1] for x in tagged]
        # type_counter = Counter(word_tags)

        filtered_words = [x[0] for x in tagged if
                          x[1] in ('NN', 'NNS', 'JJ', 'JJS', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ')]

        _bloblist += [" ".join(filtered_words)]
    return _bloblist

In [67]:
textlist = merge_listings_and_preprocess_tokens()  # list of documents with concatenated listings text

# Calculate tf-idf scores using my own non-vectorized implementation of the algorithm

## TF-IDF algorithm

In [58]:
def tf(word, blob):
    """ calculate term frequency with simple raw count. More advanced schemas could be used, such as binary, term frequency,
        log normalization, double normalization 0.5, double normalization K.
        See https://en.wikipedia.org/wiki/Tf%E2%80%93idf for more information

    Args:
        word (string):
        blob (TextBlob):

    Returns:
        float
    """
    return blob.words.count(word) / len(blob.words)


def df(word, bloblist):
    """Calculate document frequency

    Args:
        word (string):
        blob (TextBlob):

    Returns:
        float
    """
    return sum(1 for blob in bloblist if word in blob.words)


def idf(word, bloblist):
    """Calculate inverse document frequency

    Args:
        word (string):
        blob (TextBlob):

    Returns:
        float
    """
    return math.log(len(bloblist) / (1 + df(word, bloblist)))


def tfidf(word, blob, bloblist):
    """

    Args:
        word (string): a single term
        blob (TextBlob): a blob representation of the document text
        bloblist (list): all documents
    Returns:
        float
    """
    return tf(word, blob) * idf(word, bloblist)

In [68]:
bloblist = [tb(x) for x in textlist]  # convert each document to a TextBlob

In [69]:
start = time()
for i, blob in enumerate(bloblist):
    print("Calculating tfidf scores of document {} of {}".format(i + 1, len(bloblist)))
    print("\tStore: {}. Top Words and tf-idf score:".format(stores_to_analyze[i]))
    scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        print("\t\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
print("\nIt took {}s to calculate tf-idf".format(str(round(time()-start, 2))))

Calculating tfidf scores of document 1 of 10
	Store: Plumailes. Top Words and tf-idf score:
		Word: earring, TF-IDF: 0.09316
		Word: hook, TF-IDF: 0.04467
		Word: unique, TF-IDF: 0.03296
		Word: rooster, TF-IDF: 0.02826
		Word: feather, TF-IDF: 0.02625
Calculating tfidf scores of document 2 of 10
	Store: Leinloune. Top Words and tf-idf score:
		Word: album, TF-IDF: 0.03368
		Word: faux, TF-IDF: 0.02562
		Word: leinloune, TF-IDF: 0.02021
		Word: name, TF-IDF: 0.02016
		Word: cover, TF-IDF: 0.01848
Calculating tfidf scores of document 3 of 10
	Store: HalkaBOrganics. Top Words and tf-idf score:
		Word: oil, TF-IDF: 0.08535
		Word: skin, TF-IDF: 0.02265
		Word: essential, TF-IDF: 0.01607
		Word: pure, TF-IDF: 0.01549
		Word: body, TF-IDF: 0.0132
Calculating tfidf scores of document 4 of 10
	Store: ArtsyBottleCapsUS. Top Words and tf-idf score:
		Word: bottle, TF-IDF: 0.21601
		Word: cap, TF-IDF: 0.14674
		Word: epoxy, TF-IDF: 0.14201
		Word: military, TF-IDF: 0.13728
		Word: necklace, TF-I

## Calculate tf-idf scores using scikit-learn

In [87]:
start = time()
tokenize = lambda doc: doc.split(" ")  # text was already pre-processed, otherwise it could be done here
n_features = 1000  # n of words

tfidf_vectorizer = TfidfVectorizer(norm='l2', min_df=2, max_df=0.95, use_idf=True, smooth_idf=False,
                                   sublinear_tf=True, tokenizer=tokenize, max_features=n_features,
                                   stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(textlist)

for i in range(len(textlist)):
    top_n_words = top_n_words
    row = np.squeeze(tfidf[i].toarray())
    features = tfidf_vectorizer.get_feature_names()
    topn_ids = np.argsort(row)[::-1][:top_n_words]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['Word', 'tfidf score']
    print("""Top Words for Store {}:""".format(stores_to_analyze[i]))
    print(df)
    print()
    print("\nIt took {}s to calculate tf-idf with scikit".format(str(round(time()-start, 2))))

Top Words for Store Plumailes:
      Word  tfidf score
0  earring     0.191818
1  feather     0.176312
2   nickel     0.175881
3  bouquet     0.172211
4   unique     0.162860


It took 0.03s to calculate tf-idf with scikit
Top Words for Store Leinloune:
           Word  tfidf score
0          faux     0.149779
1         cover     0.140206
2  personalized     0.125894
3          book     0.124788
4          wire     0.122442


It took 0.03s to calculate tf-idf with scikit
Top Words for Store HalkaBOrganics:
      Word  tfidf score
0     skin     0.145128
1     pure     0.135778
2     seed     0.125515
3     drop     0.121990
4  organic     0.121511


It took 0.03s to calculate tf-idf with scikit
Top Words for Store ArtsyBottleCapsUS:
        Word  tfidf score
0     bottle     0.517588
1   necklace     0.386179
2        lot     0.383710
3    support     0.382444
4  available     0.232703


It took 0.04s to calculate tf-idf with scikit
Top Words for Store DesertSoulBabe:
       Word  tfid

# Conclusion

I used TF-IDF to score the relative importance of each word of each store. I implemented a naive version of
the algorithms without any optimization (such as vectorization). The identified meaningful words mostly
agree with the scoring performed by the implementation of scikit-learn library (with a few difference), which can be attributed to internal differences in the schema, 
such as for calculating the term frequency.

The text was similarly pre-processed for both algorithms.  Tokenization involved:      
        - lowercase
        - remove newline character
        - remove english stop words
        - remove non-words (punctuation, digits)
        - lemmatize to remove plurals
        - remove words with one or two characters, to avoic abbreviations such as cm (centimeter)
        - filter by word type (http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
            - remove adverbs, pronouns, interjections, etc.
            - keep only Nouns, verbs, and adjectives, since these provide more meaningful descriptions of a store


Finally, inspection of the webpage for each store (see urls below) reveals that the words found by the algorithm are reasonble in describing the respective store 




https://www.etsy.com/shop/Plumailes/items
https://www.etsy.com/shop/Leinloune/items
https://www.etsy.com/shop/HalkaBOrganics/items
https://www.etsy.com/shop/ArtsyBottleCapsUS/items
https://www.etsy.com/shop/DesertSoulBabe/items
https://www.etsy.com/shop/VintagePlanePrints/items
https://www.etsy.com/shop/beachbohojewelryshop/items
https://www.etsy.com/shop/SurplusHandsShop/items
https://www.etsy.com/shop/Homewarebyleahmarie/items
https://www.etsy.com/shop/AtHooksEnd/items


