# Etsy shops key terms analysis

## Data sourcing and preparation

The Etsy API was polled for active listings from two stores, `AgnesHart` and `DesignsByBrandiCo`.

Along with their corresponding shop names, the title and description for the active listing records for each shop were saved to a CSV file in `./data/shop_listings.csv`.

### Load store listings data

In [74]:
import pandas as pd 
shops_df = pd.read_csv('../data/shop_listings.csv')
shops_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 2 columns):
shop_name    323 non-null object
text         323 non-null object
dtypes: object(2)
memory usage: 5.1+ KB


In [75]:
agnes_df = shops_df[shops_df['shop_name'] == 'AgnesHart']
agnes_df.head()

Unnamed: 0,shop_name,text
0,AgnesHart,"1950s Bridal Headpiece, Bridal Cocktail Hat an..."
1,AgnesHart,Ivory Birdcage veil and Delicate Hair Vine - B...
2,AgnesHart,Ivory Birdcage veil with Floral lace - Bridal...
3,AgnesHart,Vintage Wedding Veil and Tiara - Bridal Crown...
4,AgnesHart,"Luxury Beaded Juliet Cap Wedding Veil , Kate ..."


In [190]:
brandi_df = shops_df[shops_df['shop_name'] == 'DesignsByBrandiCo']
brandi_df.head()

Unnamed: 0,shop_name,text
105,DesignsByBrandiCo,Disney Snack Goals // Disney World decal // Di...
106,DesignsByBrandiCo,Cheerleading socks // Summit Socks // Good Luc...
107,DesignsByBrandiCo,Cheerleading socks // Good Luck socks // dance...
108,DesignsByBrandiCo,Texas Aggie ring dish // Whoop! // ring dish /...
109,DesignsByBrandiCo,Cactus Tan Decals // Tanning Stickers // Tanni...


### Naive word counts

In [212]:
text_df = shops_df.groupby('shop_name').agg(lambda x: ' '.join(x))
text_df['text'] = text_df['text'].str.lower().str.replace(r'[^a-z]', ' ')
text_df.head()

Unnamed: 0_level_0,text
shop_name,Unnamed: 1_level_1
AgnesHart,s bridal headpiece bridal cocktail hat an...
DesignsByBrandiCo,disney snack goals disney world decal di...


In [271]:
from collections import Counter

STOP_WORDS = [
    '', 'the', 'and', 'a', 'to', 'in', 'for', 'is', 'it', 'i',
    'with', 'of', 'your', 'com', 'www', 'this', 'be', 'that',
    'me', 'etsy', 'you', 'can', 's', 'a', 'on', 'here'
]

agnes_text = [word for word in text_df.loc['AgnesHart', 'text'].split(' ') if len(word) > 3]

counts = Counter(agnes_text)
for stop_word in STOP_WORDS:
    del counts[stop_word]

counts.most_common(7)

[('veil', 731),
 ('shop', 286),
 ('please', 280),
 ('hair', 277),
 ('headpiece', 256),
 ('agneshart', 223),
 ('will', 219)]

In [270]:
brandi_text = [word for word in text_df.loc['DesignsByBrandiCo', 'text'].split(' ') if len(word) > 3]

counts = Counter(brandi_text)
for stop_word in STOP_WORDS:
    del counts[stop_word]

counts.most_common(7)

[('decal', 590),
 ('monogram', 300),
 ('decals', 231),
 ('personalized', 222),
 ('color', 185),
 ('tanning', 184),
 ('name', 159)]

## Compute TF-IDF matrices

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [272]:
STOP_WORDS =  [
    'http', 'www', 'com', 'that', 'the', 'this', 'with', 'will', 'and', 'but', 'etsy', 'for', 'about', 'can',
    'also', 'here', 'see', 'read', 'you', 'your', 'are', 'all', 'these', 'etc', 'quot', 'more', 'other', 'any', 'car'
]


tf = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 1),
    min_df=0.2,
    token_pattern=r'\b[a-z]{3,}\b',
    max_features=7,
    strip_accents='ascii',
    lowercase=True,
    stop_words=STOP_WORDS
)

X = tf.fit_transform(agnes_df['text'])
print(tf.get_feature_names())
len(tf.get_feature_names())

['agneshart', 'hair', 'headpiece', 'made', 'please', 'shop', 'veil']


7

In [273]:
X = tf.fit_transform(brandi_df['text'])
print(tf.get_feature_names())
len(tf.get_feature_names())

['checkout', 'color', 'decal', 'decals', 'monogram', 'name', 'personalized']


7