# HW3

Submit via Slack. Due on Monday, April 13th, 2020, 11:59pm PST. You may work with one other person.

## TF-IDF

You are an analyst working at Amazon as a product analyst, and charged with identifying areas for improvement to the Amazon toy product lines, which have been suffering recently from lower reviews.

Using the **`poor_amazon_toy_reviews.txt`** and **`good_amazon_toy_reviews.txt`** datasets, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that **visualizes**:
* the features your analysis showed that customers cited as reasons for a 5 star review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?



## Step 1. Importing packages and the text files

In [1]:
import pandas as pd
from collections import Counter
import re
import nltk 
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.stem import PorterStemmer 
from sklearn.feature_extraction.text import TfidfVectorizer


In [2]:
# Opening two files
poor_review = open("poor_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
good_review = open("good_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
poor_review_read = poor_review.read().lower()
good_review_read = good_review.read().lower()


## Step 2. Creating a list of custom stop words

First, I combined the lists of stopwords from gensim and nltk to cover as much general stopwords as possible. After removing the duplicates, I added words that are expected to appear a lot in this specific texts. Finally, I excluded the negative words (not, nor, no, neither, never) that might reverse the meaning of the words following those words. 

In [3]:
from gensim.parsing.preprocessing import STOPWORDS
stopwords_gensim = list(STOPWORDS)

from nltk.corpus import stopwords
stopwords_NLTK = list(stopwords.words("english"))

stopwords_combined = list(set(stopwords_gensim+stopwords_NLTK)) #to remove duplicates

custom_stopwords = ['amazon', 'prime'] #added words that are very likely to be found given the context of the files
stopwords_combined += custom_stopwords


negatives = ['not','nor','no','neither', 'never'] #took out the negative words for a more accurate analysis
stopwords_combined = list(filter(lambda x: x not in negatives, stopwords_combined))

stopwords_combined.sort()
stopwords_expression = '|'.join(stopwords_combined)
stopwords_pattern = f'({stopwords_expression})'

In [4]:
stopwords_pattern

"(a|about|above|across|after|afterwards|again|against|ain|all|almost|alone|along|already|also|although|always|am|amazon|among|amongst|amoungst|amount|an|and|another|any|anyhow|anyone|anything|anyway|anywhere|are|aren|aren't|around|as|at|back|be|became|because|become|becomes|becoming|been|before|beforehand|behind|being|below|beside|besides|between|beyond|bill|both|bottom|but|by|call|can|cannot|cant|co|computer|con|could|couldn|couldn't|couldnt|cry|d|de|describe|detail|did|didn|didn't|do|does|doesn|doesn't|doing|don|don't|done|down|due|during|each|eg|eight|either|eleven|else|elsewhere|empty|enough|etc|even|ever|every|everyone|everything|everywhere|except|few|fifteen|fifty|fill|find|fire|first|five|for|former|formerly|forty|found|four|from|front|full|further|get|give|go|had|hadn|hadn't|has|hasn|hasn't|hasnt|have|haven|haven't|having|he|hence|her|here|hereafter|hereby|herein|hereupon|hers|herself|him|himself|his|how|however|hundred|i|ie|if|in|inc|indeed|interest|into|is|isn|isn't|it|it's|i

## Step 3. Defining functions to be used in the analysis

Next, I created functions that I can apply to both files during my analysis.

I chose stemming over lemmatization because I would be able to benefit from faster speed and smaller dimension size. Lemmatization would do a better job in keeping the correct base form of the words, but, I assumed that the Amazon reviews for a product should be simple enough for me to understand them intuitively. Therefore, I decided to go with the stemming.

For the tf_idf_vectorizer function, my ngram range is (2,3). with min_df=0.01, max_df = 0.8. I thought the ngram has to be at least 2 words to understand any context. Upon multiple attempts, I noticed that the memory of my laptop does not hold anything above 3 words. The min_df and max_df were also the product of adjustment to produce as much value possible within the spec of the laptop.

In [5]:
def stem_text (text):
    import nltk
    from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.stem import PorterStemmer
    porter=PorterStemmer()
    token_sentence = nltk.sent_tokenize(text)
    stemmed_text=[]
    for sentence in token_sentence:
        token_words=nltk.word_tokenize(sentence)
        for word in token_words:
            stemmed_text.append(porter.stem(word))
            stemmed_text.append(" ")
    return "".join(stemmed_text)

In [6]:
def tf_idf_vectorizer(corpus):

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    vectorizer = TfidfVectorizer(ngram_range=(2,3), min_df=0.01, max_df = 0.8)
    
    X = vectorizer.fit_transform(corpus)
    terms = vectorizer.get_feature_names()
    tf_idf = pd.DataFrame(X.toarray().transpose(), index=terms)

    tf_idf = tf_idf.sum(axis=1)
    score = pd.DataFrame(tf_idf, columns=["score"])
    score.sort_values(by="score", ascending=False, inplace=True)
    
    return score

## Step 4. Finding Collocated Words

In [7]:
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
poor_review = open("poor_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')
good_review = open("good_amazon_toy_reviews.txt", "r", encoding = 'UTF-8')


In [8]:
documents = []
for line in poor_review.readlines():
    line = line.replace("\n", "") 
    line = re.sub(rf'\b{stopwords_pattern}\b','', line)
    line = " ".join(re.findall(r'\w+', line)).lower()
    
    if len(line) > 0:
        line = [lemmatizer.lemmatize(token) for token in word_tokenize(line)] 
        documents.append(line)
        
new_documents = []
for doc in documents:
    new_document = []
    for word in doc:
        new_document.append(word)
    new_documents.append(new_document)
    
collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

good_collocated = collocation_finder.nbest(measures.raw_freq, 10)
good_collocated

[('br', 'br'),
 ('waste', 'money'),
 ('i', 'bought'),
 ('i', 'not'),
 ('year', 'old'),
 ('not', 'worth'),
 ('do', 'not'),
 ('i', 'received'),
 ('i', 'got'),
 ('not', 'buy')]

From the top 10 collocated words, I will combine "year", "old" as one word in my analysis later and remove br altogether.

In [9]:
documents = []
for line in good_review.readlines():
    line = line.replace("\n", "") # replace the new line escape character
    line = re.sub(rf'\b{stopwords_pattern}\b','', line)
    line = " ".join(re.findall(r'\w+', line)).lower()
    if len(line) > 0:
        line = [lemmatizer.lemmatize(token) for token in word_tokenize(line)] 
        documents.append(line)
        
new_documents = []
for doc in documents:
    new_document = []
    for word in doc:
        new_document.append(word)
    new_documents.append(new_document)
    
collocation_finder = BigramCollocationFinder.from_documents(new_documents)
measures = BigramAssocMeasures()

good_collocated = collocation_finder.nbest(measures.raw_freq, 10)
good_collocated

[('year', 'old'),
 ('br', 'br'),
 ('i', 'love'),
 ('i', 'bought'),
 ('my', 'son'),
 ('my', 'daughter'),
 ('daughter', 'love'),
 ('son', 'love'),
 ('kid', 'love'),
 ('i', 'got')]

Similar to the negative reviews, I will combine "year", "old" as one word in my analysis later and remove br altogether for the positive reviews as well.

## Step 5. Poor Reviews Analysis

In [10]:
poor_review_cleaned = re.sub(rf'\b{stopwords_pattern}\b','', poor_review_read)

# Handling the collocated words identified in previous step
poor_review_cleaned = re.sub(rf'\bbr\b','', poor_review_cleaned)
poor_review_cleaned = re.sub(rf'\byear old\b','yearold', poor_review_cleaned)
poor_review_cleaned = re.sub(rf'\byears old\b','yearold', poor_review_cleaned)

poor_review_split = poor_review_cleaned.split('\n')
poor_review_df = pd.DataFrame(poor_review_split, columns = ['Poor_Reviews']) 
poor_review_df

Unnamed: 0,Poor_Reviews
0,not buy ! break fast spun 15 minutes e...
1,showed not ' shown . ' old toy. paint .
2,need expansion packs 3-5 want access play...
3,""" gift husband new pool. not receive ..."
4,received pineapple advertised '
...,...
12696,small
12697,contained glass dangerous barefoot.
12698,"""fake. not original. time 5 yr old kid sees ..."
12699,poor quality


In [11]:
poor_review_split = poor_review_cleaned.split('\n')
poor_review_df = pd.DataFrame(poor_review_split, columns = ['Poor_Reviews']) 
poor_review_df

Unnamed: 0,Poor_Reviews
0,not buy ! break fast spun 15 minutes e...
1,showed not ' shown . ' old toy. paint .
2,need expansion packs 3-5 want access play...
3,""" gift husband new pool. not receive ..."
4,received pineapple advertised '
...,...
12696,small
12697,contained glass dangerous barefoot.
12698,"""fake. not original. time 5 yr old kid sees ..."
12699,poor quality


In [12]:
for i in range(0,len(poor_review_df)):   
    #only keeping the words and not punctuations. The reason I am cleaning here in this step is because
    #I needed '/' to separate out by /n in the step above.
    poor_review_df.iloc[i,0] = " ".join(re.findall(r'\w+', poor_review_df.iloc[i,0])) 
    
    #using the stemming function identified above
    poor_review_df.iloc[i,0] = stem_text(poor_review_df.iloc[i,0])
    
poor_review_df

Unnamed: 0,Poor_Reviews
0,not buy break fast spun 15 minut end flew wast...
1,show not shown old toy paint
2,need expans pack 3 5 want access player aid fa...
3,gift husband new pool not receiv color order m...
4,receiv pineappl advertis
...,...
12696,small
12697,contain glass danger barefoot
12698,fake not origin time 5 yr old kid see origin f...
12699,poor qualiti


In [13]:
corpus = list(poor_review_df["Poor_Reviews"].values)
poor_review_score = tf_idf_vectorizer(corpus)
poor_review_score.head(20)

Unnamed: 0,score
wast money,669.23028
not work,395.584498
not worth,379.338508
not buy,373.199001
look like,357.473216
not recommend,243.414852
poor qualiti,239.068573
not good,179.312844
stop work,136.302272
fell apart,135.313463


The words above are the top 20 terms that are considered as the most important words in the poor Amazon reviews. Reading through the terms, it makes sense that these terms were found in the poor reviews.

## Step 6. Good Reviews Analysis

In [14]:
# Removing Stopwords
good_review_cleaned = re.sub(rf'\b{stopwords_pattern}\b','', good_review_read)

# Handling the collocated words identified in previous step
good_review_cleaned = re.sub(r'\bbr\b', '', good_review_cleaned)
good_review_cleaned = re.sub(r'\byear old\b', 'yearold', good_review_cleaned)
good_review_cleaned = re.sub(r'\byears old\b', 'yearold', good_review_cleaned)

good_review_split = good_review_cleaned.split('\n')
good_review_df = pd.DataFrame(good_review_split, columns = ['Good_Reviews']) 
good_review_df

Unnamed: 0,Good_Reviews
0,excellent!!!
1,"""great quality wooden track (better tried..."
2,daughter loved liked price came sho...
3,great item. pictures pop add &#34;painted....
4,pleased product.
...,...
102213,"""nice kit, priced"""
102214,supposed .
102215,grandson loves playing police figurines…….
102216,grandson loves littlebits!


In [15]:
# I did not want to use for loop because it's not very efficient for this data size, 
# but I could not just leave out "\n" when I run re.findall(r'\w+'...) or try to stem it

for i in range(0,len(good_review_df)):   
    #only keeping the words and not punctuations. The reason I am cleaning here in this step is because
    #I needed '/' to separate out by /n in the step above.    
    good_review_df.iloc[i,0] = " ".join(re.findall(r'\w+', good_review_df.iloc[i,0]))
    
    #using the stemming function identified above
    good_review_df.iloc[i,0] = stem_text(good_review_df.iloc[i,0])
    
good_review_df

Unnamed: 0,Good_Reviews
0,excel
1,great qualiti wooden track better tri perfect ...
2,daughter love like price came shop ton peopl b...
3,great item pictur pop add 34 paint 34 pictur d...
4,pleas product
...,...
102213,nice kit price
102214,suppos
102215,grandson love play polic figurin
102216,grandson love littlebit


In [16]:
corpus = list(good_review_df["Good_Reviews"].values)
good_review_score = tf_idf_vectorizer(corpus)
good_review_score.head(20)

Unnamed: 0,score
daughter love,2973.864818
son love,2750.532371
kid love,2725.666928
grandson love,2093.386828
great product,1636.458058
granddaught love,1484.997371
good qualiti,1478.405115
work great,1413.083141
highli recommend,1385.296162
great qualiti,1226.976575


Similar to the bad review, these are the top 20 that were considered as the most important in the good reviews. The terms in the list are very different from the terms identified in the poor reviews and show much positive context. 

As useful as TF-IDF is, it is still based on the "Bag of Words" method. What this means is that the number of words occurrences are emphasized in the text, but the semantics or the position within the text is not taken into cosideration.

Source: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`

In [17]:
import pandas as pd
catalog = pd.read_csv("truncated_catalog.csv")
catalog = catalog.apply(lambda x: x.astype(str).str.lower())
# making entire dataframe lower case: https://stackoverflow.com/questions/39512002/convert-whole-dataframe-from-lower-case-to-upper-case-with-pandas

catalog.head(50)

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv
0,fila,original fitness sneakers,vintage fitness leather sneakers with logo pri...,themensstore/shoes/sneakers/lowtop,https://www.saksfifthavenue.com/fila-original-...,leather/synthetic upper\nlace-up closure\ntext...,"'design':12 'fila':1a 'fit':3a,6 'leather':7 '..."
1,chanel,hat,,unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,wool tweed & felt,'chanel':1a 'hat':2a
2,frame,petit oval buckle belt,a timeless leather belt crafted from smooth co...,accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5a,9 'buckl':4a,21 'cowhid':13 'craft':..."
3,lilly pulitzer kids,little gir's & girl's ariana one-piece upf 50+...,pretty ruffle sleeves and trim elevate essenti...,"justkids/girls214/girls/swimwearcoverups,justk...",https://www.saksfifthavenue.com/lilly-pulitzer...,scoopneck\nadjustable straps\nflutter sleeves\...,'50':14a 'allov':28 'ariana':9a 'color':27 'el...
4,kissy kissy,baby girl's endearing elephants pima cotton co...,versatile convertible gown with elephant applique,justkids/baby024months/infantgirls/footiesrompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,v-neckline\nlong sleeves\nfront snap closure\n...,"'appliqu':17 'babi':3a 'convert':10a,13 'cotto..."
5,jocelyn,savage love texty time leopard-print rabbit fu...,from the savage love collection. fingerless kn...,jewelryaccessories/accessories/gloves,https://www.saksfifthavenue.com/jocelyn-savage...,acrylic/wool\nfur type: dyed rabbit\nfur origi...,'ad':29 'collect':16 'craft':20 'fingerless':1...
6,theory,teah stretch-silk camisole,"beige stretch-silk slips on 93% silk, 7% spand...",clothing / tops / tanks and camis,https://www.net-a-porter.com/us/en/product/119...,"fits true to size, take your normal size\ncut ...",'7':15 '93':13 'beig':7 'camisol':6a 'clean':1...
7,ami paris,postcard patch hoodie,casual cotton-blend hoodie with an embossed la...,themensstore/apparel/sweatshirtshoodies18q1,https://www.saksfifthavenue.com/ami-paris-post...,attached drawstring hood\nlong sleeves\npullov...,"'ami':1a,15 'blend':9 'casual':6 'chest':21 'c..."
8,alexander wang,layered velvet mini dress,black velvet concealed hook and zip fastening ...,clothing / dresses / mini,https://www.net-a-porter.com/us/en/product/120...,"fits true to size, take your normal size \ndes...",'100':21 '35':18 '65':16 'alexand':1a 'back':1...
9,j.crew,wide leather belt,the ideal way to add definition to your favori...,belts,https://www.jcrew.com/p/womens_category/belts/...,,"'add':9 'belt':4a,17 'better':27 'custom':19 '..."


## Step 2. Filtering for only women's clothing items.

In [18]:
import re
catalog['is_womens_clothing']=0

for i in range(0,len(catalog)): #in order to go through every row
    for j in [1,2,3,5,6]: #in order to go through columns "name", "description", "brand_category", "details", and "tsv"
        text = catalog.iloc[i,j] #pinpoint a cell
        
        exist = len(re.findall(r'\b(themensstore|man|men|home|tech|technology|fragrance|beauty|accessories|accessory|boy|boys|baby)\b'," ".join((re.findall(r'\w+', text))))) 
        #The reason I'm looking for non-women items are because I am assuming that the default data to be related to women's clothing products, as written in the instruction.
        #In other words, if a product includes info for neither women or not women, I am assuming that it is women's products
        
        if exist >= 1:
            catalog.iloc[i,7]+=1 #if a cell includes at least one of those words(NOT WOMEN), count up 
        else:
            catalog.iloc[i,7]+=0 #if the row seems to belong to women product, remain 0           
            
#Then, I want to flip the 0 and more than 1s in the 'is_womens_clothing' column to accurately portray women clothing
for i in range(0,len(catalog)):
    if catalog.iloc[i,7] >= 1:
        catalog.iloc[i,7] = 0
    else:
        catalog.iloc[i,7] = 1

In [19]:
#women clothing
catalog.loc[catalog['is_womens_clothing']==1,:].head()

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv,is_womens_clothing
1,chanel,hat,,unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,wool tweed & felt,'chanel':1a 'hat':2a,1
3,lilly pulitzer kids,little gir's & girl's ariana one-piece upf 50+...,pretty ruffle sleeves and trim elevate essenti...,"justkids/girls214/girls/swimwearcoverups,justk...",https://www.saksfifthavenue.com/lilly-pulitzer...,scoopneck\nadjustable straps\nflutter sleeves\...,'50':14a 'allov':28 'ariana':9a 'color':27 'el...,1
6,theory,teah stretch-silk camisole,"beige stretch-silk slips on 93% silk, 7% spand...",clothing / tops / tanks and camis,https://www.net-a-porter.com/us/en/product/119...,"fits true to size, take your normal size\ncut ...",'7':15 '93':13 'beig':7 'camisol':6a 'clean':1...,1
8,alexander wang,layered velvet mini dress,black velvet concealed hook and zip fastening ...,clothing / dresses / mini,https://www.net-a-porter.com/us/en/product/120...,"fits true to size, take your normal size \ndes...",'100':21 '35':18 '65':16 'alexand':1a 'back':1...,1
9,j.crew,wide leather belt,the ideal way to add definition to your favori...,belts,https://www.jcrew.com/p/womens_category/belts/...,,"'add':9 'belt':4a,17 'better':27 'custom':19 '...",1


In [20]:
#not women clothing
catalog.loc[catalog['is_womens_clothing']==0,:].head()

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv,is_womens_clothing
0,fila,original fitness sneakers,vintage fitness leather sneakers with logo pri...,themensstore/shoes/sneakers/lowtop,https://www.saksfifthavenue.com/fila-original-...,leather/synthetic upper\nlace-up closure\ntext...,"'design':12 'fila':1a 'fit':3a,6 'leather':7 '...",0
2,frame,petit oval buckle belt,a timeless leather belt crafted from smooth co...,accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5a,9 'buckl':4a,21 'cowhid':13 'craft':...",0
4,kissy kissy,baby girl's endearing elephants pima cotton co...,versatile convertible gown with elephant applique,justkids/baby024months/infantgirls/footiesrompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,v-neckline\nlong sleeves\nfront snap closure\n...,"'appliqu':17 'babi':3a 'convert':10a,13 'cotto...",0
5,jocelyn,savage love texty time leopard-print rabbit fu...,from the savage love collection. fingerless kn...,jewelryaccessories/accessories/gloves,https://www.saksfifthavenue.com/jocelyn-savage...,acrylic/wool\nfur type: dyed rabbit\nfur origi...,'ad':29 'collect':16 'craft':20 'fingerless':1...,0
7,ami paris,postcard patch hoodie,casual cotton-blend hoodie with an embossed la...,themensstore/apparel/sweatshirtshoodies18q1,https://www.saksfifthavenue.com/ami-paris-post...,attached drawstring hood\nlong sleeves\npullov...,"'ami':1a,15 'blend':9 'casual':6 'chest':21 'c...",0


In [21]:
print(f'{round(catalog.is_womens_clothing.sum()/catalog.is_womens_clothing.count(),4)*100}% of the dataset are women clothing item.')

50.81% of the dataset are women clothing item.


## Step 3. Identifying Category

In [22]:
catalog['product_category'] = ""
for i in range(0,len(catalog)): 
    if catalog.iloc[i,7] == 0:
        catalog.iloc[i,8] = 'na'
    
    else:
        bottom, onepiece, shoes, handbag, scarf = 0,0,0,0,0
        for j in [1,2,3,5,6]:
            cell = catalog.iloc[i,j]
            text = " ".join(re.findall(r'\w+', cell))
            
            
            #Basically, I kept track of the number of instances when the words appeared in each cell for each row. 
            #Then, I found out the part of clothing that had the most count for each of the row
            #Whatever has the most count would be considered as the category for the clothing. 
            #Otherwise, it would have been just whatever category that appeared the last in each row
            
            bottom += len(re.findall(r'\b(bottom|pants|skirt|jeans|shorts|leggings)\b',text))
            onepiece += len(re.findall(r'\b(one piece|onepiece|jumpsuit|romper|overall)\b',text))
            shoes += len(re.findall(r'\b(shoes|sneakers|boots|flats|heels|slippers|sandals)\b',text))
            handbag += len(re.findall(r'\b(handbag|bag|tote|crossbody})\b',text))
            scarf += len(re.findall(r'\b(scarf|wrap})\b',text))
            
            
        if bottom+onepiece+shoes+handbag+scarf == 0: #if everything is 0, just marked it as 'other'
            catalog.iloc[i,8] = 'other' 
            
        else:
            clothing = {'bottom': bottom, 'one piece': onepiece, 'shoes': shoes, 'handbag': handbag, 'scarf': scarf}
            winner = max(clothing, key=clothing.get)
            #https://stackoverflow.com/questions/23154821/how-to-find-out-which-variable-has-the-greatest-value
            catalog.iloc[i,8] = winner

In [23]:
print('Bottom: ',len(catalog.loc[catalog.product_category == 'bottom',:]))
print('One Piece: ',len(catalog.loc[catalog.product_category == 'one piece',:]))
print('Shoes: ',len(catalog.loc[catalog.product_category == 'shoes',:]))
print('Handbag: ',len(catalog.loc[catalog.product_category == 'handbag',:]))
print('Scarf: ',len(catalog.loc[catalog.product_category == 'scarf',:]))
print('Other: ',len(catalog.loc[catalog.product_category == 'other',:]))
print('NA: ',len(catalog.loc[catalog.product_category == 'na',:]))

Bottom:  3756
One Piece:  568
Shoes:  1738
Handbag:  807
Scarf:  289
Other:  14372
NA:  20843


## Step 4. Identifying Color

In [24]:
catalog['colors'] = ""
for i in range(0,len(catalog)): 
    if catalog.iloc[i,7] == 0:   #If not women clothing, I did not assign any color
        catalog.iloc[i,9] = 'na'
    
    elif catalog.iloc[i,8] == 'other': #Similarly, I did not assign any color to the unassigned clothing from the step above
        catalog.iloc[i,9] = 'na'
         
    else:
        beige, black, blue, brown, burgundy, gold, gray, green, multi, navy, neutral, orange, pink, purple, red, silver, teal, white, yellow = 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
        for j in [1,2,3,5,6]:
            cell = catalog.iloc[i,j]
            text = " ".join(re.findall(r'\w+', cell))
            
            beige += len(re.findall(r'\b(beige)\b',text))
            black += len(re.findall(r'\b(black)\b',text))
            blue += len(re.findall(r'\b(blue)\b',text))
            burgundy += len(re.findall(r'\b(burgundy)\b',text))
            gold += len(re.findall(r'\b(gold)\b',text))
            gray += len(re.findall(r'\b(gray)\b',text))
            green += len(re.findall(r'\b(green)\b',text))
            multi += len(re.findall(r'\b(multi)\b',text))
            navy += len(re.findall(r'\b(navy)\b',text))
            neutral += len(re.findall(r'\b(neutral)\b',text))
            orange += len(re.findall(r'\b(orange)\b',text))
            pink += len(re.findall(r'\b(pink)\b',text))
            purple += len(re.findall(r'\b(purple)\b',text))
            red += len(re.findall(r'\b(red)\b',text))
            silver += len(re.findall(r'\b(silver)\b',text))
            teal += len(re.findall(r'\b(teal)\b',text))
            white += len(re.findall(r'\b(white)\b',text))
            yellow += len(re.findall(r'\b(yellow)\b',text))           
            
        if beige+black+blue+brown+burgundy+gold+gray+green+multi+navy+neutral+orange+pink+purple+red+silver+teal+white+yellow == 0:
            catalog.iloc[i,9] = 'other'
            
        else: #finding the max count of color to associate the clothing with
            color = {'beige':beige, 'black':black, 'blue':blue, 'brown':brown, 'burgundy':burgundy, 'gold':gold, 'gray':gray, 'green':green, 'multi':multi, 'navy':navy, 'neutral':neutral, 'orange':orange, 'pink':pink, 'purple':purple, 'red':red, 'silver':silver, 'teal':teal, 'white':white, 'yellow':yellow}
            winner = max(color, key=color.get)
            #https://stackoverflow.com/questions/23154821/how-to-find-out-which-variable-has-the-greatest-value
            catalog.iloc[i,9] = winner

In [25]:
print('beige: ',len(catalog.loc[catalog.colors == 'beige',:]))
print('black: ',len(catalog.loc[catalog.colors == 'black',:]))
print('blue: ',len(catalog.loc[catalog.colors == 'blue',:]))
print('burgundy: ',len(catalog.loc[catalog.colors == 'burgundy',:]))
print('gold: ',len(catalog.loc[catalog.colors == 'gold',:]))
print('gray: ',len(catalog.loc[catalog.colors == 'gray',:]))
print('green: ',len(catalog.loc[catalog.colors == 'green',:]))


print('multi: ',len(catalog.loc[catalog.colors == 'multi',:]))
print('navy: ',len(catalog.loc[catalog.colors == 'navy',:]))
print('neutral: ',len(catalog.loc[catalog.colors == 'neutral',:]))
print('orange: ',len(catalog.loc[catalog.colors == 'orange',:]))
print('pink: ',len(catalog.loc[catalog.colors == 'pink',:]))
print('purple: ',len(catalog.loc[catalog.colors == 'purple',:]))
print('red: ',len(catalog.loc[catalog.colors == 'red',:]))

print('silver: ',len(catalog.loc[catalog.colors == 'silver',:]))
print('teal: ',len(catalog.loc[catalog.colors == 'teal',:]))
print('white: ',len(catalog.loc[catalog.colors == 'white',:]))
print('yellow: ',len(catalog.loc[catalog.colors == 'yellow',:]))
print('other: ',len(catalog.loc[catalog.colors == 'other',:]))
print('na: ',len(catalog.loc[catalog.colors == 'na',:]))

beige:  10
black:  341
blue:  133
burgundy:  4
gold:  170
gray:  11
green:  36
multi:  72
navy:  42
neutral:  69
orange:  6
pink:  54
purple:  1
red:  36
silver:  37
teal:  1
white:  216
yellow:  6
other:  5913
na:  35215


In [26]:
catalog

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv,is_womens_clothing,product_category,colors
0,fila,original fitness sneakers,vintage fitness leather sneakers with logo pri...,themensstore/shoes/sneakers/lowtop,https://www.saksfifthavenue.com/fila-original-...,leather/synthetic upper\nlace-up closure\ntext...,"'design':12 'fila':1a 'fit':3a,6 'leather':7 '...",0,na,na
1,chanel,hat,,unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,wool tweed & felt,'chanel':1a 'hat':2a,1,other,na
2,frame,petit oval buckle belt,a timeless leather belt crafted from smooth co...,accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5a,9 'buckl':4a,21 'cowhid':13 'craft':...",0,na,na
3,lilly pulitzer kids,little gir's & girl's ariana one-piece upf 50+...,pretty ruffle sleeves and trim elevate essenti...,"justkids/girls214/girls/swimwearcoverups,justk...",https://www.saksfifthavenue.com/lilly-pulitzer...,scoopneck\nadjustable straps\nflutter sleeves\...,'50':14a 'allov':28 'ariana':9a 'color':27 'el...,1,one piece,other
4,kissy kissy,baby girl's endearing elephants pima cotton co...,versatile convertible gown with elephant applique,justkids/baby024months/infantgirls/footiesrompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,v-neckline\nlong sleeves\nfront snap closure\n...,"'appliqu':17 'babi':3a 'convert':10a,13 'cotto...",0,na,na
...,...,...,...,...,...,...,...,...,...,...
42368,mara hoffman,atlas oversized belted mélange wool coat,mélange beige and cream wool button fastenings...,clothing / coats / long,https://www.net-a-porter.com/us/en/product/117...,"fits true to size, take your normal size \ndes...",'100':21 'atlas':3a 'beig':10 'belt':5a 'breas...,1,other,na
42369,philosophy di lorenzo serafini,cropped crochet-trimmed georgette top,"cream georgette ties at neck, concealed hook f...",clothing / tops / blouses,https://www.net-a-porter.com/us/en/product/111...,"fits true to size, take your normal size \nint...",'100':21 'back':20 'conceal':16 'cream':11 'cr...,1,other,na
42370,vanessa bruno,juna cotton-corduroy mini skirt,sand cotton-corduroy concealed hook and zip fa...,clothing / skirts / mini,https://www.net-a-porter.com/us/en/product/116...,"fits true to size, take your normal size \ntho...",'100':20 '35':25 '65':23 'acet':24 'back':19 '...,1,bottom,other
42371,eve denim,annabel rigid mid-rise skinny jean,although mom jeans and boyfriend jeans are all...,women:clothing:jeans,https://pink.modaoperandi.com/eve-denim-r20/an...,button and zip fastening \ncomposition: 98% co...,"'add':36 'although':10 'annabel':3a,40 'boyfri...",1,bottom,other


As the instruction states, the output, "catalog" is the same dataset with 3 additional columns defined in previous steps.