# HW3

Submit via Slack. Due on Tuesday, April 13th, 2020, 6:29pm PST. You may work with one other person.

## TF-IDF

You are an analyst working at McDonalds as a store operations analyst, and charged with identifying areas for improvement for each franchise. Several metropolitan locations have been suffering recently from lower reviews.

Using the **mcdonalds-yelp-negative-reviews.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?
- which words to collocate together?

Finally, generate a TF-IDF report that either **visualizes** or explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# import nltk
# nltk.download('wordnet')

In [39]:
mac_df = pd.read_csv("mcdonalds-yelp-negative-reviews.csv", encoding="latin1")

In [40]:
mac_df.head()

Unnamed: 0,_unit_id,city,review
0,679455653,Atlanta,"I'm not a huge mcds lover, but I've been to be..."
1,679455654,Atlanta,Terrible customer service. I came in at 9:30pm...
2,679455655,Atlanta,"First they ""lost"" my order, actually they gave..."
3,679455656,Atlanta,I see I'm not the only one giving 1 star. Only...
4,679455657,Atlanta,"Well, it's McDonald's, so you know what the fo..."


In [41]:
# Make all lowercase
mac_df["review"] = mac_df["review"].str.lower()

In [42]:
vectorizer = CountVectorizer(stop_words="english", binary=True)

X = vectorizer.fit_transform(mac_df["review"])

In [43]:
vec_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names()).T

vec_df["num_count"] = vec_df.sum(axis=1)

In [44]:
vec_df.sort_values("num_count", ascending=False)\
    .head(20)
#     .drop(list(set(stopwords.words('english'))), errors="ignore")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1516,1517,1518,1519,1520,1521,1522,1523,1524,num_count
food,0,1,0,0,1,0,1,0,1,1,...,1,0,0,0,0,1,1,1,0,574
order,1,0,1,0,1,0,0,1,1,1,...,1,1,1,1,0,0,0,1,1,515
mcdonald,0,0,0,0,1,1,1,0,0,1,...,0,0,0,0,0,0,1,1,0,486
drive,1,0,0,0,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,473
service,0,1,0,0,1,0,0,0,1,0,...,0,0,0,0,0,1,0,1,0,423
just,0,1,1,0,0,0,0,0,0,1,...,0,1,0,1,0,0,1,1,1,419
time,1,0,0,0,1,0,0,1,1,1,...,0,1,0,0,0,1,0,1,0,394
mcdonalds,0,1,0,0,0,0,0,0,1,0,...,0,0,0,1,1,1,0,0,0,389
like,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,1,361
place,0,0,0,0,1,0,0,0,0,1,...,0,0,0,1,0,0,0,1,1,350


From the CountVectorizer, we can notice that many reviews contains "food", "order", "service", and "time"

## Regex Cleaning

In [None]:
# Hamburger Variation
mac_df["review"] = mac_df["review"].str.replace(r"\w*\s*burgers?", "burger")

In [None]:
# Big Macs
mac_df["review"] = mac_df["review"].str.replace(r"big\s*macs?", "burger")

Change types of burgers into burger to see how burgers served are reviewed

In [None]:
# McDonald's
mac_df["review"] = mac_df["review"].str.replace(r"(?:\bmcdonald(?:'?s?)?\b)|(?:\bmcds?\b)", "mcdonald")

Changed variations of McDonald's to mcdonald to add to stopwords later

In [None]:
# Punctuation Removal
mac_df["review"] = mac_df["review"].str.replace(r"[!|@|#|$|%|^|&|*|(|)|+|<|>|?|:|.|,|;|\"|\'|\\]", ' ')

In [None]:
# Whitespace
mac_df["review"] = mac_df["review"].str.replace(r"\s{2,}", ' ')

In [None]:
# Numbers
mac_df["review"] = mac_df["review"].str.replace(r"\d+\S*\d*\w*", "NUM_TOKEN")

Miscellaneous Regex

## Stemming

In [None]:
# stemmer = PorterStemmer()
stemmer = SnowballStemmer("english")

In [None]:
def stmmer_func(review):
    tokens = [stemmer.stem(token) for token in review.split()]
    return ' '.join(tokens)

Reason for choosing SnowballStemmer: 

In [None]:
mac_df["city"].fillna("Unknown", inplace=True)
grouped = mac_df.groupby("city")
df_dict = {}
cities = mac_df.city.unique()

In [None]:
# Make DataFrames for each metropolitan
for city in cities:
    df_new = grouped.get_group(city)
    df_dict[city] = df_new.reset_index()

In [None]:
df_dict["Las Vegas"]

## Customize Stopwords

In [None]:
stop_words = stopwords.words('english').remove([])

## TF-IDF

In [None]:
tf_idf = TfidfVectorizer(input="mcdonalds-yelp-negative-reviews.csv",
                         encoding="latin1",
                         lowercase=True,
                         stop_words=stopwords.words('english'))

n-gram justification:

## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`

`colors` should be a list of colors, since it is possible for a piece of clothing to have multiple colors.

In [2]:
import re

In [3]:
prod_df = pd.read_csv("truncated_catalog.csv")

In [4]:
cols = prod_df.columns.to_list()

for col in cols:
    prod_df[col] = prod_df[col].str.lower()

In [5]:
# Fill Null Values
prod_df.iloc[:, :7] = prod_df.iloc[:, :7].fillna('')

In [6]:
prod_df.head(3)

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv
0,fila,original fitness sneakers,vintage fitness leather sneakers with logo pri...,themensstore/shoes/sneakers/lowtop,https://www.saksfifthavenue.com/fila-original-...,leather/synthetic upper\nlace-up closure\ntext...,"'design':12 'fila':1a 'fit':3a,6 'leather':7 '..."
1,chanel,hat,,unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,wool tweed & felt,'chanel':1a 'hat':2a
2,frame,petit oval buckle belt,a timeless leather belt crafted from smooth co...,accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5a,9 'buckl':4a,21 'cowhid':13 'craft':..."


## is_womens_clothing

In [7]:
# Regex for capturing women related words 
woman_exp = "\bwi(?:fe|ves)|girls?|wom(?:a|e)n|lad(?:y|ies)|madams?|brides?|widows?|females?|femini\w*|maternal\w*|moms?\b"

# Search all columns
for col in cols:
    prod_df[f"is_womens_clothing_{col}"] = False
    
    # Find if women related words exist in the column
    prod_df[f"is_womens_clothing_{col}"] = prod_df[col].str.contains(woman_exp, case=False, flags=re.IGNORECASE, regex=True)
        
    print(f"{col} searched")
    
# If any of is_womens_clothing is True, then is_womens_clothing is True. Otherwise False
prod_df[f"is_womens_clothing"] = prod_df.iloc[:, 7:].any(axis=1)

# Drop other intermediate columns
col_to_drop = prod_df.iloc[0, 7:-1].index.to_list()
prod_df.drop(col_to_drop, axis=1, inplace=True)

brand searched
name searched
description searched
brand_category searched
brand_canonical_url searched
details searched
tsv searched


## product_category

In [8]:
# Expressions
bottom_exp = "(?:baggies|bottom|pant|jean|cord|chino|denim|legging|overall|short|trouser)(?:es|s)?"
one_piece_exp = "\bone[\S|\s]?piece|\w*dress|all[\S|\s]?in[\S|\s]?one\b"
shoe_exp = "(?:shoe|boot|cleat|hopper|trainer|flat|flip[\S|\s]?flop|heel|pump|slide|slipper|skate|sneaker|wedge)(?:s|es)?"
handbag_exp = "(?:\w* ?bags?|clutch(?:es)?|satchels?)"
scarf_exp = "(?:\w* ?scar(?:f|(?:ves))?|snoods?|stoles?|boas?|sarongs?)"

cats_list = ["Bottom", "One_Piece", "Shoe", "Handbag", "Scarf"]
exps_list = [bottom_exp, one_piece_exp, shoe_exp, handbag_exp, scarf_exp]

# For each product category
for cat, exp in zip(cats_list, exps_list):
    prod_df.loc[:,cat] = 0
    
    for col in cols:
        # Add the number of occurrences in all columns
        prod_df[cat] = prod_df[col].str.findall(exp, flags=re.IGNORECASE).apply(lambda x: len(x))
        print(f"{cat} in {col} searched")

Bottom in brand searched
Bottom in name searched
Bottom in description searched
Bottom in brand_category searched
Bottom in brand_canonical_url searched
Bottom in details searched
Bottom in tsv searched
One_Piece in brand searched
One_Piece in name searched
One_Piece in description searched
One_Piece in brand_category searched
One_Piece in brand_canonical_url searched
One_Piece in details searched
One_Piece in tsv searched
Shoe in brand searched
Shoe in name searched
Shoe in description searched
Shoe in brand_category searched
Shoe in brand_canonical_url searched
Shoe in details searched
Shoe in tsv searched
Handbag in brand searched
Handbag in name searched
Handbag in description searched
Handbag in brand_category searched
Handbag in brand_canonical_url searched
Handbag in details searched
Handbag in tsv searched
Scarf in brand searched
Scarf in name searched
Scarf in description searched
Scarf in brand_category searched
Scarf in brand_canonical_url searched
Scarf in details searched


In [9]:
# Finds the category that has the highest score
prod_df["product_category"] = prod_df.iloc[:, 8:].apply(lambda x: x.idxmax() if x.sum() != 0 else None, axis=1)

Acknowledgement: idxmax fails to identify the category when there is a tie. idxmax fails to break the tie as it chooses the index of former tie. For example, if both shoe and handbag show up once in a product, it idxmax will choose shoe instead of handbag, which may not be true.

In [10]:
prod_df.drop(["Bottom", "One_Piece", "Shoe", "Handbag", "Scarf"], axis=1, inplace=True)

In [11]:
prod_df.head(3)

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv,is_womens_clothing,product_category
0,fila,original fitness sneakers,vintage fitness leather sneakers with logo pri...,themensstore/shoes/sneakers/lowtop,https://www.saksfifthavenue.com/fila-original-...,leather/synthetic upper\nlace-up closure\ntext...,"'design':12 'fila':1a 'fit':3a,6 'leather':7 '...",False,Shoe
1,chanel,hat,,unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,wool tweed & felt,'chanel':1a 'hat':2a,False,
2,frame,petit oval buckle belt,a timeless leather belt crafted from smooth co...,accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5a,9 'buckl':4a,21 'cowhid':13 'craft':...",False,


## colors

In [19]:
color_exp = "(?:Beige|Black|Blue|Brown|Burgund(?:y|ies)|Gold|Gra(?:y|ies)|Green|Multi|Nav(?:y|ies)|Neutral|Orange|Pink|Purple|Red|Silver|Teal|White|Yellow)s?"

for col in cols:
    prod_df[f"colors_{col}"] = None
    
    # Find colors
    prod_df[f"colors_{col}"] = prod_df[col].str.findall(color_exp, flags=re.IGNORECASE).apply(lambda x: ''.join(x))
        
    print(f"{col} searched")

brand searched
name searched
description searched
brand_category searched
brand_canonical_url searched
details searched
tsv searched


In [21]:
# Append colors to the list
prod_df["colors"] = prod_df.iloc[:, 9:].apply(lambda x: list(set([color for color in x if color != ''])), axis=1)

In [23]:
# Drop columns
prod_df.drop(["colors_brand", "colors_name", "colors_description", "colors_brand_category",
              "colors_brand_canonical_url", "colors_details", "colors_tsv"],
             axis=1, inplace=True)

In [28]:
# Change color to "Multi" if multiple colors in a product
prod_df["colors"] = prod_df["colors"].apply(lambda x: x if len(x) < 2 else ["Multi"])
prod_df["colors"] = prod_df["colors"].apply(lambda x: None if len(x) == 0 else x)

In [29]:
prod_df

Unnamed: 0,brand,name,description,brand_category,brand_canonical_url,details,tsv,is_womens_clothing,product_category,colors
0,fila,original fitness sneakers,vintage fitness leather sneakers with logo pri...,themensstore/shoes/sneakers/lowtop,https://www.saksfifthavenue.com/fila-original-...,leather/synthetic upper\nlace-up closure\ntext...,"'design':12 'fila':1a 'fit':3a,6 'leather':7 '...",False,Shoe,
1,chanel,hat,,unknown,https://www.saksfifthavenue.com/chanel-hat/pro...,wool tweed & felt,'chanel':1a 'hat':2a,False,,
2,frame,petit oval buckle belt,a timeless leather belt crafted from smooth co...,accessories,https://frame-store.com/products/petit-oval-bu...,,"'belt':5a,9 'buckl':4a,21 'cowhid':13 'craft':...",False,,[Multi]
3,lilly pulitzer kids,little gir's & girl's ariana one-piece upf 50+...,pretty ruffle sleeves and trim elevate essenti...,"justkids/girls214/girls/swimwearcoverups,justk...",https://www.saksfifthavenue.com/lilly-pulitzer...,scoopneck\nadjustable straps\nflutter sleeves\...,'50':14a 'allov':28 'ariana':9a 'color':27 'el...,True,,
4,kissy kissy,baby girl's endearing elephants pima cotton co...,versatile convertible gown with elephant applique,justkids/baby024months/infantgirls/footiesrompers,https://www.saksfifthavenue.com/kissy-kissy-ba...,v-neckline\nlong sleeves\nfront snap closure\n...,"'appliqu':17 'babi':3a 'convert':10a,13 'cotto...",True,,
...,...,...,...,...,...,...,...,...,...,...
42368,mara hoffman,atlas oversized belted mélange wool coat,mélange beige and cream wool button fastenings...,clothing / coats / long,https://www.net-a-porter.com/us/en/product/117...,"fits true to size, take your normal size \ndes...",'100':21 'atlas':3a 'beig':10 'belt':5a 'breas...,False,,[beige]
42369,philosophy di lorenzo serafini,cropped crochet-trimmed georgette top,"cream georgette ties at neck, concealed hook f...",clothing / tops / blouses,https://www.net-a-porter.com/us/en/product/111...,"fits true to size, take your normal size \nint...",'100':21 'back':20 'conceal':16 'cream':11 'cr...,False,,
42370,vanessa bruno,juna cotton-corduroy mini skirt,sand cotton-corduroy concealed hook and zip fa...,clothing / skirts / mini,https://www.net-a-porter.com/us/en/product/116...,"fits true to size, take your normal size \ntho...",'100':20 '35':25 '65':23 'acet':24 'back':19 '...,False,Bottom,
42371,eve denim,annabel rigid mid-rise skinny jean,although mom jeans and boyfriend jeans are all...,women:clothing:jeans,https://pink.modaoperandi.com/eve-denim-r20/an...,button and zip fastening \ncomposition: 98% co...,"'add':36 'although':10 'annabel':3a,40 'boyfri...",True,Bottom,[pink]
