# HW3

Submit via Slack. Due on Tuesday, April 13th, 2020, 6:29pm PST. You may work with one other person.

In [None]:
import plotly.express as px
import matplotlib.pyplot as plt
from collections import Counter,OrderedDict
import re
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
stemmer=nltk.stem.porter.PorterStemmer()

## TF-IDF

You are an analyst working at McDonalds as a store operations analyst, and charged with identifying areas for improvement for each franchise. Several metropolitan locations have been suffering recently from lower reviews.

Using the **mcdonalds-yelp-negative-reviews.csv** dataset, clean and parse the text reviews. Explain the decisions you make:
- why remove/keep stopwords?
- which stopwords to remove?
- stemming versus lemmatization?
- regex cleaning and substitution?
- adding in custom stopwords?
- what `n` for your `n-grams`?

Finally, generate a TF-IDF report that either **visualizes** or explains for a business (non-technical) stakeholder:
* the features your analysis showed that customers cited as reasons for a poor review
* the most common issues identified from your analysis that generated customer dissatisfaction.

Explain to what degree the TF-IDF findings make sense - what are its limitations?

In [None]:
reviews=pd.read_csv('mcdonalds-yelp-negative-reviews.csv',encoding='latin1')
reviews.head()

In [None]:
reviews['review']=reviews['review'].apply(lambda x: " ".join([stemmer.stem(i) for i in x.split(" ")]))

In [None]:
stopwords=list(stopwords.words('english'))
stopwords.append('mcd')
stopwords.append('mcds')
stopwords.append('mcdonald')
stopwords.append('mcdonalds')
stopwords.append('restaurant')
stopwords.append('restaurants')
stopwords.append('order')
stopwords.append('orders')
stopwords.remove('don\'t')
stopwords.remove('shouldn\'t')
stopwords.remove('didn\'t')
stopwords.remove('no')
stopwords.remove('not')

In [None]:
vectorizer=TfidfVectorizer(ngram_range=(3,4),token_pattern=r'\b[a-zA-Z]{3,}\b',max_df=0.4,stop_words=stopwords)
corpus=list(reviews['review'].values)
X=vectorizer.fit_transform(corpus)
terms=vectorizer.get_feature_names()
tf_idf=pd.DataFrame(X.toarray().transpose(),index=terms)
tf_idf=tf_idf.sum(axis=1)
score=pd.DataFrame(tf_idf,columns=['score'])
score.sort_values(by='score',ascending=False,inplace=True)
top25=score.head(25)
top25

The code above gives us a rank ordered list in terms of TFIDF scores that summarizes the major recurring components of the negative reviews. Based on the top 25 TFIDF scores, we can start to pick up a few patterns and areas of interest. Drive thrus appear to be a major bottleneck and a frequent complaint for unhappy McDonald's customers. In addition, the customer service and cleanliness appear to be unsatisfactory for these reviewers. In terms of McDonald's products, ice cream and breakfast (meals with eggs) seem to be a pain point for some.

In [None]:
fig=px.bar(top25,y='score')
fig.show()

- I removed stopwords in order to eliminate some of the noise in the review set. I first removed the English stopwords (I added back a few of the negative English stopwords to preserve some of syntactical meaning of the reviews) and then added a few custom stopwords to supplement. For example, I removed the variations of the word "McDonald's" and "order" because they don't add much value in this context. 
- I used stemming because it reduces the dimensionality more than lemmatization. Just looking at the stems of each word fits the business use case here. We don't lose too much interpretability compared to lemmatization.
- The regex cleaning was primarily taken care of in the "TfidfVectorizer" function via "token_pattern". Stemming also played a part in accounting for plurals and other similar discrepancies.
- I ended up choosing my "ngram_range" to be between 3 and 4. I believe it was a good parameter choice that enables us to capture the major themes throughout the reviews.

TFIDF does have some limitations. Primarily, its scoring system relies on the assumption that words that don't appear often have a high weight and words that appear frequently within a particular document (review) also have a high weight. If the McDonald's reviews don't abide by this definition, the scoring and summarization takeaways could be a bit misleading.

## Product Attribution (Feature Engineering and Regex Practice)

Download the [dataset](https://dso-560-nlp-text-analytics.s3.amazonaws.com/truncated_catalog.csv) from the class S3 bucket (`dso560-nlp-text-analytics`).

In preparation for the group project, our client company has provided a dataset of women's clothing products they are considering cataloging. 

1. Filter for only **women's clothing items**.

2. For each clothing item:

* Identify its **category**:
```
Bottom
One Piece
Shoe
Handbag
Scarf
```
* Identify its **color**:
```
Beige
Black
Blue
Brown
Burgundy
Gold
Gray
Green
Multi 
Navy
Neutral
Orange
Pinks
Purple
Red
Silver
Teal
White
Yellow
```

Your output will be the same dataset, except with **3 additional fields**:
* `is_womens_clothing`
* `product_category`
* `colors`

`colors` should be a list of colors, since it is possible for a piece of clothing to have multiple colors.

In [None]:
clothes=pd.read_csv('truncated_catalog.csv',encoding='latin1')
clothes=clothes.rename(columns={"ï»¿brand": "brand"})
clothes.head()

In [None]:
def women(x):
    if len(re.findall(r'(women|women\'s|womens|\
                         woman|woman\'s|\
                         girl|girl\'s|girls|\
                         ladies|lady|lady\'s|unisex)'\
                         ,x,flags=re.IGNORECASE))>=1:
        return True
    else:
        return False

In [None]:
clothes['is_womens_clothing']=clothes['name'].apply(women)

In [None]:
for i in ['description','brand_category','brand_canonical_url','details']:
    clothes.loc[(clothes['is_womens_clothing']==False) & (clothes[i].notnull()),'is_womens_clothing']=\
        clothes.loc[(clothes['is_womens_clothing']==False) & (clothes[i].notnull()),i].apply(women)

In [None]:
def category(x):
    if len(re.findall(r'(pants|shorts|skirts|skirt|\
                         bottoms|bottom|jeans|sweats|\
                         sweatpants)'\
                         ,x,flags=re.IGNORECASE))>=1:
        return 'Bottom'
    elif len(re.findall(r'(dresses|dress|one piece|one-piece)'\
                           ,x,flags=re.IGNORECASE))>=1:
        return 'One Piece'
    elif len(re.findall(r'(shoes|shoe|heels|heel|\
                           sandals|sandal|sneakers|\
                           sneaker|boots|boot|\
                           flip flops|flip flop|cleats|cleat)'\
                           ,x,flags=re.IGNORECASE))>=1:
        return 'Shoe'
    elif len(re.findall(r'(handbags|handbag|purses|purse|\
                           bags|bag|clutch|tote bag|tote bags)'\
                           ,x,flags=re.IGNORECASE))>=1:
        return 'Handbag'
    elif len(re.findall(r'(scarfs|scarf|scarves|skirt)'\
                           ,x,flags=re.IGNORECASE))>=1:
        return 'Scarf'

In [None]:
clothes.loc[clothes['name'].notnull(),'product_category']=clothes.loc[clothes['name'].notnull(),'name'].apply(category)

In [None]:
for i in ['description','brand_category','brand_canonical_url','details']:
    clothes.loc[(clothes[i].notnull()) & (clothes['product_category']\
        .isnull()),'product_category']=clothes.loc[(clothes[i].notnull())\
        & (clothes['product_category'].isnull()),i].apply(category)

In [None]:
clothes['product_category'].value_counts().plot.bar()
plt.xticks(rotation=0)

In [None]:
def color(x):
    colors=[]
    for i in ['Beige','Black','Blue','Brown','Burgundy','Gold','Gray','Green','Multi','Navy','Neutral',\
              'Orange','Pinks','Purple','Red','Silver','Teal','White','Yellow']:
        if len(re.findall(f'({i})',x,flags=re.IGNORECASE))>=1:
            colors.append(i)
    return colors

In [None]:
clothes.loc[clothes['name'].notnull(),'colors']=clothes.loc[clothes['name'].notnull(),'name'].apply(color)

In [None]:
for i in ['description','brand_category','brand_canonical_url','details']:
    clothes.loc[(clothes[i].notnull()) & (clothes['colors']\
        .isnull()),'colors']=clothes.loc[(clothes[i].notnull())\
        & (clothes['colors'].isnull()),i].apply(color)

In [None]:
clothes.loc[clothes['colors'].notnull(),'colors']=clothes.loc[clothes['colors'].notnull(),'colors']\
    .apply(lambda x: None if len(x)==0 else x)

In [None]:
clothes['colors'].value_counts().head(10).plot.barh()

In [None]:
clothes