# Text Preprocessing:
This notebook will cover the data aggregration and cleaning of all the review text in preparation for modeling, as well as initial exploratory data analysis.

- Natural Language Processing (NLP) requires a lot of text preprocessing and cleaning before any modeling begins. This is the most crucial step in determining the quality of output.
- Initial preprocessing includes removing unnecessary characters, and spacing. 
- The bulk of the processing will involve tokenizing sentences into words, removing stopwords, part-of-speech tagging, and stemming/lemmatizing words.

### Import necessary packages:

In [1]:
import pandas as pd
import re
import glob
import pickle

import nltk
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models.ldamodel import LdaModel
from gensim.test.utils import datapath
import spacy # python -m spacy download en_core_web_sm
import en_core_web_sm

import plotly.graph_objs as go
import chart_studio.plotly as py
import cufflinks # !pip install ipywidgets for cufflinks to work
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
import plotly.figure_factory as ff
from plotly.offline import iplot
import bokeh

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

### Assembling the base dataframe:

Earlier, we developed a web scraping script that was able to capture reviews from glassdoor and save them as separate csv files (~40 hours of webscraping to obtain all the data!). We can use glob to combine all the company review csv files into a single dataframe:

In [3]:
# read in all .csv files in our data folder
path = '../data/reviews'
files = [f for f in glob.glob(path + "**/*.csv", recursive=True)]

# using pandas to assemble individual DFs and combine together
li = [pd.read_csv(f) for f in files]
all_reviews = pd.concat(li, axis=0, ignore_index=True, sort=True) # sort=True b/c columns are in diff orders per df
all_reviews = all_reviews.drop(columns = ['Unnamed: 0','Unnamed: 0.1'])
all_reviews = all_reviews.dropna()
all_reviews.reset_index(drop=True, inplace=True)

# save output
all_reviews.to_csv('../../all_reviews_glob.csv')

In [14]:
all_reviews = pd.read_csv('../../all_reviews.csv').drop(columns = ['Unnamed: 0'])
all_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258358 entries, 0 to 258357
Data columns (total 5 columns):
company    258358 non-null object
cons       258358 non-null object
pros       258358 non-null object
rating     258358 non-null float64
title      258258 non-null object
dtypes: float64(1), object(4)
memory usage: 9.9+ MB


- As we can see, there are 258,358 total company reviews with no null values present.
- We still need to clean our text by removing line endings, unwanted spaces, and special characters

In [18]:
# remove line endings and spaces contained in the HTML (i.e. "/n /r")
all_reviews.pros = all_reviews.pros.map(lambda x: re.sub('\s+', ' ', x))
all_reviews.cons = all_reviews.cons.map(lambda x: re.sub('\s+', ' ', x))
# remove special characters and lowercase all text
all_reviews.title = all_reviews.title.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', str(x).lower()))
all_reviews.pros = all_reviews.pros.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))
all_reviews.cons = all_reviews.cons.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))
# add word count columns
all_reviews['word_count_pros'] = all_reviews.pros.map(lambda x: len(x.split()))
all_reviews['word_count_cons'] = all_reviews.cons.map(lambda x: len(x.split()))
all_reviews['total_word_count'] = all_reviews.word_count_pros + all_reviews.word_count_cons
# let's look at pros vs cons word count disparity
all_reviews['word_count_disparity'] = all_reviews.word_count_pros - all_reviews.word_count_cons

In [19]:
all_reviews.head(10)

Unnamed: 0,company,cons,pros,rating,title,word_count_pros,word_count_cons,total_word_count,word_count_disparity
0,Sysco,no flexibility micromanage no room for growth.,good benefits nice values nice co workers nice...,3.0,could be better,9,7,16,2
1,Sysco,its a lot of manual labor,its great pay amp steady work,5.0,great place to work,6,6,12,0
2,Sysco,appreciation for good work is no longer celebr...,great worklife balance limitless income potential,3.0,typical corporate company.,6,17,23,-11
3,Sysco,customers will call 7 days a week and nights t...,money benefits flexibility of schedule great u...,4.0,marketing associate,8,23,31,-15
4,Sysco,with the recent layoffs lacks stability,excellent benefits competitive within the market,5.0,hr specialist,6,6,12,0
5,Sysco,no work life balance pay is not that good,flexible working hours and good size office,2.0,analyst,7,9,16,-2
6,Sysco,some management and up are not ethical human b...,the people for the most part are good people.,2.0,not worth it,9,32,41,-23
7,Sysco,stop training people on the same products mont...,continual training good benefits. lots of grea...,3.0,sales,8,18,26,-10
8,Sysco,ever changing and fast paced goes both ways,great pay fast paced challenging ever changing...,5.0,great growing and proven leader in food service,22,8,30,14
9,Sysco,the warehouse can be difficult,very good pay and work schedule,4.0,great team work,6,5,11,1


### Preprocessing Functions:
Below, we will define functions that will aid in preprocessing our data:

In [None]:
# %load ../python_files/functions

# Text Preprocessing Functions:

# Regex and tokenizing text:
# takes text, removes special characters and punctuation 
# and returns a list of lower-cased tokenized words
def tokenize_sentences(text):
    pattern = "([a-zA-Z]+(?:'[a-z]+)?)" 
    raw_tokens = nltk.regexp_tokenize(text, pattern)
    return [token.lower() for token in raw_tokens]

# Removing stopwords:
# removes stopwords from pre-tokenized list of words
def remove_stopwords(token_list, custom_words= []):
    return [word for word in token_list if word not in stopwords.words('english')+custom_words]

# Joining a list of words together
def join_word_list(word_list):
    return " ".join(word_list)

# Creating part-of-speech tags using the SpaCy package
# input is a string
# output is a list of tuples [('I', 'PRON'),('went', 'VERB'),('to', 'ADP'),('the', 'DET'),('store', 'NOUN'),('today', 'NOUN')]
def create_pos_tokens(review):
    doc = nlp(review)
    pos_tag_list = [(token.text, token.pos_) for token in doc]
    return pos_tag_list
    
# Obtaining N-Grams:

# returns most frequent words in corpus
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(ngram_range=(1, 1), stop_words=stopwords.words('english')+custom_stopwords).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

# returns most frequent bigrams in corpus
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words=stopwords.words('english')+custom_stopwords).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

# returns most frequent trigrams in corpus
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words=stopwords.words('english')+custom_stopwords).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

### Tokenization:
Below, the sentences in each review are being separated into distinct words. This is called "tokenization."

In [8]:
tokenized_pros = all_reviews.pros.apply(lambda x: tokenize_sentences(x))
# preview
tokenized_pros[:3]

0    [good, benefits, nice, values, nice, co, worke...
1                 [its, great, pay, amp, steady, work]
2    [great, worklife, balance, limitless, income, ...
Name: pros, dtype: object

In [9]:
tokenized_cons = all_reviews.cons.apply(lambda x: tokenize_sentences(x))
# preview
tokenized_cons[:3]

0    [no, flexibility, micromanage, no, room, for, ...
1                     [its, a, lot, of, manual, labor]
2    [appreciation, for, good, work, is, no, longer...
Name: cons, dtype: object

### Removing Stopwords:
Stopwords are commonly occuring words in written vocabulary and are usually removed to negate the influence of words such as "and," "the," and "to." These are words that are frequently mentioned but offer limited information in understanding the meaning of a sentence. I have added custom stopwords that are frequently mentioned in our corpus due to just being the name of the company the review is written about.

In [107]:
# nltk.download('stopwords')
from nltk.corpus import stopwords

In [108]:
custom_stopwords = ['home', 'depot', 'verizon', 'boa', 'bank', 'america', 'boeing', 'comcast', 'ge', 'google', 'jp', 'morgan', 'microsoft', 'kroger', 'walgreens', 'wells', 'fargo']

In [109]:
# token list without stop words
tokenized_pros = tokenized_pros.apply(lambda x: remove_stopwords(x, custom_stopwords))
# preview
tokenized_pros[:3]

0    [good, benefits, nice, values, nice, co, worke...
1                      [great, pay, amp, steady, work]
2    [great, worklife, balance, limitless, income, ...
Name: pros, dtype: object

In [110]:
# token list without stop words
tokenized_cons = tokenized_cons.apply(lambda x: remove_stopwords(x, custom_stopwords))
# preview
tokenized_cons[:3]

0             [flexibility, micromanage, room, growth]
1                                 [lot, manual, labor]
2    [appreciation, good, work, longer, celebrated,...
Name: cons, dtype: object

In [14]:
# pros_sentences = [' '.join(sentence) for sentence in tokenized_pros]
# cons_sentences = [' '.join(sentence) for sentence in tokenized_cons]

# Or use defined function (output is Series instead of a list):
pros_sentences = tokenized_pros.map(lambda x: join_word_list(x))
cons_sentences = tokenized_cons.map(lambda x: join_word_list(x))
# preview
pros_sentences[:10]

0    good benefits nice values nice co workers nice...
1                            great pay amp steady work
2    great worklife balance limitless income potential
3    money benefits flexibility schedule great uppe...
4         excellent benefits competitive within market
5              flexible working hours good size office
6                              people part good people
7     continual training good benefits lots great food
8    great pay fast paced challenging ever changing...
9                               good pay work schedule
Name: pros, dtype: object

In [111]:
# save tokenized words using pickle
with open('../../saved_objects/pros_tokens.pkl', 'wb') as f:
    pickle.dump(tokenized_pros, f)
with open('../../saved_objects/cons_tokens.pkl', 'wb') as f:
    pickle.dump(tokenized_cons, f)

### Part-of-speech Tagging:
POS tagging allows us to isolate certain words by part of speech. If we want, we can choose to examine only nouns, adjectives, or verbs, for example.

In [17]:
# import spacy
# import en_core_web_sm
nlp = en_core_web_sm.load()

In [18]:
# use subset of dataframe since this process is intensive
subset_pros = pros_sentences.sample(10000)
subset_cons = cons_sentences.sample(10000)
# convert the series of pros/cons to pos-tagged tokens

# preview the series:
print(subset_pros[:5],'\n','\n', subset_cons[:5])

79906                                 great support manager
18596      pay benefits high compared call center positions
151233    main advantage working worklife balance even b...
225828    great people work family oriented culture grea...
5116          flexible schedule higher pay jobs outside lax
Name: pros, dtype: object 
 
 42003     available shifts inconsistent metrics wrong ca...
35858     management tougher previous pharma experiences...
207386    managment become awful company run like walmar...
238460                                     pay could better
182254    company big policies impersonal taken case cas...
Name: cons, dtype: object


In [19]:
postag_pros_sentences = subset_pros.apply(lambda x: create_pos_tokens(x))
postag_cons_sentences = subset_cons.apply(lambda x: create_pos_tokens(x))
# preview
postag_pros_sentences[:5]

79906      [(great, ADJ), (support, NOUN), (manager, NOUN)]
18596     [(pay, VERB), (benefits, NOUN), (high, ADJ), (...
151233    [(main, ADJ), (advantage, NOUN), (working, VER...
225828    [(great, ADJ), (people, NOUN), (work, VERB), (...
5116      [(flexible, ADJ), (schedule, NOUN), (higher, A...
Name: pros, dtype: object

In [20]:
pros_adjs = postag_pros_sentences.apply(lambda x: [word[0] for word in x if word[1] == 'ADJ'])
cons_adjs = postag_cons_sentences.apply(lambda x: [word[0] for word in x if word[1] == 'ADJ'])
# preview
pros_adjs[:5]

79906                                               [great]
18596                                                [high]
151233    [main, awesome, great, domestic, fair, bottom,...
225828    [great, great, flexible, depending, exciting, ...
5116                                     [flexible, higher]
Name: pros, dtype: object

In [21]:
joined_pros_adjs = pros_adjs.apply(lambda x: ' '.join(x))
joined_cons_adjs = cons_adjs.apply(lambda x: ' '.join(x))
# preview
joined_pros_adjs[:5]

79906                                                 great
18596                                                  high
151233    main awesome great domestic fair bottom availa...
225828      great great flexible depending exciting certain
5116                                        flexible higher
Name: pros, dtype: object

In [22]:
# save POS tagged words using pickle
with open('../../saved_objects/postag_pros.pkl', 'wb') as f:
    pickle.dump(postag_pros_sentences, f)
with open('../../saved_objects/postag_cons.pkl', 'wb') as f:
    pickle.dump(postag_cons_sentences, f)

### Stemming:
We can stem/lemmatize words in order to reduce multiple tenses of word to their root prefix. For example, the word "run" and "running" capture the same meaning for our purposes. Stemming the word "running" would reduce it to "run."

In [23]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

In [24]:
# preview
tokenized_pros[:5]

0    [good, benefits, nice, values, nice, co, worke...
1                      [great, pay, amp, steady, work]
2    [great, worklife, balance, limitless, income, ...
3    [money, benefits, flexibility, schedule, great...
4    [excellent, benefits, competitive, within, mar...
Name: pros, dtype: object

In [25]:
stemmed_pros = tokenized_pros.apply(lambda x: ' '.join([stemmer.stem(word) for word in x]))
stemmed_cons = tokenized_cons.apply(lambda x: ' '.join([stemmer.stem(word) for word in x]))
print(stemmed_pros[:5], '\n', '\n', stemmed_cons[:5])

0    good benefit nice valu nice co worker nice outing
1                            great pay amp steadi work
2         great worklif balanc limitless incom potenti
3      money benefit flexibl schedul great upper manag
4                 excel benefit competit within market
Name: pros, dtype: object 
 
 0                       flexibl micromanag room growth
1                                     lot manual labor
2    appreci good work longer celebr focus longer a...
3    custom call day week night work life balanc tr...
4                            recent layoff lack stabil
Name: cons, dtype: object


In [26]:
# save stemmed words objects using pickle
with open('../../saved_objects/stemmed_pros.pkl', 'wb') as f:
    pickle.dump(stemmed_pros, f)
with open('../../saved_objects/stemmed_cons.pkl', 'wb') as f:
    pickle.dump(stemmed_cons, f)

# Initial Exploratory Data Analysis

## Distribution of Reviews by Company
So admittedly, the distribution of reviews by company are not evenly distributed. I started off collecting as many reviews as possible from each company, but time constraints led me to collecting ~2000-3000 per company instead. As time went on, I favored a breadth of companies instead of as many as possible from a single company. I will have to go back and filter reviews collected by date range. I believe that would be the mest metric for collecting reviews.

In [25]:
all_reviews.groupby('company').company.count().iplot(
    kind='bar', xTitle='Count', linecolor='black', orientation = 'v', color='orange', title="Distribution of Reviews by Star Rating:")

## Distribution of Reviews by Word Counts
### Word Counts in "Pros":

In [57]:
all_reviews.loc[all_reviews.word_count_pros < 200].word_count_pros

1415

In [61]:
all_reviews.loc[all_reviews.word_count_pros < 200].word_count_pros.iplot(
    kind='hist', xTitle='Word Count', yTitle='Count', bins=100, linecolor='black', color='orange', title="Distribution of Reviews by Word Count:")

## Star Rating Distribution:
Let's look at how many reviews are in each rating category from 1-5:

In [10]:
all_reviews.groupby('rating').company.count().iplot(
    kind='bar', xTitle='Count', linecolor='black', orientation = 'v', color='blue', title="Distribution of Reviews by Star Rating:")

In [49]:
round(all_reviews.rating.value_counts(normalize=True).sort_values(ascending = True)*100,1).iplot(
    kind='bar', xTitle='Count', linecolor='black', orientation = 'v', color='grey', title="Distribution of Reviews by Star Rating (Percentage):")

It seems most ratings are in the 3-5 category (~81% of reviews).

## Word Frequency Distributions:
Just to get an initial sense of the type of words and text contained in the reviews, we can look at word frequency counts. We can see which words are most frequently used among Pros and Cons, and across the entire corpus as a whole.

In [27]:
from nltk import FreqDist

In [28]:
all_cons = []
for record in tokenized_cons:
    all_cons += record
    
all_pros = []
for record in tokenized_pros:
    all_pros += record

In [29]:
# freq dist for all cons and all pros combined
freqdist_all = FreqDist(all_cons+all_pros)
# freq dist for pros and cons separately
freqdist_pros = FreqDist(all_pros)
freqdist_cons = FreqDist(all_cons)

### Most commonly used words across all reviews:

Let's take a look at the commonly used words in Fortune 100 company reviews.

In [30]:
# most common words in all reviews (pros+cons)
# hard to extract general sentiment from just words. Will probably need to include
    # bi-grams, tri-grams, and POS-tags for adjectives
freqdist_all.most_common(15)

[('work', 150174),
 ('good', 111301),
 ('great', 111282),
 ('company', 91301),
 ('benefits', 83722),
 ('people', 72005),
 ('management', 65245),
 ('pay', 63965),
 ('get', 49394),
 ('employees', 44510),
 ('time', 43931),
 ('job', 40462),
 ('hours', 39772),
 ('environment', 31580),
 ('working', 30826)]

Most of the words here are unsurprising, like "work," "good," "great," and "company." The words that particularly stand out and give us some insight into what employees value are "benefits," "pay," "hours," and "management." It's not completely unsurprising that these words are popular among job reviews. However, it validates our assumption that these are common aspects of a job that factor into an employee's overall level of satisfaction.

### Most popular words in "Pros":

In [31]:
# most common words in 'pros'
freqdist_pros.most_common(15)

[('great', 101877),
 ('good', 94749),
 ('work', 88453),
 ('benefits', 76614),
 ('company', 48957),
 ('people', 44032),
 ('pay', 37686),
 ('environment', 21946),
 ('opportunities', 21110),
 ('job', 18834),
 ('time', 17715),
 ('balance', 16443),
 ('working', 16232),
 ('employees', 16052),
 ('get', 15688)]

"Great" and "good" are two very generic words that capture positive sentiment. It makes sense that they top the charts here. Additionally, the only unique words in the Pros section that gives us some added insight are "opportunities" and "balance." "Balance" is most likely part of the phrase "work life balance."

### Most popular words in "Cons":

In [32]:
# most common words in 'cons'
freqdist_cons.most_common(15)

[('work', 61721),
 ('management', 50192),
 ('company', 42344),
 ('get', 33706),
 ('employees', 28458),
 ('people', 27973),
 ('pay', 26279),
 ('time', 26216),
 ('hours', 25679),
 ('job', 21628),
 ('many', 20032),
 ('much', 19248),
 ('dont', 19072),
 ('managers', 17371),
 ('like', 17219)]

What is interesting here is that "management" is a top word used across both Pros and Cons, but is much more frequent among Cons. The same goes for "hours." It seems likely that management and working hours are common complaints among employees at Fortune 100 companies.

Overall, it's quite difficult to extract general sentiment from just words alone. It will probably help to examine bi-grams, tri-grams, and POS-tagged adjectives. This may give us additional, more specific insight into themes of employee satisfaction.

## Most commonly used bi-grams and tri-grams:

We can look at n-grams, or sequences of words to give us better insight into our text data.

### Top bigrams in "Pros":

In [33]:
common_words = get_top_n_bigram(all_reviews['pros'], 20)
# for word, freq in common_words:
#     print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['pros' , 'count'])

In [34]:
df1.groupby('pros').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', xTitle='Count', linecolor='black', orientation='h', color ='green', title="Top 20 Bigrams in 'Pros' after Removing Stop Words:")

The most popular 2 word phrases among Pros is "great benefits" and "good benefits." Other than that we can see the words "great" and "good" combined with an assortment of other nouns. Examples are "good pay," "great people," and "great company."

### Top bigrams in "Cons":

In [35]:
common_words = get_top_n_bigram(all_reviews['cons'], 20)
# for word, freq in common_words:
#     print(word, freq)
df2 = pd.DataFrame(common_words, columns = ['cons' , 'count'])

In [36]:
df2.groupby('cons').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='red', title="Top 20 Bigrams in 'Cons' after Removing Stop Words:")

Looking at the frequency of the phrases "work life," "life balance," "worklife balance," and "long hours," it seems like a very common complaint among Cons is poor work life balance. Some other bigrams that stand out are "upper management," "customer service," and "red tape." It seems like a lot of Cons may come from customer service workers. It's also surprising that the phrase "red tape" is so popular. This is most likely because the reviews only come from Fortune 100 companies.

### Top trigrams in "Pros":

In [37]:
common_words = get_top_n_trigram(all_reviews['pros'], 20)
# for word, freq in common_words:
#     print(word, freq)
df3 = pd.DataFrame(common_words, columns = ['pros' , 'count'])

In [38]:
df3.groupby('pros').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', xTitle='Count', linecolor='black', orientation = 'h', color='green', title="Top 20 Trigrams in 'Pros' after Removing Stop Words:")

Like the bigrams, a lot of the trigrams consist of various combinations of "great" and "good." Other commonly listed words are "benefits," "people," "pay," "life," and "balance."

### Top trigrams in "Cons":

In [39]:
common_words = get_top_n_trigram(all_reviews['cons'], 20)
# for word, freq in common_words:
#     print(word, freq)
df4 = pd.DataFrame(common_words, columns = ['cons' , 'count'])

In [40]:
df4.groupby('cons').sum()['count'].sort_values(ascending=True).iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='red', title="Top 20 Trigrams in 'Cons' after Removing Stop Words:")

Unsurprisingly, the top trigram in both Pros and Cons is "work life balance." Some other trigrams that are particularly informative are "get things done," "high turnover rate," and "hard get promoted." We can infer a lot of information from these trigrams, but we should remember that they are only present in a very small subset of our reviews.

The distribution of trigrams is much more skewed to the right, compared to the bigram and word distributions. This makes sense. As the words in an n-gram increases, the more likely it is that only a select few n-grams are used on a regular basis. On the same note, the greater the number of irrelevant n-grams there are.

## Let's search for our frequent words in the trigrams and discover the context:

We can look at the most frequently occuring words in the corpus, and search for bigrams/trigrams containing those words. This will give us greater context surrounding their usage. As an example, we can look at the word "people" and see what the most popular trigrams containing "people" are.

Let's see how people are using the word "people" in the Pros section of reviews:

In [68]:
for tri_gram in get_top_n_trigram(all_reviews['pros'], 200):
    if 'people' in tri_gram[0]:
        print(tri_gram)

('great people work', 1629)
('good people work', 845)
('great people great', 767)
('benefits great people', 474)
('good people good', 391)
('work great people', 381)
('nice people work', 374)
('smart people work', 357)
('benefits good people', 356)
('people work great', 349)
('people great work', 333)
('people great benefits', 320)
('smart people great', 316)
('great people good', 310)
('work smart people', 298)
('lots smart people', 269)
('smart people good', 254)
('meet new people', 244)
('really smart people', 238)
('people good benefits', 216)
('people work good', 201)
('nice people good', 193)
('meeting new people', 185)
('environment great people', 170)
('benefits smart people', 169)
('pay good people', 168)


It seems the most revealing and insightful uses of the word "people" that employees appreciate in a good workplace are "nice" and "smart" people. The words "great" and "good" are also frequently occuring, but are too generic to know the exact context.

Let's see how people are using the word "management" in the Cons section of reviews:

In [42]:
for tri_gram in get_top_n_trigram(all_reviews['cons'], 300):
    if 'management' in tri_gram[0]:
        print(tri_gram)

('many layers management', 286)
('management doesnt care', 254)
('upper management doesnt', 145)
('upper level management', 144)
('mid level management', 120)
('many levels management', 102)
('poor management poor', 100)
('communication upper management', 97)
('management doesnt seem', 91)
('poor upper management', 89)
('management care employees', 80)
('management low pay', 78)
('management plays favorites', 78)
('management could better', 77)
('management hit miss', 77)
('poor management lack', 72)
('management dont care', 71)
('life balance management', 70)
('lack communication management', 70)
('middle upper management', 66)
('lower level management', 66)
('management upper management', 61)
('top heavy management', 61)
('poor middle management', 61)


The results here seem to be a lot more interesting. We can see that the top trigram containing "management" is "many layers management," followed by "management doesn't care." This gives us some pretty clear context of how employees negatively view management in a company. From the results, we can make an educated guess that common management complaints revolve around a lot of hierarchy and bureaucracy, as well as a lack of care from managers.

## Most Commonly Used Adjectives in Pros and Cons:

We can use the previously POS-tagged reviews that we generated using SpaCy and discover what the most commonly used adjectives are. This is extremely useful in text generated from any sort of reviews. We can roughly gauge the way people view a certain item, product, or company in our case.

### Most common adjectives in "Pros":

In [43]:
common_words = get_top_n_words(joined_pros_adjs, 20)
# for word, freq in common_words:
#     print(word, freq)
df5 = pd.DataFrame(common_words, columns = ['cons' , 'count'])

In [58]:
top_adjectives_pros = df5.groupby('cons').sum()['count'].sort_values(ascending=True)
top_adjectives_pros.iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='green', title="Top 20 Adjectives in 'Pros' after Removing Stop Words (sample of 10k):")

Unsurprisingly, "great" and "good" top the list of most frequently used words - highly skewing the distribution to the right. Adjectives that stand out and give us an idea of what characteristics employees desire in a workplace are "flexible," "friendly," "smart," and "interesting."

### Most common adjectives in "Cons":

In [45]:
common_words = get_top_n_words(joined_cons_adjs, 20)
# for word, freq in common_words:
#     print(word, freq)
df6 = pd.DataFrame(common_words, columns = ['pros' , 'count'])

In [59]:
top_adjectives_cons = df6.groupby('pros').sum()['count'].sort_values(ascending=True)
top_adjectives_cons.iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='red', title="Top 20 Adjectives in 'Cons' after Removing Stop Words (sample of 10k):")

The adjective distribution among the Cons is not as skewed as the Pros distribution. The top three adjectives are "many," "good," and "much." At first glance, these seem like curious results, but upon further investigation, these words are usually used in the phrases "not many," "not good," and "not much."

Some adjectives that stand out are "difficult," "slow," and "corporate."

In [61]:
# save adjective counts using pickle
with open('../../saved_objects/top_adjectives_pros.pkl', 'wb') as f:
    pickle.dump(top_adjectives_pros, f)
with open('../../saved_objects/top_adjectives_cons.pkl', 'wb') as f:
    pickle.dump(top_adjectives_cons, f)

In [None]:
# save originally used pros and cons tokens
pros_remove = all_reviews['pros'].values.tolist()
pros_remove = [' '.join(w for w in word.split() if w not in (['home', 'depot', 'verizon', 'boa', 'bank', 'america', 'boeing', 'comcast', 'ge', 'google', 'jp', 'morgan', 'microsoft', 'kroger', 'walgreens', 'wells', 'fargo'] + list(stopwords.words('english')))) for word in pros_remove]
pros_remove = [tokenize_sentences(word) for word in pros_remove]

cons_remove = all_reviews['cons'].values.tolist()
cons_remove = [' '.join(w for w in word.split() if w not in (['work', 'company', 'home', 'depot', 'verizon', 'boa', 'bank', 'america', 'boeing', 'comcast', 'ge', 'google', 'jp', 'morgan', 'microsoft', 'kroger', 'walgreens', 'wells', 'fargo']  + list(stopwords.words('english')))) for word in cons_remove]
cons_remove = [tokenize_sentences(word) for word in cons_remove]

In [112]:
with open('../../saved_objects/tokenized_pros.pkl', 'wb') as f:
    pickle.dump(pros_remove, f)
with open('../../saved_objects/tokenized_cons.pkl', 'wb') as f:
    pickle.dump(cons_remove, f)

## Conclusions
In this notebook, we assembled all the reviews from our web scraping efforts into a single dataframe. From there, we saw that we had to do a lot of text preprocessing in order to prepare for modeling. Lastly, we did some EDA using frequency counts, n-gram counts, and POS tagging to get a good initial idea of the content of our data. 

## Main Takeaways
- Detailed text preprocessing is essential for NLP and differs on a case-to-case basis depending on the objective.
- Common preprocessing steps include: removing punctuation and special characters, lowercasing, tokenization, removing stopwords, stemming/lemmatization, adding n-grams, pointwise mutual information scoring, part-of-speech (POS) tagging, using context-free grammars(CFG) and parse trees, and lastly deciding on the type of vectorization to use.
- Some insightful words that give us an idea of what employees value: benefits, people, pay, environment, hours, management.
- Some insightful bigrams include: smart people, worklife balance, long hours, customer service, poor management, red tape.
- Some insightful trigrams inclued: work life balance, work long hours, high turnover rate, pay could better, hard get promoted, poor worklife balance, many layers management, lots red tape.
- The most popular adjectives used in the "pros" section are also the two most popular words used across all reviews: great, good.
- Some insightful adjectives in "pros": flexible, friendly, smart, interesting, easy, different
- Some insightful adjectives in "cons": difficult, slow, corporate