# Employee Satisfaction Themes


## Project Goals
- Attempt to discover common themes of employee satisfaction and dissatisfaction among Fortune 100 companies using reviews from glassdoor.com
- Determine which companies particularly excel or fail in these categories
- Attempt to discover common principles that companies advocate for in their mission and vision statements, using manually scraped mission and vision statements from company pages
- Determine the amount of "mismatch" between employee and company sentiment, evaluating difference between employee-provided reviews and company mission and vision statements

## Summary of Data
The employee review data consists of 250,000+ employee-provided reviews of companies. This data was web-scraped from glassdoor.com, and contains information only from Fortune 100 companies. Each review consists of a "Pros" and "Cons" section.

### Library and Function Import

In [62]:
#Import libraries
%run ../python_files/libraries
%run ../python_files/functions
%matplotlib inline
# from libraries import *   #for use within .py file

### Object Import

In [58]:
# 'unpickle' saved objects
with open('../../saved_objects/tokenized_pros.pkl', 'rb') as f:
    tokenized_pros = pickle.load(f)
with open('../../saved_objects/tokenized_cons.pkl', 'rb') as f:
    tokenized_cons = pickle.load(f)
with open('../../saved_objects/stemmed_pros.pkl', 'rb') as f:
    stemmed_pros = pickle.load(f)
with open('../../saved_objects/stemmed_cons.pkl', 'rb') as f:
    stemmed_cons = pickle.load(f)
with open('../../saved_objects/top_adjectives_pros.pkl', 'rb') as f:
    top_adjectives_pros = pickle.load(f)
with open('../../saved_objects/top_adjectives_cons.pkl', 'rb') as f:
    top_adjectives_cons = pickle.load(f)
with open('../../saved_objects/ldamodel_pros.pkl', 'rb') as f:
    ldamodel_pros = pickle.load(f)
with open('../../saved_objects/ldamodel_cons.pkl', 'rb') as f:
    ldamodel_cons = pickle.load(f)

## Initial Exploratory Analysis

### How do employees differ in describing jobs they enjoy vs. jobs they hate?
We can use the previously POS-tagged reviews that we generated using SpaCy and discover what the most commonly used adjectives are. This is extremely useful in text generated from any sort of reviews. We can roughly gauge the way people view a certain item, product, or company in our case.
### Most common adjectives used in Pros:

In [55]:
top_adjectives_pros.iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='green', title="Top 20 Adjectives in 'Pros' after Removing Stop Words (sample of 10k):")

Unsurprisingly, "great" and "good" top the list of most frequently used words - highly skewing the distribution to the right. Adjectives that stand out and give us an idea of what characteristics employees desire in a workplace are "flexible," "friendly," "smart," and "interesting."

### Most common adjectives used in Cons:

In [56]:
top_adjectives_cons.iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='red', title="Top 20 Adjectives in 'Cons' after Removing Stop Words (sample of 10k):")

The adjective distribution among the Cons is not as skewed as the Pros distribution. The top three adjectives are "many," "good," and "much." At first glance, these seem like curious results, but upon further investigation, these words are usually used in the phrases "not many," "not good," and "not much."

Some adjectives that stand out are "difficult," "slow," and "corporate."

## Topic Modeling using Latent Dirichlet Allocation (LDA)
We may have a rough idea of what employees value/dislike in a job based on popular adjectives used, but let's attempt to get a clearer idea using topic modeling. We will use unsupervised natural language processing, applying Latent Dirichlet Allocation (LDA) to attempt to extract common themes of employee satisfaction among company reviews.

### Pros

In [8]:
# creating a dictionary
dictionary_pros = corpora.Dictionary(tokenized_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in tokenized_pros]

In [9]:
ldamodel_pros_20 = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)

In [17]:
# save model
# from gensim.test.utils import datapath
temp_file = datapath("lda_pros_20")
ldamodel_pros_20.save(temp_file)

# lda_pros_20 = LdaModel.load(datapath("lda_pros_20"))
# doc_topics_pros = list(lda_pros_20.get_document_topics(pros_corpus))

In [15]:
# display model topics
ldamodel_pros_20.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.313*"work" + 0.100*"environment" + 0.100*"great" + 0.093*"good" + 0.062*"place" + 0.048*"people" + 0.048*"life" + 0.039*"balance" + 0.019*"culture" + 0.018*"team" + 0.018*"company" + 0.015*"challenging" + 0.013*"working"'),
 (1,
  '0.107*"best" + 0.080*"customers" + 0.061*"worked" + 0.055*"store" + 0.054*"associates" + 0.050*"people" + 0.047*"one" + 0.028*"meet" + 0.026*"product" + 0.025*"school" + 0.021*"ever" + 0.020*"helping" + 0.017*"thing"'),
 (2,
  '0.059*"really" + 0.047*"employees" + 0.043*"company" + 0.040*"care" + 0.035*"amazing" + 0.032*"employee" + 0.030*"take" + 0.024*"working" + 0.020*"positive" + 0.015*"think" + 0.015*"pros" + 0.014*"everything" + 0.013*"enjoyed"'),
 (3,
  '0.162*"flexible" + 0.143*"hours" + 0.140*"work" + 0.091*"schedule" + 0.055*"easy" + 0.039*"love" + 0.031*"working" + 0.017*"schedules" + 0.017*"busy" + 0.015*"hourly" + 0.014*"safety" + 0.012*"overtime" + 0.012*"weekends"'),
 (4,
  '0.038*"technology" + 0.033*"strong" + 0.032*"leadership" + 

### Pros removing common stopwords (great, good, work, company)

In [43]:
tokenized_pros = tokenized_pros.apply(lambda x: remove_stopwords(x, ['great','good','work','company']))
dictionary_pros = corpora.Dictionary(tokenized_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in tokenized_pros]
ldamodel_pros_20_custom_stopwords = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
temp_file = datapath("lda_pros_20_custom_stopwords")
ldamodel_pros_20_custom_stopwords.save(temp_file)
ldamodel_pros_20_custom_stopwords.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.169*"time" + 0.080*"part" + 0.040*"employees" + 0.034*"full" + 0.031*"available" + 0.030*"ok" + 0.026*"even" + 0.025*"quality" + 0.023*"resources" + 0.021*"provide" + 0.021*"schedules" + 0.020*"group" + 0.020*"needed"'),
 (1,
  '0.094*"make" + 0.078*"amazing" + 0.064*"money" + 0.054*"big" + 0.037*"back" + 0.033*"hard" + 0.031*"sales" + 0.028*"large" + 0.021*"exciting" + 0.021*"bad" + 0.020*"thats" + 0.018*"role" + 0.018*"none"'),
 (2,
  '0.268*"pay" + 0.243*"benefits" + 0.072*"decent" + 0.064*"salary" + 0.042*"competitive" + 0.036*"people" + 0.022*"awesome" + 0.018*"coworkers" + 0.017*"perks" + 0.013*"union" + 0.013*"wonderful" + 0.012*"bonuses" + 0.010*"healthcare"'),
 (3,
  '0.125*"culture" + 0.123*"management" + 0.084*"employees" + 0.075*"team" + 0.031*"leadership" + 0.027*"level" + 0.026*"values" + 0.020*"support" + 0.019*"supportive" + 0.019*"corporate" + 0.018*"training" + 0.017*"professional" + 0.017*"helpful"'),
 (4,
  '0.155*"place" + 0.082*"like" + 0.060*"really" + 

### Pros removing just stopwords 'work' and 'company' (same as OG model)

In [63]:
tokenized_pros = tokenized_pros.apply(lambda x: remove_stopwords(x, ['work','company']))
dictionary_pros = corpora.Dictionary(tokenized_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in tokenized_pros]
ldamodel_pros_20_custom_stopwords = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
temp_file = datapath("lda_pros_20_custom_stopwords")
ldamodel_pros_20_custom_stopwords.save(temp_file)
ldamodel_pros_20_custom_stopwords.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.102*"well" + 0.088*"like" + 0.053*"family" + 0.042*"team" + 0.030*"employees" + 0.019*"healthcare" + 0.017*"put" + 0.017*"pays" + 0.017*"management" + 0.015*"fairly" + 0.014*"role" + 0.014*"get" + 0.014*"members"'),
 (1,
  '0.121*"lot" + 0.121*"lots" + 0.099*"learn" + 0.062*"experience" + 0.053*"new" + 0.035*"opportunity" + 0.029*"people" + 0.026*"high" + 0.024*"things" + 0.022*"different" + 0.019*"technology" + 0.019*"focus" + 0.019*"resources"'),
 (2,
  '0.102*"really" + 0.058*"managers" + 0.057*"amazing" + 0.052*"worked" + 0.046*"associates" + 0.034*"level" + 0.031*"people" + 0.029*"helpful" + 0.028*"everyone" + 0.022*"ive" + 0.021*"cool" + 0.020*"years" + 0.018*"experience"'),
 (3,
  '0.462*"good" + 0.098*"balance" + 0.093*"benefits" + 0.075*"life" + 0.038*"worklife" + 0.032*"people" + 0.023*"culture" + 0.021*"salary" + 0.014*"management" + 0.014*"compensation" + 0.008*"overall" + 0.007*"flexibility" + 0.007*"stable"'),
 (4,
  '0.039*"help" + 0.026*"every" + 0.026*"worker

### Stemmed pros

In [26]:
tokenized_pros[:5]

0    [good, benefits, nice, values, nice, co, worke...
1                      [great, pay, amp, steady, work]
2    [great, worklife, balance, limitless, income, ...
3    [money, benefits, flexibility, schedule, great...
4    [excellent, benefits, competitive, within, mar...
Name: pros, dtype: object

In [27]:
stemmed_pros[:5]

0    good benefit nice valu nice co worker nice outing
1                            great pay amp steadi work
2         great worklif balanc limitless incom potenti
3      money benefit flexibl schedul great upper manag
4                 excel benefit competit within market
Name: pros, dtype: object

In [33]:
stemmed_pros = [tokenize_sentences(word) for word in stemmed_pros]
dictionary_pros = corpora.Dictionary(stemmed_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in stemmed_pros]

In [34]:
ldamodel_stemmed_pros_20 = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)

In [35]:
# save model
# from gensim.test.utils import datapath
temp_file = datapath("lda_stemmed_pros_20")
ldamodel_stemmed_pros_20.save(temp_file)

# lda_stemmed_pros_20 = LdaModel.load(datapath("lda_stemmed_pros_20"))
# doc_topics_pros = list(lda_stemmed_pros_20.get_document_topics(pros_corpus))

In [36]:
ldamodel_stemmed_pros_20.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.078*"year" + 0.056*"day" + 0.048*"work" + 0.042*"love" + 0.039*"everi" + 0.030*"enjoy" + 0.024*"month" + 0.020*"come" + 0.016*"stay" + 0.015*"wage" + 0.015*"get" + 0.014*"leav" + 0.014*"ive"'),
 (1,
  '0.331*"great" + 0.200*"work" + 0.064*"balanc" + 0.059*"benefit" + 0.057*"peopl" + 0.053*"place" + 0.048*"life" + 0.036*"compani" + 0.029*"environ" + 0.027*"cultur" + 0.025*"worklif" + 0.018*"team" + 0.010*"excel"'),
 (2,
  '0.125*"custom" + 0.086*"product" + 0.056*"servic" + 0.034*"improv" + 0.029*"market" + 0.028*"chang" + 0.027*"process" + 0.022*"qualiti" + 0.021*"perform" + 0.020*"system" + 0.020*"motiv" + 0.019*"driven" + 0.019*"extrem"'),
 (3,
  '0.235*"time" + 0.077*"part" + 0.069*"vacat" + 0.065*"paid" + 0.031*"week" + 0.031*"full" + 0.028*"top" + 0.024*"benefit" + 0.022*"employ" + 0.021*"get" + 0.020*"day" + 0.019*"tech" + 0.017*"live"'),
 (4,
  '0.099*"benefit" + 0.085*"employe" + 0.077*"great" + 0.057*"k" + 0.046*"health" + 0.032*"care" + 0.032*"insur" + 0.029*"discou

### Stemmed pros removing common stopwords (great, good, work, company)

In [47]:
# stemmed_pros = [tokenize_sentences(word) for word in stemmed_pros]
stemmed_pros = pd.Series(stemmed_pros).apply(lambda x: remove_stopwords(x, ['great','good','work','compani']))
dictionary_pros = corpora.Dictionary(stemmed_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in stemmed_pros]
ldamodel_stemmed_pros_20_custom_stopwords = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
temp_file = datapath("lda_stemmed_pros_20_custom_stopwords")
ldamodel_stemmed_pros_20_custom_stopwords.save(temp_file)
ldamodel_stemmed_pros_20_custom_stopwords.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.067*"perk" + 0.053*"right" + 0.051*"perform" + 0.047*"thing" + 0.045*"travel" + 0.042*"process" + 0.037*"mobil" + 0.037*"treat" + 0.032*"everyth" + 0.030*"fantast" + 0.029*"direct" + 0.027*"well" + 0.025*"patient"'),
 (1,
  '0.316*"peopl" + 0.066*"smart" + 0.058*"help" + 0.054*"friend" + 0.047*"around" + 0.045*"interest" + 0.039*"project" + 0.034*"lot" + 0.023*"alway" + 0.021*"realli" + 0.018*"enjoy" + 0.018*"someth" + 0.016*"get"'),
 (2,
  '0.093*"experi" + 0.075*"like" + 0.073*"famili" + 0.053*"room" + 0.038*"offer" + 0.036*"full" + 0.027*"gain" + 0.024*"huge" + 0.022*"varieti" + 0.021*"financi" + 0.021*"orient" + 0.019*"feel" + 0.018*"field"'),
 (3,
  '0.142*"learn" + 0.092*"lot" + 0.063*"new" + 0.048*"product" + 0.045*"technolog" + 0.044*"divers" + 0.031*"skill" + 0.028*"talent" + 0.024*"busi" + 0.023*"experi" + 0.022*"meet" + 0.022*"knowledg" + 0.016*"improv"'),
 (4,
  '0.113*"best" + 0.085*"compens" + 0.058*"industri" + 0.053*"one" + 0.048*"big" + 0.045*"world" + 0.032*