# Employee Satisfaction Themes


## Project Goals
- Attempt to discover common themes of employee satisfaction and dissatisfaction among Fortune 100 companies using reviews from glassdoor.com
- Determine which companies particularly excel or fail in these categories
- Attempt to discover common principles that companies advocate for in their mission and vision statements, using manually scraped mission and vision statements from company pages
- Determine the amount of "mismatch" between employee and company sentiment, evaluating difference between employee-provided reviews and company mission and vision statements

## Summary of Data
The employee review data consists of 250,000+ employee-provided reviews of companies. This data was web-scraped from glassdoor.com, and contains information only from Fortune 100 companies. Each review consists of a "Pros" and "Cons" section.

### Library and Function Import

In [133]:
#Import libraries
%run ../python_files/libraries
# need to import libraries that functions are dependent on separately within functions.py file
%run ../python_files/functions
%matplotlib inline
# from libraries import *   #for use within .py file

### Object Import

In [126]:
# 'unpickle' saved objects
with open('../../saved_objects/tokenized_pros.pkl', 'rb') as f:
    tokenized_pros = pickle.load(f)
with open('../../saved_objects/tokenized_cons.pkl', 'rb') as f:
    tokenized_cons = pickle.load(f)
with open('../../saved_objects/stemmed_pros.pkl', 'rb') as f:
    stemmed_pros = pickle.load(f)
with open('../../saved_objects/stemmed_cons.pkl', 'rb') as f:
    stemmed_cons = pickle.load(f)
with open('../../saved_objects/top_adjectives_pros.pkl', 'rb') as f:
    top_adjectives_pros = pickle.load(f)
with open('../../saved_objects/top_adjectives_cons.pkl', 'rb') as f:
    top_adjectives_cons = pickle.load(f)
with open('../../saved_objects/ldamodel_pros.pkl', 'rb') as f:
    ldamodel_pros = pickle.load(f)
with open('../../saved_objects/ldamodel_cons.pkl', 'rb') as f:
    ldamodel_cons = pickle.load(f)
with open('../../saved_objects/reviews_df.pkl', 'rb') as f:
    all_reviews = pickle.load(f)

## Initial Exploratory Analysis

### How do employees differ in describing jobs they enjoy vs. jobs they hate?
We can use the previously POS-tagged reviews that we generated using SpaCy and discover what the most commonly used adjectives are. This is extremely useful in text generated from any sort of reviews. We can roughly gauge the way people view a certain item, product, or company in our case.
### Most common adjectives used in Pros:

In [55]:
top_adjectives_pros.iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='green', title="Top 20 Adjectives in 'Pros' after Removing Stop Words (sample of 10k):")

Unsurprisingly, "great" and "good" top the list of most frequently used words - highly skewing the distribution to the right. Adjectives that stand out and give us an idea of what characteristics employees desire in a workplace are "flexible," "friendly," "smart," and "interesting."

### Most common adjectives used in Cons:

In [56]:
top_adjectives_cons.iplot(
    kind='bar', xTitle='Count', linecolor='black',orientation = 'h', color='red', title="Top 20 Adjectives in 'Cons' after Removing Stop Words (sample of 10k):")

The adjective distribution among the Cons is not as skewed as the Pros distribution. The top three adjectives are "many," "good," and "much." At first glance, these seem like curious results, but upon further investigation, these words are usually used in the phrases "not many," "not good," and "not much."

Some adjectives that stand out are "difficult," "slow," and "corporate."

## Topic Modeling using Latent Dirichlet Allocation (LDA)
We may have a rough idea of what employees value/dislike in a job based on popular adjectives used, but let's attempt to get a clearer idea using topic modeling. We will use unsupervised natural language processing, applying Latent Dirichlet Allocation (LDA) to attempt to extract common themes of employee satisfaction among company reviews.

LDA Topic Modeling is an unsupervised learning method. It is similar to clustering but differs in a fundamental way. While clustering assigns each observation (or document, in this case) to a unique cluster, LDA Topic Modeling assumes that each document consists of different topics and assigns a topic probability distribution to each document. In practice, this makes more sense. It's likely that each review contains several topics (i.e. people will talk about several aspects of a job such as great benefits, culture, management in a single review), rather than just a single topic.

### Pros

In [115]:
# Create a corpus from a list of texts
dictionary_pros = corpora.Dictionary(tokenized_pros)
# document to bag of words 
pros_corpus = [dictionary_pros.doc2bow(text) for text in tokenized_pros]

In [116]:
# model on the corpus
ldamodel_pros_20 = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)

In [117]:
# save model
# from gensim.test.utils import datapath
temp_file = datapath("lda_pros_20")
ldamodel_pros_20.save(temp_file)

# Load a pretrained model from disk
# pretrained_lda_pros = LdaModel.load(temp_file)

In [128]:
ldamodel_pros_20.show_topics(num_topics = 20, num_words = 13)

[(0,
  '0.053*"training" + 0.037*"leadership" + 0.037*"development" + 0.034*"management" + 0.029*"strong" + 0.025*"culture" + 0.025*"interesting" + 0.025*"level" + 0.024*"values" + 0.023*"support" + 0.020*"program" + 0.020*"programs" + 0.019*"plenty"'),
 (1,
  '0.196*"place" + 0.078*"work" + 0.039*"people" + 0.036*"start" + 0.027*"cool" + 0.025*"think" + 0.022*"talented" + 0.021*"career" + 0.020*"generally" + 0.019*"meet" + 0.018*"need" + 0.017*"role" + 0.017*"plus"'),
 (2,
  '0.142*"opportunities" + 0.082*"lots" + 0.061*"opportunity" + 0.055*"many" + 0.054*"growth" + 0.043*"career" + 0.040*"move" + 0.039*"within" + 0.036*"different" + 0.033*"advancement" + 0.027*"grow" + 0.027*"large" + 0.025*"company"'),
 (3,
  '0.113*"time" + 0.081*"benefits" + 0.071*"k" + 0.057*"health" + 0.041*"paid" + 0.040*"vacation" + 0.038*"insurance" + 0.032*"part" + 0.027*"bonus" + 0.026*"days" + 0.024*"year" + 0.021*"medical" + 0.021*"pto"'),
 (4,
  '0.119*"lot" + 0.102*"job" + 0.098*"learn" + 0.076*"experi

### Labeling Topics in Pros
As you can see, the LDA model output 20 topics from our corpus. The next step and more difficult step is to assign meaning to each of the topics above and see if they make sense. I have assigned a rough label to each of the topics. I did this by 1) looking at the weighted words assigned to each topic in the model output, and 2) examining the documents that the highest probability assigned to each topic.

- 0: Training and development programs
- 1: "Cool" people & place to work at the start of your career
- 2: High potential for growth and advancement
- 3: Good benefits: 401k, health/medical insurance, pto, vacation, bonus
- 4: Great learning opportunity and skills development
- 5: Rewarded for good work (caring managers, bonuses)
- 6: Good pay, reimbursement for expenses, flexible hours
- 7: Challenging, fast-paced, strive for excellence
- 8: Help and customer-oriented
- 9: No pros
- 10: Superior products/industry
- 11: Work-life balance, flexible schedule
- 12: ? throwaway topic **-> throwaway**
- 13: Friendly,helpful people and work environment
- 14: High quality, exciting projects and smart people
- 15: ? Great pay, culture, benefits **-> throwaway**
- 16: ? "Love" working there **-> throwaway**
- 17: Free amenities: free food, discounts, events, gym, education
- 18: People like "family." Best company, best brand. Always new things.
- 19: Better than other similar companies, competitors

There were some topics that were difficult to label either due to not being segmented or specific enough (topic 15 mentioned good benefits, pay, people, management, and training), or just containing noise. I had to throw away those topics. While the topic segmentation is not perfect and contains some overlap, the LDA topic modeling overall worked quite well.

### Cons

In [120]:
# Create a corpus from a list of texts
dictionary_cons = corpora.Dictionary(tokenized_cons)
# document to bag of words 
cons_corpus = [dictionary_cons.doc2bow(text) for text in tokenized_cons]

In [121]:
# model on the corpus
ldamodel_cons_20 = LdaModel(corpus = cons_corpus, num_topics=20, id2word = dictionary_cons, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)

In [122]:
# save model
# from gensim.test.utils import datapath
temp_file = datapath("lda_cons_20")
ldamodel_cons_20.save(temp_file)

# Load a pretrained model from disk
# pretrained_lda_cons = LdaModel.load(temp_file)

In [123]:
ldamodel_cons_20.show_topics(num_topics = 20, num_words = 13)

[(0,
  '0.023*"get" + 0.021*"one" + 0.019*"people" + 0.016*"job" + 0.015*"even" + 0.012*"dont" + 0.012*"manager" + 0.012*"go" + 0.011*"youre" + 0.010*"want" + 0.010*"make" + 0.010*"time" + 0.010*"never"'),
 (1,
  '0.171*"hours" + 0.101*"long" + 0.039*"time" + 0.033*"schedule" + 0.033*"working" + 0.031*"enough" + 0.030*"day" + 0.030*"days" + 0.029*"week" + 0.019*"shift" + 0.014*"weekends" + 0.014*"hour" + 0.013*"busy"'),
 (2,
  '0.131*"store" + 0.081*"environment" + 0.071*"think" + 0.064*"none" + 0.052*"retail" + 0.039*"changing" + 0.030*"cant" + 0.025*"competitive" + 0.023*"driven" + 0.019*"toxic" + 0.018*"corporation" + 0.018*"constantly" + 0.016*"overall"'),
 (3,
  '0.085*"great" + 0.082*"good" + 0.066*"really" + 0.058*"cons" + 0.048*"nothing" + 0.039*"place" + 0.026*"bad" + 0.024*"isnt" + 0.022*"say" + 0.021*"best" + 0.021*"much" + 0.020*"everything" + 0.020*"learn"'),
 (4,
  '0.075*"sales" + 0.070*"customer" + 0.042*"service" + 0.033*"expectations" + 0.031*"customers" + 0.031*"goal

### Labeling Topics in Cons
As we did for the Pros section, we can see the topics generated for the Cons section of all reviews.

- 0: ? throwaway topic **-> throwaway**
- 1: Long work hours, weekend shifts
- 2: ? No cons **-> throwaway**
- 3: No cons
- 4: Unrealistic sales goals. Lofty expectations.
- 5: Incompetent managers, focused on metrics.
- 6: Office politics, red tape, a lot of "manager influence"
- 7: Not treated well or respected
- 8: Lack of leadership, poor management, limited communication
- 9: Bureaucratic, big company, hierarchy, protocols. "small fish in a big pond."
- 10: Limited career opportunities/advancement, growth, work life balance.
- 11: Too many processes, procedures, meetings. Too formalized.
- 12: Lack of development path, training, support.
- 13: Frequent change and layoffs
- 14: Lack of care for employees from upper management
- 15: Difficult scheduling, working outside
- 16: ? throwaway topic **-> throwaway**
- 17: Low salary, limited benefits, lack of promotion
- 18: High stress, micromanagement, turnover
- 19: Had to advance from a lower level position, especially from part time to full time

There are some interesting results here that immediately catch our attention. For example, the clear mention of "office politics" and "corporate environment" in topic 6. Additionally, "poor leadership," "management," and "favoritism" in topic 8. Lastly, "job changes," "constant," "layoffs," and "security" in topic 13.

When you think about it, these are all common employee concerns that make sense. Again, while not perfect, the LDA topic modeling did a fairly good job segementing the corpus into idenitifiable topics.

## Identifying Overall Topic Distributions within Each Company
So the question is, "Great, so we've identified common reasons of why people love or hate their jobs. Now, which companies excel or fail spectacularly in these categories?"

We can accomplish this task by aggregating topic distributions by company and performing an outlier analysis to determine outlier companies in each category.

### Topic Distributions by Company (Pros)

In [129]:
doc_topics = list(ldamodel_pros_20.get_document_topics(pros_corpus))

all_reviews['doc_topic_pros'] = doc_topics

all_reviews['doc_topic_pros'] = all_reviews['doc_topic_pros'].map(lambda x: dict(x))

for topic_num in range(20):
    all_reviews[f'topic_pros_{topic_num}'] = all_reviews['doc_topic_pros'].apply(lambda x: x.get(topic_num, 0))
    
# Preview table
dist_pros = round(all_reviews.groupby('company').mean().iloc[:,1:21],3)*100
round(all_reviews.groupby('company').mean().iloc[:,1:21],3)*100

Unnamed: 0_level_0,topic_pros_0,topic_pros_1,topic_pros_2,topic_pros_3,topic_pros_4,topic_pros_5,topic_pros_6,topic_pros_7,topic_pros_8,topic_pros_9,topic_pros_10,topic_pros_11,topic_pros_12,topic_pros_13,topic_pros_14,topic_pros_15,topic_pros_16,topic_pros_17,topic_pros_18,topic_pros_19
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
3M,6.5,3.5,7.9,3.8,5.2,3.5,9.7,3.9,3.2,3.8,5.4,5.7,3.2,7.2,4.6,8.4,2.7,3.2,5.2,3.5
AIG,5.0,3.9,6.6,8.3,4.6,3.4,10.9,3.4,3.2,4.6,3.6,7.6,3.2,6.6,3.9,9.1,2.5,2.6,4.5,2.4
AT&T,4.6,3.4,6.3,6.0,4.0,3.0,17.5,3.8,4.6,3.4,3.2,4.2,3.0,3.7,2.6,11.7,3.0,5.5,3.7,2.8
AbbVie,6.5,3.3,6.3,4.8,4.2,4.0,8.9,3.4,3.1,3.7,5.1,8.0,3.5,8.6,4.2,10.1,2.2,2.5,4.2,3.3
Albertsons,3.9,3.7,4.4,5.0,5.0,3.9,14.4,4.9,7.4,5.5,2.4,2.6,3.3,8.2,3.0,6.9,2.9,4.0,5.5,3.3
Amazon,5.2,3.8,6.2,5.8,4.9,3.0,12.9,5.9,5.0,4.1,3.0,3.3,3.3,6.2,4.5,7.7,3.8,2.9,5.2,3.4
American Airlines,3.4,3.0,5.0,6.5,4.5,3.2,11.1,3.6,4.1,4.5,3.2,4.2,3.1,5.2,3.0,12.9,6.9,3.7,5.7,3.0
American Express,5.9,3.1,6.9,5.1,4.2,3.6,9.6,2.9,2.9,3.5,3.2,11.1,3.3,6.5,3.8,11.5,2.9,3.1,4.8,2.2
AmerisourceBergen,4.4,3.4,6.4,6.6,4.0,4.0,12.8,4.0,3.4,4.7,3.4,4.0,3.3,7.9,3.4,10.1,2.8,3.7,5.1,2.6
Apple,4.5,3.2,3.7,5.6,4.2,3.3,11.7,3.7,4.5,4.0,4.0,3.3,3.1,7.6,3.6,14.6,4.3,4.1,4.3,2.8


The above table represents the concentration of topics within in each company. To put it another way, it very *roughly* shows us the percentage of reviews that fell into that particular topic. For example, topic 0 represents the existence of training and advancement opportunities. So we can very roughly say that 6.5% of reviews at 3M mentioned the presence of training and advancement opportunities. 

The reason I use the word *roughly* is because, remember, LDA topic modeling assigns percentage weights of topics to each review. So each review is made up of a variety of topics. Therefore, the above table is an *average of averages*.

### Detecting Outlier Companies within each Topic

The next step is to determine which companies excel in each our predefined topics. For example, which companies excel in offering training and advancement opportunities to employees?

We can accomplish this by conducting an outlier analysis in which we define outliers as any number that is greater than the Q3(3rd quartile) + 1.5*IQR (interquartile range).

In [132]:
# find outliers by locating values: Q1 - 1.5*IQR < X > Q3 + 1.5*IQR
for i in range(0,20):
    s = dist_pros[f'topic_pros_{i}']
    IQR = iqr(dist_pros[f'topic_pros_{i}'])
    below = s[s<(s.quantile(.25)-IQR)].sort_values(ascending=True)
    above = s[s>(s.quantile(.75)+IQR)].sort_values(ascending=False)
    print(f'topic_pros_{i}: ', above, '\n')

topic_pros_0:  company
New York Life Insurance    8.7
General Electric           8.6
Proctor & Gamble           8.4
Name: topic_pros_0, dtype: float64 

topic_pros_1:  company
Tech Data        5.3
Google           4.6
Intel            4.5
Goldman Sachs    4.5
Walt Disney      4.2
HP               4.2
Name: topic_pros_1, dtype: float64 

topic_pros_2:  company
General Electric    10.2
Name: topic_pros_2, dtype: float64 

topic_pros_3:  company
Humana    8.8
Name: topic_pros_3, dtype: float64 

topic_pros_4:  company
Archer Daniels      6.6
Morgan Stanley      6.5
Goldman Sachs       6.3
General Electric    6.2
Tech Data           6.1
Name: topic_pros_4, dtype: float64 

topic_pros_5:  company
Costco            5.1
Delta Airlines    5.0
Liberty Mutual    4.6
HCA Healthcare    4.6
Name: topic_pros_5, dtype: float64 

topic_pros_6:  company
Costco                 17.7
UPS                    17.6
AT&T                   17.5
Enterprise Products    17.0
Name: topic_pros_6, dtype: float64 

to

# -------BREAK----------

In [94]:
import re
all_reviews = pd.read_csv('../../all_reviews_glob.csv').drop(columns = ['Unnamed: 0'])
# remove line endings and spaces contained in the HTML (i.e. "/n /r")
all_reviews.pros = all_reviews.pros.map(lambda x: re.sub('\s+', ' ', x))
all_reviews.cons = all_reviews.cons.map(lambda x: re.sub('\s+', ' ', x))
# remove special characters and lowercase all text
all_reviews.title = all_reviews.title.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', str(x).lower()))
all_reviews.pros = all_reviews.pros.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))
all_reviews.cons = all_reviews.cons.map(lambda x: re.sub('[^a-zA-Z0-9 \n\.]', '', x.lower()))

In [None]:
pros_remove = all_reviews['pros'].values.tolist()
pros_remove = [' '.join(w for w in word.split() if w not in (['home', 'depot', 'verizon', 'boa', 'bank', 'america', 'boeing', 'comcast', 'ge', 'google', 'jp', 'morgan', 'microsoft', 'kroger', 'walgreens', 'wells', 'fargo'] + list(stopwords.words('english')))) for word in pros_remove]

In [105]:
pros_remove = [tokenize_sentences(word) for word in pros_remove]

In [106]:
dictionary_pros = corpora.Dictionary(pros_remove)
pros_corpus = [dictionary_pros.doc2bow(text) for text in pros_remove]
ldamodel = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
ldamodel.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.316*"great" + 0.162*"work" + 0.084*"people" + 0.063*"place" + 0.058*"benefits" + 0.041*"environment" + 0.038*"company" + 0.028*"culture" + 0.023*"fun" + 0.017*"team" + 0.016*"with" + 0.015*"nice" + 0.015*"awesome"'),
 (1,
  '0.156*"employees" + 0.056*"company" + 0.044*"care" + 0.040*"strong" + 0.040*"high" + 0.036*"customer" + 0.032*"leadership" + 0.028*"service" + 0.025*"culture" + 0.023*"diverse" + 0.021*"well" + 0.020*"take" + 0.017*"senior"'),
 (2,
  '0.085*"job" + 0.084*"get" + 0.072*"easy" + 0.045*"money" + 0.044*"make" + 0.041*"projects" + 0.037*"free" + 0.024*"lot" + 0.023*"meet" + 0.023*"food" + 0.018*"need" + 0.016*"freedom" + 0.015*"usually"'),
 (3,
  '0.067*"friendly" + 0.059*"environment" + 0.044*"coworkers" + 0.044*"team" + 0.035*"help" + 0.035*"nice" + 0.034*"management" + 0.033*"managers" + 0.029*"customers" + 0.027*"staff" + 0.024*"atmosphere" + 0.023*"colleagues" + 0.023*"able"'),
 (4,
  '0.121*"training" + 0.045*"development" + 0.029*"support" + 0.027*"care

In [15]:
# display model topics
ldamodel_pros_20.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.313*"work" + 0.100*"environment" + 0.100*"great" + 0.093*"good" + 0.062*"place" + 0.048*"people" + 0.048*"life" + 0.039*"balance" + 0.019*"culture" + 0.018*"team" + 0.018*"company" + 0.015*"challenging" + 0.013*"working"'),
 (1,
  '0.107*"best" + 0.080*"customers" + 0.061*"worked" + 0.055*"store" + 0.054*"associates" + 0.050*"people" + 0.047*"one" + 0.028*"meet" + 0.026*"product" + 0.025*"school" + 0.021*"ever" + 0.020*"helping" + 0.017*"thing"'),
 (2,
  '0.059*"really" + 0.047*"employees" + 0.043*"company" + 0.040*"care" + 0.035*"amazing" + 0.032*"employee" + 0.030*"take" + 0.024*"working" + 0.020*"positive" + 0.015*"think" + 0.015*"pros" + 0.014*"everything" + 0.013*"enjoyed"'),
 (3,
  '0.162*"flexible" + 0.143*"hours" + 0.140*"work" + 0.091*"schedule" + 0.055*"easy" + 0.039*"love" + 0.031*"working" + 0.017*"schedules" + 0.017*"busy" + 0.015*"hourly" + 0.014*"safety" + 0.012*"overtime" + 0.012*"weekends"'),
 (4,
  '0.038*"technology" + 0.033*"strong" + 0.032*"leadership" + 

### Pros removing common stopwords (great, good, work, company)

In [43]:
tokenized_pros = tokenized_pros.apply(lambda x: remove_stopwords(x, ['great','good','work','company']))
dictionary_pros = corpora.Dictionary(tokenized_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in tokenized_pros]
ldamodel_pros_20_custom_stopwords = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
temp_file = datapath("lda_pros_20_custom_stopwords")
ldamodel_pros_20_custom_stopwords.save(temp_file)
ldamodel_pros_20_custom_stopwords.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.169*"time" + 0.080*"part" + 0.040*"employees" + 0.034*"full" + 0.031*"available" + 0.030*"ok" + 0.026*"even" + 0.025*"quality" + 0.023*"resources" + 0.021*"provide" + 0.021*"schedules" + 0.020*"group" + 0.020*"needed"'),
 (1,
  '0.094*"make" + 0.078*"amazing" + 0.064*"money" + 0.054*"big" + 0.037*"back" + 0.033*"hard" + 0.031*"sales" + 0.028*"large" + 0.021*"exciting" + 0.021*"bad" + 0.020*"thats" + 0.018*"role" + 0.018*"none"'),
 (2,
  '0.268*"pay" + 0.243*"benefits" + 0.072*"decent" + 0.064*"salary" + 0.042*"competitive" + 0.036*"people" + 0.022*"awesome" + 0.018*"coworkers" + 0.017*"perks" + 0.013*"union" + 0.013*"wonderful" + 0.012*"bonuses" + 0.010*"healthcare"'),
 (3,
  '0.125*"culture" + 0.123*"management" + 0.084*"employees" + 0.075*"team" + 0.031*"leadership" + 0.027*"level" + 0.026*"values" + 0.020*"support" + 0.019*"supportive" + 0.019*"corporate" + 0.018*"training" + 0.017*"professional" + 0.017*"helpful"'),
 (4,
  '0.155*"place" + 0.082*"like" + 0.060*"really" + 

### Pros removing just stopwords 'work' and 'company' (same as OG model)

In [63]:
tokenized_pros = tokenized_pros.apply(lambda x: remove_stopwords(x, ['work','company']))
dictionary_pros = corpora.Dictionary(tokenized_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in tokenized_pros]
ldamodel_pros_20_custom_stopwords = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
temp_file = datapath("lda_pros_20_custom_stopwords")
ldamodel_pros_20_custom_stopwords.save(temp_file)
ldamodel_pros_20_custom_stopwords.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.102*"well" + 0.088*"like" + 0.053*"family" + 0.042*"team" + 0.030*"employees" + 0.019*"healthcare" + 0.017*"put" + 0.017*"pays" + 0.017*"management" + 0.015*"fairly" + 0.014*"role" + 0.014*"get" + 0.014*"members"'),
 (1,
  '0.121*"lot" + 0.121*"lots" + 0.099*"learn" + 0.062*"experience" + 0.053*"new" + 0.035*"opportunity" + 0.029*"people" + 0.026*"high" + 0.024*"things" + 0.022*"different" + 0.019*"technology" + 0.019*"focus" + 0.019*"resources"'),
 (2,
  '0.102*"really" + 0.058*"managers" + 0.057*"amazing" + 0.052*"worked" + 0.046*"associates" + 0.034*"level" + 0.031*"people" + 0.029*"helpful" + 0.028*"everyone" + 0.022*"ive" + 0.021*"cool" + 0.020*"years" + 0.018*"experience"'),
 (3,
  '0.462*"good" + 0.098*"balance" + 0.093*"benefits" + 0.075*"life" + 0.038*"worklife" + 0.032*"people" + 0.023*"culture" + 0.021*"salary" + 0.014*"management" + 0.014*"compensation" + 0.008*"overall" + 0.007*"flexibility" + 0.007*"stable"'),
 (4,
  '0.039*"help" + 0.026*"every" + 0.026*"worker

### Stemmed pros

In [26]:
tokenized_pros[:5]

0    [good, benefits, nice, values, nice, co, worke...
1                      [great, pay, amp, steady, work]
2    [great, worklife, balance, limitless, income, ...
3    [money, benefits, flexibility, schedule, great...
4    [excellent, benefits, competitive, within, mar...
Name: pros, dtype: object

In [27]:
stemmed_pros[:5]

0    good benefit nice valu nice co worker nice outing
1                            great pay amp steadi work
2         great worklif balanc limitless incom potenti
3      money benefit flexibl schedul great upper manag
4                 excel benefit competit within market
Name: pros, dtype: object

In [33]:
stemmed_pros = [tokenize_sentences(word) for word in stemmed_pros]
dictionary_pros = corpora.Dictionary(stemmed_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in stemmed_pros]

In [34]:
ldamodel_stemmed_pros_20 = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)

In [35]:
# save model
# from gensim.test.utils import datapath
temp_file = datapath("lda_stemmed_pros_20")
ldamodel_stemmed_pros_20.save(temp_file)

# lda_stemmed_pros_20 = LdaModel.load(datapath("lda_stemmed_pros_20"))
# doc_topics_pros = list(lda_stemmed_pros_20.get_document_topics(pros_corpus))

In [36]:
ldamodel_stemmed_pros_20.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.078*"year" + 0.056*"day" + 0.048*"work" + 0.042*"love" + 0.039*"everi" + 0.030*"enjoy" + 0.024*"month" + 0.020*"come" + 0.016*"stay" + 0.015*"wage" + 0.015*"get" + 0.014*"leav" + 0.014*"ive"'),
 (1,
  '0.331*"great" + 0.200*"work" + 0.064*"balanc" + 0.059*"benefit" + 0.057*"peopl" + 0.053*"place" + 0.048*"life" + 0.036*"compani" + 0.029*"environ" + 0.027*"cultur" + 0.025*"worklif" + 0.018*"team" + 0.010*"excel"'),
 (2,
  '0.125*"custom" + 0.086*"product" + 0.056*"servic" + 0.034*"improv" + 0.029*"market" + 0.028*"chang" + 0.027*"process" + 0.022*"qualiti" + 0.021*"perform" + 0.020*"system" + 0.020*"motiv" + 0.019*"driven" + 0.019*"extrem"'),
 (3,
  '0.235*"time" + 0.077*"part" + 0.069*"vacat" + 0.065*"paid" + 0.031*"week" + 0.031*"full" + 0.028*"top" + 0.024*"benefit" + 0.022*"employ" + 0.021*"get" + 0.020*"day" + 0.019*"tech" + 0.017*"live"'),
 (4,
  '0.099*"benefit" + 0.085*"employe" + 0.077*"great" + 0.057*"k" + 0.046*"health" + 0.032*"care" + 0.032*"insur" + 0.029*"discou

### Stemmed pros removing common stopwords (great, good, work, company)

In [47]:
# stemmed_pros = [tokenize_sentences(word) for word in stemmed_pros]
stemmed_pros = pd.Series(stemmed_pros).apply(lambda x: remove_stopwords(x, ['great','good','work','compani']))
dictionary_pros = corpora.Dictionary(stemmed_pros)
pros_corpus = [dictionary_pros.doc2bow(text) for text in stemmed_pros]
ldamodel_stemmed_pros_20_custom_stopwords = LdaModel(corpus = pros_corpus, num_topics=20, id2word = dictionary_pros, 
                                           passes=10, random_state = 25, minimum_probability = 0.0)
temp_file = datapath("lda_stemmed_pros_20_custom_stopwords")
ldamodel_stemmed_pros_20_custom_stopwords.save(temp_file)
ldamodel_stemmed_pros_20_custom_stopwords.show_topics(num_topics = 20, num_words = 13) # remove index to show all topics

[(0,
  '0.067*"perk" + 0.053*"right" + 0.051*"perform" + 0.047*"thing" + 0.045*"travel" + 0.042*"process" + 0.037*"mobil" + 0.037*"treat" + 0.032*"everyth" + 0.030*"fantast" + 0.029*"direct" + 0.027*"well" + 0.025*"patient"'),
 (1,
  '0.316*"peopl" + 0.066*"smart" + 0.058*"help" + 0.054*"friend" + 0.047*"around" + 0.045*"interest" + 0.039*"project" + 0.034*"lot" + 0.023*"alway" + 0.021*"realli" + 0.018*"enjoy" + 0.018*"someth" + 0.016*"get"'),
 (2,
  '0.093*"experi" + 0.075*"like" + 0.073*"famili" + 0.053*"room" + 0.038*"offer" + 0.036*"full" + 0.027*"gain" + 0.024*"huge" + 0.022*"varieti" + 0.021*"financi" + 0.021*"orient" + 0.019*"feel" + 0.018*"field"'),
 (3,
  '0.142*"learn" + 0.092*"lot" + 0.063*"new" + 0.048*"product" + 0.045*"technolog" + 0.044*"divers" + 0.031*"skill" + 0.028*"talent" + 0.024*"busi" + 0.023*"experi" + 0.022*"meet" + 0.022*"knowledg" + 0.016*"improv"'),
 (4,
  '0.113*"best" + 0.085*"compens" + 0.058*"industri" + 0.053*"one" + 0.048*"big" + 0.045*"world" + 0.032*

### Cons

In [66]:
ldamodel_cons.show_topics(num_topics = 20, num_words = 13)

[(0,
  '0.023*"get" + 0.021*"one" + 0.019*"people" + 0.016*"job" + 0.015*"even" + 0.012*"dont" + 0.012*"manager" + 0.012*"go" + 0.011*"youre" + 0.010*"want" + 0.010*"make" + 0.010*"time" + 0.010*"never"'),
 (1,
  '0.171*"hours" + 0.101*"long" + 0.039*"time" + 0.033*"schedule" + 0.033*"working" + 0.031*"enough" + 0.030*"day" + 0.030*"days" + 0.029*"week" + 0.019*"shift" + 0.014*"weekends" + 0.014*"hour" + 0.013*"busy"'),
 (2,
  '0.131*"store" + 0.081*"environment" + 0.071*"think" + 0.064*"none" + 0.052*"retail" + 0.039*"changing" + 0.030*"cant" + 0.025*"competitive" + 0.023*"driven" + 0.019*"toxic" + 0.018*"corporation" + 0.018*"constantly" + 0.016*"overall"'),
 (3,
  '0.085*"great" + 0.082*"good" + 0.066*"really" + 0.058*"cons" + 0.048*"nothing" + 0.039*"place" + 0.026*"bad" + 0.024*"isnt" + 0.022*"say" + 0.021*"best" + 0.021*"much" + 0.020*"everything" + 0.020*"learn"'),
 (4,
  '0.075*"sales" + 0.070*"customer" + 0.042*"service" + 0.033*"expectations" + 0.031*"customers" + 0.031*"goal