## Statement 2

It is common for a company to conduct employee survey to sense the staff sentiment and concern. In such survey, there are often questions that employee can provide a free-form text answer.

The objective is to build a model to understand (1) what are the set of topics from the response text data? (2) what are the set of topics concerning different departments? (3) What can we infer about the profile of individuals? (bonus)

A dataframe with columns (1) the id of individual, (2) the id of departments, and (3) response text will be provided. One might need to clean the data before model building.

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
# Download stopwords
import nltk
nltk.download("stopwords")

In [3]:
import numpy as np
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim
import gensim.corpora as corpora
from gensim.models.ldamodel import LdaModel
from pprint import pprint

from src.utils2 import load_data, preprocess, evaluate, save_json

In [4]:
df = load_data()
print(df.shape)
df.head()

(155, 3)


Unnamed: 0,unique_identifier,employee_feedback,department
0,3565,There's a culture of blame within the company ...,Dept A
1,7323,The company's approach to feedback and perform...,Dept A
2,5008,"While page limits have been set, some departme...",Dept A
3,3460,na,Dept A
4,2179,The culture of collaboration within our team i...,Dept A


## Preprocess data

In [5]:
data = df["employee_feedback"].tolist()

data_lemmatized = preprocess(data)
print(len(data_lemmatized))

155


## Model

In [6]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

Setting `num_topics = 7` seems to give the best results

In [7]:
lda_model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=7,
    random_state=100,
    update_every=1,
    chunksize=100,
    passes=20,
    alpha="auto",
    per_word_topics=True,
)

In [8]:
_ = evaluate(lda_model, corpus, data_lemmatized, id2word)

  Perplexity = -6.6560
  Coherence Score = 0.4371


In [9]:
topics_keywords = lda_model.print_topics()
pprint(topics_keywords)

[(0,
  '0.048*"day" + 0.017*"team" + 0.014*"teammate" + 0.014*"future" + '
  '0.014*"operation" + 0.010*"value" + 0.010*"opinion" + 0.007*"base" + '
  '0.007*"workflow" + 0.007*"logically"'),
 (1,
  '0.031*"job" + 0.021*"help" + 0.014*"innovation" + 0.014*"passion" + '
  '0.013*"safety" + 0.013*"psychological" + 0.013*"work" + 0.013*"need" + '
  '0.012*"seem" + 0.010*"feel"'),
 (2,
  '0.023*"good" + 0.018*"pay" + 0.016*"feel" + 0.015*"work" + 0.015*"benefit" '
  '+ 0.015*"receive" + 0.014*"allow" + 0.012*"salary" + 0.012*"structure" + '
  '0.011*"hear"'),
 (3,
  '0.047*"company" + 0.030*"work" + 0.022*"employee" + 0.022*"feel" + '
  '0.020*"make" + 0.018*"provide" + 0.016*"help" + 0.016*"lack" + 0.012*"team" '
  '+ 0.011*"manager"'),
 (4,
  '0.054*"work" + 0.025*"life_balance" + 0.024*"effort" + 0.021*"flexibility" '
  '+ 0.020*"appreciate" + 0.019*"workload" + 0.017*"training" + '
  '0.017*"diverse" + 0.017*"expectation" + 0.016*"policy"'),
 (5,
  '0.027*"team" + 0.027*"feel" + 0.025*

In [10]:
save_json(topics_keywords, "./results/topics_words.json")

In [11]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

In [12]:
pyLDAvis.save_html(vis, "./results/pyldavis_vis.html")

In [13]:
# Predict topic for each feedback
doc_lda = list(lda_model.get_document_topics(corpus))
topics = np.asarray([sorted(y, key=lambda x: x[1], reverse=True)[0] for y in doc_lda])
df["topic"] = topics[:, 0].astype(int)
df["prob"] = topics[:, 1]
df.head(10)

Unnamed: 0,unique_identifier,employee_feedback,department,topic,prob
0,3565,There's a culture of blame within the company ...,Dept A,2,0.569126
1,7323,The company's approach to feedback and perform...,Dept A,3,0.570516
2,5008,"While page limits have been set, some departme...",Dept A,5,0.967868
3,3460,na,Dept A,3,0.434422
4,2179,The culture of collaboration within our team i...,Dept A,3,0.987331
5,6830,While the workload can be overwhelming at time...,Dept A,4,0.965782
6,3828,Nil,Dept A,5,0.661428
7,1598,NO,Dept A,3,0.434422
8,7594,While the company offers competitive compensat...,Dept A,3,0.980142
9,7910,While the company's benefits package is genera...,Dept A,3,0.631473


In [14]:
df["topic"].value_counts()

topic
3    77
5    16
2    14
4    12
1    12
6    12
0    12
Name: count, dtype: int64

In [15]:
# Save results
df.to_csv("./results/employee_feedback.csv", index=False)