# Problem Statement 2

It is common for a company to conduct employee survey to sense the staff sentiment and concern. In such survey, there are often questions that employee can provide a free-form text answer.

The objective is to build a model to understand (1) what are the set of topics from the response text data? (2) what are the set of topics concerning different departments? (3) What can we infer about the profile of individuals? (bonus)

A dataframe with columns (1) the id of individual, (2) the id of departments, and (3) response text will be provided. One might need to clean the data before model building.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim
from gensim.corpora import Dictionary
from gensim.models.ldamodel import LdaModel
from pprint import pprint

from src.utils2 import load_data, preprocess, evaluate, save_json

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kokmeng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
df = load_data()
print(df.shape)
df.head()

(155, 3)


Unnamed: 0,unique_identifier,employee_feedback,department
0,3565,There's a culture of blame within the company ...,Dept A
1,7323,The company's approach to feedback and perform...,Dept A
2,5008,"While page limits have been set, some departme...",Dept A
3,3460,na,Dept A
4,2179,The culture of collaboration within our team i...,Dept A


## Preprocess

In [4]:
data = df["employee_feedback"].tolist()

data_lemmatized = preprocess(data)
print(len(data_lemmatized))

155


## Model

In [6]:
# Create Dictionary
id2word = Dictionary(data_lemmatized)

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in data_lemmatized]

In [7]:
# Setting `num_topics = 7` seems to give the most interpretable results

lda_model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    num_topics=7,
    random_state=100,
    update_every=1,
    chunksize=100,
    passes=20,
    alpha="auto",
    per_word_topics=True,
)

In [8]:
_ = evaluate(lda_model, corpus, data_lemmatized, id2word)

  Perplexity = -6.7575
  Coherence Score = 0.4186


In [9]:
topics_keywords = lda_model.print_topics()
pprint(topics_keywords)

[(0,
  '0.016*"company" + 0.014*"always" + 0.013*"future" + 0.013*"difficult" + '
  '0.012*"level" + 0.012*"opinion" + 0.012*"provide" + 0.012*"vision" + '
  '0.011*"question" + 0.010*"help"'),
 (1,
  '0.042*"team" + 0.027*"feel" + 0.020*"make" + 0.018*"work" + 0.015*"member" '
  '+ 0.013*"good" + 0.013*"career" + 0.012*"company" + 0.012*"share" + '
  '0.011*"within"'),
 (2,
  '0.030*"company" + 0.023*"feedback" + 0.015*"feel" + 0.015*"could" + '
  '0.015*"benefit" + 0.014*"job" + 0.014*"progress" + 0.013*"opportunity" + '
  '0.013*"see" + 0.012*"employee"'),
 (3,
  '0.026*"day" + 0.026*"work" + 0.019*"effort" + 0.016*"feel" + 0.014*"allow" '
  '+ 0.013*"process" + 0.013*"meeting" + 0.013*"leadership" + 0.012*"within" + '
  '0.012*"company"'),
 (4,
  '0.023*"employee" + 0.020*"work" + 0.018*"management" + 0.016*"value" + '
  '0.014*"help" + 0.013*"culture" + 0.013*"however" + 0.012*"seem" + '
  '0.011*"commitment" + 0.011*"team"'),
 (5,
  '0.024*"company" + 0.020*"lack" + 0.018*"day" +

In [10]:
save_json(topics_keywords, "./results/topics_keywords.json")

In [11]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

In [12]:
pyLDAvis.save_html(vis, "./results/pyldavis_vis.html")

In [13]:
# Predict topic for each feedback
doc_lda = list(lda_model.get_document_topics(corpus))
topics = np.asarray([sorted(y, key=lambda x: x[1], reverse=True)[0] for y in doc_lda])
df["topic"] = topics[:, 0].astype(int)
df["prob"] = topics[:, 1]
df.head(10)

Unnamed: 0,unique_identifier,employee_feedback,department,topic,prob
0,3565,There's a culture of blame within the company ...,Dept A,3,0.967132
1,7323,The company's approach to feedback and perform...,Dept A,2,0.991127
2,5008,"While page limits have been set, some departme...",Dept A,1,0.978631
3,3460,na,Dept A,4,0.73078
4,2179,The culture of collaboration within our team i...,Dept A,1,0.98775
5,6830,While the workload can be overwhelming at time...,Dept A,6,0.978904
6,3828,Nil,Dept A,3,0.720302
7,1598,NO,Dept A,6,0.207057
8,7594,While the company offers competitive compensat...,Dept A,2,0.981205
9,7910,While the company's benefits package is genera...,Dept A,6,0.720279


In [14]:
df["topic"].value_counts()

topic
6    37
1    27
2    24
3    20
4    20
0    14
5    13
Name: count, dtype: int64

In [15]:
df.groupby("department")["topic"].value_counts()

department  topic
Dept A      6        13
            2         6
            4         6
            5         5
            0         4
            1         3
            3         3
Dept B      6         6
            1         3
            3         2
            4         1
            0         1
Dept C      6        16
            2        13
            3        12
            4        11
            1        10
            5         6
            0         6
Dept D      1        11
            2         5
            0         3
            3         3
            4         2
            5         2
            6         2
Name: count, dtype: int64

In [16]:
# Save results
df.to_csv("./results/employee_feedback.csv", index=False)