# How do employees talk about epistemic virtue and vice of their workplaces?

This notebook experiments with different methods for classifying employee reviews. The goal is to identify reviews that talk about the epistemic virtues and vices of the organizations reviewers work at, such as curiosity, intellectual courage, epistemic vigilance etc. on the virtue side, and epistemic malevolence, indifference, closed-mindedness etx. on the vice side. Moreover, The notebook also experiments with different approaches to extract text features based on the classification, e.g. dictionaries. 

ToDo
- I wonder whether we are using HuggingFace inefficiently. I configured the zero-shot classifier to use batches of 5, and there is some parallelization built into the standard configuration, but I have not really gotten to the bottom of what the Dataset object is capable of. Not even sure we are using the GPU, or go more or less sequentially. Fixing this would speed up the classification tasks.

## Installs

In [None]:
# HuggingFace library -- the NLP transformer library
!pip install transformers

In [None]:
# Also part of HuggingFace, to process whole datasets effectively
!pip install datasets

In [104]:
# Spacy is another NLP library that we use, for splitting text into sentences.
!pip install spacy



In [105]:
# We use this Spacy language model to lemmatize reviews before creating dictionaries 
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


## Imports

In [111]:
import pandas as pd
import numpy as np
from transformers import pipeline
from collections import Counter
from datasets import Dataset
from google.colab import files

import spacy
import en_core_web_md
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

## Load Data

In [None]:
# Note that there is a repository that takes all reviews as an input and outputs this sample of relevant companies -- based on which companies we can match with ViolationTracker and CompuStat, and based on which companies have sufficiently many reviews.
df = pd.read_csv("ReviewsFiltered_sample.csv")

## Tokenize Reviews into sentences

The goal here is to split each comment into single sentences, so that we can classify sentences one-by-one -- as long comments are often talking about a whole bunch of things.

In [107]:
# Create Spacy pipeline to convert reviews into single sentences
nlp = English()
sentencizer = nlp.create_pipe("sentencizer")
nlp.add_pipe(sentencizer)

In [66]:
# Drop reviews that have no entries for pros and cons
df_cons = df[~df["cons"].isna()]
df_pros = df[~df["pros"].isna()]

In [None]:
# Split each con and each pro review into sentences; result is a list in the respective cell
df_cons["cons"] = df_cons["cons"].apply(lambda x: [sent.text for sent in nlp(x).sents])
df_pros["pros"] = df_pros["pros"].apply(lambda x: [sent.text for sent in nlp(x).sents])


In [79]:
# Explode list into separate rows
df_cons = df_cons.explode("cons")
df_pros = df_pros.explode("pros")


Note that we end up with two dataframes: df_cons if we want to work with negative comments, and df_pros if we want to work with positive comments.

## Configure zero-shot classifier pipeline

In [71]:
classifier = pipeline("zero-shot-classification", batch_size=5, device=0)

No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli)


## Function to classify reviews

This function returns a dataframe with a row per comment. First column contains the comment text. All further columns contain scores between 0 and 1, representing the output of the classifier. The closer the score to 1, the higher the similarity between the word embedding of the label and the word embedding of the text.

Arguments:
- dataframe: either df_pros for positive comments or df_cons for negative comments
- review_column: either pros for positive comments or cons for negative comments
- classifier: the zero-shot classifier defined above.
- labels: (list). The categories that the classifier should attempt to classify comments into.
- number_of_reviews: (int). If 0, all rows in dataframe are used. If > 0, the function samples the given number of comments from the dataframe and only processes this sample. This is useful for quickly testing an idea -- just take 10.000 comments, then classification should be super fast.

In [95]:
def classify_reviews(dataframe, review_column, classifier, labels, number_of_reviews=0):
  #Sample dataset if a certain number of reviews is set
  if (number_of_reviews > 0):
    dataframe = dataframe.sample(number_of_reviews)
  # Create a HuggingFace Dataset object -- this can probably
  huggingface_dataset = Dataset.from_pandas(dataframe[["reviewID", review_column]])
  # Run the classifier
  results = classifier(huggingface_dataset[review_column][:10000], candidate_labels=labels)
  # Convert result in pandas dataframe and format
  results = pd.DataFrame(results)
  results[labels] = pd.DataFrame(results.scores.tolist())
  results = results.drop(["scores", "labels"], axis=1)
  results = results.rename(columns={"sequence": "comment"})
  return results

## Classification Experiments

Below are a bunch of attempts to classify comments based 

In [96]:
# Experiment 1: Using a single virtue category
labels = ["People here are curious"] # put as many categories in the list as you like
dataframe = df_pros
review_column = "pros"
experiment_1 = classify_reviews(dataframe, review_column, classifier, labels)


In [116]:
experiment_1.sort_values("People here are curious", ascending=False).head(25)

Unnamed: 0,comment,People here are curious
2537,"Good people in the work force, eager to listen...",0.998601
4207,"Very friendly, intellectually curious employees.",0.998492
3465,Taking Summer Friday's year-round?,0.998091
3466,Not knowing when the next round of layoffs + c...,0.998046
8242,"Why not $40,000?",0.997574
4829,Interesting work area.,0.997409
2057,Who does that??,0.997323
978,Now what did that cost?,0.997242
9835,At this point I'm wondering,0.996934
7433,"New company, looking forward to seeing what in...",0.996933


So I'd say this works insufficiently well yet. Row 1 and 2 are great hits. Rows 3-5 are misses. Row 6 is not bad, but more about the curiosity of the reviewer than of the company. The next real hit is 10 rows down: Teammates are open to new ideas.

Ideas: 
* Take out all sentences that end in a question mark.
* Provide more example / finetune the zero-shot clafssifier.




In [87]:
# Experiment 2: Using several vice categories, sentences
labels = ["People here are closed-minded.", "People here lack curiosity.", "People here are deceptive.", "People here have other attitudes."] # put as many categories in the list as you like
dataframe = df_cons
review_column = "cons"
experiment_2 = classify_reviews(dataframe, review_column, classifier, labels)


In [118]:
experiment_2.sort_values("People here are closed-minded.", ascending=False).head(25)

Unnamed: 0,comment,People here are closed-minded.,People here lack curiosity.,People here are deceptive.,People here have other attitudes.
8759,Outlook of company can vary based on the depar...,0.977762,0.014803,0.005337,0.002097
6028,"In a world where remote teams are common, ther...",0.973211,0.011267,0.007949,0.007573
1710,A lot of competing priorities,0.972569,0.019458,0.005057,0.002916
9084,Since the acquisition it seems to be a very d...,0.972233,0.015004,0.008812,0.003951
9765,"work/life balance, moving across operating groups",0.972099,0.016486,0.007808,0.003607
6627,"Some managers are more open, creating more gro...",0.972017,0.020153,0.00396,0.00387
3762,Customers come in with all sorts of problems a...,0.971912,0.021587,0.004557,0.001944
2188,Each team had wildly different personalities a...,0.971788,0.020925,0.004236,0.003051
9415,Many different cultures within same company (...,0.970684,0.020482,0.005835,0.002998
4347,Another thing is the guests at times can be ru...,0.967747,0.025132,0.005851,0.00127


These results are not compelling at all. Can hardly find a single sentence that is well-classified.

Ideas:
* Use less fine-grained categories. perhaps there are a few sentences that just indicate epistemic vice generally?
* Define final category for all the stuff that does not relate to information. 


In [89]:
# Experiment 3: Using several vice categories, single terms
labels = ["deception", "myopia", "indifference", "carelessness"] # put as many categories in the list as you like
dataframe = df_cons
review_column = "cons"
experiment_3 = classify_reviews(dataframe, review_column, classifier, labels)


In [119]:
experiment_3.sort_values("deception", ascending=False).head(25)

Unnamed: 0,comment,deception,myopia,indifference,carelessness
3454,"None at this time, neutral",0.994772,0.002264,0.001845,0.001119
4120,Nothing bad to talk about,0.989134,0.00452,0.003249,0.003097
2962,"None really, nothing to complain about",0.98911,0.00563,0.003794,0.001466
1283,None in particular to comment on,0.988848,0.004639,0.00362,0.002894
1870,There is nothing bad to report,0.988452,0.005479,0.003354,0.002716
1689,I have nothing bad to say,0.98798,0.00591,0.003929,0.002182
5351,Nothing to complain.,0.987129,0.004976,0.004535,0.00336
3409,They are going to cheat you on your hard work ...,0.986407,0.007492,0.003875,0.002225
3282,Sometimes columbia can be a little myopic and ...,0.984232,0.005534,0.005313,0.00492
6824,no cons to highlight at this time,0.983961,0.007003,0.005501,0.003535


None of this looks particularly compelling.

Ideas:
* Tokenize into words rather than sentences and try then?
* Use a version of experiment 1 and go for these terms one-by-one.

In [92]:
# Experiment 4: Using a single vice category
labels = ["People here are deceitful"] # put as many categories in the list as you like
dataframe = df_cons
review_column = "cons"
experiment_4 = classify_reviews(dataframe, review_column, classifier, labels)


In [120]:
experiment_4.sort_values("People here are deceitful", ascending=False).head(25)

Unnamed: 0,comment,People here are deceitful
3074,"Bad pay, random hours, they treat u like crap,...",0.999161
9222,I am disabled and they fired me after 2 months...,0.999015
8459,"Company doesn't care about it's technicians, N...",0.998979
7793,"This is a ""watch-your-back"", Machiavellian kin...",0.998923
5580,\r\n\r\nNot everyone that works at Overstock i...,0.998832
3265,\r\nSenior management are fake about what they...,0.998816
5581,They’re all entangled in this disgusting shady...,0.998649
5308,"Extremely unorganized and unprofessional, favo...",0.99859
5767,low pay; even after a couple years\r\nmicro ma...,0.998557
6651,"Hard work, low pay, they will trick you to fir...",0.998501


Ok this looks pretty promising, both for the highest-ranked and for the lowest-ranked comments!

Insights:
* Classifying according to just one category seems to work well.

Ideas:
* Describe category better by giving more examples. How do you do that?
* Try "This company is..." as a formular, rather than "People here are..." 


## Extract dictionary from top results

We need to do a couple of things here, tbd:

* Preprocessing: replace each comment with its lemma, so we are analysing at the level of lemmata.
* For each comment, create tf-idf vector matrix using sklearn
* Convert the vector matrix into a pandas dataframe and join with the results of the previous exercise. 
* Identify lemmas or ngrams that have higher tf-idf scores in comments that have a higher classification score. What is the algorithm here? Could be: We are looking at comments with a classification score above a certain threshold, say 0.95. For each lemma or ngram, we calculate the mean tf-idf score across all documents (is that always the same?) and in the subset of documents with classification scores above the threshold. We then calculate the difference between the two. Lemmas or ngrams where this difference is highest have the highest excess prevalence in comments classified above the threshold.
* Output should be a dataframe with a row for each lemma or ngram, and the metric just outlined in a second column.

### Preprocessing function
Replace each comment with its lemma, so we are analysing at the level of lemmata.

In [None]:
# Function to preprocess text into lemmas and kick out stopwords.
nlp = spacy.load('en_core_web_md')

def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=["ner", "parser"])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in STOP_WORDS]
    
    return ' '.join(a_lemmas)

In [None]:
# Apply preprocess to positive and negative comments columns

df_cons_lemmatized = df_cons.copy()
df_cons_lemmatized["cons"] = df_cons_lemmatized["cons"].apply(preprocess)
df_pros_lemmatized = df_pros.copy()
df_pros_lemmatized["pros"] = df_pros_lemmatized["pros"].apply(preprocess)



ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])