# Competitor Analysis - Sentiment Analysis

## GlassDoor
For the Competitor Analysis Use Case, there is opportunity to expolore the sentiment analysis on competitor's company culture based on reviews from the website GlassDoor. In this exploration, we have used a Kaggle dataset which contains GlassDoor reviews between the years 2008 - 2021. We have decided to remove a portion of the healthcare and public governemnt reviews and have subsetted for smaller companies although we decided to keep a couple of big firm companies based in London for comparison. 

This sentiment analysis exploration looks at using 3 pre-trained models for sentiment analysis: DistilBert, Vader and RoBERTa. 

### Dataset

In [1]:
import pandas as pd
from pprint import pprint
glassdoor_hotel = pd.read_csv("glassdoor_hotels.csv")

glassdoor_hotel["reviewText"] = glassdoor_hotel["headline"] + " " + glassdoor_hotel["pros"] + " " + glassdoor_hotel["cons"]
pprint(glassdoor_hotel.head(5))


       id                             firm date_review            job_title  \
0  161804  Crowne-Plaza-Hotels-and-Resorts  30/09/2011                        
1  161805  Crowne-Plaza-Hotels-and-Resorts  21/02/2012                        
2  161806  Crowne-Plaza-Hotels-and-Resorts  26/03/2012    Sales Coordinator   
3  161807  Crowne-Plaza-Hotels-and-Resorts  09/04/2012              Manager   
4  161808  Crowne-Plaza-Hotels-and-Resorts  06/05/2012   Assistant Manager    

                    location  overall_rating  \
0            Cherry Hill, NJ               3   
1                 Dallas, TX               1   
2                        NaN               3   
3            Saint Louis, MO               2   
4  Gurgaon, Haryana, Haryana               1   

                                            headline  \
0  A lot of potential if a few key players were l...   
1                                Would not work here   
2  The hotel needed renovations so it was hard to...   
3           

In [None]:
import pandas as pd
# assign data of lists.
data = {'ID': [1, 2, 3, 4,5],'Glassdoor Rating': [4, 4.2, 2.4, 1.8, 3],'ReviewText': ["I have been working at ONTHEBOOKS for over four years, and I am truly grateful for the opportunities and experiences I've had here. The company is committed to providing top-notch consulting services to its clients, and this dedication to excellence is evident in every aspect of the organization. The management team is supportive and encourages open communication, fostering a culture of collaboration and continuous improvement. ONTHEBOOKS also invests heavily in employee development, offering a wide range of training programs and resources to help us stay at the forefront of industry trends. The work-life balance is commendable, with flexible hours and remote work options that make it easy to maintain a healthy lifestyle. Overall, I am proud to be a part of the ONTHEBOOKS team and look forward to many more years of success here.", "My experience at ONTHEBOOKS has been nothing short of exceptional. The company's commitment to delivering high-quality consulting services is evident in the talented and dedicated team I have the pleasure of working with. The management is approachable and genuinely cares about the well-being and professional growth of their employees. The company culture is one of collaboration, innovation, and mutual respect, which has allowed me to thrive both personally and professionally. The compensation and benefits package is competitive, and the company regularly recognizes and rewards hard work and outstanding performance. I am excited about the future of ONTHEBOOKS and feel fortunate to be a part of this incredible organization.", "I regret to say that my time at ONTHEBOOKS has been quite disappointing. The workload is often overwhelming, with unrealistic expectations and tight deadlines that lead to a constant state of stress and burnout. The management team seems to be out of touch with the day-to-day challenges faced by employees and is often unresponsive to concerns raised. The company culture is highly competitive, with little emphasis on collaboration or teamwork. This has resulted in a toxic work environment where employees are pitted against each other, rather than working together towards common goals. Additionally, the compensation and benefits package is not on par with industry standards, making it difficult to justify the long hours and high-pressure environment.", "Having worked at ONTHEBOOKS for over a year, I must express my dissatisfaction with the company's leadership and work environment. The management is disorganized and often makes decisions without considering the impact on employees. Communication between departments is poor, leading to confusion and inefficiency. The work-life balance is practically non-existent, with employees regularly expected to work late nights and weekends to meet demanding client expectations. The company also lacks a clear growth plan, making it difficult to envision a long-term future here. Overall, I would not recommend ONTHEBOOKS to anyone seeking a fulfilling and supportive work environment.", "My experience at ONTHEBOOKS has been a mixed bag. On the positive side, the company offers a stable job with a decent salary and benefits package. The office location is convenient, and the facilities are well-maintained. The client portfolio is impressive, and the projects can be quite interesting and challenging. However, the company culture leaves much to be desired. There is a noticeable lack of camaraderie among employees, and the management team could do a better job of fostering a more inclusive and supportive environment. The workload can be quite heavy at times, but it is manageable with proper time management and prioritization. Overall, ONTHEBOOKS has its pros and cons, and whether it's the right fit for you will depend on your individual priorities and preferences."]}
# Create DataFrame.
main_df = pd.DataFrame(data)
# Print the output.
print(main_df)

   ID  Glassdoor Rating                                         ReviewText
0   1               4.0  I have been working at ONTHEBOOKS for over fou...
1   2               4.2  My experience at ONTHEBOOKS has been nothing s...
2   3               2.4  I regret to say that my time at ONTHEBOOKS has...
3   4               1.8  Having worked at ONTHEBOOKS for over a year, I...
4   5               3.0  My experience at ONTHEBOOKS has been a mixed b...


### DistilBert

We can load the model from Huggingface's transformers library the provides pre trained models that are easy to use.

In [1]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, pipeline

# Load pre-trained model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

  from .autonotebook import tqdm as notebook_tqdm




: 

: 

Before feeding data into DistilBERT, it needs to be tokenized. The tokenizer object we loaded earlier can handle this:

In [None]:
# pre processing data
def encode_review(text):
    return tokenizer.encode_plus(
        text,
        max_length=512,
        truncation=True,
        padding='max_length',
        return_attention_mask=True,
        return_tensors='pt'
    )

Using the model and tokenizer above is better suited when we are looking to train the model (a future prospect that may lead to more accurate results). For now, we will use the pipeline class to make predictions from models available on the HuggingFAce Hub. 

In [25]:
sentiment_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading (…)lve/main/config.json: 100%|██████████| 629/629 [00:00<00:00, 107kB/s]
Downloading model.safetensors: 100%|██████████| 268M/268M [00:08<00:00, 30.0MB/s] 
Downloading (…)okenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 6.32kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 3.79MB/s]


We can see that is uses the default model for sentiment analysis. 

In [45]:
def get_sentiment(
    text: str
    # model: DistilBertForSequenceClassification,
    # tokenizer: DistilBertTokenizer
):
    """Use model and tokenizer to extract sentiment of sentences from sentences_list.

    Args:
        sentences_list (list[str]): List of sentences for which we are extracting sentiment.
        result_df (pd.DataFrame): Resulting dataframe that we will concatenate with df with new extracted sentiment.
        model (AutoModelForSequenceClassification): DistilBert model.
        tokenizer (AutoTokenizer): DistilBert tokenizer.

    Returns:
        pd.DataFrame: result_df with new extracted sentiment concatenated.
    """
    #model_input = encode_review(text)
    model_output = sentiment_pipeline(text)
    # Get the prediction from the DistilBert model
    #predictions = model_output.logits.argmax().item()
    #print(predictions)
    #positive = predictions[:, 0].tolist()
    #negative = predictions[:, 1].tolist()
    #neutral = predictions[:, 2].tolist()

    #del predictions
    #del model_output
    return model_output

In [52]:
label = []
score = []
for text in main_df["ReviewText"]:
    output = get_sentiment(text = text)
    label.append(output[0]['label'])
    score.append(output[0]['score'])

main_df["Sentiment_Label"] = label
main_df["Sentiment_Confidence"] = score

print(main_df)



   ID                                         ReviewText Sentiment_Label  \
0   1  I have been working at ONTHEBOOKS for over fou...        POSITIVE   
1   2  My experience at ONTHEBOOKS has been nothing s...        POSITIVE   
2   3  I regret to say that my time at ONTHEBOOKS has...        NEGATIVE   
3   4  Having worked at ONTHEBOOKS for over a year, I...        NEGATIVE   
4   5  My experience at ONTHEBOOKS has been a mixed b...        POSITIVE   

   Sentiment_Confidence  
0              0.999833  
1              0.999849  
2              0.999221  
3              0.999303  
4              0.883224  


Once we execute this model we get a sentiment label that is either Positive or Negative with the confidence of the sentiment. This is a good starting point however we would like a bit more information into how Positive, Negative and Neutral the review is with a score and not just a confidence score. 

### VADER

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sentiments = SentimentIntensityAnalyzer()

: 

In [63]:
main_df['Positive'] = [sentiments.polarity_scores(i)["pos"] for i in main_df["ReviewText"]]
main_df['Negative'] = [sentiments.polarity_scores(i)["neg"] for i in main_df["ReviewText"]]
main_df['Neutral'] = [sentiments.polarity_scores(i)["neu"] for i in main_df["ReviewText"]]

In [64]:
print(main_df)

   ID  Glassdoor Rating                                         ReviewText  \
0   1               4.0  I have been working at ONTHEBOOKS for over fou...   
1   2               4.2  My experience at ONTHEBOOKS has been nothing s...   
2   3               2.4  I regret to say that my time at ONTHEBOOKS has...   
3   4               1.8  Having worked at ONTHEBOOKS for over a year, I...   
4   5               3.0  My experience at ONTHEBOOKS has been a mixed b...   

   Positive  Negative  Neutral  
0     0.262     0.000    0.738  
1     0.308     0.011    0.682  
2     0.048     0.094    0.858  
3     0.070     0.163    0.767  
4     0.152     0.013    0.836  


Using vader, we get the sentiment strength based on the text for Positive, Negative and Neutral sentiment. However, doing some manual comparison, the results do not seem to be as accurate as the HuggingFace models so we will go back to using another HuggingFace pre-trained model.

### RoBERTa

In [2]:
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"
roberta_model = AutoModelForSequenceClassification.from_pretrained(roberta)
roberta_tokenizer = AutoTokenizer.from_pretrained(roberta)
labels = ['Negative', 'Neutral', 'Positive']

def roberta_sentiment(
    text: str
):
    encoded_text = roberta_tokenizer(text, return_tensors='pt')
    output = roberta_model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)

    return scores

  from .autonotebook import tqdm as notebook_tqdm


: 

: 

In [28]:
neg = []
neu = []
pos = []
for text in glassdoor_hotel["reviewText"]:
    output = roberta_sentiment(text = text)
    neg.append(output[0])
    neu.append(output[1])
    pos.append(output[2])

glassdoor_hotel["Negative Sentiment"] = neg
glassdoor_hotel["Neutral Sentiment"] = neu
glassdoor_hotel["Positive Sentiment"] = pos

   ID  Glassdoor Rating                                         ReviewText  \
0   1               4.0  I have been working at ONTHEBOOKS for over fou...   
1   2               4.2  My experience at ONTHEBOOKS has been nothing s...   
2   3               2.4  I regret to say that my time at ONTHEBOOKS has...   
3   4               1.8  Having worked at ONTHEBOOKS for over a year, I...   
4   5               3.0  My experience at ONTHEBOOKS has been a mixed b...   

   Negative Sentiment  Neutral Sentiment  Positive Sentiment  
0            0.000942           0.009013            0.990045  
1            0.001332           0.008737            0.989931  
2            0.940457           0.054331            0.005212  
3            0.938415           0.055953            0.005631  
4            0.099290           0.291708            0.609002  


In [46]:
sentiment_df = glassdoor_hotel[["Negative Sentiment", "Neutral Sentiment", "Positive Sentiment"]]
glassdoor_hotel['Sentiment'] = sentiment_df.idxmax(axis=1)
pprint(glassdoor_hotel)

   ID  Glassdoor Rating                                         ReviewText  \
0   1               4.0  I have been working at ONTHEBOOKS for over fou...   
1   2               4.2  My experience at ONTHEBOOKS has been nothing s...   
2   3               2.4  I regret to say that my time at ONTHEBOOKS has...   
3   4               1.8  Having worked at ONTHEBOOKS for over a year, I...   
4   5               3.0  My experience at ONTHEBOOKS has been a mixed b...   

   Negative Sentiment  Neutral Sentiment  Positive Sentiment  \
0            0.000942           0.009013            0.990045   
1            0.001332           0.008737            0.989931   
2            0.940457           0.054331            0.005212   
3            0.938415           0.055953            0.005631   
4            0.099290           0.291708            0.609002   

            Sentiment  
0  Positive Sentiment  
1  Positive Sentiment  
2  Negative Sentiment  
3  Negative Sentiment  
4  Positive Sentiment  
