# Notebook 2: Sentiment Analysis Notebook

##### Please refer to the Python Requirements and Installation Guide pdf 

Purpose : The following notebook analyzes the translated dataframes by applying a sentiment model. 

### Imports required within the notebook 

#### Addtional Python Dependencies: 
1. __Pytorch__ <br>
    Link: https://pytorch.org/get-started/locally/ <br>
    The installation guide can be found in the link above. The installation depends on the operating system that one is 
    working on <br>
    pip install torch torchvision torchaudio <br> 


In [1]:
# Please download the following packages to run the sentiment model
#pip install torch
#pip install transformers
# pip install scipy 

In [2]:
# Importing the required packages for this notebook: 

import torch 
import transformers 
from timeit import default_timer as timer
import pandas as pd
from tqdm import tqdm
import time
import numpy as np
import sys
import warnings
warnings.simplefilter(action='always', category = FutureWarning)
warnings.filterwarnings('always')
warnings.simplefilter('ignore')
import matplotlib.pyplot as plt
import datetime 
from scipy.special import softmax
import seaborn as sns 

c:\ProgramData\Anaconda3\envs\DataScience\lib\site-packages\numpy\.libs\libopenblas.GK7GX5KEQ4F6UYO3P26ULGBQYHGQO7J4.gfortran-win_amd64.dll
c:\ProgramData\Anaconda3\envs\DataScience\lib\site-packages\numpy\.libs\libopenblas.XWYDX2IKJW2NMTWSFYNGFUWKQU3LYTCZ.gfortran-win_amd64.dll


## Importing Sentiment Model

The model we designed to use is available for 6 different languages, the accuracy varies depending on the input language. The models has already been trained on 150k english reviews, and 137k German reviews. You may find more information by using the following url: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment?text=I+hate+you. 

Import and pass the tokenizer and the model url from Hugging Face url = ("nlptown/bert-base-multilingual-uncased-sentiment")

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

In [4]:
# Here import the previously translatated files, our foucs company Bechtle, the market leader SAP,
# and the IT industry master table. 
# The following path will vary for the individual user running the following cell.
# Please adapt the path accordingly. Please leave sep='\t', encoding = 'utf-8' after path insertion
# The translated csv files can be found in the "translated_csvs_folder" folder/directory.
# Here, we utilise Bechtle's translated csv as Bechtle is our company of interest. We also utilize SAP's 
# translated csv as SAP is the market leader and we use it as a benchmark. .
# The master table is It is a union of all the IT companies we ahve scraped.
becthle_translated = pd.read_csv(r"C:\Users\jdzuc\OneDrive\Frankfurt School Courses\Third Semester\Strategic Management\Final Project\Strategy Final Project\translated_csvs_folder\bechtle_translated.csv",sep='\t', encoding = 'utf-8')
sap_translated = pd.read_csv(r"C:\Users\jdzuc\OneDrive\Frankfurt School Courses\Third Semester\Strategic Management\Final Project\Strategy Final Project\translated_csvs_folder\sap_translated.csv", sep = '\t', encoding = 'utf-8')
master_table = pd.read_csv(r"C:\Users\jdzuc\OneDrive\Frankfurt School Courses\Third Semester\Strategic Management\Final Project\Strategy Final Project\master_table", sep = '\t', encoding = 'utf-8')

In [5]:
becthle_translated.head(3)

Unnamed: 0.1,Unnamed: 0,review_idx,review_date,review_title,review_recommendation,review_rating,review_employee_info,Arbeitsatmosphäre_star,Arbeitsatmosphäre_plain_text,Work-Life-Balance_star,...,Spaßfaktor_star,Spaßfaktor_plain_text,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_star,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_plain_text,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_star,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_plain_text,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_star,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_plain_text,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_star,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_plain_text
0,0,review_0,2022-09-23T00:00:00+00:00,"Utopian performance expectations, no cohesion,...",,2.8,Ex-employee Has worked in the field of IT at B...,2.0,"blasphemy, permanent dissatisfaction, pulling ...",5.0,...,,,,,,,,,,
1,1,review_1,2022-09-23T00:00:00+00:00,Good employer with many freedoms.,,4.2,Employee Worked in IT at Bechtle Solingen in S...,,,,...,,,,,,,,,,
2,2,review_2,2022-09-21T00:00:00+00:00,Honest and fair employer,,4.7,Employee Worked for Bechtle IT-Systemhaus Nure...,4.0,"The atmosphere is great, so 4 stars is always ...",4.0,...,,,,,,,,,,


In [6]:
def construct_time(df): 
    """
    Purpose: The purpose the function is to take as input the desired dataframe to convert the review_date	column into two separate columns one for year and for month. This will aid 
    data analysis and visualization.  
    Paramters: 
        df: desired dataframe to update
    
    Return: The function returns a dataframe with two extra columns, year and month.
    """
    if "Unnamed: 0" in df.columns: 
        df.drop(columns = "Unnamed: 0", inplace = True)
        
    df.loc[:,"review_date"] = pd.to_datetime(df.loc[:,"review_date"] )
    df.insert(4, "year", df["review_date"].dt.year)
    df.insert(5, "month", df["review_date"].dt.month)
    df.pop("review_date")

    return df


In [7]:
# Here we don't pass the master_table dataframe, as the function has alreaddy been applied in the previous notebook. 
bechtle_translated = construct_time(becthle_translated)
sap_translated = construct_time(sap_translated)

# Measuring overall company sentiment 

To calculate overall company sentiment, the model will combine row wise text, thereby calculating sentiment per review per row and across the entire dataframe. 

In [43]:

def calculate_sentiment_row(df): #row_number: float
    
    """
    Purpose: The purpose of this function is to concatanate all rows-respective texts into one single paragraph. This ensures sentiment model accuracy. 
    
    Parameter: 
        df: Takes input the dataframe for which we want to run the sentiment analysis. 
        sentiment_score: The model then takes this input paragraph, tokenizes it, applies the model and returns the predicted sentiment score for that specific row. 
        the funtion returns the dataframe after adding the row wise predicte sentiment.
    
    """
    
    df["Sentiment Score"] = "nan"
    sentiment_column = df.pop("Sentiment Score")
    df.insert(df.columns.get_loc("review_rating")+1, "Sentiment Score", sentiment_column)
        
    
    for i in tqdm(np.arange(0, df.shape[0])):
        first_cols = ['review_title', 'review_recommendation', 'review_employee_info']
        second_cols = list(df.filter(like = 'plain_text', axis = 1).columns)
        cols = first_cols + second_cols
        
        string = ''
        for col in cols:
            if isinstance(df[col][i], float):
                continue
            else:
                string += ' ' + df[col][i]
                
        tokens = tokenizer.encode(string, truncation=True, padding=True, return_tensors='pt')
        result = model(tokens)
        result.logits
        sentiment_score = int(torch.argmax(result.logits))+1
        df["Sentiment Score"][i] = sentiment_score
    
    
    return df



Below you may see one example of how the column "Sentiment Score" is being calculated.

In [44]:
# Leave to demostrate how the functon works
from tqdm import tqdm
# takes input the bechtle_translated csv. 
bechtle_sentiment = calculate_sentiment_row(bechtle_translated)
# bechtle_sentiment will hold the sentiment on a row level. 
# The sentiment on a row level for each review has been caluclated and has been saved in the "sentiment" folder/directory. 


100%|██████████| 1646/1646 [04:49<00:00,  5.68it/s]


In [48]:
bechtle_sentiment.head(3)

Unnamed: 0,review_idx,review_title,review_recommendation,year,month,review_rating,Sentiment Score,review_employee_info,Arbeitsatmosphäre_star,Arbeitsatmosphäre_plain_text,...,Spaßfaktor_star,Spaßfaktor_plain_text,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_star,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_plain_text,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_star,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_plain_text,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_star,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_plain_text,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_star,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_plain_text
0,review_0,"Utopian performance expectations, no cohesion,...",,2022,9,2.8,1,Ex-employee Has worked in the field of IT at B...,2.0,"blasphemy, permanent dissatisfaction, pulling ...",...,,,,,,,,,,
1,review_1,Good employer with many freedoms.,,2022,9,4.2,5,Employee Worked in IT at Bechtle Solingen in S...,,,...,,,,,,,,,,
2,review_2,Honest and fair employer,,2022,9,4.7,4,Employee Worked for Bechtle IT-Systemhaus Nure...,4.0,"The atmosphere is great, so 4 stars is always ...",...,,,,,,,,,,


In [None]:
# Uncomment to run function on different tables 
# sentiments for SAP and IT industry:
# sap_sentiment = calculate_sentiment_row(sap_translated)
# master_table_sentiment = calculate_sentiment_row(master_table)

In [None]:
# This is the function used to save the new csv files with the predict sentiment score

def clean_send(df, file_name): 
    """
    takes in a dataframe and path to where it is supposed to be saved
    
    """
    
    if "Unnamed:0" in df.columns: 
        df.drop(columns = "Unnamed: 0", inplace = True)
        
    df.rename(columns = {'':'team_name'}, inplace = True)
    df.to_csv(file_name, sep = '\t', encoding = 'utf-8', index = True)

In [50]:
# After running the sentiment function the dataframes will match the below dataframes: 
# The function above can be utilised to save the sentiment on a row level. This was a step in our process. 
# If the notebook is being run on one instance, there in no need for the clear_send function. 
bechtle_sentiment = pd.read_csv(r"C:\Users\jdzuc\OneDrive\Frankfurt School Courses\Third Semester\Strategic Management\Final Project\Strategy Final Project\Sentiment\bechtle_sentiment.csv", sep   = '\t', encoding = 'utf-8')
sap_sentiment = pd.read_csv(r"C:\Users\jdzuc\OneDrive\Frankfurt School Courses\Third Semester\Strategic Management\Final Project\Strategy Final Project\Sentiment\sap_sentiment.csv", sep   = '\t', encoding = 'utf-8')
master_table_sentiment = pd.read_csv(r"C:\Users\jdzuc\OneDrive\Frankfurt School Courses\Third Semester\Strategic Management\Final Project\Strategy Final Project\Sentiment\Industry_sentiment.csv", sep   = '\t', encoding = 'utf-8')

---------------------------------

### Sentiment Per Column 

In [32]:
# to add a year to sentiment pass the dataframe as following: bechtle_translated[bechtle_translated['year'] > 2019]

def column_sentiment(df):
    
    """
    Purpose: The purpose of this function is to concatanate all rows and their respective texts under one specific column.
    
    Parameter: 
        df: Takes input the dataframe for which we want to run the colum wise sentiment analysis. 
        sentiment_score: The model then takes this input paragraph made from a single column, tokenizes it, applies the model and returns the predicted sentiment score for that specific row. 
        the funtion returns the dictionary displaying the column name and their overall predict sentiment
    """
    
    df = df.select_dtypes(include='object')
    results = []
    column_names = df.columns
    
    for i in tqdm(df.columns):
        string = ''
        for row in df[i]:
            if isinstance(row, float):
                continue
            else:
                string += ' ' + row
        tokens = tokenizer.encode(string, truncation=True, padding=True, return_tensors='pt')
        result = model(tokens)
        result.logits
        sentiment_score = int(torch.argmax(result.logits))+1
        results.append(sentiment_score)
                
    return dict(zip(column_names, results))

            

To calculate column sentiment, the model will combine row wise text under one specific column, thereby calculating sentiment across the entire dataframe. 

In [37]:
bechtle_sentiment_columns = column_sentiment(bechtle_translated)

100%|██████████| 32/32 [00:18<00:00,  1.73it/s]


In [38]:
# below you can visualize the columns and their corresping predicted sentiment. The sentiment on a column basis has 
# been calculated for our exploratory data analysis process
bechtle_sentiment_columns

{'review_idx': 1,
 'review_title': 1,
 'review_recommendation': 4,
 'review_employee_info': 4,
 'Arbeitsatmosphäre_plain_text': 2,
 'Work-Life-Balance_plain_text': 2,
 'Kollegenzusammenhalt_plain_text': 4,
 'Vorgesetztenverhalten_plain_text': 1,
 'Kommunikation_plain_text': 2,
 'Gehalt/Sozialleistungen_plain_text': 2,
 'Gut am Arbeitgeber finde ich_plain_text': 4,
 'Schlecht am Arbeitgeber finde ich_plain_text': 1,
 'Verbesserungsvorschläge_plain_text': 1,
 'Image_plain_text': 2,
 'Karriere/Weiterbildung_plain_text': 4,
 'Umwelt-/Sozialbewusstsein_plain_text': 2,
 'Umgang mit älteren Kollegen_plain_text': 2,
 'Arbeitsbedingungen_plain_text': 2,
 'Gleichberechtigung_plain_text': 4,
 'Interessante Aufgaben_plain_text': 2,
 'Arbeitszeiten_plain_text': 4,
 'Ausbildungsvergütung_plain_text': 3,
 'Die Ausbilder_plain_text': 5,
 'Aufgaben/Tätigkeiten_plain_text': 2,
 'Variation_plain_text': 4,
 'Respekt_plain_text': 5,
 'Karrierechancen_plain_text': 4,
 'Spaßfaktor_plain_text': 3,
 'Wie kann 

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [51]:
# Uncomment to apply to the other dataframes 

# sap_sentiment_columns = column_sentiment(sap_translated)
# industy_sentiment_columns = column_sentiment(master_table) 


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Calculating NPS and Absolute Error Score 

In [34]:
def calculate_nps(df): 
    """
    Purpose: The purpose of this function is to calculate the NPS score and other metrics that allow for overall employee sentiment metrics 
    
    Parameter: 
        df: Takes input the dataframe for which we want to calculate the nps. 
        sentiment_score: The model then takes this input paragraph, tokenizes it, applies the model and returns the predicted sentiment score for that specific row. 
        the funtion returns the dataframe after adding the row wise predicte sentiment.
    
    
    """
    
    if "Unnamed: 0" in df.columns: 
        df.drop(columns= ["Unnamed: 0"], inplace=  True) 
        
    df["Net Promoting Score"] = df["review_rating"].values * 2
    
    nps_score = df.pop("Net Promoting Score")
    df.insert(df.columns.get_loc("Sentiment Score")+1, "Net Promotion Score", nps_score)
    
    
    detractros = len(df[df["Net Promotion Score"] <= 6])
    neutrals = len(df[(df['Net Promotion Score'] > 6) &  (df['Net Promotion Score'] < 9)])
    promoters = len(df[df["Net Promotion Score"] >= 9])
    total = df.shape[0]
    
    
    _detractors = round((detractros/total)*100,1)
    _neutral = round((neutrals/total)*100,1)
    _promoters = round((promoters/total)*100,1)
    
    results = [_detractors, _neutral, _promoters]
    
    return results
            
    
    

The below code calculates the estimated average score of NPS per company across all years. 

In [None]:
nps_results = {}
for company in list(master_table_sentiment["company"].unique()):
    nps_results.update({company: list(calculate_nps(master_table_sentiment[master_table_sentiment['company'] == company]))})

In [292]:
nps_results

{'capgemini': [21.2, 39.0, 39.8],
 'ibm': [42.3, 37.4, 20.2],
 'adesso': [17.2, 32.4, 50.4],
 'fujitsu': [28.6, 41.9, 29.5],
 'computacenter': [22.6, 38.0, 39.4],
 'swisscom': [20.0, 45.5, 34.5],
 't_systems': [29.9, 45.6, 24.5],
 'sap': [5.3, 26.4, 68.3],
 'dell': [14.0, 29.2, 56.8],
 'cancom': [38.5, 35.6, 26.0],
 'bechtle': [38.4, 34.9, 26.7]}

The below lines of code serve for calculating the mean absolute error between the kununu review score and the sentiment score. 

In [53]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(bechtle_sentiment[bechtle_sentiment['year'] > 2011].loc[:,"review_rating"], bechtle_sentiment[bechtle_sentiment['year'] > 2011].loc[:,"Sentiment Score"])


In [56]:
mae = mean_absolute_error(bechtle_sentiment[bechtle_sentiment['year'] > 2011].loc[:,"review_rating"], bechtle_sentiment[bechtle_sentiment['year'] > 2011].loc[:,"Sentiment Score"])
print(round(mae,2))
print(round((mae/5)*100,2))

0.75
14.97


#### End of Notebook 

##### The next notebook to utilise is 03_topic_modeling.ipynb