# Notebook 3: Topic Modeling and Extraction

##### Please refer to the Python Requirements and Installation Guide pdf 

####  Purpose: The Purpose of the code below is to dig deeper into the "'Gut am Arbeitgeber finde ich_plain_text' and 'Schlecht am Arbeitgeber finde ich_plain_text' colums and analyse the reviews. For this purpose, we utilise the Latent Dirichlet Allocation(LDA Model) for topic modelling and key topic extraction. LDA topic modeling is a probabilistic modeling method for classifying documents based on information related to the topic of the text in a large number of unstructured documents. By analysing the 2 columns/features, we are able to better understand the most spoken about topics within these columns.

#### Intuition: Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. Latent Dirichlet Allocation (LDA) is an unsupervised clustering technique in which words are represented as topics, and documents are represented as a collection of these word topics.

#### Additional Python Libraries Required: 

1. __Spacy__ <br> 
Link: https://spacy.io/ <br>
!pip install -U spacy <br>
!python -m spacy download en_core_web_sm

2. __Gensim__ <br> 
Link: https://pypi.org/project/gensim/ <br>
!pip install gensim <br> 

3. __pyLDAvis__ <br> 
Link: https://pypi.org/project/pyLDAvis/ <br>
!pip install pyLDAvis <br> 


## Importing packages that will be required: 

In [23]:
# imports: 
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
import gensim
import numpy as np
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
import spacy
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
import re
nlp = spacy.load("en_core_web_sm")

## Loading the translated csv file that needs to be analysed: 

In [24]:
# Here, we load in Bechtle's translated dataframe and store it into input_df: 
# The translated Bechtle file can be found in "translated_csvs_folder" folder/directory. 
input_df = pd.read_csv('bechtle_translated.csv',sep='\t', encoding= 'utf-8')
input_df

Unnamed: 0.1,Unnamed: 0,review_idx,review_date,review_title,review_recommendation,review_rating,review_employee_info,Arbeitsatmosphäre_star,Arbeitsatmosphäre_plain_text,Work-Life-Balance_star,...,Spaßfaktor_star,Spaßfaktor_plain_text,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_star,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_plain_text,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_star,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_plain_text,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_star,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_plain_text,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_star,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_plain_text
0,0,review_0,2022-09-23T00:00:00+00:00,"Utopian performance expectations, no cohesion,...",,2.8,Ex-employee Has worked in the field of IT at B...,2.0,"blasphemy, permanent dissatisfaction, pulling ...",5.0,...,,,,,,,,,,
1,1,review_1,2022-09-23T00:00:00+00:00,Good employer with many freedoms.,,4.2,Employee Worked in IT at Bechtle Solingen in S...,,,,...,,,,,,,,,,
2,2,review_2,2022-09-21T00:00:00+00:00,Honest and fair employer,,4.7,Employee Worked for Bechtle IT-Systemhaus Nure...,4.0,"The atmosphere is great, so 4 stars is always ...",4.0,...,,,,,,,,,,
3,3,review_3,2022-09-18T00:00:00+00:00,Even a red apple can be rotten inside,,1.1,Employee Has worked in the field of logistics ...,1.0,Very many employees are dissatisfied. No motiv...,,...,,,,,,,,,,
4,4,review_4,2022-09-12T00:00:00+00:00,Great company with a lot of potential and clea...,,5.0,Manager / Management Worked at Bechtle GmbH & ...,5.0,Working hours and places are flexible. A high ...,5.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1641,1641,review_1641,2008-08-19T00:00:00+00:00,Very good: 4.08 out of 5 stars,,4.3,Worked for Bechtle IT-Systemhaus Oberhausen in...,,,,...,,,,,,,,,,
1642,1642,review_1642,2008-07-09T00:00:00+00:00,Good: 3.46 out of 5 stars,,3.8,Ex-employee Has worked at Bechtle GmbH in Fran...,,,,...,,,,,,,,,,
1643,1643,review_1643,2008-07-03T00:00:00+00:00,Sufficient: 1.62 out of 5 stars,,1.5,Ex-employee Has worked at Bechtle GmbH in Fran...,,,,...,,,,,,,,,,
1644,1644,review_1644,2008-01-30T00:00:00+00:00,Sufficient: 1.69 out of 5 stars,,1.9,Ex-employee Has worked at Bechtle Köln GmbH in...,,,,...,,,,,,,,,,


In [25]:
def construct_time(df): 
    '''
    Purpose: The purpose of the function is to extract the year and month that will be further utilised in our analysis. The year and 
    month is extracted from the review_date column and placed into new columns, namely, year and month. 
    
    Parameters:
        df: Takes input the input_df that has been loaded above. 
    Return: Returns the dataframe with the newly extracted year and month columns. 
    '''
    
    df.loc[:,"review_date"] = pd.to_datetime(df.loc[:,"review_date"] )
    df.insert(2, "year", df["review_date"].dt.year)
    df.insert(3, "month", df["review_date"].dt.month)
    df.pop("review_date")

    return df

In [26]:
# As you can see, the input_df_time will contain the year and month columns for Bechtle's translated dataframe. 
input_df_time = construct_time(input_df)
input_df_time

Unnamed: 0.1,Unnamed: 0,review_idx,year,month,review_title,review_recommendation,review_rating,review_employee_info,Arbeitsatmosphäre_star,Arbeitsatmosphäre_plain_text,...,Spaßfaktor_star,Spaßfaktor_plain_text,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_star,Wie kann dich dein Arbeitgeber im Umgang mit der Corona-Situation noch besser unterstützen?_plain_text,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_star,Wofür möchtest du deinen Arbeitgeber im Umgang mit der Corona-Situation loben?_plain_text,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_star,Was macht dein Arbeitgeber im Umgang mit der Corona-Situation nicht gut?_plain_text,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_star,Wo siehst du Chancen für deinen Arbeitgeber mit der Corona-Situation besser umzugehen?_plain_text
0,0,review_0,2022,9,"Utopian performance expectations, no cohesion,...",,2.8,Ex-employee Has worked in the field of IT at B...,2.0,"blasphemy, permanent dissatisfaction, pulling ...",...,,,,,,,,,,
1,1,review_1,2022,9,Good employer with many freedoms.,,4.2,Employee Worked in IT at Bechtle Solingen in S...,,,...,,,,,,,,,,
2,2,review_2,2022,9,Honest and fair employer,,4.7,Employee Worked for Bechtle IT-Systemhaus Nure...,4.0,"The atmosphere is great, so 4 stars is always ...",...,,,,,,,,,,
3,3,review_3,2022,9,Even a red apple can be rotten inside,,1.1,Employee Has worked in the field of logistics ...,1.0,Very many employees are dissatisfied. No motiv...,...,,,,,,,,,,
4,4,review_4,2022,9,Great company with a lot of potential and clea...,,5.0,Manager / Management Worked at Bechtle GmbH & ...,5.0,Working hours and places are flexible. A high ...,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1641,1641,review_1641,2008,8,Very good: 4.08 out of 5 stars,,4.3,Worked for Bechtle IT-Systemhaus Oberhausen in...,,,...,,,,,,,,,,
1642,1642,review_1642,2008,7,Good: 3.46 out of 5 stars,,3.8,Ex-employee Has worked at Bechtle GmbH in Fran...,,,...,,,,,,,,,,
1643,1643,review_1643,2008,7,Sufficient: 1.62 out of 5 stars,,1.5,Ex-employee Has worked at Bechtle GmbH in Fran...,,,...,,,,,,,,,,
1644,1644,review_1644,2008,1,Sufficient: 1.69 out of 5 stars,,1.9,Ex-employee Has worked at Bechtle Köln GmbH in...,,,...,,,,,,,,,,


In [27]:
import gensim
from gensim.utils import simple_preprocess 
from nltk.corpus import stopwords
# stopwords are loaded in order to remove a corpus of stopwords that add no weight to our topic extraction
stopwords = stopwords.words('english')
import spacy
nlp = spacy.load("en_core_web_sm")
from pprint import pprint


def topic_modelling(df, year:None, topic_col):
    
    '''
    Purpose: The motivation behind this function is to extract analyse words and cluster them in order to identify key topics that have been spoken
    about in the reviews. 
    df: The dataframe to use for topic extraction 
    year: The year for which we will perform topic modelling
    topic_col: The column for which we will perform topic modelling(str)
    
    Returns: Returns a dashboard of topics, where words have been clustered. The next step would be to utilise these clustered words on a yearly basis
    and analyze the topics that can be generated from them. The LDA model utilized is an approximation of key topics. 
    '''
    df= df[[topic_col, 'year']]
    df= df.dropna(how='any', axis=0)
    df['len']= df[topic_col].map(lambda x: len(x))
    df = df.drop(df[df.len == 1].index)

    def sent_to_words(sentences):
        '''
        Purpose: The purpose of the function is to take input th review and create a word list for the same review. Operation is performed 
        row wise in the later stages. 
        Paramters: 
            sentences: Conversion of a list of words into tokens.
        Returns: Tokenized reviews 
        '''
        for sentence in sentences:
            yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))       

    def eliminate_stopwords(data): 
        '''Purpose: The purpose of this function is to elimate any stopwords in the review and further curate the review text. Operation is 
        performed row-wise. 
        
        Parameters: 
            data: Takes input each row(review)
        
        Return: Returns the review after the removal of stopwords
        '''
        doc= []
        for word_list in data: 
            temp=[]
            for word in word_list: 
                if word not in stopwords:
                    temp.append(word)
                else: 
                    continue
            doc.append(temp)
        return doc

    def make_trigram(texts):
        '''Purpose: The purpose of the function is to look into n-grams. Here n-grams is to consider phrases. Trigram considers phrases such
        as "flexible working hours". Since the phrase conveys more information than just the word "flexible
        Parameters: 
            texts: Takes input row after the stopwords have been eliminated 
        Returns: Any trigram/ phrases that convey meaning. 
        "'''
        return [trigram_mod[bigram_mod[doc]] for doc in texts]
    def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
        '''Purpose: The purpose of the function si to identify parts-ofspeeches. For instance here, we allow- nouns, adjectives,verbs and adverbs. 
        Therefore, we will consider these words and extract topics based out of these words for our analysis. 
        Paramterers: 
            texts: Takes input the review, the operation is performed row-wise as well. 
        Returns: Return the nouns, adjectives, verbs and adverbs found in the review. 
        '''
        texts_out = []
        for sent in texts:
            doc = nlp(" ".join(sent)) 
            texts_out.append([token.lemma_ for token in doc
                             if token.pos_ in allowed_postags])
            
        return texts_out

    if year is None:
        

        topic_df= df[[topic_col]]
        topic_df= topic_df.dropna(how='any', axis=0)
        topic_df['len']= topic_df[topic_col].map(lambda x: len(x))
        topic_df = topic_df.drop(topic_df[topic_df.len == 1].index) # dropping text rows with no text
        data = topic_df[topic_col].values.tolist() # convert to list
        data = [re.sub(r'[^a-zA-Z ]+', '', str(sent)) for sent in data] # removing special chracters


        data_words = list(sent_to_words(data))
        no_stopwords = eliminate_stopwords(data_words)
            # # Creating and Applying Bigrams and Trigrams
        bigram = gensim.models.Phrases(data_words, min_count=2, threshold=30)
        trigram = gensim.models.Phrases(bigram[data_words],min_count= 3, threshold=30)
        bigram_mod = gensim.models.phrases.Phraser(bigram)
        trigram_mod = gensim.models.phrases.Phraser(trigram)
        data_words_trigrams = make_trigram(no_stopwords)

        data_lemmatized = lemmatization(data_words_trigrams,
                                    allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

        corpora_dict = corpora.Dictionary(data_lemmatized)      
        texts = data_lemmatized                       
        corpus = [corpora_dict.doc2bow(text) for text in texts] 


        #LDA Model
        lda_model = gensim.models.ldamodel.LdaModel\
                    (corpus=corpus, id2word=corpora_dict, num_topics =8, random_state = 42,
                     update_every = 1, chunksize = 50, passes = 5, alpha = 'auto',
                     per_word_topics=True) 
        pprint(lda_model.print_topics())
        doc_lda = lda_model[corpus]
        pyLDAvis.enable_notebook()
        vis = gensimvis.prepare(lda_model, corpus, corpora_dict)
        return vis 
    
    else: 
        print(year)
        time_df =  df[df['year']==year]
        topic_df= time_df[[topic_col]]
        topic_df= topic_df.dropna(how='any', axis=0)
        topic_df['len']= topic_df[topic_col].map(lambda x: len(x))
        topic_df = topic_df.drop(topic_df[topic_df.len == 1].index) # dropping text rows with no text
        data = topic_df[topic_col].values.tolist() # convert to list
        data = [re.sub(r'[^a-zA-Z ]+', '', str(sent)) for sent in data] # removing special chracters


        data_words = list(sent_to_words(data))
        no_stopwords = eliminate_stopwords(data_words)
            # # Create and Apply Bigrams and Trigrams
        bigram = gensim.models.Phrases(data_words, min_count=2, threshold=30)
        # # Higher threshold fewer phrases
        trigram = gensim.models.Phrases(bigram[data_words],min_count= 3, threshold=30)
        bigram_mod = gensim.models.phrases.Phraser(bigram)
        # # Faster way to get a sentence into a trigram/bigram
        trigram_mod = gensim.models.phrases.Phraser(trigram)
        data_words_trigrams = make_trigram(no_stopwords)

        data_lemmatized = lemmatization(data_words_trigrams,
                                    allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

        corpora_dict = corpora.Dictionary(data_lemmatized)      
        texts = data_lemmatized                       
        corpus = [corpora_dict.doc2bow(text) for text in texts] 


        # Building the LDA Model
        lda_model = gensim.models.ldamodel.LdaModel\
                    (corpus=corpus, id2word=corpora_dict, num_topics =8, random_state = 42,
                     update_every = 1, chunksize = 50, passes = 5, alpha = 'auto',
                     per_word_topics=True) 
        pprint(lda_model.print_topics())
        doc_lda = lda_model[corpus]
        pyLDAvis.enable_notebook()
        vis = gensimvis.prepare(lda_model, corpus, corpora_dict)
        # Compute perplexity
        perplexlity = lda_model.log_perplexity(corpus)
        return vis 


In [28]:
# In this cell, we provide the company's dataframe for which we would be extracting topics. The motiavtion is to run it on a yearly basis and 
# analyse the topic topics that have been spoken about. We run the cell below from the year 2010-2022. By doing so, as analysts, we 
# analyse words belonging to various unsupervised clusters and generate the most spoken about topics. From our analysis we were able to identify, 3 
# main topics - 1. Management, 2. Remunerations and 3. Work Environement. The bubbles are unsupervised clusters and an approximation of key topics.
# Through, running multiple interations and analyzing the various bubbles and perplexity, we were able to bucket the reviews to the aforementioned 
# 3 topics
# By analyzing all clusters from 2010-2022, we were able to identify these topics
topic_modelling(input_df_time,2016,'Gut am Arbeitgeber finde ich_plain_text')

2016
[(0,
  '0.062*"work" + 0.060*"great" + 0.052*"climate" + 0.039*"good" + '
  '0.031*"salary" + 0.031*"contract" + 0.031*"young" + 0.031*"working_hour" + '
  '0.027*"colleague" + 0.019*"flexible"'),
 (1,
  '0.026*"level" + 0.025*"personal" + 0.016*"flat" + 0.015*"hierarchy" + '
  '0.014*"mutual" + 0.014*"hire" + 0.014*"orientation" + 0.014*"fire" + '
  '0.014*"mentality" + 0.014*"longterm"'),
 (2,
  '0.100*"colleague" + 0.032*"good" + 0.025*"cohesion" + 0.020*"short" + '
  '0.020*"top" + 0.020*"brand" + 0.015*"perhaps" + 0.015*"commitment" + '
  '0.015*"constructively" + 0.015*"superior"'),
 (3,
  '0.066*"work" + 0.060*"good" + 0.058*"team" + 0.046*"atmosphere" + '
  '0.028*"building" + 0.019*"time" + 0.019*"level" + 0.019*"reputation" + '
  '0.019*"office" + 0.019*"nice"'),
 (4,
  '0.077*"training" + 0.067*"opportunity" + 0.036*"company" + 0.034*"offer" + '
  '0.031*"employee" + 0.027*"good" + 0.020*"give" + 0.020*"management" + '
  '0.015*"train" + 0.015*"employment"'),
 (5,
  '0.

  default_term_info = default_term_info.sort_values(


#### End of Notebook

##### Next notebook is 04_word_frequency.ipynb