<a href="https://colab.research.google.com/github/ispada/attrition/blob/main/ATTRITION_BERTopic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ATTRITION** - Topic Modeling with BERTopic

BERTopic is a topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. (https://maartengr.github.io/BERTopic/index.html)

## Information about the algorihm
website: https://maartengr.github.io/BERTopic/algorithm/algorithm.html

paper: https://arxiv.org/pdf/2203.05794.pdf  

<img src="https://maartengr.github.io/BERTopic/img/algorithm.png" width="50%">


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

*   Navigate to Edit→Notebook Settings
*   Select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic
!pip install joblib==1.1.0

After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook. **From the Menu: Runtime → Restart Runtime**

There is an issue in the library update (not solved at the time of the analysis). See https://github.com/scikit-learn-contrib/hdbscan/issues/565

# **Import Data**
Import the dataset for the Topic Modelling. In this case we will analyse the scientific pubblications about attrition. The dataset includes Title, Abstract, Authors Keywords, Year, Number of Citations and Authors of papers available on Scopus. The papers have been checked manually to select only the ones in scope for the purpose of the analysis.

In [None]:
# Connect Google Drive (GDrive) with Colab
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
# Import the file from Google Drive
import pandas as pd
attrition_paper = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/input/paper_table_clean.csv')

Let's have a look into the dataset and in the string that we will use for the topic modelling.

In [None]:
attrition_paper.head()

# **Text Preprocessing**

1. Merge text from Title, Author Keywords, and Abstracts.

In [None]:
attrition_paper[['Title', 'Abstract', 'Author Keywords']] = attrition_paper[['Title', 'Abstract', 'Author Keywords']].fillna('')
attrition_paper['Cited by'] = attrition_paper['Cited by'].fillna(0)
attrition_paper['text'] = attrition_paper['Title'] + ' ' + attrition_paper['Abstract'] + ' ' + attrition_paper['Author Keywords']

In [None]:
attrition_paper.head()

Unnamed: 0.1,Unnamed: 0,Title,Authors,Year,Cited by,Author Keywords,Abstract,EID,text
0,1,The association between shift work disorder an...,"Blytt K.M., Bjorvatn B., Moen B.E., Pallesen S...",2022,0.0,Nursing; Shift work disorder; Sleep; Turnover ...,Background: Shift work disorder (SWD) is highl...,2-s2.0-85131318824,The association between shift work disorder an...
1,2,How is leadership experienced in joy-of-life-n...,"André B., Jacobsen F.F., Haugan G.",2022,0.0,Joy of life in nursing homes; Leadership; Nurs...,Background: Nursing homes are under strong pre...,2-s2.0-85127415691,How is leadership experienced in joy-of-life-n...
2,3,Explaining the consequences of missed nursing ...,"Janatolmakan M., Khatony A.",2022,0.0,Missed nursing care; Nurse; Outcome; Qualitati...,Background: Missed nursing care is a global ch...,2-s2.0-85126176254,Explaining the consequences of missed nursing ...
3,4,Understanding the factors affecting attrition ...,"Tekle M.G., Wolde H.M., Medhin G., Teklu A.M.,...",2022,0.0,Attrition; Ethiopia; HEWs; Intention to leave,Background: The Health Extension Program (HEP)...,2-s2.0-85125002964,Understanding the factors affecting attrition ...
4,5,Precursors and outcomes of work engagement amo...,"Slåtten T., Lien G., Mutonyi B.R.",2022,2.0,Collaborative climate; Hospitals; Job satisfac...,Background: Health services organizations must...,2-s2.0-85122186441,Precursors and outcomes of work engagement amo...


2. Lemmatize and clean text removing stopwords, scientific literature blacklist, and domain blacklist.

In [None]:
# Import blacklist
# The literature Blacklist include a list of common terms from scientific literature
literature_blacklist = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/input/dictionaries/literature_blacklist.csv') 
# The Domain Blacklist includes the terms used in the keywords and the most frequent and the most rare terms from the abstracts in the dataset
domain_blacklist = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/input/dictionaries/domain_blacklist.csv')

In [None]:
# Merge blacklist
blacklist = pd.concat([literature_blacklist, domain_blacklist])
len(blacklist)

5239

In [None]:
# Remove duplicates (if any)
blacklist.drop_duplicates(subset='value', inplace=True)
len(blacklist)

5237

In [None]:
# Transform blacklist in a list with .tolist()
blacklist_list = blacklist["value"].tolist()

In [None]:
# Configure cleaning operations
config = {
    'remove_punct' : True,
    'remove_num' : True,
    'remove_stopwords' : True,
    'lemmatize' : True,
    'remove_blacklist' : blacklist_list
}

In [None]:
# Define preprocessing funcion
import spacy

nlp = spacy.load('en_core_web_sm') # load language model

def preprocess_txt(text):
    text = text.lower() # convert to lower case
    doc = nlp(text) # apply language model 
    if config['remove_punct']:
        doc = [token for token in doc if not token.is_punct]
    if config['remove_num']:
        doc = [token for token in doc if not token.is_digit]
    if config['remove_stopwords']:
        doc = [token for token in doc if not token.is_stop and token.text not in config['remove_blacklist']]
    if config['lemmatize']:
        doc = [token.lemma_ for token in doc]   # .lemma_ is a string
    if config['remove_blacklist']:
        doc = [token for token in doc if token not in config['remove_blacklist']]
    
    result = ''
    for text in doc:
        result += text + ' '
    
    return result.strip()

In [None]:
# Apply preprocessing funcion to text
attrition_paper['text_preprocessed'] = attrition_paper['text'].apply(lambda text: preprocess_txt(text))

In [None]:
attrition_paper.head()

Unnamed: 0.1,Unnamed: 0,Title,Authors,Year,Cited by,Author Keywords,Abstract,EID,text,text_preprocessed
0,1,The association between shift work disorder an...,"Blytt K.M., Bjorvatn B., Moen B.E., Pallesen S...",2022,0.0,Nursing; Shift work disorder; Sleep; Turnover ...,Background: Shift work disorder (SWD) is highl...,2-s2.0-85131318824,The association between shift work disorder an...,association shift work disorder nurse shift wo...
1,2,How is leadership experienced in joy-of-life-n...,"André B., Jacobsen F.F., Haugan G.",2022,0.0,Joy of life in nursing homes; Leadership; Nurs...,Background: Nursing homes are under strong pre...,2-s2.0-85127415691,How is leadership experienced in joy-of-life-n...,leadership experience joy life nursing home co...
2,3,Explaining the consequences of missed nursing ...,"Janatolmakan M., Khatony A.",2022,0.0,Missed nursing care; Nurse; Outcome; Qualitati...,Background: Missed nursing care is a global ch...,2-s2.0-85126176254,Explaining the consequences of missed nursing ...,consequence nursing care perspective nurse qua...
3,4,Understanding the factors affecting attrition ...,"Tekle M.G., Wolde H.M., Medhin G., Teklu A.M.,...",2022,0.0,Attrition; Ethiopia; HEWs; Intention to leave,Background: The Health Extension Program (HEP)...,2-s2.0-85125002964,Understanding the factors affecting attrition ...,understand affect leave extension worker mixed...
4,5,Precursors and outcomes of work engagement amo...,"Slåtten T., Lien G., Mutonyi B.R.",2022,2.0,Collaborative climate; Hospitals; Job satisfac...,Background: Health services organizations must...,2-s2.0-85122186441,Precursors and outcomes of work engagement amo...,precursor work engagement nursing professional...


In [None]:
# Save results
attrition_paper.to_csv(r'/content/gdrive/MyDrive/Colab Notebooks/wip/attrition_paper_clean_preprocessed.csv', index = False, header=True)

# **Application of the BERTopic model**

Let's apply BERTopic using the techniques for imporving topic representation (with reference to the elimination of stopwords in defining the names of the clusters). We will customize UMAP only to set the random state to ensure reproducibility.

Then we will visualize the Topics' Hierarchy to get information on the structure of the clustering and the UMAP model to assess the clustering. The second is a visualization of the distribution of the embeddings in the clusters in a two-dimensional space, where each paper as a point, colored as the belonging cluster. The graph will provide the title of the papers moving on the graph.  

We will apply this pipeline both to raw and clean text.

In [None]:
# Connect Google Drive (GDrive) with Colab
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
# Import pre-processed data
import pandas as pd
attrition_paper = pd.read_csv(r'/content/gdrive/MyDrive/Colab Notebooks/wip/attrition_paper_clean_preprocessed.csv')

In [None]:
# Set models
from scipy.cluster import hierarchy as sch
from bertopic import BERTopic
from umap import UMAP

# Set UMAP model
umap_model_new = UMAP(random_state=567)

# Set BERTopic model
topic_model_new = BERTopic(language="english", calculate_probabilities=True, verbose=True, diversity=0.2, n_gram_range =(1,2), umap_model=umap_model_new)

**Application to raw text**

In [None]:
# Apply model to raw text
topics_new_raw, probs_new_raw = topic_model_new.fit_transform(attrition_paper.text)
len(topic_model_new.get_topic_info())

In [None]:
freq_new_raw = topic_model_new.get_topic_info() 
freq_new_raw

In [None]:
df_new_raw = pd.DataFrame({'Topic': topics_new_raw, 'scopus_id': attrition_paper.EID, 'year':attrition_paper.Year})
df_new_raw.head()

In [None]:
# Save results
freq_new_raw.to_csv (r'/content/gdrive/MyDrive/Colab Notebooks/output/567BERTopics_new_raw_topic_freq2.csv', index = False, header=True)
df_new_raw.to_csv (r'/content/gdrive/MyDrive/Colab Notebooks/output/567BERTopics_new_raw_paper2.csv', index = False, header=True)

In [None]:
# Hierarchical topics
hierarchical_topics_new_raw = topic_model_new.hierarchical_topics(attrition_paper.text)

In [None]:
# Visualize hierarchical topics in a tree
tree_new_raw = topic_model_new.get_topic_tree(hierarchical_topics_new_raw)
print(tree_new_raw)

# (copy and paste in a txt to save the result)

In [None]:
# Results from UMAP model

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Set UMAP model [note: ONLY random_stade ensures replication]
umap_model_new = UMAP(random_state=567)

# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_raw = sentence_model.encode(attrition_paper.text, show_progress_bar=False)

# Train BERTopic
topic_model_new2_raw = BERTopic(language="english", calculate_probabilities=True, verbose=True, diversity=0.2, n_gram_range =(1,2), umap_model=umap_model_new).fit(attrition_paper.text, embeddings_raw)

# Run the visualization with the original embeddings
topic_model_new2_raw.visualize_documents(attrition_paper.text, embeddings=embeddings_raw)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings_raw = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_raw)


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

2022-11-02 15:50:47,389 - BERTopic - Reduced dimensionality
2022-11-02 15:50:48,242 - BERTopic - Clustered reduced embeddings


In [None]:
# Set only numbers as labels (for a better visualization)
topic_labels = list((range(-1,52)))

for i in range(0,len(topic_labels)):
  topic_labels[i] = str(topic_labels[i])
  
topic_model_new2_raw.set_topic_labels(topic_labels)

In [None]:
# Visualize plot
fig_UMAP_new_raw = topic_model_new2_raw.visualize_documents(attrition_paper.Title, reduced_embeddings=reduced_embeddings_raw, hide_annotations = False, custom_labels= True, width = 800, height = 500)
fig_UMAP_new_raw

In [None]:
# Save results in html to have the interactive version
import plotly.express as px
fig_UMAP_new_raw.write_html("/content/gdrive/MyDrive/Colab Notebooks/output/567BERTopics_new_raw_UMAP2.html", default_width = 1200, default_height = 1200)

**Application to clean text**

In [None]:
# Apply model to clean text
topics_new_clean, probs_new_clean = topic_model_new.fit_transform(attrition_paper.text_preprocessed)
len(topic_model_new.get_topic_info())

In [None]:
freq_new_clean = topic_model_new.get_topic_info() 
freq_new_clean

In [None]:
df_new_clean = pd.DataFrame({'Topic': topics_new_clean, 'scopus_id': attrition_paper.EID, 'year':attrition_paper.Year})
df_new_clean.head()

In [None]:
# Save results
freq_new_clean.to_csv (r'/content/gdrive/MyDrive/Colab Notebooks/output/567BERTopics_new_clean_topic_freq2.csv', index = False, header=True)
df_new_clean.to_csv (r'/content/gdrive/MyDrive/Colab Notebooks/output/567BERTopics_new_clean_paper2.csv', index = False, header=True)

In [None]:
# Hierarchical topics
hierarchical_topics_new_clean = topic_model_new.hierarchical_topics(attrition_paper.text_preprocessed)

In [None]:
# Visualize hierarchical topics in a tree
tree_new_clean = topic_model_new.get_topic_tree(hierarchical_topics_new_clean)
print(tree_new_clean)

# (copy and paste in a txt to save the result)

In [None]:
# Visualize results from UMAP model

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Set UMAP model [note: ONLY random_stade ensures replication]
umap_model_new = UMAP(random_state=567)

# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_clean = sentence_model.encode(attrition_paper.text_preprocessed, show_progress_bar=False)

# Train BERTopic
topic_model_new2_clean = BERTopic(language="english", calculate_probabilities=True, verbose=True, diversity=0.2, n_gram_range =(1,2), umap_model=umap_model_new).fit(attrition_paper.text_preprocessed, embeddings_clean)

# Run the visualization with the original embeddings
topic_model_new2_clean.visualize_documents(attrition_paper.text_preprocessed, embeddings=embeddings_clean)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings_clean = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_clean)
fig_UMAP_new_clean = topic_model_new2_clean.visualize_documents(attrition_paper.Title, reduced_embeddings=reduced_embeddings_clean)

2022-11-02 14:22:10,084 - BERTopic - Reduced dimensionality
2022-11-02 14:22:10,864 - BERTopic - Clustered reduced embeddings


In [None]:
# Set only numbers as labels (for a better visualization)
topic_labels_clean = list((range(-1,59)))

for i in range(0,len(topic_labels_clean)):
  topic_labels_clean[i] = str(topic_labels_clean[i])

topic_model_new2_clean.set_topic_labels(topic_labels_clean)

In [None]:
# Visualize plot
fig_UMAP_new_clean = topic_model_new2_clean.visualize_documents(attrition_paper.Title, reduced_embeddings=reduced_embeddings_clean, hide_annotations = False, custom_labels= True, width = 800, height = 500)
fig_UMAP_new_clean

In [None]:
# Save results in html to have the interactive version
import plotly.express as px
fig_UMAP_new_clean.write_html("/content/gdrive/MyDrive/Colab Notebooks/output/567BERTopics_new_clean_UMAP2.html", default_width = 1200, default_height = 1200)