<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Exercise 1 - Topic Modeling

Let us now put what we have learnt in the past two lessons to use. We will perform anomaly detection and topic modeling on the procurement dataset (procurementdata.csv). We will use the tender_description column as well as the agency column for performing these analyses.


Let us start with topic modeling. We will first load the data

In [None]:
import pandas as pd
import numpy as np

In [None]:
csv_file = 'datasets/procurementdata.csv'

# Load up the data from the training CSV file.
#
print ("Loading data...")
data_df = pd.read_csv(csv_file)

In [None]:
data_df.head()

We want only the two relevant columns. Let us proceed to do the preprocessing as we have learnt.

In [None]:
tender=data_df['tender_description']
agency=data_df['agency']
dataset=tender + ' ' + agency
dataset[0]

In [None]:
import nltk
import re
#only need to do once
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

In [None]:
from nltk.tokenize import word_tokenize
dataset.apply(word_tokenize)

In [None]:
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string

stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()


In [None]:
def preprocess(doc):

    punc_free = ''.join([ch for ch in doc.lower() if ch not in exclude])
    stop_free = ' '.join([i for i in punc_free.split() if i not in stop])
    normalized = ' '.join(lemma.lemmatize(word) for word in stop_free.split())
    #stemmed = ' '.join(stemmer.stem(word) for word in normalized.split())
    return normalized

dataset2=dataset.apply(preprocess)

Due to the way gensim processes the data, we need to split the string into individual words as below. We can then proceed as in the Practical.

In [None]:
dataset3=dataset2.values.tolist()
processed_docs = [doc.split() for doc in dataset3]
print(processed_docs[0])

In [None]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(processed_docs)


In [None]:
bows = [dictionary.doc2bow(processed_doc) for processed_doc in processed_docs]

In [None]:
print(bows[5])

In [None]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(bows))

Let us now train the LDA model for five topics.

In [None]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 5
chunksize = 250
passes = 20
iterations = 50
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
id2word = dictionary

np.random.seed(10)

ldamodel = LdaModel(
    corpus=bows,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [None]:
ldamodel.print_topics()

In [None]:
print('\nFile name and its corresponding topic id with probability:')
dic_topic_doc = {}
for index, doc in enumerate(processed_docs):
    bow = dictionary.doc2bow(doc)
    #get topic distribution of the ldamodel
    t = ldamodel.get_document_topics(bow)
    #sort the probability value in descending order to extract the top contributing topic id
    sorted_t = sorted(t, key=lambda x: x[1], reverse=True)
    #print only the filename
    print(index,sorted_t)

We can calculate the perplexity and coherence as before.

In [None]:
perplexity = ldamodel.log_perplexity(bows)
print(perplexity)

In [None]:
from gensim.models.coherencemodel import CoherenceModel

lda_coherence = CoherenceModel(model=ldamodel,
                               texts=processed_docs,
                               dictionary=dictionary,
                               coherence='c_v')
coherence_score = lda_coherence.get_coherence()
print(coherence_score)

As before we can try to determine the best number of topics

In [None]:
def compute_coherence_values(id2word, corpus, texts, limit, start=2, step=3):

    coherence_values = []
    perplexity_values = []
    topics_num = []
    model_list = []

    for num_topics in tqdm(range(start, limit, step)):
        np.random.seed(10)
        ldamodel = LdaModel(
            corpus=bows,
            id2word=id2word,
            chunksize=250,
            alpha='auto',
            eta='auto',
            iterations=50,
            num_topics=num_topics,
            passes=20,
            eval_every=eval_every
        )
        model_list.append(ldamodel)
        coherencemodel = CoherenceModel(model=ldamodel, texts=texts,
                                       dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
        perplexity_values.append(ldamodel.log_perplexity(corpus))
        topics_num.append(num_topics)

    return model_list, coherence_values, perplexity_values, topics_num

In [None]:
from tqdm import tqdm
# search through k-topics in steps
start=1; limit=7; step=1;
#start=5; limit=50; step=5;

model_list, coherence_values, perplexity_values, topics_num = compute_coherence_values(id2word,
                                                                           corpus=bows,
                                                                           texts=processed_docs,
                                                                           start=start, limit=limit, step=step)

In [None]:
# # Show Perplexity and Coherence graph
import matplotlib.pyplot as plt
x = range(start, limit, step)

fig, ax1 = plt.subplots()

color = 'tab:red'
ax1.set_xlabel('Num Topics')
ax1.set_ylabel('Perplexity Score', color=color)
ax1.plot(x, perplexity_values, color=color)
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx() #instantiate a second axes that share the same x-axis

color = 'tab:blue'
ax2.set_ylabel('Coherence Score', color=color)
ax2.plot(x, coherence_values, color=color)
ax2.tick_params(axis='y', labelcolor=color)

fig.tight_layout() # otherwise the right y-label is slightly clipped
plt.show()

And visualize using pyLDAvis.

In [None]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

pyLDAvis.enable_notebook()
vis_data = gensimvis.prepare(ldamodel, bows, dictionary)
pyLDAvis.display(vis_data)

# Exercise 2 - Anomaly Detection

Let us now proceed with anomaly detection using isolation forest. Depending on time available, we can also randomly take 10000 samples from the dataset.

In [None]:
dataset3=dataset2

In [None]:
#dataset3=dataset2.sample(10000, random_state=10)

We will use the CountVectorizer introduced in this lesson for preprocessing

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
doc_vec = vectorizer.fit_transform(dataset3)
print(vectorizer.get_feature_names_out())
print(doc_vec)

In [None]:
df_bow = pd.DataFrame(doc_vec.toarray(),columns=vectorizer.get_feature_names_out())
df_bow.head()

Let us now use Isolation Forest to detect anomalies.

In [None]:
from sklearn.ensemble import IsolationForest
forest = IsolationForest(random_state=0)
forest.fit(df_bow)

In [None]:
df_bow=df_bow.sample(5000)

In [None]:

scores = forest.score_samples(df_bow)
#print(scores)

In [None]:
import matplotlib.pyplot as plt
plt.hist(scores, bins=50)
plt.ylabel('Number', fontsize=15)
plt.xlabel('Anomaly score', fontsize=15)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)


In [None]:
top_n_outliers = 5
top_n_outlier_indices = np.argpartition(scores, top_n_outliers)[:top_n_outliers].tolist()
top_outlier_features = df_bow.iloc[top_n_outlier_indices, :]
top_outlier_features

The top anomalies found are shown below

In [None]:
print(dataset3.iloc[1246])
print(dataset3.iloc[7806])
print(dataset3.iloc[13550])
print(dataset3.iloc[2249])
print(dataset3.iloc[13882])

We can also try to use SHAP to gain some understanding about how the model generates the output. Bear in mind that for this dataset, due to the very large number of features, it is still not easy to interpret the model. But we can do this as an exercise.

In [None]:
#!pip install shap
import shap
explainer = shap.TreeExplainer(forest)
shap_values = explainer.shap_values(df_bow)
features = df_bow

Let's see if we can gain any insight about the factors affecting the most significant outlier

In [None]:
from IPython.core.display import display, HTML
shap.initjs

dis=shap.force_plot(explainer.expected_value, shap_values[1246, :], features.iloc[1246, :],matplotlib=False)
shap_html = f"{shap.getjs()}{dis.html()}"

with open("shap_full.html", "w", encoding='utf8') as file:
    file.write(shap_html)

From the shap summary plot below, we see that the most important word is association followed by year, etc. We also see that larger counts of these words tend to reduce the score making it more likely to be anomaly.

Discussion: Although there is some trend in the data, do note the nature of this dataset, as well as the preprocessing that we have done.

In [None]:
shap.summary_plot(shap_values, features)

Let's take a look at the shap dependence plot for 'association'. As expected, for the word 'association', the higher the occurence, the more it lowers the anomaly score.

In [None]:
shap.dependence_plot(
 'association',
 shap_values,
 features,
 interaction_index=None,
 xmax='percentile(99)' #upper bound of plots x-axis
)

We can also look at the interaction between two features , for example 'association' and 'year'. It looks like 'year' does not have much interaction with 'association' that could affect the anomaly score.

In [None]:
shap.dependence_plot(
 'association',
 shap_values,
 features,
 interaction_index='year',
 xmax='percentile(99)'
)