# Introduction
There are several steps involved in the conversion of raw text-based data into structured data for mindmap generation. The identification and classification of text based on subject is necessary for finding what group of tags and attributes the different entities and relations within the text belong to so that there is minimal error in tagging them. For example, in the context of Computer Science, the statement _"the root is present at the top of the tree"_ is true, while in Biology, the same is not necessarily be true. The classification of text based on subject is therefore, the first step in our procedure.

# Text Preprocessing
The raw text collected from various sources is processed into sentences, each of which is further classified into list of words. We can achieve this by using regular expressions to take care of the common delimiters separating words. However, we have to take note that just including spaces is not enough. We also have to include quotations, question marks, colons, semi-colons, tabs, newlines, etc.

In [None]:
import re

raw_text = """According to Peter Denning, the fundamental question underlying
computer science is, "What can be (efficiently) automated?" Theory of computation
is focused on answering fundamental questions about what can be computed and what amount
of resources are required to perform those computations. In an effort to answer the first
question, computability theory examines which computational problems are solvable on
various theoretical models of computation. The second question is addressed by computational
complexity theory, which studies the time and space costs associated with different approaches
to solving a multitude of computational problems.""" # source: https://en.wikipedia.org/wiki/Computer_science#Theory_of_computation

words = re.compile(r'[\s,:"\'\(\).;?!]+').split(raw_text)

print(words)

['According', 'to', 'Peter', 'Denning', 'the', 'fundamental', 'question', 'underlying', 'computer', 'science', 'is', 'What', 'can', 'be', 'efficiently', 'automated', 'Theory', 'of', 'computation', 'is', 'focused', 'on', 'answering', 'fundamental', 'questions', 'about', 'what', 'can', 'be', 'computed', 'and', 'what', 'amount', 'of', 'resources', 'are', 'required', 'to', 'perform', 'those', 'computations', 'In', 'an', 'effort', 'to', 'answer', 'the', 'first', 'question', 'computability', 'theory', 'examines', 'which', 'computational', 'problems', 'are', 'solvable', 'on', 'various', 'theoretical', 'models', 'of', 'computation', 'The', 'second', 'question', 'is', 'addressed', 'by', 'computational', 'complexity', 'theory', 'which', 'studies', 'the', 'time', 'and', 'space', 'costs', 'associated', 'with', 'different', 'approaches', 'to', 'solving', 'a', 'multitude', 'of', 'computational', 'problems', '']


Since the manual entry of all these delimiters is tedious and quite frankly, unnecessary, we can use a library that takes care of these for us. Here, we will use the nltk library to "tokenize" the input text into a list of words.

In [None]:
from nltk.tokenize import word_tokenize

words = word_tokenize(raw_text)
words = [ w.lower() for w in words if w.isalnum() ]

print(words)

['according', 'to', 'peter', 'denning', 'the', 'fundamental', 'question', 'underlying', 'computer', 'science', 'is', 'what', 'can', 'be', 'efficiently', 'automated', 'theory', 'of', 'computation', 'is', 'focused', 'on', 'answering', 'fundamental', 'questions', 'about', 'what', 'can', 'be', 'computed', 'and', 'what', 'amount', 'of', 'resources', 'are', 'required', 'to', 'perform', 'those', 'computations', 'in', 'an', 'effort', 'to', 'answer', 'the', 'first', 'question', 'computability', 'theory', 'examines', 'which', 'computational', 'problems', 'are', 'solvable', 'on', 'various', 'theoretical', 'models', 'of', 'computation', 'the', 'second', 'question', 'is', 'addressed', 'by', 'computational', 'complexity', 'theory', 'which', 'studies', 'the', 'time', 'and', 'space', 'costs', 'associated', 'with', 'different', 'approaches', 'to', 'solving', 'a', 'multitude', 'of', 'computational', 'problems']


Now we can filter the stopwords from the list using the nltk english corpus. We need to download this and some additional files if we have not already done so any time in the past. To do this, we just execute the following lines.

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SAGNIK\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SAGNIK\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now we compare each word in the list of words we created in the previous step to check if it is present in the list of english stopwords and if present, remove it from our list. Also, it is beneficial for us to convert our words to lower case for uniformity.

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

filtered_words = [ w for w in words if not w in stop_words and not w is '' ]
print("Filtered words: ", filtered_words)

rejected_words = [ w for w in words if w not in filtered_words ]
print("Rejected words: ", rejected_words)

Filtered words:  ['according', 'peter', 'denning', 'fundamental', 'question', 'underlying', 'computer', 'science', 'efficiently', 'automated', 'theory', 'computation', 'focused', 'answering', 'fundamental', 'questions', 'computed', 'amount', 'resources', 'required', 'perform', 'computations', 'effort', 'answer', 'first', 'question', 'computability', 'theory', 'examines', 'computational', 'problems', 'solvable', 'various', 'theoretical', 'models', 'computation', 'second', 'question', 'addressed', 'computational', 'complexity', 'theory', 'studies', 'time', 'space', 'costs', 'associated', 'different', 'approaches', 'solving', 'multitude', 'computational', 'problems']
Rejected words:  ['to', 'the', 'is', 'what', 'can', 'be', 'of', 'is', 'on', 'about', 'what', 'can', 'be', 'and', 'what', 'of', 'are', 'to', 'those', 'in', 'an', 'to', 'the', 'which', 'are', 'on', 'of', 'the', 'is', 'by', 'which', 'the', 'and', 'with', 'to', 'a', 'of']


# Subject Classification
Let us now first fix the subjects we will be investigating our text for. Here, we chose a set of 4 subjects from which to pick. Depending upon the specific needs, there may be additional steps involved in inclusion of text related to unknown subjects. One method to deal with that might be to limit the results to a certain confidence threshold below which, the program will report 'unknown' or 'other' instead of the actual result.

In [None]:
subjects = [ 'computer', 'physics', 'chemistry', 'biology' ]

Now we need a model for comparing our words with the subject. We can either create our own model using any Machine Learning library such as sklearn, tf, etc. Here, we will use a pre-trained model from Google News using the Gensim api. We then find out the score for each subject as the median similarity of all the words in our list against the subject.

In [None]:
import numpy as np
import gensim.downloader as api

model = api.load('word2vec-google-news-300')

In [None]:
filtered_available_words = []
filtered_unavailable_words = []

for word in filtered_words:
    if not word in model.vocab:
        filtered_unavailable_words.append(word)
    else:
        filtered_available_words.append(word)

similarities = []

for i, subject in enumerate(subjects):
    similarities.append([])
    for word in filtered_available_words:
        value = model.similarity(subject, word)
        similarities[i].append(value)

# We adopt a median based measurement of score
similarities = np.array(similarities)
scores = np.median(similarities, axis=1)
print("Unknown words found: %d"%(len(filtered_unavailable_words)))
print("Scores: ", scores)

Unknown words found: 0
Scores:  [0.12262617 0.10500228 0.11704105 0.13229749]


As we can see, the score for the subject "computer" is not the most. Although, it is still very close to the highest scored subject amongst all subjects in the list. We can see that even though the subject is clearly "computer", the prediction goes awry and the scores for all the subjects are very close. We do not want that. We want a clear distinction between the different subjects. There might be a few different reasons for that -
1. The training dataset might not be ideal or might be missing crucial words that can change the output significantly. We need to change our dataset to a suitable one.
2. The subjects belong to a similar group of words (here for example, they are all different disciplines of "science") and therefore, bear overlapping similarities with words in our list. We can apply novel techniques.
3. Our scoring procedure might be flawed. We need to find an optimal scoring method.

Let us tackle the problems one by one.

## Problem 1: Traning Dataset
We already found that for our specific example, we do not have any unknown words in our list. This means that we can safely assume that the results are not impacted in any manner by the absence of critical words in the dataset. However, this might not be true in all scenarios. Let us take this a step further. Suppose some word in this list of unknown words was significantly consequential to any subject. If we are studying a large enough corpus of text and the subjects have enough dissimilarity between them, the absence of a word in the dataset will not impact the entire topic extraction process as a whole very drastically. We have to set proper guidelines as to when the absence of critical words is impactful enough so that we need to change our model. One of the ways this can be achieved is by setting a threshold on the maximum contribution of these words in the text. It is to be noted that this method is partial to long words with no semantic similarities to the text and is by no way meant to be taken as for actual use, but is presented here due to it's simplicity and easy implementation.

In [None]:
threshold = 5.

unknowns_contrib = 0

for word in filtered_words:
    if not word in model:
        unknowns_contrib += len(word)

total_size = len(''.join(filtered_words))

percentage = 100 * unknowns_contrib / total_size # Calculates between 0 and 100

print('Percentage: %f%%'%(percentage))

if percentage > threshold:
    print('The model is not suitable for this data')
else:
    print('The model is suitable for this data')

Percentage: 0.000000%
The model is suitable for this data


Another problem related to training dataset might be the actual word vectors themselves. We can get a comparison by running the same set of rules for a different dataset based on an entirely different source. In this example, we will consider the wikipedia dataset and compare the results with

In [None]:
import numpy as np
import gensim.downloader as api

model = api.load('glove-wiki-gigaword-50')

filtered_available_words = []
filtered_unavailable_words = []

for word in filtered_words:
    if not word in model.vocab:
        filtered_unavailable_words.append(word)
    else:
        filtered_available_words.append(word)

similarities = []

for i, subject in enumerate(subjects):
    similarities.append([])
    for word in filtered_available_words:
        value = model.similarity(subject, word)
        similarities[i].append(value)

# We adopt a median based measurement of score
similarities = np.array(similarities)
scores = np.median(similarities, axis=1)
print("Unknown words found: %d"%(len(filtered_unavailable_words)))
print("Scores: ", scores)

Unknown words found: 0
Scores:  [0.48503646 0.3798096  0.36303478 0.39775032]


We observe that the results change significantly upon selecting a different dataset. This is somewhat expected as the raw text was from Wikipedia page on Computer Science. We also note that the score for "computer" is now more than that for "economics" which is correct. We will still follow through the next problems and see if (and how) we can improve on this result.

## Problem 2: Subject similarities
Let us now focus on the second part of our problem. If the subjects in our list are similar to each other and belong to the same broad category, there will be some overlapping between them. This, as we shall see further into the paper, can be solved by focusing on select entities within the text rather than the text as a whole.

## Problem 3: Unreliable Scoring Method
The scoring method we selected may not be accurate. The method of averaging the similarities of each token to the subject is somewhat counter-intutive because we are required to compare the document as a whole against the subject and not individual tokens of it. It may also be resulting in longer times to compute the results as every token in the document is compared against every class. Instead, we can simply calculate the average vector for the entire document array and then compare that to the average vector for every class.

In [None]:
from gensim.matutils import cossim
from numpy import dot
from numpy.linalg import norm

def avg(model, words):
    if len(words) > 0:
        return np.mean(model[words], axis=0)
    else:
        return []

def med(model, words):
    if len(words) > 0:
        return np.median(model[words], axis=0)
    else:
        return []

def sim(a, b):
    return dot(a, b) / (norm(a) * norm(b))

doc_avg = avg(model, filtered_available_words)
doc_med = med(model, filtered_available_words)

subject_vecs = [ model[subject] for subject in subjects ]

scores_avg = [ sim(svc, doc_avg) for svc in subject_vecs ]
scores_med = [ sim(svc, doc_med) for svc in subject_vecs ]

print("Scores (based on Average): ", scores_avg)
print("Scores (based on Median): ", scores_med)

Scores (based on Average):  [0.7043402, 0.65672183, 0.61259687, 0.64599967]
Scores (based on Median):  [0.68439806, 0.6045788, 0.57215124, 0.61216646]


# Validation
As is evident from the results above, we have brought down the time complexity from O(n^2) to O(n) and also kept the results consistent. Now this is just on one very specific text. We need to apply our algorithm on a real world dataset to know it's validity. We will use the `bbc news articles` dataset from Kaggle ([download](https://www.kaggle.com/yufengdev/bbc-text-categorization)) to test the validity of our algorithm. Let us load the dataset into a dataframe.

In [None]:
import pandas as pd

df = pd.read_csv('bbc-text.csv')

Now we will filter out the rows that do not contain category or content and find all unique categories from the dataset that will represent our classes. We will classify each post based on this list of categories.

In [None]:
df = df[df['category'].notna()] # Filter out posts with no category
df = df[df.text.str.len() > 1] # Filter out posts with no content
categories = df.category.unique()
print(categories)

['tech' 'business' 'sport' 'entertainment' 'politics']


Now we will create the content vectors and category vectors.

In [None]:
def filter_text(raw):
    words = word_tokenize(raw)
    words = [ w.lower() for w in words if w.isalnum() ]
    filtered_words = [ w for w in words if not w in stop_words and not w is '' ]
    filtered_available_words = []
    for word in filtered_words:
        if word in model.vocab:
            filtered_available_words.append(word)
    return filtered_available_words

category_tags_list = [ filter_text(category) for category in categories ]
print(category_tags_list)
print(df.shape)

[['tech'], ['business'], ['sport'], ['entertainment'], ['politics']]
(2225, 2)


In [None]:
filtered_categories = []
filtered_category_tags_list = []
for i, category_tags in enumerate(category_tags_list):
    if len(category_tags) > 0:
        filtered_categories.append(categories[i])
        filtered_category_tags_list.append(category_tags)
    else:
        df = df[df.category != categories[i]]
print(filtered_categories)
print(filtered_category_tags_list)

['tech', 'business', 'sport', 'entertainment', 'politics']
[['tech'], ['business'], ['sport'], ['entertainment'], ['politics']]


In [None]:
category_vecs_avg = [ avg(model, category_tags) for category_tags in filtered_category_tags_list ]
category_vecs_med = [ med(model, category_tags) for category_tags in filtered_category_tags_list ]

In [None]:
preds_avg = []
preds_med = []
for i, row in df.iterrows():
    filtered_text = filter_text(row['text'])
    content_vec_avg = avg(model, filtered_text)
    content_vec_med = med(model, filtered_text)
    scores = []
    for category_vec in category_vecs_avg:
        scores.append(sim(category_vec, content_vec_avg))
    max_id = np.array(scores).argmax()
    preds_avg.append({ 'pred': categories[max_id], 'actual': row['category'], 'res': filtered_categories[max_id] == row['category'] })
    scores = []
    for category_vec in category_vecs_med:
        scores.append(sim(category_vec, content_vec_med))
    max_id = np.array(scores).argmax()
    preds_med.append({ 'pred': categories[max_id], 'actual': row['category'], 'res': filtered_categories[max_id] == row['category'] })

accuracy_avg = len(list(filter(lambda x: x['res'] == True, preds_avg))) / len(preds_avg)
accuracy_med = len(list(filter(lambda x: x['res'] == True, preds_med))) / len(preds_med)
print("Accuracy (based on Average): ", accuracy_avg)
print("Accuracy (based on Median):", accuracy_med)

Accuracy (based on Average):  0.3056179775280899
Accuracy (based on Median): 0.26202247191011235


As we can see, the accuracy does not meet expectations for this dataset.

# Fine-tuning a Pre-trained Transformer Model to our use case
Next, we shall focus on fine-tuning an existing pre-trained transformer model - the famous, BERT model - on our news dataset and see how that improves on our predictions compared to the Gensim Word Embedding method discussed above. It is important to note that BERT also creates word embeddings internally. To abstract away the details, we will use the awesome `simpletransformers` library.

Since BERT and, as we shall see, almost all other transformer models expect labels to be vectorized, we will identify our categories and convert them to integers in order. To train our model, we will split our data into training (\~80% of total) and testing (\~20% of total) data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('bbc-text.csv')

df = df[['text', 'category']]

df.columns = ['text', 'label_text']

df.label_text = pd.Categorical(df.label_text)
df['labels'] = df.label_text.cat.codes

print('Using Labels as:')
print(dict(enumerate(df.label_text.cat.categories)))

del df['label_text']

train, test = train_test_split(df, test_size=0.2)

print('Train:')
print(train.head())
print('Test:')
print(test.head())

Using Labels as:
{0: 'business', 1: 'entertainment', 2: 'politics', 3: 'sport', 4: 'tech'}
Train:
                                                   text  label
148   moya emotional at davis cup win carlos moya de...      3
833   wenger offers mutu hope arsenal boss arsene we...      3
813   us budget deficit to reach $368bn the us budge...      0
1062  a question of trust and technology a major gov...      4
825   england  to launch ref protest  england will p...      3
Test:
                                                   text  label
1294  o driscoll/gregan lead aid stars ireland s bri...      3
682   europe blames us over weak dollar european lea...      0
1172  yukos bankruptcy  not us matter  russian autho...      0
2088  o driscoll saves irish blushes two moments of ...      3
1376  ec calls truce in deficit battle the european ...      0


Now we train the BERT model on our dataset and evaluate the results.

In [5]:
from simpletransformers.classification import ClassificationModel

# Create a TransformerModel
model = ClassificationModel('bert', 'bert-base-cased', num_labels=5, args={'reprocess_input_data': True, 'overwrite_output_dir': True}, use_cuda=False)

# Train the model
model.train_model(train)
    
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test)

print(result)

HBox(children=(FloatProgress(value=0.0, max=1780.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=223.0, style=ProgressStyle(descri…

Running loss: 1.263591



HBox(children=(FloatProgress(value=0.0, max=445.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56.0), HTML(value='')))


{'mcc': 0.9775141925216074, 'eval_loss': 0.07347185660286673}


Pretty decent. But it still took nearly half-an-hour to load and then another 2 minutes to train. Let us see how using DistilBERT improves our loading and training times.

In [3]:
from simpletransformers.classification import ClassificationModel

# Create a TransformerModel
model = ClassificationModel('distilbert', 'distilbert-base-cased', num_labels=5, args={'reprocess_input_data': True, 'overwrite_output_dir': True}, use_cuda=False)

# Train the model
model.train_model(train)
    
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test)

print(result)

HBox(children=(FloatProgress(value=0.0, max=1780.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Current iteration', max=223.0, style=ProgressStyle(descri…

Running loss: 0.026052



HBox(children=(FloatProgress(value=0.0, max=445.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=56.0), HTML(value='')))


{'mcc': 0.9746089072608206, 'eval_loss': 0.0755317120347172}


So as advertised, DistilBERT really does retain most of the accuracy from BERT while dramatically reducing the loading and training time. If we now run our model on any data that it possibly hasn't ever seen before, we find that it actually predicts accurately. For instance, consider this paragraph from a BBC News article in category tech (labelled as 4) published on 08-06-2020 (on the same day that this cell was written). As is expected, it outputs the correct label corresponding to the category 'tech'.

In [4]:
model = ClassificationModel('distilbert', './outputs', num_labels=5, args={'reprocess_input_data': True}, use_cuda=False)

predictions, raw_outputs = model.predict(['These days, your shiny new gadget is likely to be rendered obsolete by software updates (or a lack of them) before it physically grinds to a halt. A recent report by the consumer campaign group Which? suggests the lifespan of a smart fridge could be just a few years if the brand behind it stops providing software support and updates. Meanwhile, Sonos has released new software for its internet-connected speakers that does not work on its own-branded older devices. And this prompted me to casually mention on Twitter that I have a 12-year-old TV. To make myself feel better, I also asked people to share their oldest working gadgets. And a floodgate opened. Made in the days before software updates, operating systems and security vulnerabilities were part of of the ecosystem, they\'re all still going strong.'])
print(predictions)
print(raw_outputs)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))


[4]
[[-1.124043  -1.36424   -1.0584428 -1.37986    4.5885267]]
