#### Zero-shot Topic Classification

The most modern and reliable Topic Classification approach in terms of accuracy is to use a pre-trained **zero-shot** classifier model. This is greatly adaptable to our task where we have a considerable amount of labels but not the training data to back it up or the computational resources to fully train a model.

Note: For performance reasons, using a GPU is highly recommended when dealing with these types of models hence the notebook starts with a GPU check on the system.

In [8]:
import torch

print('Cuda Device Found? ', torch.cuda.is_available())
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if torch.cuda.is_available() == True:
    print('Type of Cuda Device:', torch.cuda.get_device_name(device))

Cuda Device Found?  True
Type of Cuda Device: NVIDIA GeForce GTX 1650 Ti


In [1]:
import pandas as pd
from transformers import pipeline

df = pd.read_excel('data/textcat_fewshot_data.xlsx')
data = {}

for column in df.columns:
    data[column] = df[column].dropna().tolist()

print('Training Custom Dataset Loaded')
all_labels = [label for label in data.keys()]



Training Custom Dataset Loaded


Pre-trained **zero-shot** Classifier using PyTorch as a framework.

Device = **-1** for CPU | Device = **0** for GPU

Note: CPU runtime is 7x slower.

In [2]:
zero_shot_classifier = pipeline("zero-shot-classification", model = "facebook/bart-large-mnli", device = 0, framework = "pt")

Compute scores for a batch of example sentences using Zero Shot Classification on the chosen pre-trained model.

In [3]:
example_sentences = ["who developed these medical and bio sensors?", "AI will surely be the future of Digital Health.", 
                     "She works as PR to outreach communities in LinkedIn and organize workshops.", "Developing Image Segmentation systems is one my favourite hobbies.", 
                     "This digital application will accelerate communication in LMICs.", "Immigrant communities are often faced with racism.", 
                     "I might start a PhD related to EHRs in health systems.", "We performed statistical analysis between variables during this research.",
                     "Waste incineration is often overlooked as a cause of air pollution.", "Every human should have a right to universal healthcare access.", 
                     "Smoking and alcohol consumption often lead to obesity.", 
                     "Virtual doctor consultations are essential for remote areas where there is a lack of health professionals."]

all_scores = zero_shot_classifier(
    sequences = example_sentences,
    candidate_labels = all_labels,
    multi_label = True,
    src_lang="en",
)

##### Output Scores (Multi-Label)

After multi-label predictions are made for each sentence, a scoring function gathering the total counts is computed. This function is designed for our task to classify a full document made up of hundreds of sentences.

In [4]:
def process_scores(pipeline_scores, topic_labels):
    topics_count = {label: 0 for label in topic_labels}

    for score in pipeline_scores:
        print("For Sentence: {}".format(score['sequence']))
        high_probas = [proba for proba in score['scores'] if proba >= 0.6]
        assigned_labels = []
        for i in range(len(high_probas)):
            assigned_labels.append(score['labels'][i])
            topics_count[score['labels'][i]] += 1
        print("Assigned Labels: {}".format(assigned_labels))
        print()
    
    sorted_count = sorted(topics_count.items(), key = lambda x: x[1], reverse = True)
    print('Final Count: {}'.format(sorted_count))
    
    
process_scores(all_scores, all_labels)

For Sentence: who developed these medical and bio sensors?
Assigned Labels: []

For Sentence: AI will surely be the future of Digital Health.
Assigned Labels: ['Artificial Intelligence', 'Digital Health']

For Sentence: She works as PR to outreach communities in LinkedIn and organize workshops.
Assigned Labels: ['Community Outreach', 'Lifestyle Factors']

For Sentence: Developing Image Segmentation systems is one my favourite hobbies.
Assigned Labels: []

For Sentence: This digital application will accelerate communication in LMICs.
Assigned Labels: []

For Sentence: Immigrant communities are often faced with racism.
Assigned Labels: ['Discrimination']

For Sentence: I might start a PhD related to EHRs in health systems.
Assigned Labels: ['Electronic Health Records', 'eHealth']

For Sentence: We performed statistical analysis between variables during this research.
Assigned Labels: ['Data Analysis', 'Systematic Review']

For Sentence: Waste incineration is often overlooked as a cause o

##### Document Processing

In [5]:
import re
from nltk.tokenize import sent_tokenize
from process_single_document import extract_text_from_pdf, remove_references, try_finding_keywords

##### Document extraction and pre-processing (Removing Referencesn & Potentially finding Keywords) 
single_pdf = open('data/implementome_publications/test_miner/child_obesity_switzerland.pdf', 'rb')
doc_as_str, doc_as_list = extract_text_from_pdf(single_pdf)
doc_as_str = remove_references(doc_as_str)
doc_keywords = try_finding_keywords(doc_as_str)

###### Tokenization process of the string containing the entire document text
###### Regular Expression pattern removes '\n' characters and tries to concatenate words separated by a '-'
sentences = sent_tokenize(doc_as_str)
pattern = r'(?<![a-zA-Z])-|-(?![a-zA-Z])'
sentences = [re.sub(pattern, '', sentence.replace('\n', ' ')) for sentence in sentences]

# sorted_categories = sorted(predicted_categories.items(), key = lambda x: x[1], reverse=True)

fixed_sentences = [sentence for sentence in sentences if len(sentence) >= 50]

Text from PDF File (11 pages) extracted successfully.
Original document character length: 44197
References removed, new document character length: 36152


In [6]:
import time

start = time.time()
doc_scores = zero_shot_classifier(
    sequences = fixed_sentences,
    candidate_labels = all_labels,
    multi_label = True,
    src_lang="en",
)
end = time.time() - start

In [8]:
print("Topic Classification | Document Processing Finished | Elapsed Time: {:.4f}s | Time per Sentence: {:.4f}s".format(end, end / len(fixed_sentences)))

Topic Classification | Document Processing Finished | Elapsed Time: 410.3804s | Time per Sentence: 1.9730s


In [9]:
process_scores(doc_scores, all_labels)

For Sentence: BMC Public Health          (2021) 21:243  https://doi.org/10.1186/s12889021102130 R E S E A R C H A R T I C L E Open Access The increase in child obesity in Switzerland is mainly due to migration from Southern Europe – a cross-sectional study Urs Eiholzer*, Chris Fritz and Anika Stephan Abstract Background: Novel height, weight and body mass index (BMI) references for children in Switzerland reveal an increase in BMI compared to former percentile curves.
Assigned Labels: ['Data Analysis']

For Sentence: This trend may be the result of children with parents originating from Southern European countries having a higher risk of being overweight compared to their peers with parents of Swiss origin.
Assigned Labels: ['Social Analysis']

For Sentence: We examined the association of generational, migration-related and socioeconomic factors on BMI in Switzerland and expect the results to lead to more targeted prevention programs.
Assigned Labels: ['Data Analysis']

For Sentence: M