This Jupyter notebook file will concern itself with the Topic Classification part of the project. Specifically, the **textcat** pipeline that will be added to our spaCy model.

In [12]:
import spacy

nlp = spacy.load("en_core_web_lg")

# Test the model on some text
doc = nlp("This is a test article about technology and politics.")
doc.cats = ["label1", "label3", "label5", "label10", "label20"]
print(doc.cats)

['label1', 'label3', 'label5', 'label10', 'label20']


##### Automatic MeSh indexing

Query diseases to their corresponding representation in MeSh Terms, potentially locating their **[C]** - Disease Category.

In [1]:
from query_for_mesh_terms import get_mesh_tree

queries = ['child obesity', 'aids', 'coronary heart disease', 'hiv', 'malaria', 'covid', 'asthma', 'hypertension', 'diabetes']
# queries = ['diabetes mellitus']
for query in queries:
    mesh_terms = get_mesh_tree(query)
    print('\nFor Queried Disease: {}'.format(query))
    print('Potential Corresponding MeSh Tree: {}'.format(mesh_terms))
    print('')


For Queried Disease: child obesity
Potential Corresponding MeSh Tree: ['Nutritional and Metabolic Diseases', 'Nutrition Disorders', 'Overnutrition', 'Overweight']


For Queried Disease: aids
Potential Corresponding MeSh Tree: ['Infections', 'Communicable Diseases', 'Blood-Borne Infections', 'HIV Infections']


For Queried Disease: coronary heart disease
Potential Corresponding MeSh Tree: ['Cardiovascular Diseases', 'Heart Diseases', 'Myocardial Ischemia', 'Coronary Disease']


For Queried Disease: hiv
Potential Corresponding MeSh Tree: []


For Queried Disease: malaria
Potential Corresponding MeSh Tree: ['Infections', 'Parasitic Diseases', 'Protozoan Infections', 'Malaria']


For Queried Disease: covid
Potential Corresponding MeSh Tree: []


For Queried Disease: asthma
Potential Corresponding MeSh Tree: ['Respiratory Tract Diseases', 'Bronchial Diseases', 'Asthma', 'Asthma, Aspirin-Induced']


For Queried Disease: hypertension
Potential Corresponding MeSh Tree: ['Cardiovascular Diseas

##### **Eurlex** Dataset Tryout

In [1]:
from datasets import load_dataset

dataset = load_dataset("eurlex")
dataset

No config specified, defaulting to: eurlex/eurlex57k
Found cached dataset eurlex (C:/Users/Genis/.cache/huggingface/datasets/eurlex/eurlex57k/1.1.0/d2fdeaa4fcb5f41394d2ed0317c8541d7f9be85d2d601b9fa586c8b461bc3a34)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['celex_id', 'title', 'text', 'eurovoc_concepts'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['celex_id', 'title', 'text', 'eurovoc_concepts'],
        num_rows: 6000
    })
    validation: Dataset({
        features: ['celex_id', 'title', 'text', 'eurovoc_concepts'],
        num_rows: 6000
    })
})

In [3]:
import pandas as pd

# Read the JSON file into a DataFrame
df = pd.read_json('./data/eurovoc_concepts.jsonl', lines=True)
finetuned_df = pd.read_json('./data/eurovoc_concepts_finetuned.jsonl', lines=True)
converting_dictionary = finetuned_df.set_index('id')['title'].to_dict()

# Print the resulting dictionary
print('Mapping: ', converting_dictionary)

all_titles = {title: 0 for title in df['title']}
all_titles_finetuned = {title: 0 for title in finetuned_df['title']}

print('Eurlex Dataset Label Count - Original: {} - Curated: {}'.format(len(all_titles), len(all_titles_finetuned)))

Mapping:  {'3474': 'international affairs', '3363': 'union representative', '4488': 'data processing', '6303': 'scientific discovery', '3842': 'scientific apparatus', '538': 'rights of the individual', '1909': 'migration', '1600': 'self-defence', '1592': 'anti-trust legislation', '851': 'sociocultural facilities', '1478': 'Workers International', '2577': 'power of implementation', '1641': 'trade licence', '2566': 'postal and telecommunications services', '3586': 'scientific profession', '1760': 'mental illness', '6035': 'urban community', '1172': 'government', '2581': 'power to negotiate', '4536': 'transport under customs control', '6084': 'European Works Council', '5184': 'programmes industry', '5970': 'surgery', '1625': 'freedom of assembly', '7363': 'graphic illustration', '2916': 'applied research', '1941': 'national minority', '7392': 'urban sociology', '582': 'political right', '6401': 'environmental liability', '6744': 'eugenics', '6033': 'rural community', '5938': 'transport of

In [4]:
def adjust_original_concepts(list_of_concepts, mapping_dictionary):
    adjusted_concepts = []
    for concept in list_of_concepts:
        try: 
            adjusted_concepts.append(mapping_dictionary[concept])
        except KeyError:
            continue
    return adjusted_concepts

In [15]:
from spacy.tokens import DocBin
import spacy

nlp = spacy.blank("en")
    
db = DocBin()
split_name = 'validation'
data = dataset[split_name]

count = 0
for i in range(0, len(data)):
    
    #### Performance Counter    
    if (i % 2500 == 0):
        print("{}/{}".format(i, len(data)))
        print("Selected Sentences: {} \n".format(count))  
   
    ##### Check that the given sentence contains any of our wanted labels
    concepts = [converting_dictionary[concept] for concept in data[i]['eurovoc_concepts'] if concept in converting_dictionary]
    
    if len(concepts) == 0:
        continue
    else:
        doc = nlp(data[i]['text'])
        # print(doc)
        doc.cats = all_titles_finetuned.copy()
        # print('BEFORE: ', doc.cats)
        for concept in concepts:
            doc.cats[concept] = 1
        # print('After: ', doc.cats)
        # print('')
        count += 1

    db.add(doc)
    

# db.to_disk('./models/fine_tune/corpus/curated_eurlex_{}.spacy'.format(split_name))    

0/6000
Selected Sentences: 0 

2500/6000
Selected Sentences: 226 

5000/6000
Selected Sentences: 457 



##### Classy Classification - Few-shot Topic Classification 

Shown example of how the few-shot text classifier works with training sentences for two labels: *politics* and *sports*.

In [1]:
import spacy
import classy_classification

##### Prepare the training data in a dictionary format with keys as labels and a list of train sentences as values
data = {
    "politics": [
        "Putin orders troops into pro-Russian regions of eastern Ukraine.",
        "The president decided not to go through with his speech.",
        "There is much uncertainty surrounding the coming elections.",
        "Democrats are engaged in a ‘new politics of evasion’",
    ],
    "sports": [
        "The soccer team lost.",
        "The team won by two against zero.",
        "I love all sport.",
        "The olympics were amazing.",
        "Yesterday, the tennis players wrapped up wimbledon.",
    ],
}


##### Model Training & Configuration
nlp = spacy.load('en_core_web_md')
nlp.add_pipe("text_categorizer", 
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        # "model": "spacy",
        "multi_label": True,
        "device": "gpu",
    }
)



Testing the model

In [3]:
example_sentence = "Donald Trump is running again for president in 2024"
print('For Sentence: {}'.format(example_sentence))
print('Topic Classification Prediction: {}'.format(nlp(example_sentence)._.cats))
print('')

example_sentence = "Man City is moving to the finals of the UCL."
print('For Sentence: {}'.format(example_sentence))
print('Topic Classification Prediction: {}'.format(nlp(example_sentence)._.cats))
print('')

For Sentence: Donald Trump is running again for president in 2024
Topic Classification Prediction: {'politics': 0.9148040442917577, 'sports': 0.060666751741570296}

For Sentence: Man City is moving to the finals of the UCL.
Topic Classification Prediction: {'politics': 0.4187886550089133, 'sports': 0.5130324025304186}



Training the model from scratch for our custom case

In [1]:
import pandas as pd
import spacy
import classy_classification


df = pd.read_excel('data/textcat_fewshot_data.xlsx')
data = {}

for column in df.columns:
    data[column] = df[column].dropna().tolist()

# Print the resulting dictionary
print(data)
print('Training Custom Dataset Loaded')



{'Artificial Intelligence': ['Machine learning algorithms employ advanced statistical techniques to analyze vast datasets.', 'Artificial intelligence is reshaping industries such as healthcare, finance, and autonomous vehicles.', 'Neural networks, specifically deep learning models, are pivotal in image recognition tasks.', 'Chatbots leverage natural language processing to provide conversational interactions with users.', 'Cutting-edge self-driving cars utilize computer vision and AI to navigate complex road environments.', 'Reinforcement learning enables AI agents to learn optimal decision-making through trial and error.', 'AI-powered virtual assistants like Siri and Alexa have become ubiquitous in our daily lives.', 'Image recognition technology has achieved remarkable accuracy in object detection and classification.', 'Ethical considerations surrounding AI development encompass issues of bias, privacy, and accountability.', 'Sentiment analysis algorithms can discern emotions and atti

In [3]:
##### Load either a SpaCy model to be fine-tuned or train from zero a blank model
# nlp = spacy.load('en_core_web_md')
nlp = spacy.blank("en")

##### Add Text Classifier Pipeline
nlp.add_pipe("text_categorizer", 
    config = { 
        "data": data,
        # "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        # "model": "spacy",
        "multi_label": True,
        "device": "gpu",
    }
)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

<classy_classification.classifiers.classy_spacy.ClassySpacyExternalFewShot at 0x1af850b7370>

In [3]:
example_sentences = ["who developed these medical and bio sensors?", "AI will surely be the future of Digital Health.", 
                     "She works as PR to outreach communities in LinkedIn and organize workshops.", "Developing Image Segmentation systems is one my favourite hobbies.", 
                     "This digital application will accelerate communication in LMICs.", "Immigrant communities are often faced with racism.", 
                     "I might start a PhD related to EHRs in health systems.", "We performed statistical analysis between variables during this research.",
                     "Waste incineration is often overlooked as a cause of air pollution.", "Every human should have a right to universal healthcare access.", 
                     "Smoking and alcohol consumption often lead to obesity.", 
                     "Virtual doctor consultations are essential for remote areas where there is a lack of health professionals."]


print('Topic Classification')
for sentence in example_sentences:
    prediction = nlp(sentence)._.cats
    max_value_key = max(prediction, key = prediction.get)

    print('For Sentence: {}'.format(sentence))
    # print('Topic Classification Prediction: {}'.format(sorted(predictions.items(), key = lambda x: x[1], reverse = True)))
    print('Prediction - This sentence seems to be about: {} | {:.5f}'.format(max_value_key, prediction[max_value_key]))
    print('')

Topic Classification
For Sentence: who developed these medical and bio sensors?
Prediction - This sentence seems to be about: Biomedical Engineering | 0.82375

For Sentence: AI will surely be the future of Digital Health.
Prediction - This sentence seems to be about: Artificial Intelligence | 0.57328

For Sentence: She works as PR to outreach communities in LinkedIn and organize workshops.
Prediction - This sentence seems to be about: Community Outreach | 0.81108

For Sentence: Developing Image Segmentation systems is one my favourite hobbies.
Prediction - This sentence seems to be about: Computer Vision | 0.56477

For Sentence: This digital application will accelerate communication in LMICs.
Prediction - This sentence seems to be about: Digital Accelerators | 0.47649

For Sentence: Immigrant communities are often faced with racism.
Prediction - This sentence seems to be about: Discrimination | 1.00000

For Sentence: I might start a PhD related to EHRs in health systems.
Prediction - T

##### Custom Evaluation Set Score

In [4]:
##### Loading the test set from the corresponding excel file
import pandas as pd
import random

random.seed(596)
df = pd.read_excel('data/textcat_evaluation.xlsx')
evaluation_set = [tuple(row) for row in df.to_records(index = False)]
eval_length = len(evaluation_set)
random.shuffle(evaluation_set)

In [5]:
all_sentences = [item[0] for item in evaluation_set]

In [6]:
import time

start = time.time()
prediction_tuples = []
correct_count = 0
print('Topic Classification Evaluation Started...')
for i, sentence in enumerate(all_sentences):
    prediction = nlp(sentence)._.cats
    max_value_key = max(prediction, key = prediction.get)
    predictions = sorted(prediction.items(), key = lambda x: x[1], reverse = True)
    predictions = [prediction[0] for prediction in predictions[:5]]
        
    if evaluation_set[i][1] in predictions:
        correct_count += 1
        
    # print('For Sentence: {}'.format(sentence))
    # print('Topic Classification Prediction: {}'.format(predictions))
    # print('True: {} | Prediction: {} ({:.3f})'.format(evaluation_set[i][1], max_value_key, prediction[max_value_key]))
    # print('')
end = time.time() - start

print("Topic Classification | Evaluation Finished | Elapsed Time: {:.4f}s | Time per Sentence: {:.4f}s".format(end, end / eval_length))
print("Accuracy: {:1.2f}%".format((correct_count / eval_length) * 100))

Topic Classification Evaluation Started...
Topic Classification | Evaluation Finished | Elapsed Time: 12.5684s | Time per Sentence: 0.0157s
Accuracy: 31.00%


##### Test on Documents

In [4]:
import os
import re
from nltk.tokenize import sent_tokenize
from process_single_document import extract_text_from_pdf, remove_references, try_finding_keywords

for file in os.listdir('data/implementome_publications/test_miner/'):
     filename = os.fsdecode(file)
     print('For Document: {}'.format(filename))
     if filename.endswith('.pdf'):
        ##### Document extraction and pre-processing (Removing Referencesn & Potentially finding Keywords) 
        single_pdf = open('data/implementome_publications/test_miner/' + filename, 'rb')
        # single_pdf = open('data/implementome_publications/test_miner/child_obesity_switzerland.pdf', 'rb')
        doc_as_str, doc_as_list = extract_text_from_pdf(single_pdf)
        doc_as_str = remove_references(doc_as_str)
        doc_keywords = try_finding_keywords(doc_as_str)

        ###### Tokenization process of the string containing the entire document text
        ###### Regular Expression pattern removes '\n' characters and tries to concatenate words separated by a '-'
        sentences = sent_tokenize(doc_as_str)
        pattern = r'(?<![a-zA-Z])-|-(?![a-zA-Z])'
        sentences = [re.sub(pattern, '', sentence.replace('\n', ' ')) for sentence in sentences]

        # sorted_categories = sorted(predicted_categories.items(), key = lambda x: x[1], reverse=True)
        
        fixed_sentences = [sentence for sentence in sentences if len(sentence) >= 50]
        topics_count = {label: 0 for label in data.keys()}
        docs = list(nlp.pipe(fixed_sentences))

        for doc in docs:
            prediction = doc._.cats
            # prediction = classifier(doc.text)
            max_value_key = max(prediction, key = prediction.get)
            if 0.95 > prediction[max_value_key] >= 0.4:
                topics_count[max_value_key] += 1

            # print('For Sentence: {}'.format(doc.text))
            # # print('Topic Classification Prediction: {}'.format(sorted(predictions.items(), key = lambda x: x[1], reverse = True)))
            # print('Prediction - This sentence seems to be about: {} | {:.5f}'.format(max_value_key, prediction[max_value_key]))
            # print('')

        print('Final Topic Classification Count: {}'.format(sorted(topics_count.items(), key = lambda x: x[1], reverse = True)))
        print('')

For Document: ai_and_surgical_decision_making.pdf
Text from PDF File (11 pages) extracted successfully.
Original document character length: 66451
References removed, new document character length: 50476
Final Topic Classification Count: [('Natural Language Processing', 15), ('Computer Vision', 9), ('eHealth', 3), ('Global Health', 2), ('Biomedical Engineering', 1), ('Discrimination', 1), ('Global Threats', 1), ('Healthcare Financing', 1), ('Medical Imaging', 1), ('Public Health Law', 1), ('Social Analysis', 1), ('Substance Abuse', 1), ('Artificial Intelligence', 0), ('Clinical Decision Support', 0), ('Community Outreach', 0), ('Data Analysis', 0), ('Decision Making', 0), ('Diagnostic Assessments', 0), ('Digital Accelerators', 0), ('Digital Health', 0), ('Electronic Health Records', 0), ('Empirical Study', 0), ('Environmental Conservation', 0), ('Evidence From Patients', 0), ('Human Rights', 0), ('Humanitarian Response', 0), ('Knowledge Management', 0), ('Lifestyle Factors', 0), ('Liter