# Custom NER for Identifying Diseases and Treatments

This notebook implements a custom Named Entity Recognition (NER) system to identify diseases and treatments from a medical dataset. The dataset is provided in tokenized format, where each word is associated with a label:
- `O` indicates "Other"
- `D` indicates "Disease"
- `T` indicates "Treatment"

## Steps in this Notebook
1. **Data Preprocessing:** Reconstruct sentences and labels from the tokenized dataset.
2. **Concept Identification:** Identify key concepts in the dataset using PoS tagging.
3. **Defining Features for CRF:** Create features for training the CRF model.
4. **Getting Features for Words and Sentences:** Apply feature definitions to all sentences.
5. **Defining Input and Target Variables:** Prepare input features and labels for training and testing.
6. **Building the Model:** Train the CRF model on the training dataset.
7. **Evaluating the Model:** Evaluate the model on the test dataset using F1 score and classification metrics.
8. **Identifying Diseases and Predicted Treatments:** Extract relationships between diseases and treatments using the trained model.


## Step 1: Data Preprocessing
The dataset is provided in tokenized format, where each word is stored on a separate line, and sentences are separated by blank lines. In this step, I will:
1. Reconstruct sentences and labels from the training and testing datasets.
2. Count the number of sentences and labels in the processed datasets.


In [1]:
# Paths to the dataset files
train_sent_path = 'data/train_sent'
train_label_path = 'data/train_label'
test_sent_path = 'data/test_sent'
test_label_path = 'data/test_label'

def process_data(file_path):
    """
    Read a dataset file and reconstruct sentences or labels.

    Parameters:
    file_path (str): Path to the file containing data in tokenized format.

    Returns:
    list: A list of sentences or labels reconstructed from the file.
    """
    sentences = []
    current_sentence = []

    with open(file_path, 'r') as file:
        for line in file:
            line = line.strip()
            if line == "":  # A blank line indicates the end of a sentence
                if current_sentence:
                    sentences.append(current_sentence)
                    current_sentence = []
            else:
                current_sentence.append(line)
        if current_sentence:  # Add the last sentence if the file does not end with a blank line
            sentences.append(current_sentence)

    return sentences

# Process train and test datasets
train_sentences = process_data(train_sent_path)
train_labels = process_data(train_label_path)
test_sentences = process_data(test_sent_path)
test_labels = process_data(test_label_path)

# Verify by printing counts
print(f"Number of sentences in train dataset: {len(train_sentences)}")
print(f"Number of sentences in test dataset: {len(test_sentences)}")
print(f"Number of label lines in train dataset: {len(train_labels)}")
print(f"Number of label lines in test dataset: {len(test_labels)}")


Number of sentences in train dataset: 2599
Number of sentences in test dataset: 1056
Number of label lines in train dataset: 2599
Number of label lines in test dataset: 1056


## Step 2: Concept Identification
In this step, I will identify key concepts (e.g., diseases and treatments) from the dataset by:
1. Performing Part-of-Speech (PoS) tagging on the text data.
2. Extracting tokens with PoS tags corresponding to nouns (`NOUN` and `PROPN`).
3. Counting the frequency of these tokens across the entire dataset (both training and testing data).
4. Printing the top 25 most frequently mentioned concepts.


In [None]:
# For formatting outputs
from tabulate import tabulate

In [3]:
import spacy
from collections import Counter

# Load spaCy model for PoS tagging
nlp = spacy.load("en_core_web_sm")

def extract_noun_phrases(sentences):
    """
    Extract nouns and proper nouns from the given sentences.

    Parameters:
    sentences (list): A list of tokenized sentences.

    Returns:
    list: A list of nouns and proper nouns extracted from the sentences.
    """
    nouns = []
    for sentence in sentences:
        doc = nlp(" ".join(sentence))
        for token in doc:
            if token.pos_ in ["NOUN", "PROPN"]:  # Select nouns and proper nouns
                nouns.append(token.text.lower())
    return nouns

# Combine training and testing sentences for concept identification
all_sentences = train_sentences + test_sentences

# Extract nouns and calculate their frequencies
nouns = extract_noun_phrases(all_sentences)
noun_frequencies = Counter(nouns)

# Print the top 25 most common nouns
table_data = [[concept, freq] for concept, freq in noun_frequencies.most_common(25)]
print(tabulate(table_data, headers=["Concept", "Frequency"], tablefmt="github"))

| Concept      |   Frequency |
|--------------|-------------|
| patients     |         507 |
| treatment    |         304 |
| %            |         247 |
| cancer       |         211 |
| therapy      |         177 |
| study        |         174 |
| disease      |         149 |
| cell         |         142 |
| lung         |         118 |
| results      |         116 |
| group        |         111 |
| effects      |          99 |
| gene         |          91 |
| chemotherapy |          91 |
| use          |          87 |
| effect       |          82 |
| women        |          81 |
| analysis     |          76 |
| risk         |          74 |
| surgery      |          73 |
| cases        |          72 |
| p            |          72 |
| rate         |          68 |
| survival     |          67 |
| response     |          66 |


## Step 3: Defining Features for CRF
This step involves defining the features for training the Conditional Random Field (CRF) model. The features will capture:
1. Word-level attributes (e.g., lowercase form, capitalization, title-case, digits).
2. Part-of-Speech (PoS) tags for the current word, as well as preceding and succeeding words.
3. Contextual information, such as bigrams and sentence boundaries (start and end indicators).
The features are essential for capturing the relationships and contexts necessary for NER.


In [None]:
def word2features(sentence, i):
    """
    Generate features for a single word in a sentence with context relationships.

    Parameters:
    sentence (list): A list of tokens (words) in the sentence.
    i (int): Index of the word in the sentence.

    Returns:
    dict: A dictionary of features for the word.
    """
    word = sentence[i]
    features = {
        'word.lower()': word.lower(),  # Lowercase of the word
        'word.isupper()': word.isupper(),  # Is the word in uppercase
        'word.istitle()': word.istitle(),  # Is the word title-cased
        'word.isdigit()': word.isdigit(),  # Is the word a digit
    }
    
    # PoS tagging using spaCy
    doc = nlp(" ".join(sentence))  # Process the sentence using spaCy
    pos_tag = doc[i].pos_
    features['pos'] = pos_tag  # PoS tag of the current word

    # Features for the beginning of a sentence
    if i == 0:
        features['BOS'] = True  # Beginning of a sentence
    else:
        features['BOS'] = False  # Not the beginning
        features['prev_word.lower()'] = sentence[i-1].lower()  # Lowercase of the previous word
        features['prev_word.pos'] = doc[i-1].pos_  # PoS tag of the previous word

    # Features for the end of a sentence
    if i == len(sentence) - 1:
        features['EOS'] = True  # End of a sentence
    else:
        features['EOS'] = False  # Not the end
        features['next_word.lower()'] = sentence[i+1].lower()  # Lowercase of the next word
        features['next_word.pos'] = doc[i+1].pos_  # PoS tag of the next word

    # Bigram features: Combine current word with previous and next words
    if i > 0:
        features['bigram.prev'] = sentence[i-1].lower() + "_" + word.lower()
    if i < len(sentence) - 1:
        features['bigram.next'] = word.lower() + "_" + sentence[i+1].lower()

    return features

def sent2features(sentence):
    """
    Generate features for all words in a sentence.

    Parameters:
    sentence (list): A list of tokens (words) in the sentence.

    Returns:
    list: A list of dictionaries, each containing features for a word.
    """
    return [word2features(sentence, i) for i in range(len(sentence))]

## Step 4: Getting Features for Words and Sentences
Using the feature extraction functions defined earlier, I will generate features for all sentences in the training and testing datasets. This involves:
1. Applying `sent2features` to each sentence.
2. Preparing the data in a format suitable for training and evaluating the CRF model.


In [5]:
def prepare_features_and_labels(sentences, labels):
    """
    Generate features and labels for all sentences in the dataset.

    Parameters:
    sentences (list): A list of sentences, where each sentence is a list of tokens (words).
    labels (list): A list of label sequences, where each sequence corresponds to a sentence.

    Returns:
    tuple: A tuple containing:
        - features (list): A list of feature dictionaries for each sentence.
        - labels (list): A list of label sequences for each sentence.
    """
    features = [sent2features(sentence) for sentence in sentences]
    return features, labels

# Prepare features and labels for the train dataset
train_features, train_labels = prepare_features_and_labels(train_sentences, train_labels)

# Prepare features and labels for the test dataset
test_features, test_labels = prepare_features_and_labels(test_sentences, test_labels)


## Step 5: Defining Input and Target Variables
In this step, I will define the input features and target labels for the CRF model:
1. Input Variables: Features extracted for each word in the sentences.
2. Target Variables: Corresponding labels (`O`, `D`, `T`) for each word in the sentences.

Additionally, I will display a random example from the training dataset in a tabular format to inspect the features and labels.


In [16]:
import random
from tabulate import tabulate

# Display the number of samples for training and testing
print(f"Number of training samples: {len(train_features)}")
print(f"Number of testing samples: {len(test_features)}")


Number of training samples: 2599
Number of testing samples: 1056


In [20]:
# Function to display features and labels in a tabular format
def display_random_example(features, labels, sentences):
    """
    Display a random example from the dataset in a tabular format.

    Parameters:
    features (list): List of feature dictionaries for the dataset.
    labels (list): List of label sequences corresponding to the features.
    sentences (list): List of tokenized sentences.
    """
    # Select a random example
    random_index = random.randint(0, len(features) - 1)
    example_features = features[random_index]
    example_labels = labels[random_index]
    example_sentence = sentences[random_index]
    
    # Prepare the data for tabulation
    table_data = []
    for i, (word, label, feature) in enumerate(zip(example_sentence, example_labels, example_features)):
        row = [i + 1, word, label] + [f"{key}: {value}" for key, value in feature.items()]
        table_data.append(row)
    
    # Define headers for the table
    headers = ["Index", "Word", "Label"] + [f"Feature {i + 1}" for i in range(len(example_features[0]))]
    
    # Display the table using tabulate
    print(f"\nRandom Example from Training Set (Index {random_index}):")
    print(tabulate(table_data, headers=headers, tablefmt="github"))

# Display a random example from the training set
display_random_example(train_features, train_labels, train_sentences)



Random Example from Training Set (Index 1892):
|   Index | Word           | Label   | Feature 1                    | Feature 2             | Feature 3             | Feature 4             | Feature 5   | Feature 6   | Feature 7                         | Feature 8             | Feature 9          | Feature 10                     |
|---------|----------------|---------|------------------------------|-----------------------|-----------------------|-----------------------|-------------|-------------|-----------------------------------|-----------------------|--------------------|--------------------------------|
|       1 | Immunogenicity | O       | word.lower(): immunogenicity | word.isupper(): False | word.istitle(): True  | word.isdigit(): False | pos: NOUN   | BOS: True   | EOS: False                        | next_word.lower(): of | next_word.pos: ADP | bigram.next: immunogenicity_of |
|       2 | of             | O       | word.lower(): of             | word.isupper(): False | word.i

## Step 6: Building the Model

In this step, I will build and train a Conditional Random Field (CRF) model for the custom NER task. The CRF model learns relationships between words, their features, and corresponding labels (`O`, `D`, `T`). The model will be trained using the features and labels prepared in the previous steps.


In [7]:
from sklearn_crfsuite import CRF

# Initialize the CRF model
crf_model = CRF(
    algorithm='lbfgs',  # Optimization algorithm
    c1=0.1,             # Coefficient for L1 regularization
    c2=0.1,             # Coefficient for L2 regularization
    max_iterations=100, # Maximum number of iterations
    all_possible_transitions=True  # Allow transitions between all states
)

# Train the CRF model on the training data
crf_model.fit(train_features, train_labels)

crf_model

## Step 7: Evaluating the Model
In this step, I will evaluate the CRF model's performance using the test dataset. The model will:
1. Predict labels for each token in the test sentences.
2. Calculate the F1 score for overall performance.
3. Display a detailed classification report to analyze the model's predictions for each label (`O`, `D`, `T`).


In [8]:
from sklearn_crfsuite import metrics

# Predict labels for the test dataset
test_predictions = crf_model.predict(test_features)

# Evaluate the model using the F1 score
f1_score = metrics.flat_f1_score(
    test_labels, test_predictions, average='weighted', labels=crf_model.classes_
)

print(f"F1 Score: {f1_score:.2f}")

# Print classification report for detailed evaluation
classification_report = metrics.flat_classification_report(
    test_labels, test_predictions, labels=crf_model.classes_, digits=3
)
print("Classification Report:")
print(classification_report)


F1 Score: 0.91
Classification Report:
              precision    recall  f1-score   support

           O      0.931     0.983     0.956     16127
           D      0.828     0.560     0.668      1450
           T      0.813     0.468     0.594      1041

    accuracy                          0.922     18618
   macro avg      0.857     0.670     0.739     18618
weighted avg      0.916     0.922     0.914     18618



## Step 8: Identifying Diseases and Predicted Treatments
In this step, I will extract diseases and their corresponding treatments from the test dataset using the trained CRF model. The output will be structured as a dictionary, where:
- Each disease (label `D`) is a key.
- Treatments (label `T`) associated with the disease are the values.
Additionally, the results for the specific disease "hereditary retinoblastoma" will be explicitly extracted to meet the assignment's requirements.


In [13]:
from collections import defaultdict
import spacy
import re

# Load spaCy's small English model for dependency parsing
nlp = spacy.load("en_core_web_sm")

def extract_diseases_and_treatments(sentences, predictions):
    """
    Extract diseases and treatments, including descriptive multi-word entities,
    with reduced noise using dependency parsing and validation.

    Parameters:
    sentences (list): A list of tokenized sentences.
    predictions (list): A list of predicted label sequences for each sentence.

    Returns:
    dict: A dictionary where keys are diseases (D) with descriptors and values are lists of treatments (T).
    """
    disease_treatment_map = defaultdict(list)

    def is_valid_entity(entity):
        """
        Validate if the extracted entity is meaningful.

        Parameters:
        entity (str): The entity to validate.

        Returns:
        bool: True if the entity is valid, False otherwise.
        """
        # Disallow entities with invalid characters or overly short entities
        if re.search(r"[()\d]", entity) or len(entity.split()) < 1:
            return False
        # Exclude overly generic terms
        if entity.lower() in ["disease", "cancer", "advanced disease"]:
            return False
        return True

    def is_valid_treatment(treatment):
        """
        Validate if the extracted treatment is meaningful.

        Parameters:
        treatment (str): The treatment to validate.

        Returns:
        bool: True if the treatment is valid, False otherwise.
        """
        # Exclude generic terms and overly short treatments
        invalid_terms = {"and", "with", "the", "of"}
        return treatment.isalpha() and len(treatment) > 2 and treatment.lower() not in invalid_terms

    for sentence, prediction in zip(sentences, predictions):
        # Convert the tokenized sentence into a spaCy Doc object for dependency parsing
        doc = nlp(" ".join(sentence))

        current_disease = None
        for idx, (word, label) in enumerate(zip(sentence, prediction)):
            if label == "D":  # Identify disease
                # Start forming a multi-word entity
                token = doc[idx]
                descriptor = set()

                # Add adjectives or compound descriptors linked to the disease
                for child in token.children:
                    if child.dep_ in ["amod", "compound"] and child.pos_ in ["ADJ", "NOUN"]:
                        descriptor.add(child.text)

                # Check for preceding descriptors in the sentence
                j = idx - 1
                while j >= 0 and prediction[j] == "O":
                    prev_token = doc[j]
                    if prev_token.dep_ in ["amod", "compound"] and prev_token.pos_ in ["ADJ", "NOUN"]:
                        descriptor.add(sentence[j])
                    j -= 1

                # Combine descriptor with the disease
                descriptor_list = list(descriptor)
                current_disease = " ".join(descriptor_list + [word])

                # Include subsequent words labeled as `D` to form a multi-word entity
                k = idx + 1
                while k < len(sentence) and prediction[k] == "D":
                    current_disease += f" {sentence[k]}"
                    k += 1

                # Skip to the last word of the entity
                idx = k - 1

                # Validate disease entity
                if not is_valid_entity(current_disease):
                    current_disease = None

            elif label == "T" and current_disease:  # Associate treatment with the disease
                if is_valid_treatment(word):
                    disease_treatment_map[current_disease].append(word)

    # Post-process the map to remove non-alphabetic treatments and normalize phrases
    final_map = {}
    for disease, treatments in disease_treatment_map.items():
        meaningful_treatments = list(set(t for t in treatments if is_valid_treatment(t)))  # Deduplicate treatments
        if is_valid_entity(disease):
            final_map[disease] = meaningful_treatments

    return final_map


In [14]:
# Extract diseases and treatments using test sentences and predictions
disease_treatment_dict = extract_diseases_and_treatments(test_sentences, test_predictions)

In [15]:
# Display the complete disease-treatment mapping
print("\nComplete Disease-Treatment Dictionary:")
table_data = [[disease, ', '.join(treatments)] for disease, treatments in disease_treatment_dict.items()]
print(tabulate(table_data, headers=["Disease", "Treatments"], tablefmt="github"))


Complete Disease-Treatment Dictionary:
| Disease                                   | Treatments                                                              |
|-------------------------------------------|-------------------------------------------------------------------------|
| diabetes gestational cases                | control, good, glycemic                                                 |
| hereditary retinoblastoma                 | radiotherapy                                                            |
| myocardial infarction                     | aspirin, warfarin                                                       |
| hemorrhagic stroke                        | alteplase, infusion, accelerated                                        |
| proteinuric hypertension                  | insemination, donor, intrauterine, sperm                                |
| insemination partner preeclampsia         | insemination, donor                                                     |


In [12]:
# Display results for "hereditary retinoblastoma"
specific_disease = "hereditary retinoblastoma"
specific_treatments = disease_treatment_dict.get(specific_disease, [])

if specific_treatments:
    print(f"Predicted treatments for the disease '{specific_disease}': {', '.join(specific_treatments)}")
else:
    print(f"No treatments found for the disease '{specific_disease}'.")

Predicted treatments for the disease 'hereditary retinoblastoma': radiotherapy
