# Named Entity Recognition for Medical Data

## Objective
In this notebook, we aim to build a custom Named Entity Recognition (NER) model for identifying diseases and their treatments from medical data. This involves the following key steps:
1. Data preprocessing
2. Concept identification
3. Defining features for a Conditional Random Field (CRF) model
4. Extracting features and labels for sentences
5. Defining input and target variables
6. Training a CRF model
7. Evaluating the model
8. Extracting diseases and their treatments using custom NER

## Dataset Description
The dataset comprises individual words and their corresponding labels:
- **Train Sentences:** `train_sent`
- **Train Labels:** `train_label`
- **Test Sentences:** `test_sent`
- **Test Labels:** `test_label`

Each sentence ends with a blank line in the file, and labels indicate the following:
- `D`: Disease
- `T`: Treatment
- `O`: Other

Our goal is to process this data, train a CRF model, and extract insights in the form of disease-treatment mappings.


## Step 1: Data Preprocessing

### Objective:
- Construct proper sentences from individual words by processing the train and test datasets.
- Count the number of sentences and label lines in both the train and test datasets.

### Approach:
1. Load the datasets from the provided files.
2. Process the data to reconstruct sentences and their corresponding labels.
3. Count and print the required statistics.


In [1]:
# pip install -U spacy
# !python -m spacy download en_core_web_sm
# !pip install sklearn-crfsuite

In [2]:
# Function to read and process the dataset into sentences or labels
def process_data(file_path):
    sentences = []
    current_sentence = []
    
    with open(file_path, 'r') as file:
        for line in file:
            line = line.strip()
            if line == "":  # A blank line indicates the end of a sentence
                if current_sentence:
                    sentences.append(" ".join(current_sentence))
                    current_sentence = []
            else:
                current_sentence.append(line)
        if current_sentence:  # Add the last sentence if the file does not end with a blank line
            sentences.append(" ".join(current_sentence))
    
    return sentences

In [3]:
# Paths to the dataset files
train_sent_path = 'data/train_sent'
train_label_path = 'data/train_label'
test_sent_path = 'data/test_sent'
test_label_path = 'data/test_label'

# Process the datasets
train_sentences = process_data(train_sent_path)
train_labels = process_data(train_label_path)
test_sentences = process_data(test_sent_path)
test_labels = process_data(test_label_path)

# Print counts for verification
print(f"Number of sentences in train dataset: {len(train_sentences)}")
print(f"Number of label lines in train dataset: {len(train_labels)}")
print(f"Number of sentences in test dataset: {len(test_sentences)}")
print(f"Number of label lines in test dataset: {len(test_labels)}")


Number of sentences in train dataset: 2599
Number of label lines in train dataset: 2599
Number of sentences in test dataset: 1056
Number of label lines in test dataset: 1056


In [4]:
from tabulate import tabulate

In [5]:
# Function to display sentence and labels in tabular format
def display_sentence_and_labels(sentences, labels, n=5):
    print(f"\nDisplaying the first {n} sentences with their labels in tabular format:\n")
    
    for i in range(n):
        words = sentences[i].split()
        label_list = labels[i].split()
        table = [[word, label] for word, label in zip(words, label_list)]
        print(f"Sentence {i+1}:")
        print(tabulate(table, headers=["Word", "Label"], tablefmt="grid"))
        print("-" * 50)

# Display the first 5 sentences and their labels
display_sentence_and_labels(train_sentences, train_labels, n=1)


Displaying the first 1 sentences with their labels in tabular format:

Sentence 1:
+-----------------+---------+
| Word            | Label   |
| All             | O       |
+-----------------+---------+
| live            | O       |
+-----------------+---------+
| births          | O       |
+-----------------+---------+
| >               | O       |
+-----------------+---------+
| or              | O       |
+-----------------+---------+
| =               | O       |
+-----------------+---------+
| 23              | O       |
+-----------------+---------+
| weeks           | O       |
+-----------------+---------+
| at              | O       |
+-----------------+---------+
| the             | O       |
+-----------------+---------+
| University      | O       |
+-----------------+---------+
| of              | O       |
+-----------------+---------+
| Vermont         | O       |
+-----------------+---------+
| in              | O       |
+-----------------+---------+
| 1995          

## Step 2: Concept Identification

### Objective:
- Extract tokens with `NOUN` or `PROPN` PoS tags from the combined dataset (train + test sentences).
- Identify and count the frequency of these tokens.
- Print the top 25 most frequent concepts.

### Approach:
1. Use spaCy for PoS tagging.
2. Combine the train and test datasets.
3. Identify tokens with `NOUN` or `PROPN` PoS tags and compute their frequencies.
4. Print the top 25 most frequent concepts.


In [6]:
import spacy
from collections import Counter

# Load the spaCy language model
nlp = spacy.load("en_core_web_sm")

# Combine train and test sentences
all_sentences = train_sentences + test_sentences

# Function to extract NOUN and PROPN tokens and their frequencies
def extract_nouns_and_proper_nouns(sentences):
    noun_counter = Counter()
    for sentence in sentences:
        doc = nlp(sentence)
        for token in doc:
            if token.pos_ in ["NOUN", "PROPN"]:
                noun_counter[token.text.lower()] += 1
    return noun_counter

# Extract and count NOUN/PROPN tokens
noun_counts = extract_nouns_and_proper_nouns(all_sentences)

In [7]:
# Display the top 25 most frequent NOUN/PROPN tokens
def display_top_tokens(token_counts, n=25):
    print(f"\nTop {n} most frequent NOUN/PROPN tokens:\n")
    
    # Prepare the data for tabulation
    token_table = [["Token", "Frequency"]] + token_counts.most_common(n)
    print(tabulate(token_table, headers="firstrow", tablefmt="grid"))

# Display the top 25 tokens in a tabular format
display_top_tokens(noun_counts, n=25)


Top 25 most frequent NOUN/PROPN tokens:

+--------------+-------------+
| Token        |   Frequency |
| patients     |         507 |
+--------------+-------------+
| treatment    |         304 |
+--------------+-------------+
| %            |         247 |
+--------------+-------------+
| cancer       |         211 |
+--------------+-------------+
| therapy      |         177 |
+--------------+-------------+
| study        |         174 |
+--------------+-------------+
| disease      |         149 |
+--------------+-------------+
| cell         |         142 |
+--------------+-------------+
| lung         |         118 |
+--------------+-------------+
| results      |         116 |
+--------------+-------------+
| group        |         111 |
+--------------+-------------+
| effects      |          99 |
+--------------+-------------+
| gene         |          91 |
+--------------+-------------+
| chemotherapy |          91 |
+--------------+-------------+
| use          |          87

## Step 3: Defining Features for the CRF Model

### Objective:
- Define features for each token in a sentence for use in the CRF model.
- Include the following as features:
  1. PoS tag of the token.
  2. Information about the preceding word.
  3. Mark the beginning and end of a sentence.

### Approach:
1. Define a function to extract features for a single word in a sentence.
2. Include word-level and sentence-level features for better model performance.


In [8]:
# Function to extract features for a single word in a sentence
def word2features(sentence, index):
    word = sentence[index]
    features = {
        'word.lower()': word.lower(),
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }

    # Add PoS tag as a feature using spaCy
    doc = nlp(" ".join(sentence))
    token = doc[index]
    features['pos'] = token.pos_

    # Features for the previous word
    if index > 0:
        word1 = sentence[index - 1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.isupper()': word1.isupper(),
            '-1:word.istitle()': word1.istitle(),
        })
    else:
        features['BOS'] = True  # Beginning of Sentence

    # Features for the next word
    if index < len(sentence) - 1:
        word1 = sentence[index + 1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.isupper()': word1.isupper(),
            '+1:word.istitle()': word1.istitle(),
        })
    else:
        features['EOS'] = True  # End of Sentence

    return features

# Function to extract features for all words in a sentence
def sentence2features(sentence):
    return [word2features(sentence, i) for i in range(len(sentence))]

# Example: Extract features for a sample sentence
sample_sentence = train_sentences[0].split()
sample_features = sentence2features(sample_sentence)


In [9]:
# Display features of the first sentence in a tabular format
def display_features(sentence, features):
    print("Features for the first sentence in tabular format:\n")
    
    # Prepare the data for tabulation
    feature_table = []
    for i, word_features in enumerate(features):
        row = [f"Word {i+1}", sentence[i]] + [f"{key}: {value}" for key, value in word_features.items()]
        feature_table.append(row)
    
    # Tabulate the results
    headers = ["Index", "Word"] + [f"Feature {i+1}" for i in range(len(features[0]))]
    print(tabulate(feature_table, headers=headers, tablefmt="grid"))

# Display features for the first sentence
display_features(sample_sentence, sample_features)

Features for the first sentence in tabular format:

+---------+-----------------+-------------------------------+-----------------------+-----------------------+-----------------------+-------------+----------------------------------+--------------------------+--------------------------+----------------------------------+
| Index   | Word            | Feature 1                     | Feature 2             | Feature 3             | Feature 4             | Feature 5   | Feature 6                        | Feature 7                | Feature 8                | Feature 9                        |
| Word 1  | All             | word.lower(): all             | word.isupper(): False | word.istitle(): True  | word.isdigit(): False | pos: DET    | BOS: True                        | +1:word.lower(): live    | +1:word.isupper(): False | +1:word.istitle(): False         |
+---------+-----------------+-------------------------------+-----------------------+-----------------------+-----------------------

## Step 4: Extracting Features and Labels for Sentences

### Objective:
- Extract features for each token in every sentence.
- Extract labels for each sentence based on the processed label lines.

### Approach:
1. Use the `sentence2features` function defined earlier to extract features for each sentence.
2. Use the processed label lines to retrieve labels for each sentence.
3. Prepare feature sets and labels for the train and test datasets.


In [10]:
# Function to convert a label string to a list of labels
def sentence2labels(label_string):
    return label_string.split()

# Extract features and labels for the train dataset
train_features = [sentence2features(sentence.split()) for sentence in train_sentences]
train_targets = [sentence2labels(labels) for labels in train_labels]

# Extract features and labels for the test dataset
test_features = [sentence2features(sentence.split()) for sentence in test_sentences]
test_targets = [sentence2labels(labels) for labels in test_labels]


In [11]:
# Function to display features and labels in tabular format
def display_features_and_labels(features, labels):
    print("Sample train sentence features and labels in tabular format:\n")
    
    # Prepare the data for tabulation
    table_data = []
    for word_features, label in zip(features, labels):
        row = [label] + [f"{key}: {value}" for key, value in word_features.items()]
        table_data.append(row)
    
    # Tabulate the results
    headers = ["Label"] + [f"Feature {i+1}" for i in range(len(features[0]))]
    print(tabulate(table_data, headers=headers, tablefmt="grid"))

# Display sample features and labels for the first train sentence
display_features_and_labels(train_features[0], train_targets[0])

Sample train sentence features and labels in tabular format:

+---------+-------------------------------+-----------------------+-----------------------+-----------------------+-------------+----------------------------------+--------------------------+--------------------------+----------------------------------+
| Label   | Feature 1                     | Feature 2             | Feature 3             | Feature 4             | Feature 5   | Feature 6                        | Feature 7                | Feature 8                | Feature 9                        |
| O       | word.lower(): all             | word.isupper(): False | word.istitle(): True  | word.isdigit(): False | pos: DET    | BOS: True                        | +1:word.lower(): live    | +1:word.isupper(): False | +1:word.istitle(): False         |
+---------+-------------------------------+-----------------------+-----------------------+-----------------------+-------------+----------------------------------+------------

## Step 5: Building the CRF Model

### Objective:
- Train a CRF model using the features and target labels from the training dataset.

### Approach:
1. Use the `sklearn-crfsuite` library to train the CRF model.
2. Configure parameters for the CRF model.
3. Fit the model using the training dataset.

### Note:
- Ensure that the `sklearn-crfsuite` library is installed. If not, install it using `pip install sklearn-crfsuite`.


In [12]:
import sklearn_crfsuite
from sklearn_crfsuite import metrics

# Initialize the CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',  # Use L-BFGS optimization
    c1=0.1,             # Coefficient for L1 regularization
    c2=0.1,             # Coefficient for L2 regularization
    max_iterations=100, # Maximum number of iterations
    all_possible_transitions=True
)

# Train the CRF model
print("Training the CRF model...")
crf.fit(train_features, train_targets)
print("CRF model training complete.")


Training the CRF model...
CRF model training complete.


## Step 6: Evaluating the Model

### Objective:
- Evaluate the CRF model using the test dataset.
- Predict the labels for each token in the test sentences.
- Calculate the F1 score using the actual and predicted labels.

### Approach:
1. Use the trained CRF model to predict labels for the test dataset.
2. Compare the predicted labels with the actual labels.
3. Compute the F1 score to evaluate model performance.

In [13]:
# Predict the labels for the test dataset
print("Predicting labels for the test dataset...")
test_predictions = crf.predict(test_features)

# Calculate and print the F1 score
f1_score = metrics.flat_f1_score(test_targets, test_predictions, average='weighted', labels=crf.classes_)
print(f"F1 Score: {f1_score:.4f}")

# Print classification report
print("\nClassification Report:")
report = metrics.flat_classification_report(test_targets, test_predictions, labels=crf.classes_, digits=4)
print(report)


Predicting labels for the test dataset...
F1 Score: 0.9167

Classification Report:
              precision    recall  f1-score   support

           O     0.9365    0.9782    0.9569     16127
           D     0.8118    0.5890    0.6827      1450
           T     0.7573    0.5245    0.6198      1041

    accuracy                         0.9225     18618
   macro avg     0.8352    0.6972    0.7531     18618
weighted avg     0.9167    0.9225    0.9167     18618



## Step 7: Extracting Diseases and Treatments

### Objective:
- Extract and map diseases (`D`) to their predicted treatments (`T`) from the test dataset.
- Present the mappings in a dictionary format.

### Approach:
1. Iterate through the test sentences and their predicted labels.
2. Identify tokens with the `D` (Disease) and `T` (Treatment) labels.
3. Create a dictionary where diseases are the keys and their corresponding treatments are the values.

### Note:
- If multiple treatments are associated with a single disease, store them as a list in the dictionary.


In [14]:
# Function to extract disease-treatment pairs from sentences and labels
def extract_disease_treatment_pairs(sentences, predictions):
    disease_treatment_mapping = {}
    
    for sentence, prediction in zip(sentences, predictions):
        tokens = sentence.split()
        current_disease = None
        
        for token, label in zip(tokens, prediction):
            if label == 'D':  # Disease label
                current_disease = token
                if current_disease not in disease_treatment_mapping:
                    disease_treatment_mapping[current_disease] = []
            elif label == 'T' and current_disease:  # Treatment label
                disease_treatment_mapping[current_disease].append(token)
    
    # Remove duplicates in treatment lists
    for disease in disease_treatment_mapping:
        disease_treatment_mapping[disease] = list(set(disease_treatment_mapping[disease]))
    
    return disease_treatment_mapping

# Extract disease-treatment mappings
disease_treatment_dict = extract_disease_treatment_pairs(test_sentences, test_predictions)

In [15]:
# Function to display disease-treatment mappings in tabular format
def display_disease_treatment_mappings(mappings):
    print("Extracted Disease-Treatment Mappings in Tabular Format:\n")
    
    # Prepare the data for tabulation
    table_data = [[disease, ", ".join(treatments)] for disease, treatments in mappings.items()]
    print(tabulate(table_data, headers=["Disease", "Treatments"], tablefmt="grid"))

# Display the extracted disease-treatment mappings
display_disease_treatment_mappings(disease_treatment_dict)

Extracted Disease-Treatment Mappings in Tabular Format:

+----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Disease                    | Treatments                                                                                                                                                                                                                         |
| dehydration                |                                                                                                                                                                                                                                    |
+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------

## Step 8: Predicting the Treatment for "Hereditary Retinoblastoma"

### Objective:
- Predict the treatment for the disease named "hereditary retinoblastoma" using the trained CRF model and extracted mappings.

### Approach:
1. Check if the disease "hereditary retinoblastoma" exists in the extracted mappings.
2. Print the corresponding treatments.


In [16]:
# Predict the treatment for "hereditary retinoblastoma"
disease_to_predict = "hereditary retinoblastoma"

if disease_to_predict in disease_treatment_dict:
    treatments = disease_treatment_dict[disease_to_predict]
    print(f"Predicted treatments for '{disease_to_predict}': {', '.join(treatments)}")
else:
    print(f"No treatments found for the disease '{disease_to_predict}'.")

No treatments found for the disease 'hereditary retinoblastoma'.


In [17]:
# Predict the treatment for "restenosis"
disease_to_predict = "restenosis"

if disease_to_predict in disease_treatment_dict:
    treatments = disease_treatment_dict[disease_to_predict]
    print(f"Predicted treatments for '{disease_to_predict}': {', '.join(treatments)}")
else:
    print(f"No treatments found for the disease '{disease_to_predict}'.")

Predicted treatments for 'restenosis': coronary, angioplasty
