# Disease Mention Detection from Medical Abstracts
This project demonstrates how to programmatically collect medical research articles, filter for relevant content, and build a text classification model for disease mention detection. Using COVID-19 as an example, the workflow can easily be adapted to detect mentions of other diseases such as influenza, malaria, or dengue.

## Objective
* Scrape medical abstracts from PUBmed.
* Filter and label abstracts that mention the disease of interest (COVID-10 will be used as an exmaple).
* Clean and preprocess the text data to ensure accuracy and usability.
* Build and train a simple classification model to detect disease mention.
* Evaluate the performance of the trained classifcation model.
* Demonstrate that the pipeline is adaptable to other diseases.

## Dataset
Source: 
* The dataset was collected from PubMed, a publicly accessible repository of biomedical literature.
* Abstracts were retrieved using keyword-based queries focused on COVID-19 and general medical research.

Strcture:
* Each data point consists of a PUbMed abstract and a binary label: 1: Covid-19 mention, 0: COVID-19 not mentioned.
* The final dataset contained 381 abstracts, with a significant class imbalance favoring non-COVID articles.
* Due to the class imbalance, SMOTE (Synthetic Minority Oversampling Technique) was applied to the training data to balance the classes.

Preprocessing:
* Abstracts were: lowercased, tokenized, cleaned of punctuation and vectorized using TF-IDF.
* Labels were manually assigned based on the presence of keywords.

### Data Collection
We start by scraping medical text data from PubMed using the `Bio.Entrez` module from the Biopython library to build our dataset for disease mention detection. After fetching the articles, we extract the abstract text from the XML-formatted responses using Python's built-in xml module. These abstracts form the basis of our dataset for this project.

In [1]:
from Bio import Entrez
import pandas as pd
import xml.etree.ElementTree as ET

Entrez.email = 'rmhirla@ucl.ac.uk'

# Step 1: Search for articles
terms = 'covid OR covid19 OR covid-19 OR sars-cov-2 OR sarscov2 OR coronavirus OR pandemic OR outbreak OR respiratory infection'
handle = Entrez.esearch(db = 'pubmed', term = terms, retmax = 400)
record = Entrez.read(handle)
id_list = record['IdList']

# Step 2: Fetch article data in XML format
handle = Entrez.efetch(db = 'pubmed', id = id_list, rettype = 'xml', retmode = 'xml')
xml_data = handle.read()

# Step 3: Save data into an XML file
with open('pubmed_data.xml', 'wb') as f:
    f.write(xml_data)

# Step 4: Extract the abstract texts from XML data
tree = ET.parse('pubmed_data.xml')
root = tree.getroot()

abstracts = []
for article in root.findall('.//PubmedArticle'):
    abstract_texts = article.findall('.//AbstractText')
    text = ' '.join([elem.text for elem in abstract_texts if elem.text])
    abstracts.append(text)
        
# Step 5: Convert the list into a pandas dataframe
df = pd.DataFrame(abstracts, columns = ['abstracts'])

### Data Cleaning

In [4]:
import re

# Remove duplicates
df = df.drop_duplicates(subset = 'abstracts').reset_index(drop = True)

# Clean data
def clean_text(text):
    text = text.lower()
    text = re.sub(r'covid[\s\-]?19', 'covid19', text)
    text = re.sub(r'sars[\s\-]?cov[\s\-]?2', 'sarscov2', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text

df['abstracts_cleaned'] = df['abstracts'].apply(lambda x: clean_text(x))

### Data Filtering and Labeling

In [5]:
keywords = ['coronavirus', 'covid19', 'sarscov2', 'covid']

def auto_label(abstract, keyword):
    abstract = str(abstract).lower()
    return int(any(re.search(rf'\b{kw}\b', abstract, re.IGNORECASE) for kw in keyword))

df['label'] = df['abstracts_cleaned'].apply(lambda x: auto_label(x, keywords))

After labelling the data, we observed that a majority of the research articles extracted did not mention COVID-19, resulting in a highly imbalanced dataset. To address this, we have decided to oversample the minority class by applying SMOTE to the training data to help balance the dataset and improve the model's ability to egenralize to underrepresented cases.

### Text Preprocessing

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

custom_vocab = ['coronavirus', 'covid19', 'sarscov2', 'covid']

vectorizer = TfidfVectorizer(vocabulary = custom_vocab)
vectorized = vectorizer.fit_transform(df['abstracts_cleaned'])
X = pd.DataFrame(vectorized.toarray(), columns = vectorizer.get_feature_names_out())

### Oversample Minority Class with SMOTE

In [7]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

smote = SMOTE(random_state = 1)

# Split data into train and test data
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

# Apply SMOTE to training data only
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

## Modelling

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import train_test_split

model = LogisticRegression(random_state = 1)
model.fit(X_resampled, y_resampled)

y_pred = model.predict(X_test)
pred_prob = model.predict_proba(X_test)[:,1]

# Confusion matrix
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n')

# Classification report
target_names = ['Did not mention COVID-19', 'Mentioned COVID-19']
print(f'Classification Report:\n{classification_report(y_test, y_pred, target_names = target_names)}\n')

# ROC-AUC Score
print(f'ROC-AUC Score: {roc_auc_score(y_pred, pred_prob)}')

Confusion Matrix:
[[96  0]
 [ 3 16]]

Classification Report:
                          precision    recall  f1-score   support

Did not mention COVID-19       0.97      1.00      0.98        96
      Mentioned COVID-19       1.00      0.84      0.91        19

                accuracy                           0.97       115
               macro avg       0.98      0.92      0.95       115
            weighted avg       0.97      0.97      0.97       115


ROC-AUC Score: 1.0


## Results:
* Confusion matrix: The model correctly identified 96 abstracts that mentioned COVID-19 (True Positives) and made no incorrect positive predictions (False Positives = 0). It missed 3 actual COVID-19 mentioned (False Negatives) and correctly classfied 16 articles as not mentioning COVID-10 (True Negatives).
* Precision: Everytime the model predicted COVID-19 mention, it was correct. (100%)
* Recall: The model successfully identified 84% of all actual COVID-19 mentions, though a few were missed.
* F1 Score: (0.91) The F1 score reflects a strong balance between precision and recall.
* ROC-AUC: The model demonstrated excellent discriminatory power, being able to perfecting identify COVID-19 mentioned from non-mentions. (100%)

Overall, this model achieved an ROC_AUC of 1.0, suggesting strong predictive ability.

## Key Insights
* The logistic regression model achieved an accuracy of 97% and a a perfecr ROC-AUC score of 1.0, indicating exceptional performance in distinguishing abstracts that mention COVID-19.
* The 100% precision suggests that the model is reliable when predicting positive cases.
* The 84% recall suggests that the model has strong sensitivity, though a few COVID-19 mentioned were missed.

These results validate the effectiveness of the current model.

## Next Steps

* Upgrade to BERT/BioBERT to enhance semantic understanding and classification performance by incoporating contextual embeddings.
* Extend the current model to other diseases (e.g. dengue) to evaluate the model's ability to adapt and generalize across different medical contexts.

## Conclusion
* Abstracts from medical research articles were scraped from Pubmed and parsed for analysis.
* Data was labelled, filtered and preprocessed for model training.
* To address class imbalance, SMOTE was applied to oversample the minority class in the training data.
* A logistic regression model was trained and evaluated to detect COVID-19 mentions, demonstrating exceptional predictive performance in detecting COVID-19 mentiond in medical abstracts.