<a href="https://colab.research.google.com/github/nbil-s/CS1/blob/main/169393.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**MAKE A COPY OF THIS COLAB NOTEBOOK BEFORE STARTING**

## Introduction


You've been given this notebook with code that loads a medical abstracts dataset and introduces several real-world data quality issues that you'll need to handle. The code **loads data from the Hugging Face medical abstracts dataset (train + test combined)**




## Your Task
- Build a complete NLP classification pipeline as per what we've covered in class in the last 2 weeks
- Evaluate your model performance with appropriate metrics
- Deploy the final model with a Gradio UI where users can input medical abstracts and get predictions, and make sure the title of your Gradio app is your first name
- You are free to use any algorithm as well as any feature extraction method, that you see fit, given the data and context of this problem / model


## Deliverables
- Once done, use the class attendance [Google form](https://forms.gle/ThaqeLtnHB7ui4rE9) to upload your file as well as answer some questions based on what you built
- All links close after 9:45 am, November 10th, 2025
- In case of any issues, any of the class reps, can reach out via email / phone.

In [None]:
'''
- DO NOT MODIFY ANY CODE IN THIS CELL.
- MAKING ANY CHNAGES WILL RESULT IN A FAILING THIS EXERCISE.

'''

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

splits = {
    'train': 'data/train-00000-of-00001.parquet',
    'test': 'data/test-00000-of-00001.parquet'
}

df = pd.concat(
    [
        pd.read_parquet("hf://datasets/TimSchopf/medical_abstracts/" + splits[s])
        for s in ["train", "test"]
    ]
)

np.random.seed(42)

df.loc[np.random.choice(df.index, size=int(0.15 * len(df)), replace=False), 'medical_abstract'] = np.nan
df.loc[np.random.choice(df.index, size=int(0.05 * len(df)), replace=False), 'condition_label'] = np.nan
df['medical_abstract'] = df['medical_abstract'].str.replace(' ', '  ', regex=False)
df = pd.concat([df, df.sample(frac=0.05)], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
df['condition_label'] = (df['condition_label']).astype(str)
df['medical_abstract'] = df['medical_abstract'].astype(str)


df.head()

Unnamed: 0,condition_label,medical_abstract
0,5.0,Sudden death caused by coronary artery a...
1,3.0,Motor unit discharge characteristics and ...
2,4.0,Prevalence of coronary heart disease in ...
3,,Light microscopic diagnosis of human micr...
4,5.0,Use of a knee-brace for control of tibi...


# Start writing you code below, add more cells if needed.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
import gradio as gr
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk



In [None]:
# Data Cleaning
df = df.dropna(subset=['medical_abstract'])
df = df.dropna(subset=['condition_label'])
df['medical_abstract'] = df['medical_abstract'].str.replace(r'\s+', ' ', regex=True).str.strip()
df = df.drop_duplicates()
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

In [None]:
df['processed_abstract'] = df['medical_abstract'].apply(preprocess_text)

In [None]:
X = df['processed_abstract']
y = df['condition_label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
# Build the pipeline:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=5000, ngram_range=(1,2))),
    ('clf', LogisticRegression(random_state=42, max_iter=1000))
])

# Train the model
pipeline.fit(X_train, y_train)


In [None]:
# Evaluate the model
y_pred = pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Function for prediction
def predict_abstract(abstract):
    processed = preprocess_text(abstract)
    prediction = pipeline.predict([processed])[0]
    return f"Predicted Condition: {prediction}"



Accuracy: 0.5357297531398874
Classification Report:
               precision    recall  f1-score   support

         1.0       0.63      0.72      0.67       467
         2.0       0.48      0.39      0.43       222
         3.0       0.51      0.40      0.45       298
         4.0       0.62      0.68      0.65       451
         5.0       0.45      0.55      0.49       714
         nan       0.00      0.00      0.00       157

    accuracy                           0.54      2309
   macro avg       0.45      0.46      0.45      2309
weighted avg       0.50      0.54      0.51      2309

Confusion Matrix:
 [[334  16  16  11  90   0]
 [ 22  87   8   7  98   0]
 [ 21   9 119  34 114   1]
 [ 10   4  14 307 116   0]
 [ 98  54  66 106 390   0]
 [ 46  13   9  30  59   0]]


In [None]:
# Gradio UI
iface = gr.Interface(
    fn=predict_abstract,
    inputs=gr.Textbox(lines=5, placeholder="Enter medical abstract here..."),
    outputs="text",
    title="Bilgis Nzembi Medical Abstract Classifier",
    description="Input a medical abstract to predict the condition label."
)
iface.launch()

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://7ab58e15c12cb616dd.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


