# Text Classification for Medical Reports
---
The code aims to build a machine learning model that can classify medical transcriptions into various specialties, contributing to several real-world applications such as automating documentation, improving data management, supporting medical research, and aiding in decision-making. Despite the low performance indicated by the current output, the process demonstrates essential steps in developing, evaluating, and refining a text classification model. Continuous improvement and tuning can lead to a more accurate and reliable system, ultimately benefiting the healthcare industry.


##  1. Importing Libraries
The necessary libraries for data manipulation, model training, and evaluation are imported.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder


## 2. Loading the Dataset
The dataset is loaded from a CSV file into a Pandas DataFrame.

In [None]:
# Load the dataset
file_path = '/content/mtsamples.csv'
data = pd.read_csv(file_path)

## 3. Initial Data Inspection
The first few rows of the dataset are displayed to understand its structure, followed by an information summary and checking for missing values.

In [None]:
# Display the first few rows of the dataset
print(data.head())

# Check the structure of the dataset
print(data.info())

# Check for missing values
print(data.isnull().sum())

   Unnamed: 0                                        description  \
0           0   A 23-year-old white female presents with comp...   
1           1           Consult for laparoscopic gastric bypass.   
2           2           Consult for laparoscopic gastric bypass.   
3           3                             2-D M-Mode. Doppler.     
4           4                                 2-D Echocardiogram   

             medical_specialty                                sample_name  \
0         Allergy / Immunology                         Allergic Rhinitis    
1                   Bariatrics   Laparoscopic Gastric Bypass Consult - 2    
2                   Bariatrics   Laparoscopic Gastric Bypass Consult - 1    
3   Cardiovascular / Pulmonary                    2-D Echocardiogram - 1    
4   Cardiovascular / Pulmonary                    2-D Echocardiogram - 2    

                                       transcription  \
0  SUBJECTIVE:,  This 23-year-old white female pr...   
1  PAST MEDICAL 

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


## 4. Displaying Basic Statistics
Basic statistical details of the dataset are displayed

In [None]:
# Display basic statistics
print(data.describe())

        Unnamed: 0
count  4999.000000
mean   2499.000000
std    1443.231328
min       0.000000
25%    1249.500000
50%    2499.000000
75%    3748.500000
max    4998.000000


# 5. Preprocessing
### 5.1 Removing Unnecessary Columns

If there is an index column named 'Unnamed: 0', it is dropped.

In [None]:
# Drop index column if it exists
if 'Unnamed: 0' in data.columns:
    data = data.drop(columns=['Unnamed: 0'])


### 5.2 Defining Features and Labels

The features (X) and labels (y) are defined. Here, 'description' is used as the feature and 'medical_specialty' as the label.

In [None]:
# Define features and labels
X = data['description']
y = data['medical_specialty']


## 6. Splitting the Data
The dataset is split into training and testing sets.

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 7. Text Vectorization
TF-IDF Vectorizer is used to convert text data into numerical format.

In [None]:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

## 8. Model Training and Evaluation
### 8.1 Logistic Regression Model

A Logistic Regression model is initialized, trained on the training data, and used to make predictions on the test data

In [None]:
# Initialize and train the classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)

# Make predictions
y_pred = model.predict(X_test_tfidf)


### 8.2 Model Evaluation

The model's performance is evaluated using accuracy score and classification report.

In [None]:
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.234
Classification Report:
                                 precision    recall  f1-score   support

          Allergy / Immunology       0.00      0.00      0.00         1
                       Autopsy       0.00      0.00      0.00         2
                    Bariatrics       0.00      0.00      0.00         3
    Cardiovascular / Pulmonary       0.19      0.22      0.20        69
                  Chiropractic       0.00      0.00      0.00         1
    Consult - History and Phy.       0.21      0.39      0.28       107
    Cosmetic / Plastic Surgery       0.00      0.00      0.00         4
                     Dentistry       0.00      0.00      0.00         8
                   Dermatology       0.00      0.00      0.00         3
          Diets and Nutritions       0.00      0.00      0.00         1
             Discharge Summary       0.50      0.05      0.09        21
          ENT - Otolaryngology       0.25      0.04      0.07        25
        Emergency Room 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## 9. Encoding Labels
The 'medical_specialty' column is encoded to numerical labels using LabelEncoder.

In [None]:
# Initialize the LabelEncoder
le = LabelEncoder()

# Encode the 'medical_specialty' column
data['medical_specialty'] = le.fit_transform(data['medical_specialty'])

# Inspect the transformed labels
print(data['medical_specialty'].unique())


[ 0  2  3 22  7 39 15 38 37 35 36 34 33 32 31 30 29 28 27 26 25 24 23 21
 20 19 18 17 16 14 11 13 12 10  9  8  6  5  4  1]


# 10. Re-Vectorization with Additional Options
TF-IDF Vectorizer is re-initialized with specific parameters and applied to the 'description' text.

In [None]:
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

# Fit and transform the 'description' text
X = vectorizer.fit_transform(data['description'])

# Inspect the feature names
print(vectorizer.get_feature_names_out())


['10' '100' '11' '12' '14' '15' '16' '17' '18' '20' '21' '25' '26' '30'
 '300' '32' '40' '51' '52' '53' '62' '66' '67' '69' 'abdomen' 'abdominal'
 'ablation' 'abnormal' 'abscess' 'abuse' 'access' 'accident' 'activity'
 'acute' 'adenocarcinoma' 'adenoid' 'adenoidectomy' 'adhesions' 'adjacent'
 'admission' 'admitted' 'adrenal' 'advanced' 'age' 'aged' 'ago' 'air'
 'airway' 'alcohol' 'allergic' 'allergies' 'allograft' 'anastomosis'
 'anemia' 'anesthesia' 'aneurysm' 'angina' 'angiogram' 'angiography'
 'angioplasty' 'ankle' 'answers' 'anterior' 'antibiotic' 'anxiety'
 'aortic' 'aphasia' 'apnea' 'aponeurosis' 'appendectomy' 'appendicitis'
 'application' 'approach' 'approximately' 'area' 'areas' 'arm' 'arms'
 'arterial' 'arteriovenous' 'artery' 'arthritis' 'arthrodesis'
 'arthroplasty' 'arthroscopic' 'arthroscopy' 'aspect' 'aspiration'
 'assess' 'assist' 'assisted' 'associated' 'atrial' 'attempt' 'attempted'
 'atypical' 'austin' 'autologous' 'axial' 'axillary' 'baby' 'basal' 'base'
 'bengal' '

## 11. Splitting Data and Training the Model Again
The data is split again, and the Logistic Regression model is retrained with the new TF-IDF features.

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, data['medical_specialty'], test_size=0.3, random_state=42)

# Initialize and train the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)


## 12. Final Model Evaluation
The final evaluation of the model is conducted.

#### Purpose: To evaluate the quality of the predictions for each class (medical specialty).
####Interpretation:
1. Precision: The ratio of correctly predicted positive observations to the total predicted positives. High precision means low false positive rate.
2. Recall: The ratio of correctly predicted positive observations to all observations in the actual class. High recall means low false negative rate.
3. F1-Score: The weighted average of precision and recall. It gives a single score that balances the precision and recall.
Support: The number of actual occurrences of the class in the dataset.

In [None]:
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.2693333333333333
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         3
           2       0.00      0.00      0.00         3
           3       0.27      0.26      0.26       105
           4       0.00      0.00      0.00         3
           5       0.22      0.46      0.30       157
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00        12
           8       0.00      0.00      0.00         6
           9       0.00      0.00      0.00         2
          10       0.33      0.06      0.10        33
          11       0.33      0.06      0.10        35
          12       0.00      0.00      0.00        24
          13       0.00      0.00      0.00         4
          14       0.17      0.09      0.12        68
          15       0.14      0.09      0.11        74
          16       0.00     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


1. Rows corresponding to each specialty: Show individual performance metrics for each class.
2. Overall accuracy: The fraction of correct predictions.
3. Macro average: Average performance metrics without considering the class imbalance.
4. Weighted average: Average performance metrics considering the class imbalance.