# Structuring and Cleaning

---

Within this notebook I will be performing cleaning and structuring using python
and some libraries and packages from online. I will keep a list of the packages I 
use as well as the important notes (insert below). 

## Important Notes 

- **pandas:** for mathematical calculations and arrays.
- **re:** for regex pattern matching and sequence classification.
- **torch:** tensor operations, efficiency and mainly neural network building blocks. 
- **torch.nn:** neural network building blocks.
- **transformers:** provides pre-trained models 

---

### nn.Module

In PyTorch, nn.Module is a base class for all neural network modules. It is a fundamental building block for creating neural network architectures. When you create a custom neural network in PyTorch, you typically subclass nn.Module and define the network's layers and operations within this class.

### nn.Dropout

The `nn.Dropout` layer is a regularization technique commonly used in neural networks to prevent overfitting. It works by randomly setting a fraction of input units to zero during training, which helps prevent the network from relying too much on any particular set of features. 

The argument `0.1` in `nn.Dropout(0.1)` specifies the probability of dropping out each neuron during training. In this case, `0.1` means that each neuron in the input will be set to zero with a probability of 0.1 (or 10%) during each forward pass through the network. 

During inference (i.e., when the model is used to make predictions), dropout is typically turned off or scaled appropriately to ensure that the expected output remains the same. This scaling is often achieved automatically in PyTorch when using `nn.Dropout` with the `model.eval()` mode.

In summary, `nn.Dropout(0.1)` introduces random dropout with a probability of 0.1 during training to prevent 
overfitting and improve the generalization ability of the neural network.

In [1]:
## NOTE: can ignore the UserWarning as it is an internal pytorch issue
# import important libraries and packages

import pandas as pd 
import re 
import torch 
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertModel, BertTokenizer
from torch.utils.data import DataLoader


  torch.utils._pytree._register_pytree_node(


In [2]:
# Step 1: Read the Data
df = pd.read_csv("discharge.csv", nrows=500)

In [3]:
df.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,\nName: ___ Unit No: _...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,\nName: ___ Unit No: _...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,\nName: ___ Unit No: _...
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,\nName: ___ Unit No: _...
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,\nName: ___ Unit No: __...


In [4]:
# Checking the data.
df['text'][33]

' \nName:  ___                    Unit No:   ___\n \nAdmission Date:  ___              Discharge Date:   ___\n \nDate of Birth:  ___             Sex:   F\n \nService: MEDICINE\n \nAllergies: \n___\n \nAttending: ___.\n \nChief Complaint:\nAbdominal pain \n \nMajor Surgical or Invasive Procedure:\nColonoscopy with biopsy ___\n\n \nHistory of Present Illness:\nThis patient is a ___ year old female with Hx of sigmoid \ndiverticulitis s/p resection in ___, who complains of RLQ \nabdominal pain. The patient states that her pain began yesterday \nafternoon, worsened overnight and causing her to present to the \nED around 3AM. She describes it as a "gnawing" pain, \nnonradiating, constant and ___ intensity. She states this \nfeels similar to her episode of diverticulitis several years \nago, only is present on the other side of her abdomen. She \ndenies any fever, nausea, vomiting, SOB, Chest pain, BRBPR. She \ndoes endorse subjective feeling of chills.  \n.  \nPrior to the current episode, t

### Compute Number of Titles

To use the code below we need to know how many titles there are in a section of text. This is based off of the 
number of labels present in the sample text. Below we will: 

1. define the __regex__ pattern for titles. 
2. find all the titles matching the pattern. 
3. remove duplicates and normalise the list. 

In [5]:
# Initialise the text sample
text_sample = df['text'][33]
text_sample = text_sample.strip()

# Remove weird characters
text_sample = re.sub(r'\n', '', text_sample)
text_sample = text_sample.strip()

# Define the title pattern
title_pattern = r'(\b[A-Za-z\s]+):\s*'

# Find all the titles matching the pattern
titles = re.findall(title_pattern, text_sample.strip())

# Normalise the titles
normalized_titles = list(set(titles))
normalized_titles = sorted(normalized_titles)

# Display the normalised titles
for title in normalized_titles:
    print("Title: ", title)

# Define number of labels 
num_labels = len(normalized_titles)
print("Number of Labels (num_labels):", num_labels)

Title:                      Unit No
Title:                Discharge Date
Title:               Sex
Title:    Anxiety  Allergic rhinitis  GERD  Eczema  Migraine headaches  Eustacian tube dysfunction   Social History
Title:    Physical Exam
Title:   Admission Date
Title:   Attending
Title:   BP
Title:   Date of Birth
Title:   History
Title:   History of Present Illness
Title:   P
Title:   R
Title:   fatty acids     Capsule Sig
Title:   mg Tablet Sig
Title:   unit Capsule Sig
Title:   weeks ago   Discharge Medications
Title:  Abdominal pain
Title:  Abdominal pain  Major Surgical or Invasive Procedure
Title:  Activity Status
Title:  Brief Hospital Course
Title:  Cecal Mass
Title:  Cecal MassHemorrhagic ovarian cyst Discharge Condition
Title:  Chief Complaint
Title:  DIAGNOSIS
Title:  Discharge Disposition
Title:  Discharge Instructions
Title:  F Service
Title:  FINDINGS
Title:  Findings
Title:  Followup Instructions
Title:  Home Discharge Diagnosis
Title:  IMPRESSION
Title:  Imaging
Title: 

In [6]:
# Define the normalized titles
normalized_titles = [
    "Name", "Unit No", "Admission Date", "Discharge Date", "Date of Birth", 
    "Sex", "Service", "Allergies", "Attending", "Chief Complaint", 
    "Major Surgical or Invasive Procedure", "History of Present Illness", 
    "Past Medical History", "Social History", "Family History", 
    "Physical Exam", "Discharge", "Pertinent Results", "CXR", 
    "U/S", "Brief Hospital Course", "Medications on Admission", 
    "Discharge Medications", "Discharge Disposition", "Discharge Diagnosis", 
    "Discharge Condition", "Discharge Instructions", "Followup Instructions"
]

# Normalize titles in the text
for title in normalized_titles:
    # Define the regex pattern to find each title
    pattern = re.compile(r'\b' + title + r'\b', re.IGNORECASE)
    # Replace the title with the normalized version
    text_sample = re.sub(pattern, title.upper(), text_sample)

# Print the updated text sample
print(text_sample)

NAME:  ___                    UNIT NO:   ___ ADMISSION DATE:  ___              DISCHARGE DATE:   ___ DATE OF BIRTH:  ___             SEX:   F SERVICE: MEDICINE ALLERGIES: ___ ATTENDING: ___. CHIEF COMPLAINT:Abdominal pain  MAJOR SURGICAL OR INVASIVE PROCEDURE:Colonoscopy with biopsy ___ HISTORY OF PRESENT ILLNESS:This patient is a ___ year old female with Hx of sigmoid diverticulitis s/p resection in ___, who complains of RLQ abdominal pain. The patient states that her pain began yesterday afternoon, worsened overnight and causing her to present to the ED around 3AM. She describes it as a "gnawing" pain, nonradiating, constant and ___ intensity. She states this feels similar to her episode of diverticulitis several years ago, only is present on the other side of her abdomen. She denies any fever, nausea, vomiting, SOB, Chest pain, BRBPR. She does endorse subjective feeling of chills.  .  Prior to the current episode, the patient reports having a "sinus infection" about 3 weeks ago that

### Extract Subsections

At this point I think it's a good idea to extract the subsections.
However, I don't think the data has been cleaned well enough so I am going to 
continue filtering and structuring the data above.

**Update 1:** 

I have now managed to (mostly) normalise the subsection titles so extraction should be easy?


__Update 2:__ 

I should be applying the above code to the entire dataframe and creating a new column so that
I can extract titles and content together. 

In [7]:
# Lets modify the original df
# Compile pattern outside to improve time complexity
compiled_pattern = re.compile(title_pattern)

# Create a function to replace the text in each row with the new structured one
def normalize_titles(text):
    # Strip the text once
    text_sample = text.strip()
    
    # Find all the titles matching the pattern
    titles = re.findall(compiled_pattern, text_sample)
    
    # Normalize all the titles
    normalized_text = text_sample
    for title in titles:
        # Replace the title with the normalized version
        normalized_text = re.sub(r'\b' + title + r'\b', title.upper(), normalized_text, flags=re.IGNORECASE)
    
    return normalized_text

# apply to all the columns in the dataframe
df['text'] = df['text'].apply(normalize_titles)

# Check that the application has been executed properly
df.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,NAME: ___ UNIT NO: ___\...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,NAME: ___ UNIT NO: ___\...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,NAME: ___ UNIT NO: ___\...
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,NAME: ___ UNIT NO: ___\...
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,NAME: ___ UNIT NO: ___\n...


In [12]:
print(df['text'][33])

NAME:  Peatah Puhkah                    UNIT NO:   Peatah Puhkah
 
ADMISSION DATE:  Peatah Puhkah              DISCHARGE DATE:   Peatah Puhkah
 
DATE OF BIRTH:  Peatah Puhkah             SEX:   F
 
SERVICE: MEDICINE
 
ALLERGIES: 
Peatah Puhkah
 
ATTENDING: Peatah Puhkah.
 
CHIEF COMPLAINT:
ABDOMINAL PAIN 
 
MAJOR SURGICAL OR INVASIVE PROCEDURE:
Colonoscopy with biopsy Peatah Puhkah

 
HISTORY OF PRESENT ILLNESS:
This patient is a Peatah Puhkah year old female with Hx of sigmoid 
diverticulitis s/p resection in Peatah Puhkah, who complains of RLQ 
ABDOMINAL PAIN. The patient states that her pain began yesterday 
afternoon, worsened overnight and causing her to present to the 
ED around 3AM. She describes it as a "gnawing" pain, 
nonradiating, constant and Peatah Puhkah intensity. She states this 
feels similar to her episode of diverticulitis several years 
ago, only is present on the other side of her abdomen. She 
denies any fever, nausea, vomiting, SOB, Chest pain, BRBPR. She 
does e

In [9]:
print(df['text'][len(df) - 1])

NAME:  ___                 UNIT NO:   ___
 
ADMISSION DATE:  ___              DISCHARGE DATE:   ___
 
DATE OF BIRTH:  ___             SEX:   M
 
SERVICE: ORTHOPAEDICS
 
ALLERGIES: 
VANCOCIN
 
ATTENDING: ___.
 
CHIEF COMPLAINT:
RIGHT KNEE PAIN
 
MAJOR SURGICAL OR INVASIVE PROCEDURE:
complex revision right total knee replacement, re-implant of 
components - ROTATING HINGE

 
HISTORY OF PRESENT ILLNESS:
Mr ___ underwent R ___ on ___ that ultimately got 
infected and was resected.  He has undergone 6 WEEKS of IV abx 
successfully.  His most recent labs values and aspirate of the 
knee are reassuring that the infection has been cleared.  He 
elects to proceed with re-implantation of components.
 
PAST MEDICAL HISTORY:
Hypertension
OA of bt. KNEES
 
SOCIAL HISTORY:
___
FAMILY HISTORY:
Positive for cancer, nonspecific.
 
PHYSICAL EXAM:
well appearing, well nourished ___ YEAR OLD MALE
ALERT AND ORIENTED
NO ACUTE DISTRESS
RLE:
 -dressing-c/d/i
 -incision-c/d/i, no erythema.  mild bloody drainag

### Cleaning 2.0

At this stage we have structured the subsection titles/labels. Next we need to clean the messy data such as 
removing the special symbols like `#`, `[]`, `___` etc. 

To do this lets create a clean text function that will deal with these nuances and apply it to every row.

In [10]:
def clean_text(text):
    # Define the list of symbols to deal with
    symbols_to_replace = ['#', '[]', '___']
    placeholder = 'Peatah Puhkah'
    
    # loop over each symbold and replace with what we want
    for symbol in symbols_to_replace:
        # If symbol is name PH then replace with PH Name
        if symbol == '___':
            text = re.sub(re.escape(symbol), placeholder, text)
        else: # otherwise just remove it from the text
            text = re.sub(re.escape(symbol), "", text)
            
    cleaned_text = text
    return cleaned_text

# Apply changes to the dataframe
df['text'] = df['text'].apply(clean_text)

# Check application successful 
print(df['text'][54])

NAME:  Peatah Puhkah             UNIT NO:   Peatah Puhkah
 
ADMISSION DATE:  Peatah Puhkah              DISCHARGE DATE:   Peatah Puhkah
 
DATE OF BIRTH:  Peatah Puhkah             SEX:   F
 
SERVICE: MEDICINE
 
ALLERGIES: 
IV Dye, Iodine Containing Contrast Media / Oxycodone / 
cilostazol / VARENICLINE
 
ATTENDING: Peatah Puhkah
 
CHIEF COMPLAINT:
Dyspnea, ATRIAL FIBRILLATION
 
MAJOR SURGICAL OR INVASIVE PROCEDURE:
NONE

 
HISTORY OF PRESENT ILLNESS:
Peatah Puhkah F with pmhx of COPD (nighttime O2), htn, afib who presents 
with dyspnea, currently being treated for COPD and admitted for 
Afib with RVR.  
 The patient went to the ED on Peatah Puhkah and was diagnosed with a 
COPD flare. She was discharged with a prednisone taper 
(currently on 60mg) and azithromycin. This AM she initially felt 
well, then developed dyspnea at rest, worsening with exertion. 
Her inhalers improved her SOB. She felt that these symptoms were 
consistent with her COPD. She saw her PCP Peatah Puhkah today in 


## Extracting Subsection Titles and Content

**Steps:**
1. Create a function to extract the subsections and their corresponding content.
2. Make a function that finds the row and column to enter the data. 
3. Make a function that inserts the data into that position in the dataframe.

In [11]:
## NOTE: This code isnt working yet

# Function to extract the subsection titles and corresponding content.
def extract_subsections_and_content(text):
    # define the pattern with regex
    pattern = r'(?<=^|\n)([A-Z\s:]+)\s*:\s*([\s\S]*?)(?=\n[A-Z\s:]+:|\Z)'
    subsections = re.findall(pattern, text)
    subsections_dict = {title.strip(): content.strip() for title, content in subsections}
    return subsections_dict

# Function to add subsections to the dataframe as new columns.
def add_subsections_to_dataframe(df):
    for index,row in df.iterrows():
        subsections_dict = extract_subsections_and_content(row['text'])
        for title, content in subsections_dict.items():
            df.at[index, title] = content
    return df

df_with_subsections = add_subsections_to_dataframe(df)

print(df_with_subsections)

error: look-behind requires fixed-width pattern

In [None]:
# Set up model parameters
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)

In [None]:
# Create class for Title Classification

class TitleClassifier(nn.Module):
    def __init__(self, bert_model, num_labels):
        super(TitleClassifier, self).__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask): 
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

Breaking down each line of the code and understand what it does:

```python
class TitleClassifier(nn.Module):
```
- This line defines a new class called `TitleClassifier`, which is a subclass of `nn.Module`. This class will serve as our title classifier neural network model.

```python
    def __init__(self, bert_model, num_labels):
```
- This line defines the constructor method `__init__()` for the `TitleClassifier` class. It initializes the class with two parameters: `bert_model` and `num_labels`.
- `bert_model`: This parameter is the pre-trained BERT model that will be used for classification.
- `num_labels`: This parameter specifies the number of output labels for classification.

```python
        super(TitleClassifier, self).__init__()
```
- This line calls the constructor of the superclass `nn.Module` to initialize the `TitleClassifier` class.

```python
        self.bert = bert_model
```
- This line assigns the pre-trained BERT model (`bert_model`) to the `bert` attribute of the `TitleClassifier` class. This allows the classifier to use BERT's pre-trained weights and layers.

```python
        self.dropout = nn.Dropout(0.1)
```
- This line creates a dropout layer with a dropout probability of 0.1 (10%) and assigns it to the `dropout` attribute of the `TitleClassifier` class. Dropout is a regularization technique used to prevent overfitting by randomly setting some output features to zero during training.

```python
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_labels)
```
- This line creates a fully connected linear layer (`nn.Linear`) with input size equal to the hidden size of the BERT model (`bert_model.config.hidden_size`) and output size equal to the number of labels (`num_labels`). This layer will be used to classify the input into different categories.

```python
    def forward(self, input_ids, attention_mask): 
```
- This line defines the forward method for the `TitleClassifier` class. The forward method specifies how input data should be processed through the neural network during the forward pass.

```python
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
```
- This line passes the input token IDs (`input_ids`) and attention mask (`attention_mask`) to the pre-trained BERT model (`self.bert`) and obtains the model outputs, including the hidden states and pooled output.

```python
        pooled_output = outputs.pooler_output
```
- This line extracts the pooled output from the BERT model outputs. The pooled output is a summary representation of the input sequence obtained by applying a pooling operation over the hidden states of the last layer.

```python
        pooled_output = self.dropout(pooled_output)
```
- This line applies dropout regularization to the pooled output obtained from BERT by passing it through the dropout layer (`self.dropout`). This helps prevent overfitting during training.

```python
        logits = self.classifier(pooled_output)
```
- This line passes the dropout output (`pooled_output`) through the linear classifier (`self.classifier`) to obtain the logits, which are unnormalized scores representing the predicted probabilities for each class label.

```python
        return logits
```
- Finally, this line returns the logits from the forward pass as the output of the `forward` method. These logits will be used to compute the loss and perform backpropagation during training.

Overall, the `TitleClassifier` class defines a neural network model for classifying input sequences into different categories using a pre-trained BERT model for feature extraction and a linear classifier for classification. The model applies dropout regularization to prevent overfitting and returns the logits for classification.

In [None]:
# Fine-tune the BERT Model 
num_labels = len(normalized_titles)
model = TitleClassifier(bert_model, num_labels)

# Define loss function and optimizer 
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

### Create a Training Loop 

This is where we fine tune and teach the model based on the textual data. 
We will use `epochs` as iterations and train the model to pick up on the sequences. 

In [None]:
# Training Loop

# Let's define num_epochs to be 3 cycles
num_epochs = 5

# 

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        optimizer.zero_grad() 
        logits = model(input_ids, attention_mask)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()

In [None]:
def clean_text(text): 
    # replacing the underscores with placeholder names
    cleaned_text = re.sub(r'___', 'PythonSQL', text)
    