# Structuring and Cleaning

---

Within this notebook I will be performing cleaning and structuring using python
and some libraries and packages from online. I will keep a list of the packages I 
use as well as the important notes (insert below). 

## Important Notes 

- **pandas:** for mathematical calculations and arrays.
- **re:** for regex pattern matching and sequence classification.
- **torch:** tensor operations, efficiency and mainly neural network building blocks. 
- **torch.nn:** neural network building blocks.
- **transformers:** provides pre-trained models 

---

### nn.Module

In PyTorch, nn.Module is a base class for all neural network modules. It is a fundamental building block for creating neural network architectures. When you create a custom neural network in PyTorch, you typically subclass nn.Module and define the network's layers and operations within this class.

### nn.Dropout

The `nn.Dropout` layer is a regularization technique commonly used in neural networks to prevent overfitting. It works by randomly setting a fraction of input units to zero during training, which helps prevent the network from relying too much on any particular set of features. 

The argument `0.1` in `nn.Dropout(0.1)` specifies the probability of dropping out each neuron during training. In this case, `0.1` means that each neuron in the input will be set to zero with a probability of 0.1 (or 10%) during each forward pass through the network. 

During inference (i.e., when the model is used to make predictions), dropout is typically turned off or scaled appropriately to ensure that the expected output remains the same. This scaling is often achieved automatically in PyTorch when using `nn.Dropout` with the `model.eval()` mode.

In summary, `nn.Dropout(0.1)` introduces random dropout with a probability of 0.1 during training to prevent 
overfitting and improve the generalization ability of the neural network.

In [27]:
## NOTE: can ignore the UserWarning as it is an internal pytorch issue
# import important libraries and packages

import pandas as pd 
import re 
import torch 
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertModel, BertTokenizer


In [28]:
# Step 1: Read the Data
df = pd.read_csv("discharge.csv", nrows=500)

In [29]:
df.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,\nName: ___ Unit No: _...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,\nName: ___ Unit No: _...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,\nName: ___ Unit No: _...
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,\nName: ___ Unit No: _...
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,\nName: ___ Unit No: __...


In [30]:
# Checking the data.
df['text'][0]

' \nName:  ___                     Unit No:   ___\n \nAdmission Date:  ___              Discharge Date:   ___\n \nDate of Birth:  ___             Sex:   F\n \nService: MEDICINE\n \nAllergies: \nNo Known Allergies / Adverse Drug Reactions\n \nAttending: ___\n \nChief Complaint:\nWorsening ABD distension and pain \n \nMajor Surgical or Invasive Procedure:\nParacentesis\n\n \nHistory of Present Illness:\n___ HCV cirrhosis c/b ascites, hiv on ART, h/o IVDU, COPD, \nbioplar, PTSD, presented from OSH ED with worsening abd \ndistension over past week.  \nPt reports self-discontinuing lasix and spirnolactone ___ weeks \nago, because she feels like "they don\'t do anything" and that \nshe "doesn\'t want to put more chemicals in her." She does not \nfollow Na-restricted diets. In the past week, she notes that she \nhas been having worsening abd distension and discomfort. She \ndenies ___ edema, or SOB, or orthopnea. She denies f/c/n/v, d/c, \ndysuria. She had food poisoning a week ago from eatin

### Compute Number of Titles

To use the code below we need to know how many titles there are in a section of text. This is based off of the 
number of labels present in the sample text. Below we will: 

1. define the __regex__ pattern for titles. 
2. find all the titles matching the pattern. 
3. remove duplicates and normalise the list. 

In [47]:
# Initialise the text sample
text_sample = df['text'][0]

# Remove weird characters
text_sample = re.sub(r'\n', '', text_sample)
text_sample = re.sub(r'___', 'PLACEHOLDERNAME', text_sample)

# Define the title pattern
title_pattern = r'(?:(?!PLACEHOLDERNAME).)([A-Za-z\s]+):'

# Find all the titles matching the pattern
titles = re.findall(title_pattern, text_sample)

# Normalise the titles
normalized_titles = list(set(titles))

# Display the normalised titles
print("Normalised Titles:", normalized_titles)

# Define number of labels 
num_labels = len(normalized_titles)
print("Number of Labels (num_labels):", num_labels)

Normalised Titles: ['HYSICAL EXAMINATION', ' GU', 'H SOB  Discharge Medications', 'scites from Portal HTN Discharge Condition', '   Social History', ' PLACEHOLDERNAME                     Unit No', 'ental Status', 'XII intact  Discharge', ' Tablet Refills', 'RA  General', 'MEDICINE Allergies', 'aracentesis History of Present Illness', ' no LAD  CV', 'XII intact   Pertinent Results', '   Physical Exam', 'in NAD  HEENT', 'S', 'g  Lungs', '  PLACEHOLDERNAME Admission Date', ' clubbing  Neuro', 'AP PLACEHOLDERNAME PLACEHOLDERNAME', 'Name', 'ACEHOLDERNAMEFamily History', 'spleen edge PLACEHOLDERNAME distension  GU', ' SBP negativediuretics', '   Past Medical History', 'H', 'ome Discharge Diagnosis', ' PLACEHOLDERNAME              Discharge Date', '  F Service', 'Activity Status', 'General', ' OP clear  Neck', 'RN pain  Discharge Disposition', ' Discharge Instructions', 'no foley  Ext', '     Medications on Admission', '   Followup Instructions', ' Adverse Drug Reactions Attending', ' Brief H

In [31]:
# Set up model parameters
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)

In [32]:
# Create class for Title Classification

class TitleClassifier(nn.Module):
    def __init__(self, bert_model, num_labels):
        super(TitleClassifier, self).__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask): 
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

Breaking down each line of the code and understand what it does:

```python
class TitleClassifier(nn.Module):
```
- This line defines a new class called `TitleClassifier`, which is a subclass of `nn.Module`. This class will serve as our title classifier neural network model.

```python
    def __init__(self, bert_model, num_labels):
```
- This line defines the constructor method `__init__()` for the `TitleClassifier` class. It initializes the class with two parameters: `bert_model` and `num_labels`.
- `bert_model`: This parameter is the pre-trained BERT model that will be used for classification.
- `num_labels`: This parameter specifies the number of output labels for classification.

```python
        super(TitleClassifier, self).__init__()
```
- This line calls the constructor of the superclass `nn.Module` to initialize the `TitleClassifier` class.

```python
        self.bert = bert_model
```
- This line assigns the pre-trained BERT model (`bert_model`) to the `bert` attribute of the `TitleClassifier` class. This allows the classifier to use BERT's pre-trained weights and layers.

```python
        self.dropout = nn.Dropout(0.1)
```
- This line creates a dropout layer with a dropout probability of 0.1 (10%) and assigns it to the `dropout` attribute of the `TitleClassifier` class. Dropout is a regularization technique used to prevent overfitting by randomly setting some output features to zero during training.

```python
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_labels)
```
- This line creates a fully connected linear layer (`nn.Linear`) with input size equal to the hidden size of the BERT model (`bert_model.config.hidden_size`) and output size equal to the number of labels (`num_labels`). This layer will be used to classify the input into different categories.

```python
    def forward(self, input_ids, attention_mask): 
```
- This line defines the forward method for the `TitleClassifier` class. The forward method specifies how input data should be processed through the neural network during the forward pass.

```python
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
```
- This line passes the input token IDs (`input_ids`) and attention mask (`attention_mask`) to the pre-trained BERT model (`self.bert`) and obtains the model outputs, including the hidden states and pooled output.

```python
        pooled_output = outputs.pooler_output
```
- This line extracts the pooled output from the BERT model outputs. The pooled output is a summary representation of the input sequence obtained by applying a pooling operation over the hidden states of the last layer.

```python
        pooled_output = self.dropout(pooled_output)
```
- This line applies dropout regularization to the pooled output obtained from BERT by passing it through the dropout layer (`self.dropout`). This helps prevent overfitting during training.

```python
        logits = self.classifier(pooled_output)
```
- This line passes the dropout output (`pooled_output`) through the linear classifier (`self.classifier`) to obtain the logits, which are unnormalized scores representing the predicted probabilities for each class label.

```python
        return logits
```
- Finally, this line returns the logits from the forward pass as the output of the `forward` method. These logits will be used to compute the loss and perform backpropagation during training.

Overall, the `TitleClassifier` class defines a neural network model for classifying input sequences into different categories using a pre-trained BERT model for feature extraction and a linear classifier for classification. The model applies dropout regularization to prevent overfitting and returns the logits for classification.

In [None]:
# Fine-tune the BERT Model 
num_labels = len(normalized_titles)
model = TitleClassifer(bert_model, num_labels)

# Define loss function and optimizer 
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

### Create a Training Loop 

This is where we fine tune and teach the model based on the textual data. 
We will use `epochs` as iterations and train the model to pick up on the sequences. 

In [None]:
# Training Loop

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        optimizer.zero_grad() 
        logits = model(input_ids, attention_mask)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()

In [33]:
def clean_text(text): 
    # replacing the underscores with placeholder names
    cleaned_text = re.sub(r'___', 'PythonSQL', text)
    