# Structuring and Cleaning

---

Within this notebook I will be performing cleaning and structuring using python
and some libraries and packages from online. I will keep a list of the packages I 
use as well as the important notes (insert below). 

## Important Notes 

- **pandas:** for mathematical calculations and arrays.
- **re:** for regex pattern matching and sequence classification.
- **torch:** tensor operations, efficiency and mainly neural network building blocks. 
- **torch.nn:** neural network building blocks.
- **transformers:** provides pre-trained models 

---

### nn.Module

In PyTorch, nn.Module is a base class for all neural network modules. It is a fundamental building block for creating neural network architectures. When you create a custom neural network in PyTorch, you typically subclass nn.Module and define the network's layers and operations within this class.

### nn.Dropout

The `nn.Dropout` layer is a regularization technique commonly used in neural networks to prevent overfitting. It works by randomly setting a fraction of input units to zero during training, which helps prevent the network from relying too much on any particular set of features. 

The argument `0.1` in `nn.Dropout(0.1)` specifies the probability of dropping out each neuron during training. In this case, `0.1` means that each neuron in the input will be set to zero with a probability of 0.1 (or 10%) during each forward pass through the network. 

During inference (i.e., when the model is used to make predictions), dropout is typically turned off or scaled appropriately to ensure that the expected output remains the same. This scaling is often achieved automatically in PyTorch when using `nn.Dropout` with the `model.eval()` mode.

In summary, `nn.Dropout(0.1)` introduces random dropout with a probability of 0.1 during training to prevent 
overfitting and improve the generalization ability of the neural network.

In [1]:
%pip install Flask
%pip install sqlalchemy
%pip install --upgrade transformers
%pip install transformers

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Import Important Libraries

In [2]:
## NOTE: can ignore the UserWarning as it is an internal pytorch issue
# import important libraries and packages

import pandas as pd 
import re 
import torch 
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertModel, BertTokenizer
from torch.utils.data import DataLoader
from flask import Flask, render_template
# from flask_sqlalchemy import SQLAlchemy
from transformers import AutoTokenizer, BioGptForCausalLM

## Raw Data

--- 

Read in the data from the `discharge.csv` file and clean the data by removing weird '\n', '_\_\_' (or some amount of), '#'

In [3]:
# Step 1: Read the Data
df = pd.read_csv("discharge.csv", nrows=50)

In [4]:
df.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,\nName: ___ Unit No: _...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,\nName: ___ Unit No: _...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,\nName: ___ Unit No: _...
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,\nName: ___ Unit No: _...
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,\nName: ___ Unit No: __...


### Cleaning 2.0

At this stage we have structured the subsection titles/labels. Next we need to clean the messy data such as 
removing the special symbols like `#`, `[]`, `___` etc. 

To do this lets create a clean text function that will deal with these nuances and apply it to every row.

In [5]:
def clean_text(text):
    # Define the list of symbols to deal with
    symbols_to_replace = ['#', '[]', '___', '\n']
    placeholder = 'Peatah Puhkah'
    
    # loop over each symbold and replace with what we want
    for symbol in symbols_to_replace:
        # If symbol is name PH then replace with PH Name
        if symbol == '___':
            text = re.sub(re.escape(symbol), placeholder, text)
        if symbol == '\n':
            text = re.sub(re.escape(symbol), " ", text)
        else: # otherwise just remove it from the text
            text = re.sub(re.escape(symbol), "", text)
            text = text.strip()
            
    cleaned_text = text
    return cleaned_text

# Apply changes to the dataframe
df['text'] = df['text'].apply(clean_text)

# Check application successful 
print(df['text'][0])

Name:  Peatah Puhkah                     Unit No:   Peatah Puhkah   Admission Date:  Peatah Puhkah              Discharge Date:   Peatah Puhkah   Date of Birth:  Peatah Puhkah             Sex:   F   Service: MEDICINE   Allergies:  No Known Allergies / Adverse Drug Reactions   Attending: Peatah Puhkah   Chief Complaint: Worsening ABD distension and pain    Major Surgical or Invasive Procedure: Paracentesis    History of Present Illness: Peatah Puhkah HCV cirrhosis c/b ascites, hiv on ART, h/o IVDU, COPD,  bioplar, PTSD, presented from OSH ED with worsening abd  distension over past week.   Pt reports self-discontinuing lasix and spirnolactone Peatah Puhkah weeks  ago, because she feels like "they don't do anything" and that  she "doesn't want to put more chemicals in her." She does not  follow Na-restricted diets. In the past week, she notes that she  has been having worsening abd distension and discomfort. She  denies Peatah Puhkah edema, or SOB, or orthopnea. She denies f/c/n/v, d/c, 

In [6]:
# Checking the data.
for i in range(3):
    print(df['text'][i])

Name:  Peatah Puhkah                     Unit No:   Peatah Puhkah   Admission Date:  Peatah Puhkah              Discharge Date:   Peatah Puhkah   Date of Birth:  Peatah Puhkah             Sex:   F   Service: MEDICINE   Allergies:  No Known Allergies / Adverse Drug Reactions   Attending: Peatah Puhkah   Chief Complaint: Worsening ABD distension and pain    Major Surgical or Invasive Procedure: Paracentesis    History of Present Illness: Peatah Puhkah HCV cirrhosis c/b ascites, hiv on ART, h/o IVDU, COPD,  bioplar, PTSD, presented from OSH ED with worsening abd  distension over past week.   Pt reports self-discontinuing lasix and spirnolactone Peatah Puhkah weeks  ago, because she feels like "they don't do anything" and that  she "doesn't want to put more chemicals in her." She does not  follow Na-restricted diets. In the past week, she notes that she  has been having worsening abd distension and discomfort. She  denies Peatah Puhkah edema, or SOB, or orthopnea. She denies f/c/n/v, d/c, 

## Regex 

---

Here is where we create the regex to pattern match the desired titles so that we can normalise them and extract 
the subsection titles and their corresponding content.

**History of Present Illness - Regex:** 
- (?i): This is a flag indicating case insensitivity, so the pattern will match regardless of whether the text is in uppercase, lowercase, or mixed case. 
- past: Matches the word "past."
- \s+: Matches one or more whitespace characters (space, tab, etc.).
- medical: Matches the word "medical."
- \s+: Matches one or more whitespace characters.
- history: Matches the word "history."
- (?i): This is a flag in regex that makes the pattern case-insensitive.
- Past\\s+Medical\\s+History: This part of the pattern matches the mixed-case version of the subsection title "Past Medical History".
- .*?: This part matches any characters (including newline characters) in a non-greedy way until it encounters the next subsection title or the end of the text.
- (?=(?:^\\s*\\w+:|\\Z)): This is a positive lookahead assertion that checks for either the start of a new line followed by optional whitespace and a word followed by a colon (indicating a new subsection title), or the end of the text (\Z).

**Family History:**
- (?i): This flag makes the pattern case-insensitive.
- family\\s+history: This part of the pattern matches the mixed-case version of the subsection title "family history".
- .*?: This part matches any characters (including newline characters) in a non-greedy way until it encounters the next subsection title or the end of the text.
- (?=(?:^\\s*\\w+:|\\Z)): This positive lookahead assertion checks for either the start of a new line followed by optional whitespace and a word followed by a colon (indicating a new subsection title), or the end of the text (\Z).

In [7]:
# Define title patterns
chief_complaint_pattern = r'Chief Complaint:(.*?)(?=(?:[A-Z][a-z]*\s*:|$))'
his_of_pres_ill_pattern = r'History of Present Illness:(.*?)(?=(?:[A-Z][a-z]*\s*:|$))'
family_hist_pattern = r'Family History:(.*?)(?=(?:[A-Z][a-z]*\s*:|$))'
past_med_hist_pattern = r'Past Medical History:(.*?)(?=(?:[A-Z][a-z]*\s*:|$))'

desired_titles = [chief_complaint_pattern, his_of_pres_ill_pattern, family_hist_pattern, past_med_hist_pattern]

# Function to extract and normalize titles
def extract_normalize_titles(text): 
    for desired_title in desired_titles:
        text = text.strip()

        # Clean and remove \n and extra whitespace
        text = re.sub(r'\n', ' ', text)
        text = re.sub(r'\s+', ' ', text)

        # Use the title pattern to find all the titles
        titles = re.findall(desired_title, text)

        # Normalize the titles
        normalized_titles = sorted(set(titles), key=lambda x: titles.index(x))
        for title in normalized_titles: 
            escaped_title = re.escape(title)
            pattern = re.compile(r'\b' + escaped_title + r'\b', re.IGNORECASE)
            # Replace the title with normalized version
            text = re.sub(pattern, title, text)
    return text


In [8]:
# test the function works
test_sample = df['text'][1]
test_sample_normalised = extract_normalize_titles(test_sample)
test_sample_normalised = extract_normalize_titles(test_sample_normalised)
test_sample_normalised = extract_normalize_titles(test_sample_normalised)
test_sample_normalised = extract_normalize_titles(test_sample_normalised)
print(test_sample_normalised)

Name: Peatah Puhkah Unit No: Peatah Puhkah Admission Date: Peatah Puhkah Discharge Date: Peatah Puhkah Date of Birth: Peatah Puhkah Sex: F Service: MEDICINE Allergies: Percocet Attending: Peatah Puhkah. Chief Complaint: abdominal fullness and discomfort Major Surgical or Invasive Procedure: Peatah Puhkah diagnostic paracentesis Peatah Puhkah therapeutic paracentesis History of Present Illness: Peatah Puhkah with HIV on HAART, COPD, HCV cirrhosis complicated by ascites and HE admitted with abdominal distention and pain. She was admitted to Peatah Puhkah for the same symptoms recently and had 3L fluid removed (no SBP) three days ago and felt better. Since discharge, her abdomen has become increasingly distended with pain. This feels similar to prior episodes of ascites. Her diuretics were recently decreased on Peatah Puhkah due to worsening hyponatremia 128 and hyperkalemia 5.1. Patient states she has been compliant with her HIV and diuretic medications but never filled out the lactulose

In [9]:
# Apply extract_normalize_titles(text) to the df
df['text'] = df['text'].apply(extract_normalize_titles)

In [10]:
# test for random rows
for i in range(10): 
    print(df['text'][i], '\n')

Name: Peatah Puhkah Unit No: Peatah Puhkah Admission Date: Peatah Puhkah Discharge Date: Peatah Puhkah Date of Birth: Peatah Puhkah Sex: F Service: MEDICINE Allergies: No Known Allergies / Adverse Drug Reactions Attending: Peatah Puhkah Chief Complaint: Worsening ABD distension and pain Major Surgical or Invasive Procedure: Paracentesis History of Present Illness: Peatah Puhkah HCV cirrhosis c/b ascites, hiv on ART, h/o IVDU, COPD, bioplar, PTSD, presented from OSH ED with worsening abd distension over past week. Pt reports self-discontinuing lasix and spirnolactone Peatah Puhkah weeks ago, because she feels like "they don't do anything" and that she "doesn't want to put more chemicals in her." She does not follow Na-restricted diets. In the past week, she notes that she has been having worsening abd distension and discomfort. She denies Peatah Puhkah edema, or SOB, or orthopnea. She denies f/c/n/v, d/c, dysuria. She had food poisoning a week ago from eating stale cake (n/v 20 min afte

### Extract Subsections

At this point I think it's a good idea to extract the subsections.
However, I don't think the data has been cleaned well enough so I am going to 
continue filtering and structuring the data above.

**Update 1:** 

I have now managed to (mostly) normalise the subsection titles so extraction should be easy?


__Update 2:__ 

I should be applying the above code to the entire dataframe and creating a new column so that
I can extract titles and content together. 

__Update 3:__

Extracting subsections from the desired titles section.

In [11]:
# Create a function to extract the subsections and their corresponding content
desired_titles = {
    "Chief Complaint": chief_complaint_pattern,
    "History of Present Illness": his_of_pres_ill_pattern,
    "Family History": family_hist_pattern,
    "Past Medical History": past_med_hist_pattern
}

def extract_subsections(text): 
    subsections = {}
    for title, pattern in desired_titles.items():
        matches = re.search(pattern, text, re.DOTALL)
        if matches:
            print(f"Title: {title}, Content: {matches.group(1).strip()}")
            subsections[title] = matches.group(1).strip()
    return subsections


extracted_subsections = extract_subsections(df['text'][0])
print(extracted_subsections)

Title: Chief Complaint, Content: Worsening ABD distension and pain Major Surgical or Invasive
Title: History of Present Illness, Content: Peatah Puhkah HCV cirrhosis c/b ascites, hiv on ART, h/o IVDU, COPD, bioplar, PTSD, presented from OSH ED with worsening abd distension over past week. Pt reports self-discontinuing lasix and spirnolactone Peatah Puhkah weeks ago, because she feels like "they don't do anything" and that she "doesn't want to put more chemicals in her." She does not follow Na-restricted diets. In the past week, she notes that she has been having worsening abd distension and discomfort. She denies Peatah Puhkah edema, or SOB, or orthopnea. She denies f/c/n/v, d/c, dysuria. She had food poisoning a week ago from eating stale cake (n/v 20 min after food ingestion), which resolved the same day. She denies other recent illness or sick contacts. She notes that she has been noticing gum bleeding while brushing her teeth in recent weeks. she denies easy bruising, melena, BRBPR

After running `exctract_subsections` now we try to apply it to each row of text and add each subsection and title
to the dataframe. 

In [12]:
# Define function to apply subsection extraction and df mutation 
def apply_subsection_extraction():
    for index, row in df.iterrows():
        subsections = extract_subsections(row['text'])
        for title, content in subsections.items():
            print(f"Title: {title}, content: {content}")
            df.loc[index, title] = content
            
apply_subsection_extraction()

Title: Chief Complaint, Content: Worsening ABD distension and pain Major Surgical or Invasive
Title: History of Present Illness, Content: Peatah Puhkah HCV cirrhosis c/b ascites, hiv on ART, h/o IVDU, COPD, bioplar, PTSD, presented from OSH ED with worsening abd distension over past week. Pt reports self-discontinuing lasix and spirnolactone Peatah Puhkah weeks ago, because she feels like "they don't do anything" and that she "doesn't want to put more chemicals in her." She does not follow Na-restricted diets. In the past week, she notes that she has been having worsening abd distension and discomfort. She denies Peatah Puhkah edema, or SOB, or orthopnea. She denies f/c/n/v, d/c, dysuria. She had food poisoning a week ago from eating stale cake (n/v 20 min after food ingestion), which resolved the same day. She denies other recent illness or sick contacts. She notes that she has been noticing gum bleeding while brushing her teeth in recent weeks. she denies easy bruising, melena, BRBPR

In [13]:
df.head()

Unnamed: 0,note_id,subject_id,hadm_id,note_type,note_seq,charttime,storetime,text,Chief Complaint,History of Present Illness,Family History,Past Medical History
0,10000032-DS-21,10000032,22595853,DS,21,2180-05-07 00:00:00,2180-05-09 15:26:00,Name: Peatah Puhkah Unit No: Peatah Puhkah Adm...,Worsening ABD distension and pain Major Surgic...,"Peatah Puhkah HCV cirrhosis c/b ascites, hiv o...","She a total of five siblings, but she is not t...",1. HCV Cirrhosis 2. No history of abnormal Pap...
1,10000032-DS-22,10000032,22841357,DS,22,2180-06-27 00:00:00,2180-07-01 10:15:00,Name: Peatah Puhkah Unit No: Peatah Puhkah Adm...,abdominal fullness and discomfort Major Surgic...,"Peatah Puhkah with HIV on HAART, COPD, HCV cir...","She a total of five siblings, but she is not t...",1. HCV Cirrhosis 2. No history of abnormal Pap...
2,10000032-DS-23,10000032,29079034,DS,23,2180-07-25 00:00:00,2180-07-25 21:42:00,Name: Peatah Puhkah Unit No: Peatah Puhkah Adm...,altered mental status Major Surgical or Invasive,Mrs. Peatah Puhkah is a Peatah Puhkah female w...,"She a total of five siblings, but she is not t...",- HCV
3,10000032-DS-24,10000032,25742920,DS,24,2180-08-07 00:00:00,2180-08-10 05:43:00,Name: Peatah Puhkah Unit No: Peatah Puhkah Adm...,Abdominal pain Major Surgical or Invasive,"Peatah Puhkah w/ HIV on HAART, COPD on 3L home...","She a total of five siblings, but she is not t...",- HCV
4,10000084-DS-17,10000084,23052089,DS,17,2160-11-25 00:00:00,2160-11-25 15:09:00,Name: Peatah Puhkah Unit No: Peatah Puhkah Adm...,Visual hallucinations Major Surgical or Invasive,"Peatah Puhkah male with Peatah Puhkah disease,...","His mother died at age Peatah Puhkah of ""old a...",Peatah Puhkah disease Peatah Puhkah Body Demen...


In [14]:
print(list(df.columns))

['note_id', 'subject_id', 'hadm_id', 'note_type', 'note_seq', 'charttime', 'storetime', 'text', 'Chief Complaint', 'History of Present Illness', 'Family History', 'Past Medical History']


In [15]:
print(df['History of Present Illness'][20])

Peatah Puhkah year old female with history of HTN, HLD, hx of CVA, CAD s/p BMS to circumflex and POBA (Peatah Puhkah), on Aspirin and Plavix, CHF (EF 45% in Peatah Puhkah, diabetes, presenting with acute onset shortness of breath and substernal chest tightness since Peatah Puhkah evening. Patient notes that Peatah Puhkah evening had a large seafood dinner which is not usual for her, and then later around 10pm had acute onset of SOB feeling like she could not take deep breaths with chest tightness (patient notes this is her "angina"). Denies pleuritic component to CP, described as central and across lower rib cage, persistent since onset, no radiation to the shoulders/jaw/back, no diaphoreses. Worsens with activity, improves somewhat with rest. Patient does not it feels like other episodes when she then required her stent placement. Took a SLNx1, which improved her symptoms though these persisted, but almost immediately led to abdominal discomfort with vomiting x1, nonbloody with dinner

# Finished structuring up to this point @Ovyl

### Next Steps

---

At this stage we have the dictionaries with the subsections and now all we have to do is summarise each of the
subsections (new columns) and this will allow us to export a cleaned summary to a Comma Separated Value file.

**Summarise using BioGPT:**
1. Add import statements to the import cell.
2. Create function to summarise text. 
3. run this summary on all the rows of the dataframe that are not the first 7 columns.

In [16]:
def summarise_using_biogpt(text): 
    # BioGPT: create tokenizer and model
    tokenizer = AutoTokenizer.from_pretrained("microsoft/biogpt") 
    model = BioGptForCausalLM.from_pretrained("microsoft/biogpt")
    
    # Tokenization and summarisation
    inputs = tokenizer.encode("summarise: " + text, return_tensors='pt', max_length=1024, truncation=True)
    summary_ids = model.generate(inputs, max_length=1025, min_length=200, length_penalty=2.0, num_beams=4, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    
    return summary

### Speeding up the process using Pool

import Pool from multiprocessing.

In [17]:
from multiprocessing import Pool

In [None]:
def apply_summarisation(column):
    df[column] = df[column].apply(summarise_using_biogpt)
    
num_processes = 4

chunks = [df.columns[i: i+num_processes] for i in range(9, len(df.columns), num_processes)]

with Pool(num_processes) as pool: 
    # apply the summarisation to each chunk of columns in parallel
    pool.map(apply_summarisation, chunks)
    
print(df.head())

Process SpawnPoolWorker-1:
Traceback (most recent call last):
  File "/Users/alex/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/alex/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/alex/anaconda3/lib/python3.11/multiprocessing/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/Users/alex/anaconda3/lib/python3.11/multiprocessing/queues.py", line 367, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'apply_summarisation' on <module '__main__' (built-in)>


## Extracting Subsection Titles and Content

**Steps:**
1. Create a function to extract the subsections and their corresponding content.
2. Make a function that finds the row and column to enter the data. 
3. Make a function that inserts the data into that position in the dataframe.

In [None]:
# Set up model parameters
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)

In [None]:
# Create class for Title Classification

class TitleClassifier(nn.Module):
    def __init__(self, bert_model, num_labels):
        super(TitleClassifier, self).__init__()
        self.bert = bert_model
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_labels)
        
    def forward(self, input_ids, attention_mask): 
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

Breaking down each line of the code and understand what it does:

```python
class TitleClassifier(nn.Module):
```
- This line defines a new class called `TitleClassifier`, which is a subclass of `nn.Module`. This class will serve as our title classifier neural network model.

```python
    def __init__(self, bert_model, num_labels):
```
- This line defines the constructor method `__init__()` for the `TitleClassifier` class. It initializes the class with two parameters: `bert_model` and `num_labels`.
- `bert_model`: This parameter is the pre-trained BERT model that will be used for classification.
- `num_labels`: This parameter specifies the number of output labels for classification.

```python
        super(TitleClassifier, self).__init__()
```
- This line calls the constructor of the superclass `nn.Module` to initialize the `TitleClassifier` class.

```python
        self.bert = bert_model
```
- This line assigns the pre-trained BERT model (`bert_model`) to the `bert` attribute of the `TitleClassifier` class. This allows the classifier to use BERT's pre-trained weights and layers.

```python
        self.dropout = nn.Dropout(0.1)
```
- This line creates a dropout layer with a dropout probability of 0.1 (10%) and assigns it to the `dropout` attribute of the `TitleClassifier` class. Dropout is a regularization technique used to prevent overfitting by randomly setting some output features to zero during training.

```python
        self.classifier = nn.Linear(bert_model.config.hidden_size, num_labels)
```
- This line creates a fully connected linear layer (`nn.Linear`) with input size equal to the hidden size of the BERT model (`bert_model.config.hidden_size`) and output size equal to the number of labels (`num_labels`). This layer will be used to classify the input into different categories.

```python
    def forward(self, input_ids, attention_mask): 
```
- This line defines the forward method for the `TitleClassifier` class. The forward method specifies how input data should be processed through the neural network during the forward pass.

```python
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
```
- This line passes the input token IDs (`input_ids`) and attention mask (`attention_mask`) to the pre-trained BERT model (`self.bert`) and obtains the model outputs, including the hidden states and pooled output.

```python
        pooled_output = outputs.pooler_output
```
- This line extracts the pooled output from the BERT model outputs. The pooled output is a summary representation of the input sequence obtained by applying a pooling operation over the hidden states of the last layer.

```python
        pooled_output = self.dropout(pooled_output)
```
- This line applies dropout regularization to the pooled output obtained from BERT by passing it through the dropout layer (`self.dropout`). This helps prevent overfitting during training.

```python
        logits = self.classifier(pooled_output)
```
- This line passes the dropout output (`pooled_output`) through the linear classifier (`self.classifier`) to obtain the logits, which are unnormalized scores representing the predicted probabilities for each class label.

```python
        return logits
```
- Finally, this line returns the logits from the forward pass as the output of the `forward` method. These logits will be used to compute the loss and perform backpropagation during training.

Overall, the `TitleClassifier` class defines a neural network model for classifying input sequences into different categories using a pre-trained BERT model for feature extraction and a linear classifier for classification. The model applies dropout regularization to prevent overfitting and returns the logits for classification.

In [None]:
# Fine-tune the BERT Model 
num_labels = len(normalized_titles)
model = TitleClassifier(bert_model, num_labels)

# Define loss function and optimizer 
criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

### Create a Training Loop 

This is where we fine tune and teach the model based on the textual data. 
We will use `epochs` as iterations and train the model to pick up on the sequences. 

In [None]:
# Training Loop

# Let's define num_epochs to be 3 cycles
num_epochs = 5

# 

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        input_ids, attention_mask, labels = batch
        optimizer.zero_grad() 
        logits = model(input_ids, attention_mask)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()

In [None]:
def clean_text(text): 
    # replacing the underscores with placeholder names
    cleaned_text = re.sub(r'___', 'PythonSQL', text)
    