# Example of how to use Deep Learning for text classification

Click [Text Classification Example](https://colab.research.google.com/github/magic-lantern/coprh-nlp-2021/blob/main/text_classification_colab.ipynb) to open this notebook in Google Colab - no local setup required, results saved to your Google Drive.

In Google Colab, make sure you change the Runtime to a GPU instance - that will result vastly improved runtimes for deep learning steps. On CPU based hardware some steps take 40 minutes versus 1 minute per training epoch on systems with a GPU. Steps to update runtime:

    Runtime > Change runtime type > Select 'GPU' in Hardware accelerator box

Click 'Save' to switch from Default runtime to GPU accelerated runtime.

In [None]:
print('this is a new python cell')

this is a new markdown cell

In [None]:
# this cell makes sure Google Colab version has latest software dependencies
!pip install -Uqq fastbook fastai

In [None]:
import fastbook
import gc

In [None]:
# this cell will setup a path that allows you to load and save files to your google drive
# you will be prompted to log in with a Google enabled account.
fastbook.setup_book()

In [None]:
from fastai.text.all import *

In [None]:
Path.cwd()

In [None]:
data_path = fastbook.gdrive / Path('data')
data_file = data_path / 'mtsamples.csv'

In [None]:
data_file

In [None]:
if data_file.is_file():
    print('Already downloaded')
else:
    print('downloading data file')
    download_data(
        url='https://github.com/socd06/medical-nlp/raw/master/data/mtsamples.csv',
        fname = data_file
    )

In [None]:
mtsamples = pd.read_csv(str(data_path) + '/mtsamples.csv')
mtsamples.head()

In [None]:
# find rows where number of words is very low

In [None]:
word_len = mtsamples.transcription.str.split(' ').str.len()

In [None]:
mtsamples['num_words'] =  mtsamples.transcription.str.split(' ').str.len()

In [None]:
word_len.median()

In [None]:
word_len.min()

A note on data cleanup. In the best case, a machine learning model can only be as good as the data provided. If data is of poor quality or inconsistent, it will be difficult to get good accuracy with your model.

Let's review the really short transcription notes.

In [None]:
print(f"There are {len(mtsamples[mtsamples['num_words'] < 5])} records with short notes.")
mtsamples[mtsamples['num_words'] < 5].head()

In [None]:
# drop transcripts with fewer than 5 words - they are not good notes
mtsamples = mtsamples[mtsamples['num_words'] >= 5]

In [None]:
mtsamples.medical_specialty.value_counts()

This is one way to load the data for training

In [None]:
TextDataLoaders.from_df(mtsamples,
                        valid_pct=0.2,
                        seed=42,
                        text_col='transcription',
                        label_col='medical_specialty',
                        seq_len=72)

In [None]:
# dls_lm = DataBlock(
#     blocks=(TextBlock.from_df('transcription', is_lm=True), CategoryBlock),
#     get_x=ColReader('transcription'), 
#     get_y=ColReader('medical_specialty'),
#     splitter=TrainTestSplitter(test_size=0.2,
#                                random_state=42,
#                                stratify=mtsamples.medical_specialty)
# )

In [None]:
dls_lm = DataBlock(
    blocks=TextBlock.from_df('transcription', is_lm=True),
    get_x=ColReader('text'),
    splitter=TrainTestSplitter(test_size=0.2,
                               random_state=42)
)

In [None]:
# if changes in the above were made (such as batch size), it may be beneficial to purge GPU memory
# gc.collect()
# torch.cuda.empty_cache()
# gc.collect()
# torch.cuda.empty_cache()

In [None]:
# This cell takes 1 - 2 minutes
# when using seq_len, bs of 128 fits in GPU memory, bs 256 does not fit
dls_lm = dls_lm.dataloaders(mtsamples, bs=128, seq_len=256)

In [None]:
dls_lm.show_batch(max_n=2)

In [None]:
lm_learn = language_model_learner(
    dls_lm,
    AWD_LSTM,
    model_dir=fastbook.gdrive / Path('data'),
    drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

In [None]:
# on macbook this takes 40 minues
# on GPU machine this takes about 1 minute
lm_learn.fit_one_cycle(1, 2e-2)

In [None]:
lm_learn.save('1stepoch')
# can load already saved trained model with
# learn.load('1stepoch')

In [None]:
# if changes in the above were made (such as batch size), it may be beneficial to purge GPU memory
# gc.collect()
# torch.cuda.empty_cache()
# gc.collect()
# torch.cuda.empty_cache()

In [None]:
lm_learn.unfreeze()
lm_learn.fit_one_cycle(10, 2e-3)

In [None]:
lm_learn.save_encoder('finetuned')
# can load with
# lm_learn.load('finetuned')

## Now build a text classifier

Now that we have finetuned a language model based on our actual text let's build a classifier to identify the medical specialty of the transcription notes. As shown previously in this dataset, some specialties are more represented than others. Split data such that training and testing data sets are balanced based on specialty.

In [None]:
# this cell takes about 2 minutes to run
dls_clas = DataBlock(
    blocks=(TextBlock.from_df('transcription', vocab=dls_lm.vocab), CategoryBlock),
    get_x=ColReader('text'),
    get_y=ColReader('medical_specialty'),
    splitter=TrainTestSplitter(test_size=0.2,
                               random_state=42,
                               stratify=mtsamples.medical_specialty)
).dataloaders(mtsamples, bs=128, seq_len=256)

In [None]:
dls_clas.show_batch(max_n=2)

In [None]:
learn = text_classifier_learner(
    dls_clas,
    AWD_LSTM,
    drop_mult=0.5,
    metrics=accuracy).to_fp16()

In [None]:
# load the encoder used for the language model - must be the same to build off language model.
learn = learn.load_encoder('finetuned')

In [None]:
learn.fit_one_cycle(1, 2e-2)