<a href="https://colab.research.google.com/github/petroslamb/my-things/blob/master/Alzheimer's_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Part of this notebook is based on work from Jay Alammar*

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/13/33/ffb67897a6985a7b7d8e5e7878c3628678f553634bd3836404fef06ef19b/transformers-2.5.1-py3-none-any.whl (499kB)
[K     |▋                               | 10kB 16.5MB/s eta 0:00:01[K     |█▎                              | 20kB 2.2MB/s eta 0:00:01[K     |██                              | 30kB 3.2MB/s eta 0:00:01[K     |██▋                             | 40kB 2.1MB/s eta 0:00:01[K     |███▎                            | 51kB 2.6MB/s eta 0:00:01[K     |████                            | 61kB 3.1MB/s eta 0:00:01[K     |████▋                           | 71kB 3.6MB/s eta 0:00:01[K     |█████▎                          | 81kB 4.0MB/s eta 0:00:01[K     |██████                          | 92kB 4.5MB/s eta 0:00:01[K     |██████▋                         | 102kB 3.4MB/s eta 0:00:01[K     |███████▏                        | 112kB 3.4MB/s eta 0:00:01[K     |███████▉                        | 122kB 3.4M

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [16]:
# Make data directory if it doesn't exist
!mkdir -p data
!unzip -n -d data ./AssignmentData.zip


Archive:  ./AssignmentData.zip


In [8]:
!ls ./data/AssignmentData/Assigment/AD/

01.txt	04.txt	07.txt	10.txt	13.txt	16.txt	19.txt	22.txt	25.txt	28.txt
02.txt	05.txt	08.txt	11.txt	14.txt	17.txt	20.txt	23.txt	26.txt	29.txt
03.txt	06.txt	09.txt	12.txt	15.txt	18.txt	21.txt	24.txt	27.txt	30.txt


Load data and create dataframe for those tested as AD positive.

In [17]:
import glob
import os

ad_path = './data/AssignmentData/Assigment/AD'
ad_filenames = glob.glob(os.path.join(ad_path, '*.txt'))
ad_data = []

for fn in ad_filenames:
  with open(fn, 'r') as f:
    ad_data.append(f.read().splitlines())

# See the form of the imported data
print(ad_data)
# Count the number of words per patient file
print([sum([len(s.split(' ')) for s in p]) for p in ad_data])

[['oh boy .', 'alright .', 'family is in the kitchen .', "the mother's washing dishes .", 'and her sink is overflowing .', "and she's looking out the window .", "and the two kids are taking they're stealing cookies off the out_of the cupboard .", "and the boy looks like he's gonna  fall down and hurt himself or fall against his mother .", "and the girl is whispering “don't make too much noise” to him .", "she's &let or else she's laughin at him .", 'they got the cookies .', "alright now though the window, let's see .", "there's a nice look outside, real nice .", 'I told you the water was running over and splashing onto the floor .', "and the mother doesn't seem too too affected by it .", "she's dryin a dish or wiping it .", "let's see .", "I guess the girl is laughing at her brother because he's going to fall .", 'looks like a nice house .', "there is a little bit of very little  but I don't think that's meant for this .", 'the corner that got to the corner .', 'so this is a corner her

All the above passages are within the 512 Bert input limit. Which is good as we can use Bert easily.

In [23]:
import pandas as pd

ad_data_concat = [(" ").join(s) for s in ad_data]
print(ad_data_concat)

ad_df = pd.DataFrame(ad_data_concat, columns=['text'])
ad_df['label'] = pd.Series([1 for x in range(len(ad_df.index))], index=ad_df.index)
ad_df

["oh boy . alright . family is in the kitchen . the mother's washing dishes . and her sink is overflowing . and she's looking out the window . and the two kids are taking they're stealing cookies off the out_of the cupboard . and the boy looks like he's gonna  fall down and hurt himself or fall against his mother . and the girl is whispering “don't make too much noise” to him . she's &let or else she's laughin at him . they got the cookies . alright now though the window, let's see . there's a nice look outside, real nice . I told you the water was running over and splashing onto the floor . and the mother doesn't seem too too affected by it . she's dryin a dish or wiping it . let's see . I guess the girl is laughing at her brother because he's going to fall . looks like a nice house . there is a little bit of very little  but I don't think that's meant for this . the corner that got to the corner . so this is a corner here . and that goes back into there but that's  . do you see what 

Unnamed: 0,text,label
0,oh boy . alright . family is in the kitchen . ...,1
1,what's happening there ? oh my . poor kids ...,1
2,oh little boy's in the cookie jar . the girl's...,1
3,there's a little girl reaching for the cookie...,1
4,here's a cookie jar . and the lid is off the c...,1
5,first of all the little girl's saying . and a...,1
6,look down to talk to you ? the little girl ...,1
7,can I look at it and tell you ? oh okay . w...,1
8,alright . I see the little boy stealing cookie...,1
9,there's a little girl . and a little boy stand...,1


Let's do the same for the negative samples with label 0 this time.

In [20]:
import glob
import os

nc_path = './data/AssignmentData/Assigment/NC'
nc_filenames = glob.glob(os.path.join(nc_path, '*.txt'))
nc_data = []

for fn in nc_filenames:
  with open(fn, 'r') as f:
    nc_data.append(f.read().splitlines())

# See the form of the imported data
print(nc_data)
# Count the number of words per patient file
print([sum([len(s.split(' ')) for s in p]) for p in nc_data])

[['a boy is getting cookies out of the cookie jar . ', "he's standing on a stool that's gonna fall . ", 'the girl is reaching for a cookie . ', "the mother's drying dishes . ", "the faucet's running water . ", "it's dripping out of the sink . ", 'spilling onto the floor . ', 'dishes are on the counter . ', 'window is open .', 'must be summertime . ', 'the girl is laughing . ', "looks like she's laughing . ", "that's about it . "], ["the boy's getting cookies out o  the cookie jar . ", "he's handing one to a girl . ", "the the stool he's standing on is falling . ", "the lady's drying dishes . ", 'the sink is running over . ', "the water's turned on full . ", 'cups are sitting on the counter, plates sitting on the counter . ', "puddle of water's on the floor . ", 'little girl is sayin  . ', "don't tell anybody  . ", "and the cookie jar looks like it's ready to fall out . ", 'and the cookie jar is full, clear full . ', "that's about all I see that's goin  on .", '', ' '], ['oh goody . ', 

In [0]:
import pandas as pd

nc_data_concat = [(" ").join(s) for s in nc_data]
print(nc_data_concat)

nc_df = pd.DataFrame(nc_data_concat, columns=['text'])
nc_df['label'] = pd.Series([0 for x in range(len(nc_df.index))], index=nc_df.index)
nc_df

In [0]:
batch_1 = ad_df.append(nc_df, ignore_index=True)

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [30]:
batch_1['label'].value_counts()

1    30
0    30
Name: label, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [31]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=546, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=267967963, style=ProgressStyle(description_…




Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [0]:
tokenized = batch_1['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [0]:
np.array(padded).shape

(2000, 59)

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [0]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="http://127.0.0.1:4000/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [0]:
labels = batch_1[1]

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

<img src="http://127.0.0.1:4000/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" />

### [Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [0]:
# parameters = {'C': np.linspace(0.0001, 100, 20)}
# grid_search = GridSearchCV(LogisticRegression(), parameters)
# grid_search.fit(train_features, train_labels)

# print('best parameters: ', grid_search.best_params_)
# print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [0]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

<img src="http://127.0.0.1:4000/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model #2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [0]:
lr_clf.score(test_features, test_labels)

0.824

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [0]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.527 (+/- 0.05)


So our model clearly does better than a dummy classifier. But how does it compare against the best models?

## Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.



And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.