# Computational Exercise 9: Bag of Words Models

**Please note that (optionally) this exercise may be completed in groups of 2 students.**

---
In this exercise, we'll convert sentences from different sections of medical abstracts (e.g. background, methods, etc) into bag of words feature vectors. In a subsequent exercise, we'll then use these feature vectors to develop and test a predictive model.

Goals are as follows:

- Further improve your understanding of count-based text features
- Learn how to convert text data into features that can be used to develop a predictive model

We'll begin by importing the usual libraries in addition to `requests`, which will help us load the dataset from url. Later on, we'll also import a new one, the **natural language toolkit (nltk)**, which will help us preprocess our text data.

- numpy for efficient math operations
- pandas for data and dataframe manipulations
- matplotlib for visualization/plotting
- requests to load data from url
- **nltk for text pre-processing**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests

## Load the dataset

We'll be working with the "PubMed 200k RCT dataset" dataset developed by Franck Dernoncourt. This dataset contains sentences from different sections of Pubmed abstracts along with labels indicating which section they're from. The sections are:

- OBJECTIVE
- BACKGROUND
- METHODS
- RESULTS
- CONCLUSIONS

Over the next few exercises, our goal will be to develop a classifier that assigns sentences to the correct label. This is not a very useful classifier, but shows that natural language processing is effective even for text with complex terminology, including clinical notes. The training, validation, and test data are found at the following addresses:

In [3]:
train_url = 'https://github.com/Franck-Dernoncourt/pubmed-rct/raw/master/PubMed_20k_RCT/train.txt?raw=true'
val_url = 'https://github.com/Franck-Dernoncourt/pubmed-rct/raw/master/PubMed_20k_RCT/dev.txt?raw=true'
test_url = 'https://github.com/Franck-Dernoncourt/pubmed-rct/raw/master/PubMed_20k_RCT/test.txt?raw=true'

We'll begin by defining a function to read these data. Much like in previous exercises, **the details here are *not* important to our goals;** we just need the data. For now, we'll load only the training data (as `sentences` and `labels`), but in later exercises, we'll reuse these addresses and the function below to load the validation and test sets as well.

In [4]:
import requests

def read_pubmed_rct(url):

    labels = []
    sentences = []
    
    with requests.get(url) as r:
        for line in r.iter_lines():
            fields = line.decode('utf-8').strip().split('\t')
            if len(fields) == 2:
                labels.append(fields[0])
                sentences.append(fields[1])
                
    return sentences, labels

sentences_train, y_train = read_pubmed_rct(train_url)
print('There are %i sentences in the training set' % len(sentences_train))

sentences_val, y_val = read_pubmed_rct(val_url)
print('There are %i sentences in the validation set' % len(sentences_val))

sentences_test, y_test = read_pubmed_rct(test_url)
print('There are %i sentences in the test set' % len(sentences_test))

There are 180040 sentences in the training set
There are 30212 sentences in the validation set
There are 30135 sentences in the test set


## Import the Natural Language Toolkit (NLTK) for text processing

We can now import NLTK. We'll first make sure it's installed, since it's not part of the Anaconda base environment. We'll also import:
- `word_tokenize`, which splits a sentence into a list of *tokens* (e.g. words, numbers, punctuation)
- `stopwords`, a list of commonly used words that we can safely ignore when processing our text
- `PorterStemmer`, which will convert words into stems, as described in the lecture and shown in an example below

We'll also download lists of punctuation ('punkt') and stopwords ('stopwords'), then create `sw`, a set containing all the stopwords, and `ps`, an instance of `PorterStemmer` that we can apply to our words. **Before moving on, take a look at** `sw` **and try out** `word_tokenize` **on a few different sentences.**

In [5]:
!pip install nltk
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

sw = set(stopwords.words('english'))
ps = PorterStemmer()



[nltk_data] Downloading package punkt to /Users/mme/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/mme/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
def tokenize(sentence):
    return [
        ps.stem(w.lower())
        for w in word_tokenize(sentence)
        if w.replace("'", "", 1).isalpha() and (w not in sw)
    ]

## Exercise 8.2: Process all sentences

You can now use a single list comprehension to apply `tokenize` to *all* of the sentences, resulting in a list of 180,040 stemmed, tokenized sentences.

In [7]:
tokens_train = [tokenize(s) for s in sentences_train]
tokens_val = [tokenize(s) for s in sentences_val]
tokens_test = [tokenize(s) for s in sentences_test]

## Exercise 8.3: Create your vocabulary

We're now ready to create our vocabulary using the approach described in the [bag of words lecture](https://github.com/mengelhard/bsrt_ml4h/blob/master/lectures/al10.pdf). You'll need to complete the following steps:
- Put the stemmed tokens from *all* sentences together in a single list or array. This can be done with a list comprehension or `np.concatenate`.
- Count the number of occurrences of each distinct token. This can be done with `np.unique` (use `return_counts=True`) or `pd.value_counts`.
- Remove those that occur fewer than 50 times. This can be done using boolean indexing: if we have the arrays `words` and `word_counts`, for example, we can write `vocabulary = words[word_counts >= 50]`. Later on, we'll explore how making this number larger or smaller affects model performance.

The resulting list (or array) is your vocabulary, which defines the features for our bag of words model.

In [8]:
vcs = pd.value_counts([w for s in tokens_train for w in s])
vocabulary = vcs.index.values[vcs >= 50]
print('There are %i words in our vocabulary' % len(vocabulary))

There are 3986 words in our vocabulary


## Define a function to create features

Finally, we can use (a) the vocabulary, and (b) our list of stemmed, tokenized sentences to create numeric features corresponding to each sentence. The block below defines a function `create_features` and shows how it can be applied to a sample list of tokenized sentences along with a sample vocabulary. **You do not need to make changes to this block, but please take a look at the code and verify that it is creating feature vectors using the approach described in our lecture.**

In [10]:
def create_features(tokenized_sentences, vocabulary):
    
    vocab_dict = {v:i for i, v in enumerate(vocabulary)}
    
    features = np.zeros((len(tokenized_sentences), len(vocabulary)))
    
    for i, tokenized_sentence in enumerate(tokenized_sentences):
        for word in tokenized_sentence:
            if word in vocabulary:
                features[i, vocab_dict[word]] += 1
            
    return features

## Exercise 8.4: Create the feature vectors

*Your* list and vocabulary have been stemmed, so they'll look different than those in the example above. What's important is that the format of tokens in the vocabulary matches the format in the tokenized sentences, which should be the case if you've followed the steps outlined above.

In the block below, apply `create_features` to your tokenized sentence list and vocabulary to create `x_train`, which we'll use to train a predictive model in our next computational exercise. This may take a few minutes.

In [11]:
x_train = create_features(tokens_train, vocabulary)
x_val = create_features(tokens_val, vocabulary)
x_test = create_features(tokens_test, vocabulary)

In [13]:
x_train.shape

(180040, 3986)

## Exercise 9.1: Create a Logistic Regression Model

- train it on the training set
- evaluate it on the validation set

In [14]:
y_train[:10]

['OBJECTIVE',
 'METHODS',
 'METHODS',
 'METHODS',
 'METHODS',
 'METHODS',
 'RESULTS',
 'RESULTS',
 'RESULTS',
 'RESULTS']

In [18]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression().fit(x_train, y_train)

print('The accuracy is %.1f' % (100 * np.mean(lr_model.predict(x_val) == y_val)))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


The accuracy is 75.9


## Exercise 9.2: Figure out which words are most important

## Exercise 9.3: Tune the model and evaluate it on the test set

## Once you've completed these exercises, please turn in the assignment as follows:

If you're using Anaconda on your local machine:
- download your notebook as html (see File > Download as > HTML (.html))
- .zip the file (i.e. place it in a .zip archive)
- submit the .zip file in Talent LMS

If you're using Google Colab:
- download your notebook as .ipynb (see File > Download > Download .ipynb)
- if you have nbconvert installed, convert it to .html; if not, leave is as .ipynb
- .zip the file (i.e. place it in a .zip archive)
- submit the .zip file in Talent LMS