# Computational Exercise 9: Bag of Words Models

**Please note that (optionally) this exercise may be completed in groups of 2 students.**

---
In this exercise, we'll use these feature vectors we constructed in the previous exercise ([CE8](https://github.com/mengelhard/bsrt_ml4h/blob/master/notebooks/ce8.ipynb)) to develop and test a predictive model.

Goals are as follows:

- Fully implement a bag of words model
- Explain the model's predictions
- Continue to gain experience with the model development process
- Explore how hyperparameter settings affect performance

We'll begin by importing the usual libraries in addition to `requests`, which will help us load the dataset from url. Later on, we'll also import a new one, the **natural language toolkit (nltk)**, which will help us preprocess our text data.

- numpy for efficient math operations
- pandas for data and dataframe manipulations
- matplotlib for visualization/plotting
- requests to load data from url
- **nltk for text pre-processing**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests

!pip install nltk
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')

sw = set(stopwords.words('english'))
ps = PorterStemmer()



[nltk_data] Downloading package punkt to /Users/mme/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/mme/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load and preprocess the dataset

In the following block, we'll prepare the PubMed 200k RCT dataset for model development. Please review [CE8](https://github.com/mengelhard/bsrt_ml4h/blob/master/notebooks/ce8.ipynb) if/as needed to understand this process. **Please note that this block may take a few minutes to run.**

Steps:
1. Load and tokenize all sentences (train, val, test)
2. Create the vocabulary (**note: you may want to revisit this part later on**)
3. Create features based on the sentences + vocabulary

In [2]:
# STEP ONE: LOAD AND TOKENIZE

train_url = 'https://github.com/Franck-Dernoncourt/pubmed-rct/raw/master/PubMed_20k_RCT/train.txt?raw=true'
val_url = 'https://github.com/Franck-Dernoncourt/pubmed-rct/raw/master/PubMed_20k_RCT/dev.txt?raw=true'
test_url = 'https://github.com/Franck-Dernoncourt/pubmed-rct/raw/master/PubMed_20k_RCT/test.txt?raw=true'

import requests

def tokenize(sentence):
    return [
        ps.stem(w.lower())
        for w in word_tokenize(sentence)
        if w.replace("'", "", 1).isalpha() and (w not in sw)
    ]

def read_and_tokenize_pubmed_rct(url):

    labels = []
    tokenized_sentences = []
    
    with requests.get(url) as r:
        for line in r.iter_lines():
            fields = line.decode('utf-8').strip().split('\t')
            if len(fields) == 2:
                labels.append(fields[0])
                tokenized_sentences.append(tokenize(fields[1]))
                
    return tokenized_sentences, labels

s_train, y_train = read_and_tokenize_pubmed_rct(train_url)
print('There are %i sentences in the training set' % len(s_train))

s_val, y_val = read_and_tokenize_pubmed_rct(val_url)
print('There are %i sentences in the validation set' % len(s_val))

s_test, y_test = read_and_tokenize_pubmed_rct(test_url)
print('There are %i sentences in the test set' % len(s_test))

There are 180040 sentences in the training set
There are 30212 sentences in the validation set
There are 30135 sentences in the test set


In [3]:
### STEP ONE AND A HALF: CONVERT THE LABELS TO INTEGERS

sections = ['BACKGROUND', 'OBJECTIVE', 'METHODS', 'RESULTS', 'CONCLUSIONS']
section_to_idx = {s: i for i, s in enumerate(sections)}

y_train = [section_to_idx[l] for l in y_train]
y_val = [section_to_idx[l] for l in y_val]
y_test = [section_to_idx[l] for l in y_test]

In [4]:
# STEP TWO: CREATE THE VOCABULARY

MIN_COUNT = 50

vcs = pd.value_counts([w for s in s_train for w in s])
vocabulary = vcs.index.values[vcs >= MIN_COUNT]
print('There are %i words in our vocabulary' % len(vocabulary))

There are 3986 words in our vocabulary


In [5]:
# STEP THREE: CREATE FEATURES

def create_features(tokenized_sentences, vocabulary):
    
    vocab_dict = {v:i for i, v in enumerate(vocabulary)}
    
    features = np.zeros((len(tokenized_sentences), len(vocabulary)))
    
    for i, tokenized_sentence in enumerate(tokenized_sentences):
        for word in tokenized_sentence:
            if word in vocabulary:
                features[i, vocab_dict[word]] += 1
            
    return features

x_train = create_features(s_train, vocabulary)
print('The training set has shape', x_train.shape)

x_val = create_features(s_val, vocabulary)
print('The validation set has shape', x_val.shape)

x_test = create_features(s_test, vocabulary)
print('The test set has shape', x_test.shape)

The training set has shape (180040, 3986)
The validation set has shape (30212, 3986)
The test set has shape (30135, 3986)


## Exercise 9.1: A first bag of words model

In this part of the exercise, you should create a logistic regression model that predicts the PubMed abstract section associated with a given sentence. Then, evaluate it on the **validation** set. We'll save the test set for later. This is going to take a while; you may want to either (a) limit the number of iterations, or (b) train on only a subset of the training set.

In [6]:
from sklearn.linear_model import LogisticRegression

### CREATE AND TRAIN THE MODEL ###


### EVALUATE ACCURACY ON THE VALIDATION SET ###



## Exercise 9.2: Important words

Now, we can inspect the parameters of our trained model to determine which words increase the log-odds most for a each section. The parameters can be accessed via the `.coef_` attribute of the trained model. Similar to activity 10, we can use a `pandas` series to sort words in our vocabulary.

The block below contains code to determine which words increase the log-odds of the 'BACKGROUND' section most. Note that you'll need to change `model` to the name of your model from the previous code block. In this block, you should extend the code to the remaining four sections.

In [7]:
def sort_arr_by_vals(arr, vals):
    return pd.Series(vals, index=arr).sort_values(ascending=False)

### DETERMINE WHICH WORDS ARE MOST PREDICTIVE OF BACKGROUND ###
#sort_arr_by_vals(vocabulary, lr_model.coef_[0])

### DETERMINE WHICH WORDS ARE MOST PREDICTIVE OF OBJECTIVE ###


### DETERMINE WHICH WORDS ARE MOST PREDICTIVE OF METHODS ###


### DETERMINE WHICH WORDS ARE MOST PREDICTIVE OF RESULTS ###


### DETERMINE WHICH WORDS ARE MOST PREDICTIVE OF CONCLUSIONS ###



## Exercise 9.3: Tune the model and evaluate it on the test set

We can probably build a better model. In the following block, you should:
1. explore at least one modification to the previous model
2. compare the performance of both/all models on the validation set
3. choose the one that performs best on the validation set as your final model
4. evaluate the accuracy of your final model on the test set

Here are some modifications you might try:
- Make the vocabulary larger or smaller by changing `MIN_COUNT`, then generating an updated set of features
- Use tf-idf features instead of raw counts (see `sklearn.feature_extraction.text.TfidfTransformer`)
- Increase or decrease the regularization penalty (via the `C` parameter) of your logistic regression model
- Instead of logistic regression, use an `MLPClassifier` or other classification model
- (challenge) include 2-grams in your vocabulary

You don't need to try all of these or even most of them, but you do need to make at least one modification to the model and/or preprocessing that you believe is likely to improve performance.

In [8]:
### YOUR CODE HERE ###



## Exercise 9.4: Plot and label the confusion matrix for your final model

So far, we've been using accuracy as a crude measure of performance, but it'd be better to break down prediction performance between each of the five abstract sections. In this section, you should use the `confusion_matrix` function from `sklearn` (e.g. `confusion_matrix(y_test, y_test_pred)`) to create the confusion matrix, then plot it with `plt.matshow`.

(optional) **challenge**: In a separate code block, plot the ROC curve for a single section (e.g. BACKGROUND vs all other sections)

In [9]:
from sklearn.metrics import confusion_matrix

### CREATE THE CONFUSION MATRIX ###


### PLOT IT USING plt.matshow ###


### CHANGE THE TICKS FROM NUMBERS TO SECTION LABELS ###



## Once you've completed these exercises, please turn in the assignment as follows:

If you're using Anaconda on your local machine:
- download your notebook as html (see File > Download as > HTML (.html))
- .zip the file (i.e. place it in a .zip archive)
- submit the .zip file in Talent LMS

If you're using Google Colab:
- download your notebook as .ipynb (see File > Download > Download .ipynb)
- if you have nbconvert installed, convert it to .html; if not, leave is as .ipynb
- .zip the file (i.e. place it in a .zip archive)
- submit the .zip file in Talent LMS