This Jupyter notebook includes exercises for understanding experimental design in Machine Learning. In this notebook we will introduce common evaluation measures in supervised machine learning. You will also be able to split your dataset in train, development and test, and also understand how cross-validation works. 




## EXPERIMENTAL DESIGN

---

First, we import the libraries that we are going to use, including as usual numpy (vector manipulation), nltk (text preprocessing) and scikit-learn (machine learning).

**Note:** All these libraries need to be downloaded beforehand if not using Google Colab. Check their official websites for details on how to install them.

In [1]:
import numpy as np
import nltk
import sklearn
import operator
import requests
nltk.download('stopwords') # If needed
nltk.download('punkt') # If needed
nltk.download('wordnet') # If needed

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tianbai\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tianbai\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tianbai\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## A) TRAIN, DEVELOPMENT AND TEST SPLITS

To start with, we are going to work with the same sentiment analysis dataset used in the previous session, i.e., RT-polarity. First, as usual, we need to load the dataset in Python. We are going to load it directly from the internet, but remember from the previous session that you can also load your dataset locally if you wish:






In [2]:
url_pos="http://josecamachocollados.com/rt-polarity.pos.txt" # Containing all positive reviews, one review per line
url_neg="http://josecamachocollados.com/rt-polarity.neg.txt" # Containing all negative reviews, one review per line

#Load positive reviews
response_pos = requests.get(url_pos)
dataset_file_pos = response_pos.text.split("\n")

#Load negative reviews
response_neg = requests.get(url_neg)
dataset_file_neg = response_neg.text.split("\n")

Now we are going to split the dataset into training and test splits. First, we need to put together positive and negative reviews into a single list. 

In [3]:
dataset_full=[]
for pos_review in dataset_file_pos:
  dataset_full.append((pos_review,1))
for neg_review in dataset_file_neg:
  dataset_full.append((neg_review,0))

**Note:** Remember that positive reviews are going to be labelled as "0" and negative reviews as "1". To store reviews with their corresponding labels, we have used tuples of the form `(review,label)`.

With the full dataset stored in a single list, we are going to split our dataset into training and test, by following a standard 80%/20% distribution. We are going to randomly extract examples from the original dataset, 80% for the training set, and 20% for the test set.

In [4]:
from sklearn.model_selection import train_test_split
import random

In [9]:
size_dataset_full=len(dataset_full)
size_test=int(round(size_dataset_full*0.2,0))

list_test_indices=random.sample(range(size_dataset_full), size_test)

print(type(list_test_indices))
print(list_test_indices)
train_set=[]
test_set=[]
for i,example in enumerate(dataset_full):
  if i in list_test_indices: test_set.append(example)
  else: train_set.append(example)

<class 'list'>
[1724, 5623, 9346, 5248, 955, 8003, 9725, 2467, 9460, 1433, 1969, 902, 372, 9046, 3033, 9330, 5858, 2536, 9748, 7059, 9358, 1213, 3076, 1770, 2785, 119, 17, 3800, 3571, 9052, 6867, 4184, 6292, 577, 2258, 3762, 816, 5235, 4251, 3555, 4989, 3468, 7368, 4734, 9712, 6194, 3881, 7907, 3342, 1201, 9140, 6448, 9127, 5482, 2032, 9025, 6204, 1018, 99, 1151, 8858, 2464, 2814, 1582, 3274, 772, 3286, 3462, 6147, 1093, 5653, 8754, 692, 6662, 9572, 127, 2544, 10188, 8968, 5848, 10329, 6135, 6672, 10257, 7623, 4449, 7766, 9655, 9119, 732, 3860, 5018, 4284, 6636, 762, 1662, 7193, 4747, 1867, 5359, 10355, 58, 9187, 7005, 4884, 4667, 8812, 4002, 842, 4280, 8732, 1566, 3533, 7541, 3158, 303, 9301, 5982, 3758, 2783, 9146, 5285, 7824, 7021, 1015, 2028, 4749, 3594, 1509, 8190, 1733, 3190, 5729, 7389, 2969, 6105, 8480, 3099, 2863, 537, 9519, 1583, 8788, 765, 9364, 2369, 3480, 9429, 606, 6842, 5028, 4368, 3937, 8473, 2661, 4863, 9362, 3586, 515, 3742, 8260, 3110, 4497, 9804, 1896, 5894, 2979, 2

**Excercise (Optional):**
Use the function [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from sklearn to split  the original RT-polarity dataset into training and test. More information in this [blog post](https://medium.com/@contactsunny/how-to-split-your-dataset-to-train-and-test-datasets-using-scikit-learn-e7cf6eb5e0d).


To double-check that we have split the dataset as we planned to, let's check the final sizes. We are going to also shuffle the examples in each of the splits (using the function `random.shuffle`), as it is recommended in many cases.

In [6]:
random.shuffle(train_set)
random.shuffle(test_set)

In [7]:
print ("Size dataset full: "+str(size_dataset_full))
print ("Size training set: "+str(len(train_set)))
print ("Size test set: "+str(len(test_set)))

Size dataset full: 10664
Size training set: 8531
Size test set: 2133


**Excercise 1:** Given a dataset represented as list with instances (as e.g. our `dataset_full` in the RT-polarity dataset) and the size of the test set (e.g. `0.2`) as input, create a function that split the given dataset in training and test sets of the given size. Check your function with our RT-polarity dataset (i.e. `dataset_full`) and `0.2` as inputs.

In [0]:
def get_train_test_split(dataset_full,ratio):
  pre_train_set=[]
  pre_test_set=[]
  # To complete...

  return pre_train_set,pre_test_set


Now we have our dataset split into training and test. However, in many cases we would also need a development set, which can be used to tune our model. To get the development set, we can split the test set in half, and therefore obtain a standard train/dev/test split of 80%/10%/10%.

In [0]:
original_size_test=len(test_set)
size_dev=int(round(original_size_test*0.5,0))
list_dev_indices=random.sample(range(original_size_test), size_dev)
new_dev_set=[]
new_test_set=[]
for i,example in enumerate(test_set):
  if i in list_dev_indices: new_dev_set.append(example)
  else: new_test_set.append(example)
new_train_set=train_set
random.shuffle(new_train_set)
random.shuffle(new_dev_set)
random.shuffle(new_test_set)

Our dataset is now split into training, development and test. Let's check some examples from each of the splits.

In [0]:
print ("TRAINING SET")
print ("Size training set: "+str(len(new_train_set)))
for example in new_train_set[:3]:
  print (example)
print ("    \n-------\n")
print ("DEV SET")
print ("Size development set: "+str(len(new_dev_set)))
for example in new_dev_set[:3]:
  print (example)
print ("    \n-------\n")
print ("TEST SET")
print ("Size test set: "+str(len(new_test_set)))
for example in new_test_set[:3]:
  print (example)




## B) EVALUATION MEASURES

---


In this section we will evaluate our linear SVM binary classifier (similar to the one we trained in the previous session) in the RT-polarity dataset. We will first train the model on the training set, and then evaluate it in the test set. To this end, we will use functions from the previous sessions, slightly modified to be more general and cover this case.


In [0]:
lemmatizer = nltk.stem.WordNetLemmatizer()
stopwords=set(nltk.corpus.stopwords.words('english'))
stopwords.add(".")
stopwords.add(",")
stopwords.add("--")
stopwords.add("``")

# Function taken from Session 1
def get_list_tokens(string): # Function to retrieve the list of tokens from a string
  sentence_split=nltk.tokenize.sent_tokenize(string)
  list_tokens=[]
  for sentence in sentence_split:
    list_tokens_sentence=nltk.tokenize.word_tokenize(sentence)
    for token in list_tokens_sentence:
      list_tokens.append(lemmatizer.lemmatize(token).lower())
  return list_tokens

# Function taken from Session 2
def get_vector_text(list_vocab,string):
  vector_text=np.zeros(len(list_vocab))
  list_tokens_string=get_list_tokens(string)
  for i, word in enumerate(list_vocab):
    if word in list_tokens_string:
      vector_text[i]=list_tokens_string.count(word)
  return vector_text


# Functions slightly modified from Session 2

def get_vocabulary(training_set, num_features): # Function to retrieve vocabulary
  dict_word_frequency={}
  for instance in training_set:
    sentence_tokens=get_list_tokens(instance[0])
    for word in sentence_tokens:
      if word in stopwords: continue
      if word not in dict_word_frequency: dict_word_frequency[word]=1
      else: dict_word_frequency[word]+=1
  sorted_list = sorted(dict_word_frequency.items(), key=operator.itemgetter(1), reverse=True)[:num_features]
  vocabulary=[]
  for word,frequency in sorted_list:
    vocabulary.append(word)
  return vocabulary

def train_svm_classifier(training_set, vocabulary): # Function for training our svm classifier
  X_train=[]
  Y_train=[]
  for instance in training_set:
    vector_instance=get_vector_text(vocabulary,instance[0])
    X_train.append(vector_instance)
    Y_train.append(instance[1])
  # Finally, we train the SVM classifier 
  svm_clf=sklearn.svm.SVC(kernel="linear",gamma='auto')
  svm_clf.fit(np.asarray(X_train),np.asarray(Y_train))
  return svm_clf

In [0]:
vocabulary=get_vocabulary(new_train_set, 1000)  # We use the get_vocabulary function to retrieve the vocabulary

In [0]:
svm_clf=train_svm_classifier(new_train_set, vocabulary) # We finally use the function to train our SVM classifier. This can take a while...

We can now test our model with an example.

In [0]:
print (svm_clf.predict([get_vector_text(vocabulary,"Fascinating!")]))

Once we have trained our SVM classifier, we can test our model in the training set. To that end, we need to convert the training set in two lists (`X_test` and `Y_test`), similarly as we did with the training set.

In [0]:
X_test=[]
Y_test=[]
for instance in new_test_set:
  vector_instance=get_vector_text(vocabulary,instance[0])
  X_test.append(vector_instance)
  Y_test.append(instance[1])
X_test=np.asarray(X_test)
Y_test_gold=np.asarray(Y_test)

We referred to the labels in the test set as `Y_test_gold` to distinguish them from our predictions (*gold standard* makes reference to the ground truth, which are the labels that are known to be correct). Now we can test our model in the test set using `predict` (to obtain the predictions of our model) and `classification_report` (to get the results) from sklearn.

In [0]:
from sklearn.metrics import classification_report

In [0]:
Y_text_predictions=svm_clf.predict(X_test)

In [0]:
print(classification_report(Y_test_gold, Y_text_predictions))

We can also get the individual accuracy and macro-average precision, recall and F-score individually.

In [0]:
from sklearn.metrics import precision_score,recall_score,f1_score,accuracy_score

In [0]:
precision=precision_score(Y_test_gold, Y_text_predictions, average='macro')
recall=recall_score(Y_test_gold, Y_text_predictions, average='macro')
f1=f1_score(Y_test_gold, Y_text_predictions, average='macro')
accuracy=accuracy_score(Y_test_gold, Y_text_predictions)

print ("Precision: "+str(round(precision,3)))
print ("Recall: "+str(round(recall,3)))
print ("F1-Score: "+str(round(f1,3)))
print ("Accuracy: "+str(round(accuracy,3)))

To understand better the source of the error made by the model, we can get a confusion matrix (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for more details on confusion matrices in sklearn).

In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
print (confusion_matrix(Y_test_gold, Y_text_predictions))

Depending on your split, your results may vary a bit. As you may have realized, we have not made use of our **development set**! Let's try to tune our model in the development set, as that can help improve our model overall! In the development set we can tune anything we want, from the model to use, to the parameters or features. In our case, let's try to tune the number of features in the test set. We can try with less than 1000 features, which was our vocabulary. For example, let's try with `num_features=250`, `num_features=500`, `num_features=750` and `num_features=1000`. We can then tune our model with respect to these features and optimize it for accuracy.

In [0]:
# We first get the gold standard labels from the development set

Y_dev=[]
for instance in new_dev_set:
  Y_dev.append(instance[1])
Y_dev_gold=np.asarray(Y_dev)

# Now we can train our three models with the different number of features, and test each of them in the dev set

list_num_features=[250,500,750,1000]
best_accuracy_dev=0.0
for num_features in list_num_features:
  # First, we get the vocabulary from the training set and train our svm classifier
  vocabulary=get_vocabulary(new_train_set, num_features)  
  svm_clf=train_svm_classifier(new_train_set, vocabulary)
  # Then, we transform our dev set into vectors and make the prediction on this set
  X_dev=[]
  for instance in new_dev_set:
    vector_instance=get_vector_text(vocabulary,instance[0])
    X_dev.append(vector_instance)
  X_dev=np.asarray(X_dev)
  Y_dev_predictions=svm_clf.predict(X_dev)
  # Finally, we get the accuracy results of the classifier
  accuracy_dev=accuracy_score(Y_dev_gold, Y_dev_predictions)
  print ("Accuracy with "+str(num_features)+": "+str(round(accuracy_dev,3)))
  if accuracy_dev>=best_accuracy_dev:
    best_accuracy_dev=accuracy_dev
    best_num_features=num_features
    best_vocabulary=vocabulary
    best_svm_clf=svm_clf
print ("\n Best accuracy overall in the dev set is "+str(round(best_accuracy_dev,3))+" with "+str(best_num_features)+" features.")

Let's now check the performance (accuracy) of the best model in the test set.

**Note:** Not always the best model in the development set leads to the best results on the test set.

In [0]:
X_test=[]
Y_test=[]
for instance in new_test_set:
  vector_instance=get_vector_text(best_vocabulary,instance[0])
  X_test.append(vector_instance)
  Y_test.append(instance[1])
best_X_test=np.asarray(X_test)
Y_test_gold=np.asarray(Y_test)

best_Y_text_predictions=best_svm_clf.predict(best_X_test)
print(classification_report(Y_test_gold, best_Y_text_predictions))

**Note:** Please note that we have made use of the test set only once. We haven't evaluated more than one model in the test set. This is important, as any tuning should be done in the test set if we want our method to generalize well and comparable to other models. If we evaluate many times on the test set, we risk overfitting our model to the test set.

**Exercise 2:** Tune the same classifier, this time with `num_features=100`, `num_features=500` and `num_features=1000` and optimize it for macro-average F1-score, instead of accuracy. Test the best-performing classifier in the development set (in terms of F1-score) on the test.

In [0]:
list_num_features=[100,500,1000]
# To complete

**Exercise (optional):** Think about other elements to tune in the development set. For example, parameters in the SVM (e.g., smaller values of the [C regularization parameter](https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel), more information about the parameters of the SVM [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)), other vocabulary sizes or features, feature selection methods, etc.

## C) CROSS-VALIDATION

In addition to the usual train, development and test splits, there is an alternative that it's called cross-validation. With this technique we use a single set with all our examples, and create several different train/test splits (or train/dev/test). This has the advantage of testing on a wider range of examples (useful especially when your dataset is not very large) but the disadvantage of being computationally more expensive and not easily reproducible.

We are going to start with 5-fold validation, i.e., the dataset is split into five parts, which will be used as five different test sets. Let's evaluate our model with 500 features on the full RT-polarity dataset using 5-fold cross-validation.
 

In [0]:
from sklearn.model_selection import KFold

In [0]:
kf = KFold(n_splits=5)
random.shuffle(dataset_full)
kf.get_n_splits(dataset_full)
for train_index, test_index in kf.split(dataset_full):
  train_set_fold=[]
  test_set_fold=[]
  accuracy_total=0.0
  for i,instance in enumerate(dataset_full):
    if i in train_index:
      train_set_fold.append(instance)
    else:
      test_set_fold.append(instance)
  vocabulary_fold=get_vocabulary(train_set_fold, 500)
  svm_clf_fold=train_svm_classifier(train_set_fold, vocabulary_fold)
  X_test_fold=[]
  Y_test_fold=[]
  for instance in test_set_fold:
    vector_instance=get_vector_text(vocabulary_fold,instance[0])
    X_test_fold.append(vector_instance)
    Y_test_fold.append(instance[1])
  Y_test_fold_gold=np.asarray(Y_test_fold)
  X_test_fold=np.asarray(X_test_fold)
  Y_test_predictions_fold=svm_clf_fold.predict(X_test_fold)
  accuracy_fold=accuracy_score(Y_test_fold_gold, Y_test_predictions_fold)
  accuracy_total+=accuracy_fold
  print ("Fold completed.")
average_accuracy=accuracy_total/5
print ("\nAverage Accuracy: "+str(round(accuracy_fold,3)))

**Note:** Sklearn contains the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function, which is very convinent to evaluate our model in a cross-validation setting. However, we cannot use this function when the features depend on the dataset itself, as it is our case in the RT-polarity dataset (the vocabulary depends on the training set).

**Exercise (optional):** Use the `cross_val_score` function from sklearn to evaluate an SVM classifier from the Diabetes dataset (Session 2) using 10-fold cross-validation.

**Exercise 3:** Use 3-fold cross-validation to evaluate the SVM classifier with 1000 features (instead of 500). Print the accuracy of the classifier in every of the three folds, and the overall accuracy at the end.

In [0]:
# To complete