This Jupyter notebook is a first approximation to Machine Learning using the library scikit-learn. In this notebook we will train a machine learning algorithm for the first time! Feature engineering and feature selection are also included below.

Let's start with a simple introduction to scikit-learn, which is the Machine Learning library we are going to use during the whole first semester.





## INTRODUCTION TO SCIKIT-LEARN


---

First, we import the libraries that we are going to use: numpy (vector manipulation), nltk (text preprocessing) and scikit-learn (machine learning).

**Note:** All these libraries need to be downloaded beforehand if not using Google Colab. Check their official websites for details on how to install them.

In [1]:
import numpy as np
import nltk
import sklearn
import operator
import requests
nltk.download('stopwords') # If needed
nltk.download('punkt') # If needed
nltk.download('wordnet') # If needed

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

We are going to work with a binary classification dataset, named "Diabetes", with the goal of predicting whether a person has diabetes or not. First, we need to load the dataset in Python. There are three different ways to download the dataset:


1.   (General) Load directly from the web. 
2.   (Google Colab) Download manually the dataset from the web or Learning Central, add it to your Google Drive and load it from there.
3.   (Local) Download manually the dataset from the web or Learning Central, and load it directly from your hard drive.

Choose your favorite method below, and un/comment out the two other methods that you will not be using (remember you can comment lines of code by adding *#* at the beginning).






In [0]:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

#Method 1
response = requests.get(url)
dataset_file = response.text.split("\n")

#Method 2 - Google Colab
##from google.colab import drive
##drive.mount('/content/drive/')
##path= '/content/drive/My Drive/pima-indians-diabetes.data.csv'
##dataset_file=open(path).readlines()


#Method 3 - Local
##path='/home/user/Downloads/pima-indians-diabetes.data.csv'
##dataset_file=open(path).readlines()

The dataset is stored as a .csv (comma-separated) file. In the following we are going to access the data, each file corresponding to a patient and their diagnostic measures (we will refer to these as features). In total, there are eight features, sorted from left to right:

1.   Number of times pregnant.
2.   Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
3.   Diastolic blood pressure (mm Hg).
4.   Triceps skinfold thickness (mm).
5.   2-Hour serum insulin (mu U/ml).
6.   Body mass index (weight in kg/(height in m)^2).
7.   Diabetes pedigree function.
8.   Age (years).

The last column corresponds to whether the patient has diabetes (1) or not (0). This is the feature we want to predict. 

Let's check how the data looks like by, for example, checking the number of patients overall and the features of the first five patients of the dataset:



In [3]:
print ("Number of patients: "+str(len(dataset_file))+"\n")
for patient_line in dataset_file[:5]:
  print (patient_line)

Number of patients: 768

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1


As we can observed in this small sample of five patients, the first, third and fifth got diabetes (last column=1), while the second and fourth did not (last column=0). 
**Note:** This dataset contains a few missing values, which are set as zeroes.

**Excercise (optional):**
Try to load and process the csv file using [pandas](https://pandas.pydata.org). Pandas is a library to process data structures (e.g. csv files) and also provides useful data analysis tools.

To train our machine learning classifier, we first need to convert the input features of each person into vectors (numpy arrays) and keep that information into a list. Similary, we keep the output for each person (1 or 0 depending whether the person has diabetes or not) in another list.



In [0]:
X_train=[]
Y_train=[]
for patient_line in dataset_file:
  patient_linesplit=patient_line.split(",")
  vector_patient_features=np.zeros(len(patient_linesplit)-1)
  for i in range(len(patient_linesplit)-1):
    vector_patient_features[i]=float(patient_linesplit[i])
  X_train.append(vector_patient_features)
  Y_train.append(int(patient_linesplit[-1]))

Once we preprocessed the data, we are ready to train our first machine learning algorithm! In this case we are going to use an SVM binary classifier (we will see more details about machine learning algorithms from Session 4). As a binary classifier, for training we should provide the features as input and "1" or "0 as output. The function to train a machine learning model in sklearn is `.fit`.

In [5]:
X_train_diabetes=np.asarray(X_train)
Y_train_diabetes=np.asarray(Y_train) # This step is really not necessary, but it is recommended to work with numpy arrays instead of Python lists.

svm_clf_diabetes=sklearn.svm.SVC() # Initialize the SVM model
svm_clf_diabetes.fit(X_train_diabetes,Y_train_diabetes) # Train the SVM model



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

We have already trained our first supervised machine learning classifier! Let's check now how it works with two random patients:

In [6]:
patient_1=['0', '100', '86', '20', '39', '35.1', '0.242', '21']
patient_2=['1', '197', '70', '45', '543', '30.5', '0.158', '51']
print (svm_clf_diabetes.predict([patient_1]))
print (svm_clf_diabetes.predict([patient_2]))

[0]
[1]


**Excercise 1:**
Choose three features from the eight features of the "Diabetes" dataset and learn the same binary SVM classifier. Check how the classifier works with one example, i.e., choose random values for your three features and check the prediction of your SVM classifier.

In [0]:
X_train=[]
Y_train=[]
#To complete




## Feature engineering

---


In Machine Learning, the process of feature engineering consists of transforming data into features. In the previous examples with the "Diabetes" dataset the features were already given, but in most cases we should extract the features ourselves. In this case, we are going to deal with examples with textual data. To extract features from textual content, we can make use of what we learned from the exercises of Session 1.

For these exercises we will be using a dataset for *sentiment analysis*. Sentiment analysis is the automatic process of classifying opinions as positive or negative (there are other definitions of sentiment analysis which are more general as well). To do so, we are going to make use of the RT-polarity dataset. Let's first download it and inspect the data. This time we are going to load the dataset directly from the web, but feel free to use the method of your choice to load the data, as explained above: 



In [0]:
url_pos="http://josecamachocollados.com/rt-polarity.pos.txt" # Containing all positive reviews, one review per line
url_neg="http://josecamachocollados.com/rt-polarity.neg.txt" # Containing all negative reviews, one review per line
#Load positive reviews
response_pos = requests.get(url_pos)
dataset_file_pos = response_pos.text.split("\n")

#Load negative reviews
response_pos = requests.get(url_pos)
dataset_file_neg = response_pos.text.split("\n")

Let's inspect a bit the dataset, by printing the first five positive and negative reviews.

In [9]:
print ("Positive reviews:\n")
for pos_review in dataset_file_pos[:5]:
  print (pos_review)
print ("\n   ------\n")  
print ("Negative reviews:\n")
for neg_review in dataset_file_neg[:5]:
  print (neg_review)
 

Positive reviews:

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 
effective but too-tepid biopic
if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 
emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . 

   ------

Negative reviews:

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately de

Now we are going to try to define a vocabulary which can be used to transform sentences (strings) into text. Let's take, for example, the 1000 most frequent words in the dataset, excluding stopwords.

**Note:** Stopwords are generally short function words that do not provide a specific meaning without context (e.g. articles such as "the" or prepositions such as "on"). They can be different depending on the purpose. In our case we will be using the English stopwords as given by NLTK.



In [0]:
lemmatizer = nltk.stem.WordNetLemmatizer()

# Function taken from Session 1
def get_list_tokens(string):
  sentence_split=nltk.tokenize.sent_tokenize(string)
  list_tokens=[]
  for sentence in sentence_split:
    list_tokens_sentence=nltk.tokenize.word_tokenize(sentence)
    for token in list_tokens_sentence:
      list_tokens.append(lemmatizer.lemmatize(token).lower())
  return list_tokens

In [11]:
# First, we get the stopwords list from nltk
stopwords=set(nltk.corpus.stopwords.words('english'))
# We can add more words to the stopword list, like punctuation marks
stopwords.add(".")
stopwords.add(",")
stopwords.add("--")
stopwords.add("``")

# Now we create a frequency dictionary with all words in the dataset
# This can take a few minutes depending on your computer, since we are processing more than ten thousand sentences

dict_word_frequency={}
for pos_review in dataset_file_pos:
  sentence_tokens=get_list_tokens(pos_review)
  for word in sentence_tokens:
    if word in stopwords: continue
    if word not in dict_word_frequency: dict_word_frequency[word]=1
    else: dict_word_frequency[word]+=1
for neg_review in dataset_file_neg:
  sentence_tokens=get_list_tokens(neg_review)
  for word in sentence_tokens:
    if word in stopwords: continue
    if word not in dict_word_frequency: dict_word_frequency[word]=1
    else: dict_word_frequency[word]+=1
      
# Now we create a sorted frequency list with the top 1000 words, using the function "sorted". Let's see the 15 most frequent words
sorted_list = sorted(dict_word_frequency.items(), key=operator.itemgetter(1), reverse=True)[:1000]
i=0
for word,frequency in sorted_list[:15]:
  i+=1
  print (str(i)+". "+word+" - "+str(frequency))
  
# Finally, we create our vocabulary based on the sorted frequency list 
vocabulary=[]
for word,frequency in sorted_list:
  vocabulary.append(word)

1. 's - 3626
2. film - 2016
3. movie - 1300
4. ha - 768
5. one - 766
6. n't - 694
7. make - 590
8. story - 574
9. like - 566
10. ' - 480
11. performance - 468
12. character - 460
13. comedy - 430
14. time - 428
15. work - 412


Once we have our vocabulary, we can transform sentences into vectors as we saw in Session 1, using the function below.

In [0]:
def get_vector_text(list_vocab,string):
  vector_text=np.zeros(len(list_vocab))
  list_tokens_string=get_list_tokens(string)
  for i, word in enumerate(list_vocab):
    if word in list_tokens_string:
      vector_text[i]=list_tokens_string.count(word)
  return vector_text

Using this function we can now load our training features, as we did with the "Diabetes" dataset. In this case, we will label positive reviews as "1" and negative reviews as "0".

In [0]:
# This can take a while, as we are converting more than ten thousand sentences into vectors!
X_train=[]
Y_train=[]
for pos_review in dataset_file_pos:
  vector_pos_review=get_vector_text(vocabulary,pos_review)
  X_train.append(vector_pos_review)
  Y_train.append(1)
for neg_review in dataset_file_neg:
  vector_neg_review=get_vector_text(vocabulary,neg_review)
  X_train.append(vector_neg_review)
  Y_train.append(0)

**Exercise (optional):** Try transforming the sentences into weighted frequency features using [TFidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). This function uses a weighted scheme called [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (term frequency-inverse document frequency) which basically penalizes words that are repeated across many documents (e.g. frequent words such as "the" or "a").

Once we have loaded all the feature vectors, we can now train our SVM binary classifier! 

In [14]:
X_train_sentanalysis=np.asarray(X_train)
Y_train_sentanalysis=np.asarray(Y_train)

svm_clf_sentanalysis=sklearn.svm.SVC()
svm_clf_sentanalysis.fit(X_train_sentanalysis,Y_train_sentanalysis) # Train the SVM model. This may also take a while.




SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

Let's try how it works with some examples!

In [21]:
sentence_1="It was fascinating, probably one of the best movies I've ever seen."
sentence_2="Bad movie, probably one of the worst I have ever seen."
print (svm_clf_sentanalysis.predict([get_vector_text(vocabulary,sentence_1)]))
print (svm_clf_sentanalysis.predict([get_vector_text(vocabulary,sentence_2)]))

[1]
[0]


It seems to be working! However, this is a very simple classifier and is definetely not perfect. You can try other examples yourself to see how the model behaves, find weaknesses and try to improve it with better features!

**Excercise 2:**
Based on this example, create a function that, given two files of positive and negative reviews (one sentence per line as in our RT-polarity dataset) and an integer number X as input, it returns the vocabulary and a binary SVM classifier similar to what we learned, using the X most frequent words as features. Check how the classifier works with X=1200. You can check the predictions with the same sample sentences as above.

**Note:** You can use auxiliary functions if needed (not mandatory but can be useful). For example, a function that first retrieves the vocabulary given the datasets and X.

In [0]:
def train_svm_classifier(dataset_file_pos, dataset_file_neg, x):
  #To complete

**Exercise (optional):** Think about different features that can be useful for sentiment analysis and add it to our frequency vector. Some ideas: (1) use a dictionary of positive or negative words (some dictionaries available [here](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon)); (2) use n-gram features (n-grams are sequence of n-words as opposed or a single word, e.g., "cardiff university" would be a bigram); (3) use only verbs and adjectives as features (see [PoS tagging](https://www.nltk.org/book/ch05.html) in NLTK)...


## Feature selection

---


The process of feature selection consists of selecting a subset of relevant features. For example, in the sentiment analysis example above, we selected 1000 features based on the 1000 most frequent words. However, not all words may be equally relevant. For example, "film" is the second most frequent word but may appear equally in positive and negative reviews, therefore it is not a very relevant feature for our task.


In this notebook we are going to use the [chi-squared test](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) method, available in sklearn. This method basically removes the features that appear to be irrelevant to a given class (in our case positive or negative). For example, words that do not express sentiment are expected to be removed from the set. Let's apply this feature selection method to our RT-polarity dataset to keep only the 500 most relevant features.

In [0]:
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest

In [24]:
fs_sentanalysis=SelectKBest(chi2, k=500).fit(X_train_sentanalysis, Y_train_sentanalysis)
X_train_sentanalysis_new = fs_sentanalysis.transform(X_train_sentanalysis)
#X_train_new = SelectKBest(chi2, k=500).fit_transform(X_train, Y_train)
print ("Size original training matrix: "+str(X_train_sentanalysis.shape))
print ("Size new training matrix: "+str(X_train_sentanalysis_new.shape))

Size original training matrix: (10664, 1000)
Size new training matrix: (10664, 500)


Now we can train again our SVM classifier with the 500 most relevant features, replacing the old one.

In [25]:
svm_clf_sentanalysis_=sklearn.svm.SVC() # Change the name here, e.g. 'new sentanalysis_svm_clf', and below if you don't want to replace your old classifier.
svm_clf_sentanalysis_.fit(X_train_sentanalysis_new,Y_train_sentanalysis) #Train the new SVM model. This may take a while.



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

And now we can test our classifier with some new examples.

**Note**: To transform the original 1000 features into our reduced 500 features, we use the function `.transform`. This function is very common in sklearn.

In [26]:
sentence_3="Highly recommended: I enjoyed the movie from the beginning to the end."
sentence_4="I got a bit bored, it was not what I was expecting."
print (svm_clf_sentanalysis_.predict(fs_sentanalysis.transform([get_vector_text(vocabulary,sentence_3)])))
print (svm_clf_sentanalysis_.predict(fs_sentanalysis.transform([get_vector_text(vocabulary,sentence_4)])))

[1]
[0]


**Exercise 3:** Apply the same chi-squared feature selection method to select the seven most relevant features from the Diabetes dataset. Check your method with some sample input features (you can use the same "patient_1" and "patient_2" examples).

In [29]:
# To complete

[0]
[1]




**Exercise (optional):** Check other feature selection methods in skelarn (feature selection methods available [here](https://scikit-learn.org/stable/modules/feature_selection.html)) and try one of them with our sentiment analysis dataset.