# Machine Learning Tutorial $-$ Hack Rice 2017 
### Marc E. Canby

This tutorial will teach you how to:
* Train a binary classifier in Python using a classical machine learning algorithm
* Process text using the Python library sklearn
* Select an appropriate error metric for your problem

This tutorial will not cover:
* Neural networks
* Unsupervised learning
* Big data computing algorithms such as Spark or MapReduce
* Big data computing resources such as AWS or Google Cloud

If you plan to run the code locally, you will need to have Python 3 installed, as well as the following libraries:
* numpy
* pandas
* sklearn

## Step 1: Read in the data
Before training a model, we will need some data. For our purposes, can use a dataset built into sklearn.

In [1]:
from sklearn.datasets import fetch_20newsgroups

random_seed = 1147
twenty_train = fetch_20newsgroups(shuffle=True, random_state=random_seed)

document_1 = twenty_train.data[0]
document_1_class = twenty_train.target[0]
document_1_class_name = twenty_train.target_names[0]
print("Document 1:", document_1)
print("Document 1 Class:", document_1_class)
print("Document 1 Class Name:", document_1_class_name)

('Document 1:', u"From: jodfishe@silver.ucs.indiana.edu (joseph dale fisher)\nSubject: Re: prayers and advice requested on family problem\nOrganization: Indiana University\nLines: 34\n\nJulie, it is a really trying situation that you have described.  My\nbrother was living with someone like that and things were almost as bad\n(although he left after a considerably shorter amount of time due to\nother problems with the relationship).  Anyway, the best thing to do\nwould be to get everyone in the same room together (optimally in a room\nwith nothing breakable), lock the door behind you, throw the key out\nunderneath the door (just as far as the longest hand can reach.  You\nwould like to get out after the conclusion, I would imagine), and hash\nthings out.  More than likely, there will be screaming, crying, and\npossibly hitting (unless of course someone decided to bring some rope to\ntie people down).  Some of the best strategies in keeping things calmer\nwould include:\n   have each in

It is also helpful to get a sense of the dataset as a whole.

In [2]:
print("All classes:", twenty_train.target_names)
print ("Number of documents:", len(twenty_train.data))

('All classes:', ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc'])
('Number of documents:', 11314)


Our goal will be to predict the class of a document from only the text within it. We will build a classifier to do this.

## Step 2: Split the data into a train and a test set

In machine learning, it is important to partition data into a training and a test set. The training set will be used to train the model (i.e. we will feed examples in the training set into the model). The test set will be used to score the model.

It is typical to use 80% of the data we have as the training set and 20% as the test set.

In [3]:
from sklearn.model_selection import train_test_split

train_idxes, test_idxes = train_test_split(range(len(twenty_train.data)), test_size = 0.20, random_state = random_seed)
train_data = [twenty_train.data[i] for i in range(len(twenty_train.data)) if i in train_idxes]
test_data = [twenty_train.data[i] for i in range(len(twenty_train.data)) if i in test_idxes]
train_labels = [twenty_train.target[i] for i in range(len(twenty_train.target)) if i in train_idxes]
test_labels = [twenty_train.target[i] for i in range(len(twenty_train.target)) if i in test_idxes]

print ("Number of training examples:", len(train_data))
print("Number of testing examples:", len(test_data))

('Number of training examples:', 9051)
('Number of testing examples:', 2263)


## Step 3: Transform the data into a workable format

In order to build a model from the data, we need to get the data into a numerical format. We will construct a matrix in which each row represents one of our documents, and where each column represents a "feature" of that document.

In our case, we will have 11,314 rows (one for each document). We will have a column for each unique word in the entire document corpus. An entry at row $i$ and column $j$ of the matrix will represent the number of times word $j$ occurs in document $i$.

Fortunately, we don't have to do this counting ourselves. There is a handy Python library that can do it.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X_train_counts = cv.fit_transform(train_data)
print ("Shape of training matrix: " +str(X_train_counts.shape))

X_test_counts = cv.transform(test_data)
print("Shape of test matrix: " + str(X_test_counts.shape))

Shape of training matrix: (9051, 115114)
Shape of test matrix: (2263, 115114)


Our training matrix now has a row for each document, and each column represents the number of times a word appears in a given document. But what if some documents have only a couple of words, and some have thousands? The counts for longer documents will naturally be higher, causing our data to be dependent on document length. Can we normalize the data?

Yes we can. We can calculate the term frequency of word $j$ in document $i$. This is given by

$$TF(i,j)=\frac{\textrm{Number of times word } j \textrm{ appears in document } i}{\textrm{Number of words in document }i}$$

But we can do even better. Will all words be equally likely in determining what class a document belongs to? No. Common words like "the", "a", and "where" occur very frequently and are not likely to be very predictive. We can penalize such words using what's called the inverse document frequency transformation on document $i$:

$$IDF(i) = \log \frac{\textrm{Number of documents}}{\textrm{Number of documents containing word } i}$$

Multiplying $TF(i,j)$ and $IDF(i,j$ gives the TFIDF value of document $i$ and word $j$, and we can use these values in our matrix instead of the counts:

$$TFIDF(i,j)=TF(i,j)\times IDF(i)$$

We don't have to perform any of this math ourselves. There is a built-in method to do it.

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train = tfidf_transformer.fit_transform(X_train_counts)
print ("Shape of training matrix: " + str(X_train.shape))

X_test = tfidf_transformer.transform(X_test_counts)
print ("Shape of test matrix: " + str(X_test.shape))

Shape of training matrix: (9051, 115114)
Shape of test matrix: (2263, 115114)


## Step 4: Training a binary classifier
To simplify things, we will start with training a binary classifier. A binary classifier is a function that will classify a given document as "yes" or "no".

For the moment, we will be interested in determining if a document is about computer graphics. All documents about politics will be given the label $1$ and all other documents will be given the label $0$.

In [14]:
y_train = [1 if train_labels[i] == twenty_train.target_names.index('comp.graphics') else 0 for i in range(len(train_labels))]
y_test = [1 if test_labels[i] == twenty_train.target_names.index('comp.graphics') else 0 for i in range(len(test_labels))]

print("Proportion of 1's in training data: " + str(sum(y_train)) + " / " + str(len(y_train)) + " = " + str(1.*sum(y_train)/len(y_train)))
print("Proportion of 1's in testing data: " + str(sum(y_test)) + " / " + str(len(y_test)) + " = " + str(1.*sum(y_test)/len(y_test)))

Proportion of 1's in training data: 486 / 9051 = 0.0536957242294
Proportion of 1's in testing data: 98 / 2263 = 0.0433053468847


We are now ready to train a binary classifier using logistic regression. Logistic regression is a modification of linear regression that produces an output between $0$ and $1$. This output is interpreted as the probability that a given document has label $1$.

Typically, if the output is greater than $0.5$, label $1$ is predicted. Otherwise, label $0$ is predicted.

Sklearn offers a convenient way to perform logistic regression.

In [15]:
from sklearn.linear_model import LogisticRegression

lr_classifier = LogisticRegression()
lr_classifier.fit(X_train, y_train)

y_predicted = lr_classifier.predict(X_test)
print("Proportion of predictions that are 1: " + str(sum(y_predicted)) + " / " + str(len(y_predicted)) + " = " + str(1.*sum(y_predicted)/len(y_predicted)))
print("Accuracy of prediction: " + str(1.*sum(y_predicted == y_test) / len(y_predicted)))

Proportion of predictions that are 1: 17 / 2263 = 0.00751215201061
Accuracy of prediction: 0.962439239947


We get 96% accuracy!! But is accuracy really a good metric here?

## Step 5: Evaluating a binary classifier
In order to say whether our classifier is good or not, we need to find a numerical way to evaluate it. The first thing that comes to mind is accuracy, which is defined as

$$ Accuracy = \frac{\textrm{Number of correctly classified documents}}{\textrm{Number of classified documents}}$$

But we said that less than 5% of our data has label "1". Therefore, if we always predicted label $0$, we would have about 95% accuracy! Clearly, accuracy is not a good error metric if we wish to predict a rare event. So, we have to do something more clever.

<table>
<tr>
<td>&nbsp;</td>
<td colspan = 2>Predicted Positive</td>
</tr>
<tr>
<td rowspan = 2>Actual Positive</td>
<td>True Positives ($TP$)</td>
<td>False Positives ($FP$)</td>
</tr>
<tr>
<td>True Negatives ($TN$)</td>
<td>False Negatives ($FN$)</td>
</tr>
</table>

We can define now define **precision** as the proportion of all documents predicted positive that actually are positive:

$$ Precision = \frac{TP}{TP+FP}$$

And we can define **recall** as the proportion of all documents that actually are positive that are predicted positive:

$$ Recall = \frac{TP}{TP+FN}$$

If we can somehow balance the precision and recall, then we know we will have a good classifier that can accurately identify rare events. We balance these two metrics by taking the harmonic mean of them to get what is known as the **f1-score**:

$$f_1 = 2 \cdot \frac{Precision \times Recall}{Precision + Recall}$$

Instead of accuracy, we will use the f1-score as our metric. We are finally ready to see how our classifier did.

In [17]:
from sklearn.metrics import precision_recall_fscore_support

(precision, recall, f1, support) = precision_recall_fscore_support(y_test, y_predicted, average = 'binary')

print("Precision: " + str(precision))
print("Recall: " + str(recall))
print("f1-score: " + str(f1))

Precision: 0.882352941176
Recall: 0.15306122449
f1-score: 0.260869565217


We can see that of all the labels we predicted $1$, 88% of them were correctly predicted. However, we only correctly predicted label $1$ 15% of the time. This results in our overall f1-score being $0.26$. There is definitely room for improvement!

## Next Steps and Final Thoughts
We ran a very basic logistic regression on our data, while there is so much more we can do. Here are some things we could consider trying next:
* Improve our text processing. It is usually very helpful to throw out **stop words** such as "the", "and", and "where". It also helps to **stem** words, which maps different forms of a word to one form (i.e. "go" and "went" are both mapped to "go", or "cat" and "cats" are both mapped to "cat"). We can combine **named entities** into a single word, such as "Microsoft Office" or "MacBook Pro". The nltk library provides many functions to help with this kind of text processing.
* Tune parameters of the model. If you look in sklearn's documentation for logistic regression you will see that you can pass many parameters into the LogisticRegression object, such as whether or not to include **regularization** or how to optimize the cost function.
* Perform cross-validation. Instead of having one train and test set, you could partition the train set into $k$ **folds**, and train on $k-1$ of those and test on the remaining $1$. This can help elminate bias in parameter tuning.
* Consider alternate binary classification algorithms. While logistgic regression works well, there are other common ones: Naive Bayes, support vector machine, decision tree, random forest. Sklearn offers support for all of these and they are just as easy to use as logistic regression (one call to train and one call to transform).

If you would rather do multi-class classification, sklearn offers support for that as well $-$ for example, multinomial Naive Bayes or softmax regression. This would allow you to solve the problem of predicting the most likely of the 20 labels for a given document.

Even though we only looked at an example in text processing, the techniques we have covered can be applied to anything. Consider image classification for example. If we design a matrix such that each row represents an image and each column represents the RGB value of a pixel, we could use the same algorithms and error metrics to classify pictures! As long as we can get a data matrix in this form, we can call these machine learning models on pretty much anything.

Hope you enjoyed! Have fun at the hackathon!