# Introduction to Machine Learning
This tutorial is based on the official tutorial titled [Working With Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) by the `scikit-learn` team and a similarly-minded [tutorial](https://github.com/jonhare/WAIS-ML101) conducted by Dr Jonathon Hare in 2016. Compared to those versions, the current tutorial features a dataset that has been published by [Kaggle](https://www.kaggle.com) as part of a competition that they organised for [Detecting Insults in Social Commentary](https://www.kaggle.com/c/detecting-insults-in-social-commentary). Furthermore, a number of changes have been carried in the code of the above-mentioned tutorials in order to include `pandas` in the data pre-processing stage, and to assure compatibility with the updated version of `scikit-learn`.

In this tutorial you will learn how to:
* Extract feature vectors from text documents
* Load, inspect and pre-process a dataset of comments on social media
* Train a classifier to predict whether a comment on social media is insulting or not
* Use Grid Search in order to tune better the hyper-parameters of your Machine Learning pipeline

In order to run this iPython Notebook (Python 2), [Jupyter](http://jupyter.org/) should be installed in your machine. Besides Jupyter, the following Python packages should also be installed: (i) `pandas` and (ii) `scikit-learn`. The easiest way to install all of these together is with [Anaconda](https://www.anaconda.com/) (Windows, macOS and Linux installers available).

# Loading the Kaggle Dataset
For the purposes of this tutorial, we will be using a dataset of comments on social media along with their classification labels (i.e. "insult" or "neutral"). The dataset is encoded in binary-encoded `pickle` files which reside in `./Kaggle/train.p`.

In [1]:
kaggle_dataset_location = './Kaggle/train.p' # The location of the Kaggle dataset

In [2]:
import cPickle as pickle

# Loading the binary-encoded pickle files from the designated location.
with open(kaggle_dataset_location, 'rb') as f:
    kaggle_dataset = pickle.load(f)

The `kaggle_dataset` variable contains the dataset as a pythonic dictionary of lists. We will be using the `pandas` library in order to tranform this structure into a `pandas.DataFrame` which will simplify the data inspection and pre-processing process.

In [3]:
import pandas as pd

kaggle_dataset_df = pd.DataFrame(kaggle_dataset)
print len(kaggle_dataset_df) # Number of rows in the loaded DataFrame

3947


We print the first 10 rows of the `DataFrame` in order to get an understanding of the structure of the dataset.

In [4]:
print kaggle_dataset_df.tail(n=10)

        Class                                            Comment  \
3937  neutral  Your Yellowstone Fly Fishing Report:\n\n.. The...   
3938  neutral  MrO,\n\nProof is shown by liberals not wanting...   
3939  neutral  The only ignorant person here is you, who thin...   
3940  neutral               oh i had many cars like this before.   
3941  neutral  @Sara Besleaga Griji, doruri sau dorin\\xc8\\x...   
3942   insult    you are both morons and that is never happening   
3943  neutral  Many toolbars include spell check, like Yahoo ...   
3944  neutral  @LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux Fa...   
3945  neutral  How about Felix? He is sure turning into one h...   
3946  neutral  You're all upset, defending this hipster band....   

                 Date  
3937  20120619145323Z  
3938  20120612001129Z  
3939  20120619205630Z  
3940  20120610114639Z  
3941              NaN  
3942  20120502172717Z  
3943  20120528164814Z  
3944  20120620142813Z  
3945  20120528205648Z  
3946  20120

In [5]:
print kaggle_dataset_df['Comment'][1]
print kaggle_dataset_df['Comment'][10]

i really don't understand your point.\xa0 It seems that you are mixing apples and oranges.
@jdstorm dont wish him injury but it happened on its OWN and i DOUBT he's injured, he looked embarrassed to me


In [6]:
# Print all the available columns of the dataset.
print(kaggle_dataset_df.columns)
# Define the target column which we want to predict.
target_column = u'Class'

Index([u'Class', u'Comment', u'Date'], dtype='object')


In [7]:
# Returns the category codes of each of the classes in the target-column.
inputs = kaggle_dataset_df['Comment']
outputs = kaggle_dataset_df['Class'].astype('category').cat.codes

In [8]:
class_names = kaggle_dataset_df['Class'].astype('category').cat.categories.tolist()
print class_names

[u'insult', u'neutral']


It is very important to gain a basic understanding about how potentially imbalanced towards certain classes our dataset is before moving further.

In [9]:
for cl in class_names:
    print '%d comments that are labeled as %s.' % (len(kaggle_dataset_df[kaggle_dataset_df['Class'] == cl]), cl)

1049 comments that are labeled as insult.
2898 comments that are labeled as neutral.


In order to evaluate the performance of our algorithm, we should test its performance on data that it hasn't *seen* during training. Luckily, `scikit-learn` includes an appropriate function that splits the items for a dataset into random train and test subsets.

We set the portion of the original dataset that will be used for testing.

In [10]:
from sklearn.model_selection import train_test_split

test_size = 0.2
# Set a random_state number for replicability of the experiments.
random_state = 10
# Split dataset into training and testing according to the test_size variable.
# output_train and output_test are lists containing the classes' indices.
input_train, input_test = train_test_split(inputs.tolist(), test_size=test_size, random_state=random_state)
y_train, y_test = train_test_split(outputs.tolist(), test_size=test_size, random_state=random_state)

# Extracting Features from Text: The Bag-of-Words Approach
In order to be able to use text documents$^1$ as either input to Machine Learning algorithms, we need to follow a process that would turn them into numerical feature vectors. We generally refer to this process as *vectorisation*. The most intuitive way to do so is the **bags-of-words** approach, which is carried out as follows:
1. Identify all the words that occur in the documents of a training set.
2. Assign a fixed integer ID to each one of those words. For example in Python you could build a dictionary that would map each word to each corresponding integer ID:
 ```python
 dictionary = {'I': 1,
               'study': 2,
               'machine': 3,
               'learning': 4,
               ...}
 ```
3. For each document in the training set, we count the number of occurrences of each word, and we store it in $X[d, w]$, as the value of the $w$-th feature for the $d$-th document, where $w$ is the index of the word in the dictionary.

The bags-of-words representation implies that total number of features is the number of distinct words in the corpus, which typically is larger than 100k. 

While storing all these values in a `numpy` array would require substantial amount of memory, most values in $X$ will be zeros since for a given document only a small subset of the set of the distinct words in the dataset will be present. For this reason, we say that bags-of-words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory. `scipy.sparse` matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

In `scipy` text preprocessing, tokenising and stop-words (e.g. "and", "or" and "that") filtering are included in a high-level component that is able to build a dictionary of features and transform documents to feature vectors:

$^1$ Text documents can vary substantially in length and writing style. In our case, we refer to text documents as the short-lengthed comments of our Kaggle dataset, but the techniques presented in this tutorial could work on much longer collections, such as articles or books.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
# The lines below load the TweetTokenizer from the nltk library.
# You can comment-them-in along with the tokenizer variable of
# the CountVectorizer should you like to see the results with
# a different tokeniser.
# from nltk.tokenize import TweetTokenizer
# word_tokenizer = TweetTokenizer(preserve_case=False, 
#                                 strip_handles=True, 
#                                 reduce_len=True).tokenize

count_vectorizer = CountVectorizer(ngram_range=(1, 1),
                                   stop_words=None,
                                   # tokenizer=word_tokenizer,
                                   # Ignore terms that have a document frequency strictly higher than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   max_df=1.0,
                                   # Ignore terms that have a document frequency strictly lower than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   min_df=1)
count_vectorizer

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

## Working on a Toy Example
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

In [12]:
toy_corpus = [u'You have studied Machine Learning',
              u'I love learning Machine Learning',
              u'Looking forward to #WAISAwayDay',
              u'Have you studied Machine Learning']

# Fits and tranforms the corpus in its bag-of-words representation.
toy_count = count_vectorizer.fit_transform(toy_corpus) 

During the fitting process each team is assigned a unique integer index corresponding to a column in the resulting `toy_count` matrix (i.e. equivalent to the $X$ matrix that has been mentioned in the description of this section of the tutorial).

In [13]:
print toy_count # This is the memory-efficient representation of a sparse matrix.
print toy_count.toarray()

  (0, 2)	1
  (0, 5)	1
  (0, 6)	1
  (0, 1)	1
  (0, 9)	1
  (1, 4)	1
  (1, 2)	2
  (1, 5)	1
  (2, 8)	1
  (2, 7)	1
  (2, 0)	1
  (2, 3)	1
  (3, 2)	1
  (3, 5)	1
  (3, 6)	1
  (3, 1)	1
  (3, 9)	1
[[0 1 1 0 0 1 1 0 0 1]
 [0 0 2 0 1 1 0 0 0 0]
 [1 0 0 1 0 0 0 1 1 0]
 [0 1 1 0 0 1 1 0 0 1]]


You can see that the first and the last rows of the array are identical. This is happening because they correspond to comments with the same words, and, thus, are encoded in equal vectors, which leads to loss of valuable information. `CountVectorizer` also supports counts of n-grams of words or consecutive characters. N-grams are runs of consecutive characters or words, so for example in the case of word bi-grams, every consecutive pair of words would be a feature. Support for n-grams can be enabled by adjusting the `ngram_range` variable during the initialisation of the `CountVectorizer`.

In the initialisation of `CountVectorizer` set the `ngram_range` variable to `(1, 2)`, and check the resulting `toy_count` matrix by running `toy_count.toarray()`. Do the results make sense?

The interpretation of the columns can be retrieved as follows:

In [14]:
count_vectorizer.get_feature_names()

[u'forward',
 u'have',
 u'learning',
 u'looking',
 u'love',
 u'machine',
 u'studied',
 u'to',
 u'waisawayday',
 u'you']

Once the vectoriser is fitted, you can retrieve the index (starting from zero) of a particular word in the dictionary by simply calling:

In [15]:
count_vectorizer.vocabulary_.get(u'machine')

5

For further details about the functionality of `CountVectorizer`, please refer [here](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer).

## Working on the Kaggle Dataset

In [16]:
# Print a small part of the comments that will be used for training.
input_train[:5] 

[u"Only if Ronald McDonald gives him time off from his regular job as a McSlave at McDoonald's.",
 u"You're a fucking joke.",
 u'Magic Number, Magic Underpants = No Big Deal',
 u"YOU HAVE A BEAUTIFUL BODY. What's wrong with a man looking at a woman with a beautiful body ESPECIALLY NICE HIPS? If I'm stalker for that then every dude I know is a stalker as well.",
 u'So, now we know what you are by your own identification, fool, will you kindly moveon.org and stop cluttering up the place and so we intelligent posters can post?']

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
# The lines below load the TweetTokenizer from the nltk library.
# You can comment-them-in along with the tokenizer variable of
# the CountVectorizer should you like to see the results with
# a different tokeniser.
# from nltk.tokenize import TweetTokenizer
# word_tokenizer = TweetTokenizer(preserve_case=False, 
#                                 strip_handles=True, 
#                                 reduce_len=True).tokenize

count_vectorizer = CountVectorizer(ngram_range=(1, 1),
                                   stop_words=None,
                                   # tokenizer=word_tokenizer,
                                   # Ignore terms that have a document frequency strictly higher than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   max_df=1.0,
                                   # Ignore terms that have a document frequency strictly lower than the given threshold.
                                   # If float, the parameter represents a proportion of documents, integer absolute counts.
                                   min_df=1)
count_vectorizer

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [18]:
# Fits and tranforms the corpus in its bag-of-words representation.
X_train_count = count_vectorizer.fit_transform(input_train)

During the fitting process each team is assigned a unique integer index corresponding to a column in the resulting `X_train_count` matrix (i.e. equivalent to the $X$ matrix that has been mentioned in the description of this section of the tutorial).

In [19]:
print X_train_count.shape
print X_train_count.toarray()

(3157, 14315)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### From Occurrences To Frequencies
Occurrence count is a good start. However, longer documents will have higher average count values than shorter documents, even though they might talk about similar topics. To avoid these potential discrepancies, it suffices to divide the number of occurrences of each word in a document by the total number of words in the document. The number of times a term occurs in a document, divided by the number of terms in a document is called the **term frequency** (**tf**).

Another refinement on top of term frequency is to downscale weights for words that occur in many documents in the corpus, and are therefore less informative than those that occur only in a smaller portion of the corpus. In order to achieve this we can weight terms on the basis of the **inverse document frequency** (**idf**). The *document frequency* is the number of documents a given word occurs in; the inverse document frequency is often defined as the total number of documents in the corpus divided by the document frequency.

Combining tf and idf results in a *family of weightings* (tf is usually multiplied by idf, but there a few different variations of how idf is computed) known as **term frequency-inverse document frequency** (**tf–idf**).

Both tf and tf–idf on our `toy_corpus` can be computed using `scikit-learn` as follows:

In [20]:
toy_corpus

[u'You have studied Machine Learning',
 u'I love learning Machine Learning',
 u'Looking forward to #WAISAwayDay',
 u'Have you studied Machine Learning']

In [21]:
from sklearn.feature_extraction.text import TfidfTransformer

# Computing tf using the counts that have been computed from the CountVectorizer.
tf_transformer = TfidfTransformer(use_idf=False, norm='l1', smooth_idf=False)
X_train_tf = tf_transformer.fit_transform(toy_count)

print X_train_tf.shape
print X_train_tf.toarray()

(4, 10)
[[0.   0.2  0.2  0.   0.   0.2  0.2  0.   0.   0.2 ]
 [0.   0.   0.5  0.   0.25 0.25 0.   0.   0.   0.  ]
 [0.25 0.   0.   0.25 0.   0.   0.   0.25 0.25 0.  ]
 [0.   0.2  0.2  0.   0.   0.2  0.2  0.   0.   0.2 ]]


In [22]:
toy_count.toarray()

array([[0, 1, 1, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 2, 0, 1, 1, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 1, 1, 0, 0, 1, 1, 0, 0, 1]])

In [23]:
# Computing tf-idf using the counts that have been computed from the CountVectorizer.
tfidf_transformer = TfidfTransformer(use_idf=True, norm='l1', smooth_idf=False)
X_train_tfidf = tfidf_transformer.fit_transform(toy_count)

print X_train_tfidf.shape
print X_train_tfidf.toarray()

(4, 10)
[[0.         0.22118748 0.16821878 0.         0.         0.16821878
  0.22118748 0.         0.         0.22118748]
 [0.         0.         0.41210174 0.         0.38184739 0.20605087
  0.         0.         0.         0.        ]
 [0.25       0.         0.         0.25       0.         0.
  0.         0.25       0.25       0.        ]
 [0.         0.22118748 0.16821878 0.         0.         0.16821878
  0.22118748 0.         0.         0.22118748]]


Rather than transforming the raw counts with the `TfidfTransformer`, it is alternatively possible to use the `TfidfVectorizer` to directly parse the dataset. We compute tf-idf scores on the Kaggle dataset as follows:

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer(ngram_range=(1, 1),
                             stop_words='english',
                             # tokenizer='word_tokenizer',
                             # Ignore terms that have a document frequency strictly higher than the given threshold.
                             # If float, the parameter represents a proportion of documents, integer absolute counts.
                             max_df=0.75,
                             # Ignore terms that have a document frequency strictly lower than the given threshold.
                             # If float, the parameter represents a proportion of documents, integer absolute counts.
                             min_df=5,
                             # tf-idf parameters
                             use_idf=True, norm='l1', smooth_idf=False)

X_train_tfidf = tfidf_vect.fit_transform(input_train)

print X_train_tfidf.shape
print X_train_tfidf.toarray()

(3157, 1853)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Try filtering out terms that are either too frequent or infrequent in the dataset by adjusting the `max_df` and `min_df` variable respectively. This is an easy way of not only filtering out the less informative words but also reducing the number of features (less storage complexity).

# Building a Predictive Model using K-Nearest-Neighbours
Now that we have our training features and the labels of each post, we can train a classifier to predict whether a message is insulting or not. Let's start with a KNN classifier, which provides a simple baseline, although is perhaps not the best classifier for this task:

In [25]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3).fit(X_train_tfidf, y_train)

To try to predict the outcome on a new comment we need to extract the features using almost the same feature extracting chain as before. The differences are that we call (i) `transform` instead of `fit_transform` on the transformer or vectoriser, and (ii) `predict` on the classifier since they have both been fit to the training set.

You can test your own comments by changing the text in the `test_comment` variable. Does your classifier identify all the insults properly?

In [26]:
test_comment = 'you are a moron'
test_comment_tfidf = tfidf_vect.transform([test_comment])
y_pred = knn_clf.predict(test_comment_tfidf)

print'%s: %s' % (test_comment, class_names[y_pred[0]])

you are a moron: insult


## Evaluating the Performance on the Test Set
We will be evaluating the performance of our KNN classifier on the *unseen* data of the test set based on the accuracy metric. In a binary classification task, such as ours, the accuracy with which a model predicts a specific class $c$ (e.g. insults) is formally defined as:

\begin{align}
\frac{\sum \text{TP} + \sum \text{TN}}{\sum \text{TP} + \sum \text{FP} + \sum \text{TN} + \sum \text{FN}}
\end{align}
where:
* $\text{TP}$ refers to True Positive predictions: both the predicted and the empirical labels are $c$
* $\text{TN}$ refers to True Negative predictions: both the predicted and the empirical labels are $\neq c$
* $\text{FP}$ refers to False Positive predictions: the predicted label is $c$ but the empirical label $\neq c$
* $\text{FN}$ refers to False Negative predictions: the predicted label is $\neq c$ but the empirical label is $c$

In [27]:
from sklearn import metrics

X_test_tfidf = tfidf_vect.transform(input_test)
y_pred = knn_clf.predict(X_test_tfidf)
print 'Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred))


Accuracy: 0.77


### Building a Pipeline
In order to make our pipeline (i.e. vectoriser or transformer $\rightarrow$ classifier) easier to work with, `scikit-learn` provides the `Pipeline` class that behaves like a compound classifier.

In [28]:
from sklearn.pipeline import Pipeline
clf_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 1),
                                               stop_words='english',
                                               # tokenizer=word_tokenizer,
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               max_df=1.0,
                                               # Ignore terms that have a document frequency strictly lower than the given threshold.
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               min_df=5,
                                               # tf-idf parameters
                                               use_idf=True, norm='l1', smooth_idf=False)),
                         ('clf', KNeighborsClassifier(n_neighbors=3))])

The names `tfidf` and `clf` (classifier) are arbitrary. We shall see their use in the section on grid search, below. We can now train (on the training set) and test (on the test set) the model in a similar fashion to when we had all the different components separate.

In [29]:
# Model Training
clf_pipeline.fit(input_train, y_train)
y_pred = clf_pipeline.predict(input_test) # We are making prediction on the test set.
print 'Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.77


Let's see if we can do better with a linear Support Vector Machine (SVM). We can change the learner by just plugging a different classifier object into our pipeline as follows:

In [30]:
from sklearn.linear_model import SGDClassifier
clf_pipeline = Pipeline([('tfidf', TfidfVectorizer(ngram_range=(1, 1),
                                               stop_words='english',
                                               # tokenizer=word_tokenizer,
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               max_df=0.75,
                                               # Ignore terms that have a document frequency strictly lower than the given threshold.
                                               # If float, the parameter represents a proportion of documents, integer absolute counts.
                                               min_df=3,
                                               # tf-idf parameters
                                               use_idf=True, norm='l2', smooth_idf=False)),
                         ('clf', SGDClassifier(loss='hinge',
                                           penalty='l2',
                                           tol=1e-5,
                                           random_state=random_state))])
# Model Training
clf_pipeline.fit(input_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.75, max_features=None, min_df=3,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=F...om_state=10, shuffle=True, tol=1e-05,
       validation_fraction=0.1, verbose=0, warm_start=False))])

In [31]:
y_pred = clf_pipeline.predict(input_test) # We are making prediction on the test set.
print 'Accuracy: %.2f' % (metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.81


`scikit-learn` further provides utilities for a more detailed performance analysis of the results using different metrics (i.e. precision, recall, and f1-score). Support refers to the number of samples that belong to each particular class.

For further details about those you can have a look [here](https://en.wikipedia.org/wiki/Precision_and_recall).

In [32]:
print(metrics.classification_report(y_test, 
                                    y_pred,
                                    target_names=class_names))

              precision    recall  f1-score   support

      insult       0.60      0.62      0.61       186
     neutral       0.88      0.87      0.88       604

   micro avg       0.81      0.81      0.81       790
   macro avg       0.74      0.75      0.74       790
weighted avg       0.82      0.81      0.82       790



Like we did before, we can test how well our classifier is doing by inputing our own comments.

In [33]:
test_comment = 'You are such a moron'
y_pred = clf_pipeline.predict([test_comment])

print'%s: %s' % (test_comment, class_names[y_pred[0]])

You are such a moron: insult


Try experimenting with different parameters (e.g. `ngram_range`, `tokenizer`, `max_df` or `min_df`) to see whether you achieve any better accuracy.

# Parameter Tuning using Grid Search
We have already encountered some parameters such as `use_idf` in the `TfidfTransformer` (and `TfidfVectorizer`). Classifiers tend to have many parameters as well. For example `KNeighborsClassifier` includes parameter for the number of neighbours and `SGDClassifier` has a penalty parameter alpha and configurable loss and penalty terms in the objective function.

Instead of tweaking the parameters of the various components of the chain, it is possible to run an exhaustive search of the best parameters on a grid of possible values. Let's use this to explore whether we can make the KNeighborsClassifier perform as well as our linear SVM.

In [34]:
knn_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                         ('clf', KNeighborsClassifier(n_neighbors=3))])

from sklearn.model_selection import GridSearchCV
parameters = {'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
              'tfidf__use_idf': (True, False),
              'clf__n_neighbors': (1, 3, 5, 7, 9)}

Obviously, such an exhaustive search can be expensive. If we have multiple CPU cores at our disposal, we can tell the grid searcher to try these eleven parameter combinations in parallel with the `n_jobs` parameter. If we give this parameter a value of -1, grid search will detect how many cores are installed and uses them all.

In [35]:
knn_pipeline = GridSearchCV(knn_pipeline, param_grid=parameters, cv=3, n_jobs=-1)

The grid search instance behaves like a normal `scikit-learn` model.

In [36]:
knn_pipeline.fit(input_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smooth_idf=T...ki',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform'))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'tfidf__use_idf': (True, False), 'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)], 'clf__n_neighbors': (1, 3, 5, 7, 9)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

We can get the optimal parameters out by inspecting the object's `grid_scores_` attribute, which is a list of parameters/score pairs. To get the best scoring attributes, we do as follows:

In [37]:
for param_name in knn_pipeline.best_params_:
    print("%s: %r" % (param_name, knn_pipeline.best_params_[param_name]))
print 'The best achieved accuracy is %.2f' % knn_pipeline.best_score_

tfidf__use_idf: True
tfidf__ngram_range: (1, 1)
clf__n_neighbors: 9
The best achieved accuracy is 0.81


Let's test and see how well new comments are classified on our `knn_pipeline`.

In [38]:
test_comment = 'I do not agree!'
y_pred = knn_pipeline.predict([test_comment])

print'%s: %s' % (test_comment, class_names[y_pred[0]])

I do not agree!: neutral


# Exploring K-Means Clustering
Now that we have extracted features from our training documents we're in a position to experiment with clustering. We will use K-Means as its one of the most intuitive clustering methods, although it does have a few limitations.

K-Means clustering with 5 clusters can be achieved as follows:

In [39]:
from sklearn.cluster import KMeans
num_clusters = 5
k_means = KMeans(num_clusters)
k_means.fit(X_train_tfidf)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

The assignments of the original posts to cluster id is given by `km.labels_` once `km.fit(...)` has been called. The centroids of the clusters is given by `km.cluster_centers_`. Intuitively, the vector that describes the centre of a cluster is just like any other feature vector. An interesting way to explore what each cluster is representing is to calculate and print the top weighted (either by occurrence or tf-idf) terms for that cluster:

In [40]:
order_centroids = k_means.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vect.get_feature_names()
for i in range(num_clusters):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :5]:
        print ' %s' % terms[ind],
        print

Cluster 0:  sick
 fuck
 zero
 forgive
 followers
Cluster 1:  xa0
 like
 just
 don
 fucking
Cluster 2:  fuck
 google
 fag
 nget
 dad
Cluster 3:  shit
 holy
 nice
 did
 zero
Cluster 4:  idiot
 wrong
 need
 fucking
 don


A number of different metrics exist that allow us to measure how well the clusters fit the known distribution of underlying newsgroups. One such metric is the homogeneity which is a measure of how pure the clusters are with respect to the known labels (e.g. insult or neutral).

In [41]:
print "Homogeneity: %0.3f" % metrics.homogeneity_score(y_train, k_means.labels_)

Homogeneity: 0.012


Homogeneity scores vary between 0 and 1; a score of 1 indicates that the clusters match the original label distribution exactly.

Explore what happens if you make the number of clusters larger. What do you notice? Do the clusters begin to make more intuitive sense?

# Next Steps

If you have enjoyed this tutorial, the exercises in [Working With Data](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) are a good next step.