# C2 : Classification

In [1]:
from sklearn.neighbors import KNeighborsClassifier

In [2]:
classifier = KNeighborsClassifier(n_neighbors=1)

In [5]:
from sklearn.model_selection import KFold

In [106]:
data[0]

array([ 15.26 ,  14.84 ,   0.871,   5.763,   3.312,   2.221,   5.22 ,
           nan])

In [107]:
import os

In [117]:
def load_dataset(path, dataset_name):
    '''
    data,labels = load_dataset(dataset_name)

    Load a given dataset

    Returns
    -------
    data : numpy ndarray
    labels : list of str
    '''
    data = []
    labels = []
    with open(os.path.join(path, '{0}'.format(dataset_name))) as ifile:
        for line in ifile:
            tokens = line.strip().split('\t')
            data.append([float(tk) for tk in tokens[:-1]])
            labels.append(tokens[-1])
    data = np.array(data)
    labels = np.array(labels)
    return data, labels

In [118]:
features, labels = load_dataset("./BuildingMachineLearningSystemsWithPython"
                     "/ch02/data/", "seeds.tsv")

In [119]:
print(features.shape)
print(labels.shape)

(210, 7)
(210,)


In [120]:
labels

array(['Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama',
       'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Kama', 'Rosa', 'Rosa',
       'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa',
       'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa',
       'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa',
       'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa',
       'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa', 'Rosa',
      

In [57]:
feature_names = [
    'area',
    'perimeter',
    'compactness',
    'length of kernel',
    'width of kernel',
    'asymmetry coefficien',
    'length of kernel groove',
]         

- If you studied physics (and you remember your lessons), you might have already noticed that we had been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is  how far away from the mean it is, in units of standard deviation. It comes down  to this operation: 

In [58]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [103]:
train_features.dtype

dtype('float64')

In [104]:
train_labels.dtype

dtype('float64')

In [121]:
classifier = KNeighborsClassifier(n_neighbors=1)

kf = KFold(n_splits=5, shuffle=True)
means = []
for train_index, test_index in kf.split(features):
    # We learn a model for this fold with `fit` and then apply it to the
    # testing data with `predict`:
    train_features, test_features = features[train_index], features[test_index]
    train_labels, test_labels = labels[train_index], labels[test_index]
    
    classifier.fit(train_features, train_labels)
    prediction = classifier.predict(test_features)

    # np.mean on an array of booleans returns fraction
    # of correct decisions for this fold:
    curmean = np.mean(prediction ==  test_labels)
    means.append(curmean)
print('Mean accuracy: {:.1%}'.format(np.mean(means)))

Mean accuracy: 89.5%


- The Pipeline constructor takes a list of pairs (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps. 

In [122]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

classifier = KNeighborsClassifier(n_neighbors=1)
classifier = Pipeline([('norm', StandardScaler()), ('knn', classifier)])

kf = KFold(n_splits=5, shuffle=True)
means = []
for train_index, test_index in kf.split(features):
    # We learn a model for this fold with `fit` and then apply it to the
    # testing data with `predict`:
    train_features, test_features = features[train_index], features[test_index]
    train_labels, test_labels = labels[train_index], labels[test_index]
    
    classifier.fit(train_features, train_labels)
    prediction = classifier.predict(test_features)

    # np.mean on an array of booleans returns fraction
    # of correct decisions for this fold:
    curmean = np.mean(prediction ==  test_labels)
    means.append(curmean)
print('Mean accuracy: {:.1%}'.format(np.mean(means)))

Mean accuracy: 93.3%


## Binary and multiclass classification

- It is often simpler to define a simple binary method than the one that works on multiclass problems. However, we can reduce any multiclass problem to a series of binary decisions. This is what we did earlier in the Iris dataset, in a haphazard way: we observed that it was easy to separate one of the initial classes and focused on the other two, reducing the problem to two binary decisions: 

- ** The simplest is to use a series of one versus the rest classifiers. **

# Clustering- Finding Related Posts

- Well, of course, we will not be able to learn a classification model. Still, we could find some pattern within the data itself. That is, let the data describe itself. This is what we will do in this chapter, where we consider the challenge of a question and answer website. 

- The naïve approach will be to simply take the post, calculate its similarity to all  other posts and display the top n most similar posts as links on the page. Quickly, this will become very costly. Instead, we need a method that quickly finds all the related posts.

## Measuring the relatedness of posts

#### How not to do it

- One text similarity measure is the Levenshtein distance, which also goes by the name Edit Distance. Let's say we have two words, "machine" and "mchiene". The similarity between them can be expressed as the minimum set of edits that are necessary to turn one word into the other. In this case, the edit distance will be 2, as we have to add an "a" after the "m" and delete the first "e". This algorithm is, however, quite costly as it is bound by the length of the first word times the length of the second word. 

- i.e. 'abcd' & 'abcde' -> O(4 * 5) 

- But even if it would have been fast enough, there is another problem. In the earlier post, the word "format" accounts for an edit distance of 2, due to deleting it first, then adding it. So, our distance seems to be not robust enough to take word reordering into account.

#### How to do it

- More robust than edit distance is the so-called bag of word approach. It totally ignores the order of words and simply uses word counts as their basis. For each word in the post, its occurrence is counted and noted in a vector. Not surprisingly, this step is also called vectorization. 

- The columns Occurrences in post 1 and Occurrences in post 2 can now be treated as simple vectors. We can simply calculate the Euclidean distance between the vectors of all posts and take the nearest one (too slow, as we have found out earlier).

- Extrack salient features from each post and store it as a vector per post.
- Then compute clustering on the vectors
- Determine the cluster for the post in question
- From this cluster, fetch a handful of posts having a different similarity to the post in question. This will increase diversity.

## Preprocessing - similarity measured as a similar number of common words

### Converting raw text into a bag of words

In [123]:
from sklearn.feature_extraction.text import CountVectorizer

In [124]:
vectorizer = CountVectorizer(min_df=1)

- The min_df parameter determines how CountVectorizer treats seldom words (minimum document frequency). If it is set to an integer, all words occurring less than that value will be dropped.  The max_df parameter works in a similar manner

In [125]:
print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


- We see that, as expected, the counting is done at word level (analyzer=word) and words are determined by the regular expression pattern token_pattern. It will, for example, tokenize "cross-validated" into "cross" and "validated".

In [130]:
content = ["How to format my hard disk", "Hard disk format problems "]

In [131]:
X = vectorizer.fit_transform(content)

In [132]:
vectorizer.get_feature_names()

['disk', 'format', 'hard', 'how', 'my', 'problems', 'to']

In [133]:
print(X.toarray().transpose())

[[1 1]
 [1 1]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 0]]


- This means that the first sentence contains all the words except "problems", while the second contains all but "how", "my", and "to". In fact, these are exactly the same columns as we have seen in the preceding table. From X, we can extract a feature vector that we will use to compare two documents with each other. 

### Counting words

In [160]:
DIR = "./BuildingMachineLearningSystemsWithPython/ch03/data/toy"
posts = [open(os.path.join(DIR, f)).read() for f in os.listdir(DIR)]

In [161]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)

In [162]:
X_train = vectorizer.fit_transform(posts)
num_samples, num_features = X_train.shape
print("#samples: {}, #features: {}".format(num_samples, num_features))

#samples: 5, #features: 25


In [163]:
print(vectorizer.get_feature_names())

['about', 'actually', 'capabilities', 'contains', 'data', 'databases', 'images', 'imaging', 'interesting', 'is', 'it', 'learning', 'machine', 'most', 'much', 'not', 'permanently', 'post', 'provide', 'save', 'storage', 'store', 'stuff', 'this', 'toy']


In [164]:
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

- Note that the count vectors returned by the transform method are sparse. That is, each vector does not store one count value for each word, as most of those counts will be zero (the post does not contain the word). Instead, it uses the more memoryefficient implementation coo_matrix (for "COOrdinate").

In [143]:
print(new_post_vec)

  (0, 5)	1
  (0, 7)	1


In [144]:
print(new_post_vec.toarray())

[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


- We need to use the full array, if we want to use it as a vector for similarity calculations. For the similarity measurement (the naïve one), we calculate the Euclidean distance between the count vectors of the new post and all the old posts:


In [150]:
import scipy as sp
def dist_raw(v1, v2):
    delta = v1 - v2
    return sp.linalg.norm(delta.toarray())

- The norm() function calculates the Euclidean norm (shortest distance). This is just one obvious first pick and there are many more interesting ways to calculate the distance. 

- http://ojs.pythonpapers.org/index.php/tppsc/article/view/135/144

In [151]:
import sys
best_doc = None
best_dist = 100000
best_i = None
for i, post in enumerate(posts):
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_raw(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s" % (i, d, post))
    if d < best_dist:
        best_dist = d
        best_i = i

print("Best post is %i with dist=%.2f" % (best_i, best_dist))

=== Post 0 with dist=4.00: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=1.73: Imaging databases provide storage capabilities.
=== Post 2 with dist=2.00: Most imaging databases save images permanently.

=== Post 3 with dist=1.41: Imaging databases store data.
=== Post 4 with dist=5.10: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=1.41


In [152]:
print(X_train.getrow(3).toarray())
print(X_train.getrow(4).toarray())

[[0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]
[[0 0 0 0 3 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0]]


### Normalizating word count vectors

In [153]:
def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

In [154]:
import sys
best_doc = None
best_dist = 100000
best_i = None
for i, post in enumerate(posts):
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s" % (i, d, post))
    if d < best_dist:
        best_dist = d
        best_i = i

print("Best post is %i with dist=%.2f" % (best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.92: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77


### Removing less important words

-  Words such as "most" appear very often in all sorts of different contexts and are called stop words. They do not carry as much information and thus should not be weighed as much as words such as "images", which doesn't occur often in different contexts.  The best option would be to remove all the words that are so frequent that they do not help to distinguish between different texts. These words are called stop words. 

In [169]:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

- If you have a clear picture of what kind of stop words you would want to remove, you can also pass a list of them. Setting stop_words to english will use a set of  318 English stop words. To find out which ones, you can use get_stop_words():


In [170]:
sorted(vectorizer.get_stop_words())[0:20]

['a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amoungst']

In [171]:
import sys
best_doc = None
best_dist = 100000
best_i = None
for i, post in enumerate(posts):
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s" % (i, d, post))
    if d < best_dist:
        best_dist = d
        best_i = i

print("Best post is %i with dist=%.2f" % (best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.86: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77


### Stemming

- Natural Language Tollkit

### Installing and using NLTK 
-  http:// nltk.org/install.html

In [172]:
import nltk

In [173]:
import nltk.stem

In [174]:
s = nltk.stem.SnowballStemmer('english')
s.stem("graphics")


'graphic'

In [175]:
s.stem('imaging')

'imag'

In [176]:
s.stem('image')

'imag'

In [177]:
s.stem('imagination')

'imagin'

In [178]:
s.stem('imagine')


'imagin'

In [179]:
s.stem('buts')
s.stem('buying')

'buy'

In [180]:
s.stem('bought')

'bought'

### Extending the vectorizer with NLTK's stemmer

-  The preprocessor and tokenizer can be set as parameters in the constructor. We do not want to place the stemmer into any of them, because we will then have to do the tokenization and normalization by ourselves. Instead, we overwrite the build_analyzer method:


In [181]:
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer,
                         self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc)) 

In [182]:
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')

- The first step is lower casing the raw post in the preprocessing step(done in the parent class).
- Extracting all individual words in the tokenization step(done in the parent class).
- This concludes with converting each word into its stemmed version.

In [185]:
X_train = vectorizer.fit_transform(posts)
print(vectorizer.get_feature_names())
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])

['actual', 'capabl', 'contain', 'data', 'databas', 'imag', 'interest', 'learn', 'machin', 'perman', 'post', 'provid', 'save', 'storag', 'store', 'stuff', 'toy']


In [186]:
import sys
best_doc = None
best_dist = 100000
best_i = None
for i, post in enumerate(posts):
    if post == new_post:
        continue
    post_vec = X_train.getrow(i)
    d = dist_norm(post_vec, new_post_vec)
    print("=== Post %i with dist=%.2f: %s" % (i, d, post))
    if d < best_dist:
        best_dist = d
        best_i = i

print("Best post is %i with dist=%.2f" % (best_i, best_dist))

=== Post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
=== Post 1 with dist=0.86: Imaging databases provide storage capabilities.
=== Post 2 with dist=0.63: Most imaging databases save images permanently.

=== Post 3 with dist=0.77: Imaging databases store data.
=== Post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 2 with dist=0.63


## Stop words on steroids

# P. 84