# Notes

Hello, here I take notes while reading the book: <br>
Richert, W., & Coelho, L.P. (2013). **Building Machine Learning Systems with Python**. Birmingham: Livery Place.

Obs: Some of the examples are from the book and others are from my own interpretation of a model.

# Supervised Learning

## 1. Regression

### 1.1. Simple Linear Regression

## 2. Classification

### 2.1. A Brief Introduction of a Nearest Neighbor Classifier
A new sample is classified by calculating the distance to the nearest training case; the sign of that point then determines the classification of the sample.<br>
If we consider that each sample is represented by its features (in mathematical terms, as a point in N-dimensional space), we can compute the distance between samples. <br>
Euclidean distance = $\sqrt{\sum_{i=1}^{n} (p_i-q_i)^2}$ <br>

### The dataset (Iris dataset)
Overall, this includes 150 samples 50 in each of three classes. <br>Attribute Information, features: **sepal length, sepal width, petal length, petal width**, and classes: **Iris-Setosa, Iris-Versicolour, Iris-Virginica**.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from math import sqrt

Loads the data with load_iris from sklearn

In [2]:
iris_data = load_iris()

In [3]:
features = iris_data['data']
target = iris_data['target']
target_names = iris_data['target_names']

Converts the data into a data frame for a better understanding

In [4]:
target = target.reshape(150,1)
data = np.hstack([features, target])

In [5]:
df = pd.DataFrame(data=data, columns=['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width','Species'])

In [6]:
df.describe()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667,1.0
std,0.828066,0.433594,1.76442,0.763161,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


### Modeling
Calculates the Euclidean distance between two points in a N-dimensional space.

In [7]:
def distance(p, q):
    'Computes squared euclidean distance'
    return sqrt(np.sum((p-q)**2))

Now when classifying, we adopt a simple rule: given a new sample, we look at
the dataset for the point that is closest to it (its nearest neighbor) and look at its label:

In [8]:
def nn_classefy(training_set, training_labels, new_sample):
    for t in training_set:
        dists = np.array([distance(t, new_sample) for t in training_set])
    nearest = dists.argmin()
    return target_names[training_labels[nearest]]

In [9]:
'Predicts the class of a given sample'
new_sample = np.array([3,0,2,0.4])
label = nn_classefy(features, target, new_sample).take(0)
print('The given sample better fit in the %s class' %(label))

The given sample better fit in the setosa class


Now, note that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples have exactly the same features but different labels, which can happen). Therefore, it is essential to test using a cross-validation protocol.

**PS:** As you may have noticed we did not take into account the units of the features, sometimes it can be an problem because we may be summing up different kinds of units and mixing up them like lengths, areas, and dimensionless quantities (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to Z-scores. The Z-score of a value is how far away from the mean it is in terms of units of standard deviation. It comes down to this simple pair of operations: <br>
- subtract the mean for each feature:
        features -= features.mean(axis=0)
- divide each feature by its standard deviation:
       features /= features.std(axis=0)

Independent of what the original values were, after Z-scoring, a value of zero is the mean and positive values are above the mean and negative values are below it. The nearest neighbor classifier is simple, but sometimes good enough.

### 2.2 k-NN Classifier
The k-NN classifier extends the idea previously discussed in **2.1 A Brief Introduction of a Nearest Neighbor Classifier** by considering not just the closest point but the k closest points. All k neighbors vote to select the label. k is typically a small number and odd to break ties, such as 3 or 5, but can be larger, particularly if the dataset is very large. Larger k values help reduce the effects of noisy points within the training data set, and the choice of k is often performed through cross-validation.

### Starting with k-nearest neighbor (k-NN) algorithm

### 2.3 Logistic Regression

# Unsupervised Learning

## 1. Clustering

### 1.1. Finding Related Posts with a Naive Approach (Bag-of-words)
The bag-of-word approach uses simple word counts as its basis. For each word in the post, its occurrence is counted and noted in a vector. Not surprisingly, this step is also called vectorization. The vector is typically huge as it contains as many elements as the words that occur in the whole dataset. <br>
So let us pick a random post, for which we will then create the count vector. We will then compare its distance to all the count vectors and fetch the post with the smallest one.

### The dataset
Let us play with the dataset consisting of the following posts:

In [10]:
posts = {
    0: "I can only imagine how difficult this is for you.",
    1: "Can you imagine that?",
    2: "I can't imagine what he was thinking to hide a thing like that from you.",
    3: "Imagine that you personally had to create everything you wanted to use.",
    4: "He could imagine her horror when she discovered what he planned.",
    5: "Then imagine if you shared your Digital Echo with a billion other people on the planet.",
    6: "He cannot imagine how very, very happy he will be when he can tell us his thoughts, and we can tell him how we have loved him so long.", 
    7: "I imagine it would taste mighty good.",
    8: "I can just imagine what a funny figure that policeman cut!",
    9: "The winter's better here than Europe, I imagine, he said with a smile.",
    10: "Can you imagine a world without poverty?",
    11: "I couldn't imagine you'd take that long for a dog walk."
}

In this post dataset, we want to find the most similar post for the short given post "Can you imagine all the people smiling?"

### Converting text to vectors

Extending the vectorizer with NLTK's stemmer. We need to stem the posts before we feed them into **TfidfVectorizer**. Notice we could use just **CountVectorizer**, however we'd not count the term frequencies for every post, and in addition, discounting those that appear in many posts. <br>
In other words, we want a high value for a given term in a given value if that term occurs often in that particular post and very rarely anywhere else. <br>
The resulting document vectors will not contain counts any more. Instead, they will contain the individual TF-IDF values per term. <br>

In [11]:
'Equivalent to CountVectorizer followed by TfidfTransformer'
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem

english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

Our current text preprocessing phase includes the following steps:
1. Lower casing the raw post in the preprocessing step (done in the parent class).
2. Extracting all individual words in the tokenization step (done in the parent class).
3. Converting each word into its stemmed version.
4. Throwing away words that occur way too often to be of any help in detecting relevant posts.
5. Throwing away words that occur so seldom that there is only a small chance that they occur in future posts.
6. Counting the remaining words.
7. Calculating TF-IDF values from the counts, considering the whole text corpus.

In [12]:
vectorizer = StemmedTfidfVectorizer(min_df=1, 
                                    stop_words='english', 
                                    decode_error='ignore')
X_train = vectorizer.fit_transform(posts.values())
num_samples, num_features = X_train.shape
print("samples: %d, features: %d" % (num_samples, num_features))

samples: 12, features: 42


This means we have 12 posts with a total of 42 different words. The following words that have been tokenized will be counted:

In [13]:
print(vectorizer.get_feature_names())

['better', 'billion', 'couldn', 'creat', 'cut', 'difficult', 'digit', 'discov', 'dog', 'echo', 'europ', 'figur', 'funni', 'good', 'happi', 'hide', 'horror', 'imagin', 'just', 'like', 'long', 'love', 'mighti', 'peopl', 'person', 'plan', 'planet', 'policeman', 'poverti', 'said', 'share', 'smile', 'tast', 'tell', 'thing', 'think', 'thought', 'use', 'walk', 'want', 'winter', 'world']


Picks a random new post to find related posts

In [14]:
new_post = "Imagine all the people in the world smiling?"
new_post_vec = vectorizer.transform([new_post])
print("Coordinate matrix")
print(new_post_vec)
print()
print("Full array")
print(new_post_vec.toarray())

Coordinate matrix
  (0, 41)	0.5660249087784507
  (0, 31)	0.5660249087784507
  (0, 23)	0.5660249087784507
  (0, 17)	0.1970974579415979

Full array
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.19709746
  0.         0.         0.         0.         0.         0.56602491
  0.         0.         0.         0.         0.         0.
  0.         0.56602491 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.56602491]]


Notice that count vectors returned by the **transform** method are sparse. That is, each vector does not store one count value for each word, as most of those counts would be zero (post does not contain the word)

Obviously, using only the counts of the raw words is too simple. We will have to normalize them to get vectors of unit length.

In [15]:
'calculates the Euclidean distance between the count vectors of the new post and all the old posts'
import scipy as sp

def dist_norm(v1, v2):
    v1_normalized = v1/sp.linalg.norm(v1.toarray())
    v2_normalized = v2/sp.linalg.norm(v2.toarray())
    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

The **norm( )** function calculates the Euclidean norm (shortest distance). With **dist_norm**, we just need to iterate over all the posts and remember the nearest one:

In [16]:
import sys

best_doc = None
best_dist = sys.maxsize
best_i = None

for i in range(num_samples):
    post = posts[i]
    post_vec = X_train.getrow(i)
    d = dist_norm(new_post_vec, post_vec)
    print("Post %i with dist %.2f: %s" % (i, d, post))
    if d < best_dist:
        best_dist = d
        best_i = i
print("\nBest post is %i with dist %.2f"%(best_i, best_dist))

Post 0 with dist 1.37: I can only imagine how difficult this is for you.
Post 1 with dist 1.27: Can you imagine that?
Post 2 with dist 1.39: I can't imagine what he was thinking to hide a thing like that from you.
Post 3 with dist 1.39: Imagine that you personally had to create everything you wanted to use.
Post 4 with dist 1.39: He could imagine her horror when she discovered what he planned.
Post 5 with dist 1.22: Then imagine if you shared your Digital Echo with a billion other people on the planet.
Post 6 with dist 1.40: He cannot imagine how very, very happy he will be when he can tell us his thoughts, and we can tell him how we have loved him so long.
Post 7 with dist 1.39: I imagine it would taste mighty good.
Post 8 with dist 1.39: I can just imagine what a funny figure that policeman cut!
Post 9 with dist 1.20: The winter's better here than Europe, I imagine, he said with a smile.
Post 10 with dist 1.06: Can you imagine a world without poverty?
Post 11 with dist 1.39: I couldn

With this process, we are able to convert a bunch of noisy text into a concise representation of feature values.
But, as simple and as powerful as the bag-of-words approach with its extensions is, it has some drawbacks that we should be aware of. They are as follows:
- It does not cover word relations. With the previous vectorization approach, the text "Car hits wall" and "Wall hits car" will both have the same feature vector.
- It does not cover word relations. With the previous vectorization approach, the text "Car hits wall" and "Wall hits car" will both have the same feature vector.
- It totally fails with misspelled words. Although it is clear to the readers that "database" and "databas" convey the same meaning, our approach will treat them as totally different words.

### 1.2. KMeans
KMeans is the most widely used flat clustering algorithm. After it is initialized with the desired number of clusters, num_clusters, it maintains that number of so-called cluster centroids. Initially, it would pick any of the num_clusters posts and set the centroids to their feature vector. Then it would go through all other posts and assign them the nearest centroid as their current cluster. Then it will move each centroid into the middle of all the vectors of that particular class. This changes, of course, the cluster assignment. Some posts are now nearer to another cluster. So it will update the assignments for those changed posts. This is done as long as the centroids move a considerable amount. After some iterations, the movements will fall below a threshold and we consider clustering to be converged.

### The dataset (20newsgroup)
One standard dataset in machine learning is the 20newsgroup dataset, which contains 18,826 posts from 20 different newsgroups. Among the groups' topics are technical ones such as comp.sys.mac.hardware or sci.crypt as well as more politics- and religion-related ones such as talk.politics.guns or soc.religion. christian.

In [17]:
from sklearn.datasets import fetch_20newsgroups

groups = ['comp.graphics', 'comp.os.ms-windows.misc', 
          'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 
          'comp.windows.x', 'sci.space']

train_data = fetch_20newsgroups(subset='train', categories=groups)

### Clustering posts

In [18]:
vectorizer = StemmedTfidfVectorizer(min_df=10, 
                                    max_df=0.5, 
                                    stop_words='english',
                                    decode_error='ignore')
vectorized = vectorizer.fit_transform(train_data.data)
num_samples, num_features = vectorized.shape
print("samples: %d, features: %d" % (num_samples, num_features))

samples: 3529, features: 4712


We now have a pool of **3529** posts and extracted for each of them a feature vector of **47121** dimensions.

In [38]:
from sklearn.cluster import KMeans

'fix the cluster size to 50'
num_clusters = 50
km = KMeans(n_clusters=num_clusters, init='random', n_init=1,
   verbose=1)
km.fit(vectorized)

Initialization complete
Iteration  0, inertia 5915.679
Iteration  1, inertia 3218.997
Iteration  2, inertia 3180.855
Iteration  3, inertia 3162.343
Iteration  4, inertia 3152.263
Iteration  5, inertia 3145.155
Iteration  6, inertia 3140.431
Iteration  7, inertia 3136.516
Iteration  8, inertia 3133.709
Iteration  9, inertia 3131.616
Iteration 10, inertia 3130.326
Iteration 11, inertia 3129.639
Iteration 12, inertia 3129.083
Iteration 13, inertia 3128.305
Iteration 14, inertia 3127.844
Iteration 15, inertia 3127.494
Iteration 16, inertia 3127.022
Iteration 17, inertia 3126.544
Iteration 18, inertia 3126.386
Iteration 19, inertia 3126.245
Iteration 20, inertia 3126.132
Iteration 21, inertia 3126.067
Iteration 22, inertia 3125.982
Iteration 23, inertia 3125.868
Iteration 24, inertia 3125.730
Iteration 25, inertia 3125.619
Iteration 26, inertia 3125.596
Converged at iteration 26: center shift 0.000000e+00 within tolerance 2.069005e-08


KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300,
    n_clusters=50, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)

For every vectorized post that has been fit, there is a corresponding integer label in **km.labels_**.

In [33]:
print(km.labels_)
print(km.labels_.shape)

[34 37 27 ... 23 13 17]
(3529,)


### Solving our initial challenge

assign a cluster to a newly arriving post using **km.predict**.

In [57]:
new_post = "Disk drive problems. Hi, I have a problem with my hard disk. After 1 year it is working only sporadically now. I tried to format it, but now it doesn't boot any more. Any ideas? Thanks."
print(new_post)

Disk drive problems. Hi, I have a problem with my hard disk. After 1 year it is working only sporadically now. I tried to format it, but now it doesn't boot any more. Any ideas? Thanks.


vectorize this post before we predict its label.

In [58]:
new_post_vec = vectorizer.transform([new_post])
new_post_label = km.predict(new_post_vec)[0]

Now that we have the clustering, we do not need to compare **new_post_vec** to all post vectors. Instead, we can focus only on the posts of the same cluster. <br>
The comparison in the bracket results in a Boolean array, and **nonzero** converts that
array into a smaller array containing the indices of the **True** elements.

In [63]:
similar_indices = (km.labels_==new_post_label).nonzero()[0]
print(similar_indices)
print(len(similar_indices))

[  66   69   71  125  157  201  213  214  225  233  247  308  351  354
  359  370  384  392  395  463  479  531  533  565  580  581  618  676
  689  714  731  779  806  807  905  935  939  944  961  964  976  987
 1005 1114 1228 1242 1246 1266 1286 1313 1316 1388 1389 1427 1431 1481
 1486 1487 1512 1519 1538 1548 1624 1637 1670 1716 1747 1752 1769 1840
 1843 1852 1893 1986 1990 2010 2013 2061 2085 2139 2151 2223 2235 2257
 2270 2277 2306 2347 2351 2400 2412 2414 2436 2447 2463 2475 2493 2512
 2516 2525 2539 2565 2590 2612 2624 2651 2667 2678 2705 2745 2752 2791
 2815 2842 2852 2951 2956 2964 2993 3018 3065 3145 3173 3186 3192 3202
 3214 3219 3225 3285 3289 3296 3309 3321 3437 3450 3458]
137


Using **similar_indices**, we then simply have to build a list of posts together with
their similarity scores

In [66]:
similar = []

for i in similar_indices:
    dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
    similar.append((dist, train_data.data[i])) 
    
similar = sorted(similar)
print(len(similar))

137


We found **137** posts in the cluster of our post. To give the user a quick idea of what kind of similar posts are available, we can now present the most similar post.

The following lines shows the posts together with their similarity values:

In [150]:
show_at_1 = similar[0]
show_at_2 = similar[len(similar)//2]
show_at_3 = similar[-1]

print('Position', 1)
print('Similarity', show_at_1[0])
print()
print(show_at_1[1])
print('#'*108)
print('Position', 2)
print('Similarity', show_at_2[0])
print()
print(show_at_2[1])
print('#'*108)
print('Position', 3)
print('Similarity', show_at_3[0])
print()
print(show_at_3[1])

Position 1
Similarity 1.0378441731334074

From: Thomas Dachsel <GERTHD@mvs.sas.com>
Subject: BOOT PROBLEM with IDE controller
Nntp-Posting-Host: sdcmvs.mvs.sas.com
Organization: SAS Institute Inc.
Lines: 25

Hi,
I've got a Multi I/O card (IDE controller + serial/parallel
interface) and two floppy drives (5 1/4, 3 1/2) and a
Quantum ProDrive 80AT connected to it.
I was able to format the hard disk, but I could not boot from
it. I can boot from drive A: (which disk drive does not matter)
but if I remove the disk from drive A and press the reset switch,
the LED of drive A: continues to glow, and the hard disk is
not accessed at all.
I guess this must be a problem of either the Multi I/o card
or floppy disk drive settings (jumper configuration?)
Does someone have any hint what could be the reason for it.
Please reply by email to GERTHD@MVS.SAS.COM
Thanks,
Thomas
+-------------------------------------------------------------------+
| Thomas Dachsel                                           

### 1.3 LDL - Latent Dirichlet Allocation (Topic Model)

In [159]:
from gensim import corpora, models, similarities

corpus = corpora.BleiCorpus('ap/ap.dat', 'ap/vocab.txt')

In [236]:
num_topics= 15
model = models.ldamodel.LdaModel(corpus,
                                 num_topics=num_topics,
                                 id2word=corpus.id2word)

In [221]:
topics = [model[c] for c in corpus]
print(topics[2])

[(3, 0.029852249), (26, 0.034497377), (29, 0.01256772), (35, 0.027599448), (36, 0.07910758), (38, 0.21753761), (45, 0.039460044), (51, 0.024642373), (61, 0.01032077), (64, 0.3754707), (80, 0.10773616)]


In [233]:
model.get_document_topics(topics[400])

[(59, 0.50087535)]

In [252]:
sorted(model.get_topic_terms(num_topics-1))

[(0, 0.0040408657),
 (1, 0.0074477717),
 (2, 0.0063218917),
 (4, 0.0048942105),
 (7, 0.0049159424),
 (9, 0.004210106),
 (13, 0.003475071),
 (18, 0.0043182923),
 (143, 0.0035379431),
 (1622, 0.0037720285)]

In [259]:
[model.get_document_topics[topic] for topic in topics if len(model.ge == 3]

TypeError: 'method' object is not subscriptable

In [220]:
model.get_term_topics(0)

[(8, 0.018344613),
 (20, 0.018087087),
 (27, 0.013613409),
 (36, 0.01096429),
 (86, 0.01764328)]

## 2. Association

In [263]:
dict = {"car": ["a", "b", "a", "c", "b"], "class": [1,2,1,3,2]}

In [335]:
df = pd.DataFrame(dict)
df

Unnamed: 0,car,class
0,a,1
1,b,2
2,a,1
3,c,3
4,b,2


In [338]:
df.to_dict('list')

{'car': ['a', 'b', 'a', 'c', 'b'], 'class': [1, 2, 1, 3, 2]}

In [353]:
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore
0,Nighthawks,1st,Miller,4,25
1,Nighthawks,1st,Jacobson,24,94
2,Nighthawks,2nd,Ali,31,57
3,Nighthawks,2nd,Milner,2,62
4,Dragoons,1st,Cooze,3,70
5,Dragoons,1st,Jacon,4,25
6,Dragoons,2nd,Ryaner,24,94
7,Dragoons,2nd,Sone,31,57
8,Scouts,1st,Sloan,2,62
9,Scouts,1st,Piger,3,70


In [354]:
x = df.groupby('regiment')[['company','name']].apply(lambda x: x.to_dict(orient='list')).to_dict()
print(x)

{'Dragoons': {'company': ['1st', '1st', '2nd', '2nd'], 'name': ['Cooze', 'Jacon', 'Ryaner', 'Sone']}, 'Nighthawks': {'company': ['1st', '1st', '2nd', '2nd'], 'name': ['Miller', 'Jacobson', 'Ali', 'Milner']}, 'Scouts': {'company': ['1st', '1st', '2nd', '2nd'], 'name': ['Sloan', 'Piger', 'Riani', 'Ali']}}


In [355]:
new_df = pd.DataFrame(x)
new_df

Unnamed: 0,Dragoons,Nighthawks,Scouts
company,"[1st, 1st, 2nd, 2nd]","[1st, 1st, 2nd, 2nd]","[1st, 1st, 2nd, 2nd]"
name,"[Cooze, Jacon, Ryaner, Sone]","[Miller, Jacobson, Ali, Milner]","[Sloan, Piger, Riani, Ali]"


In [356]:
new_df.transpose()

Unnamed: 0,company,name
Dragoons,"[1st, 1st, 2nd, 2nd]","[Cooze, Jacon, Ryaner, Sone]"
Nighthawks,"[1st, 1st, 2nd, 2nd]","[Miller, Jacobson, Ali, Milner]"
Scouts,"[1st, 1st, 2nd, 2nd]","[Sloan, Piger, Riani, Ali]"
