## A bit of feedback to the last assignment.

The analisys of the songs might take a bit too long. The question is now how to speed things up.

One way of receiving results more quickly is to make a better use of your multi-core processor. By default your Python programs run on a single processor core, which is demonstrated in the program below.

Since it is executed sequentially, it will take in 40 seconds in total to sleep 40 times for a single second.

In [86]:
from uuid import uuid4
from time import sleep


def long_running_function(word_to_find):
    # me = uuid4()
    # print(f'Started {me}')
    sleep(1)
    # print(f'Done with {me}')
    

def run():
    for idx in range(40):
        long_running_function('')

%timeit run()

40.1 s ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


To use all cores of your machine in parallel, you can use the `multiprocessing` module to create a set of parallel processes, one running per core.

In [88]:
from multiprocessing import Pool, cpu_count


pool = Pool(processes=cpu_count())
%timeit r = pool.map(long_running_function, range(40))


6.01 s ± 735 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]:
%matplotlib notebook
import matplotlib.pyplot as plt

# Making Sense of Data ...


For this notebook you need to have the `scikit-learn` (http://scikit-learn.org/stable/index.html) module installed, which is part of a normal Anaconda installation.

# Feature Spaces

## One-dimensional Features Space 

In [76]:
%matplotlib notebook
import numpy as np
from sklearn.datasets.samples_generator import make_blobs


centers = [[60], [120], [170], [240]]
data, _ = make_blobs(n_samples=400, n_features=1, cluster_std=4, centers=centers)

data_1d = np.rint(data).astype(np.uint8)
plt.xlim(0, 255)
plt.ylim(-0.3, 0.3)

y = np.zeros(np.shape(data))
plt.plot(data_1d , y, 'b|', ms=50)
plt.axis('off')
plt.show()

<IPython.core.display.Javascript object>

[[240.83432089]
 [235.34666441]
 [241.99251085]
 [237.87530887]
 [173.01114543]
 [125.73265253]
 [ 58.29259904]
 [ 60.4997759 ]
 [234.03692241]
 [169.05130435]
 [170.25408328]
 [124.82018718]
 [173.78999476]
 [ 60.15299978]
 [ 60.07469123]
 [172.11185114]
 [ 53.62509267]
 [171.5044245 ]
 [169.1237364 ]
 [ 59.97719549]
 [112.79020626]
 [124.5513949 ]
 [ 57.19288038]
 [244.2118152 ]
 [237.32961365]
 [177.0896635 ]
 [236.12600213]
 [173.22383571]
 [171.08132464]
 [ 58.87733174]
 [115.81221414]
 [241.99993074]
 [ 59.15827979]
 [118.2922065 ]
 [119.558375  ]
 [237.94359805]
 [169.12686043]
 [ 61.15550443]
 [122.94036331]
 [241.27871161]
 [171.76290031]
 [ 52.51934989]
 [238.1156245 ]
 [ 65.17221068]
 [ 65.21175318]
 [118.33828801]
 [125.9547355 ]
 [ 61.4428243 ]
 [165.72831417]
 [117.77906173]
 [ 64.03477056]
 [167.26071191]
 [ 62.75179785]
 [165.42972091]
 [ 54.6167986 ]
 [240.59621378]
 [236.50624234]
 [165.00916902]
 [126.46857602]
 [ 57.5326889 ]
 [166.4977811 ]
 [168.79304248]
 [ 59.75

## Two-dimensional Features Space 

In [61]:
%matplotlib notebook
from sklearn.datasets.samples_generator import make_blobs

centers = [[2, 1], [0, 0], [1, -1]]
data_2d, _ = make_blobs(n_samples=2500, centers=centers, 
                        cluster_std=0.37)

plt.plot(data_2d[:,0], data_2d[:,1], 'b.')

plt.title('Estimated number of clusters:')
plt.show()

<IPython.core.display.Javascript object>

## Three-dimensional Features Space 

In [75]:
%matplotlib notebook
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets.samples_generator import make_blobs


centers = [[2, 1, 0], [0, -1, -1], [1, -1, 3]]
data_3d, _ = make_blobs(n_samples=2500, centers=centers, cluster_std=0.37)
x, y, z = data_3d[:,0], data_3d[:,1], data_3d[:,2]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z, linewidth=0.2)

plt.show()

<IPython.core.display.Javascript object>

# How do we find clusters?

Humans are quite good in quickly finding clusters when looking at the data visualizations above. The problem is, how to make a machine find clusters.

http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html#sphx-glr-auto-examples-cluster-plot-mean-shift-py

In [58]:
from sklearn.cluster import MeanShift, estimate_bandwidth


def mean_shift(data, n_samples=1000):
    bandwidth = estimate_bandwidth(data, quantile=0.2, 
                                   n_samples=n_samples)

    ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
    ms.fit(data)
    labels = ms.labels_
    cluster_centers = ms.cluster_centers_

    labels_unique = np.unique(labels)
    n_clusters = len(labels_unique)

    print('Number of estimated clusters : {}'.format(n_clusters))
    
    return labels, cluster_centers, n_clusters

In [59]:
%matplotlib notebook
from itertools import cycle


labels, cluster_centers, n_clusters = mean_shift(data_1d)

plt.cla()
plt.xlim(0, 255)
plt.ylim(-0.3, 0.3)

colors = cycle('bgrcmy')
for k, col in zip(range(n_clusters), colors):
    my_members = (labels == k)
    cluster_center = cluster_centers[k]
    
    x = data_1d[my_members, 0]
    y = np.zeros(np.shape(x))

    plt.plot(x , y, col + '|', ms=50)
    plt.plot(cluster_center[0] , 0, 'k|', ms=70)

plt.show()

Number of estimated clusters : 4


<IPython.core.display.Javascript object>

In [62]:
%matplotlib notebook
from itertools import cycle


labels, cluster_centers, n_clusters = mean_shift(data_2d)

fig = plt.figure()
ax = fig.add_subplot(111)

colors = cycle('bgrcmy')
for k, col in zip(range(n_clusters), colors):
    my_members = (labels == k)
    cluster_center = cluster_centers[k]
    
    x, y = data_2d[my_members,0], data_2d[my_members,1]
    ax.scatter(x, y, c=col,  linewidth=0.2)
    ax.scatter(cluster_center[0], cluster_center[1], c='k', s=50, linewidth=0.2)
    
plt.title('Estimated number of clusters: {}'.format(n_clusters))
plt.show()

Number of estimated clusters : 3


<IPython.core.display.Javascript object>

In [25]:
%matplotlib notebook
from itertools import cycle


labels, cluster_centers, n_clusters = mean_shift(data_3d)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

colors = cycle('bgrcmy')
for k, col in zip(range(n_clusters), colors):
    my_members = (labels == k)
    cluster_center = cluster_centers[k]
    
    x, y, z = data_3d[my_members,0], data_3d[my_members,1], data_3d[my_members,2]
    ax.scatter(x, y, z, c=col,  linewidth=0.2, alpha=0.1)
    ax.scatter(cluster_center[0], cluster_center[1], cluster_center[2], s=150, c='k')
    
plt.title('Estimated number of clusters: {}'.format(n_clusters))
plt.show()

Number of estimated clusters : 3


<IPython.core.display.Javascript object>

In [81]:
import pandas as pd


filename = './iris_data.csv'
df = pd.read_csv(filename)
df

Unnamed: 0,Sepal length,Sepal width,Petal length,Petal width,Species
0,5.1,3.5,1.4,0.2,I. setosa
1,4.9,3.0,1.4,0.2,I. setosa
2,4.7,3.2,1.3,0.2,I. setosa
3,4.6,3.1,1.5,0.2,I. setosa
4,5.0,3.6,1.4,0.2,I. setosa
5,5.4,3.9,1.7,0.4,I. setosa
6,4.6,3.4,1.4,0.3,I. setosa
7,5.0,3.4,1.5,0.2,I. setosa
8,4.4,2.9,1.4,0.2,I. setosa
9,4.9,3.1,1.5,0.1,I. setosa


In [28]:
labels = np.unique(df['Species'])

fig = plt.figure()
ax = fig.add_subplot(111)

colors = cycle('bgrcmy')
for label, col in zip(labels, colors):
    print(label, col)
    x = df[df['Species'] == label]['Sepal length']
    y = df[df['Species'] == label]['Sepal width']

    ax.scatter(x, y, c=col,  linewidth=0.2)
    
plt.title('Sepal length vs. width')
plt.show()

<IPython.core.display.Javascript object>

I. setosa b
I. versicolor g
I. virginica r


In [84]:
data_2d = df[['Sepal length', 'Sepal width']].as_matrix()
labels, cluster_centers, n_clusters = mean_shift(data_2d)

fig = plt.figure()
ax = fig.add_subplot(111)

colors = cycle('bgrcmy')
for k, col in zip(range(n_clusters), colors):
    my_members = (labels == k)
    cluster_center = cluster_centers[k]
    
    x, y = data_2d[my_members,0], data_2d[my_members,1]
    ax.scatter(x, y, c=col,  linewidth=0.2)
    ax.scatter(cluster_center[0], cluster_center[1], c='k', s=50, linewidth=0.2)
    
plt.title('Estimated number of clusters: {}'.format(n_clusters))
plt.show()

Number of estimated clusters : 3


<IPython.core.display.Javascript object>

# How does this work?

  * http://stackoverflow.com/a/17912660
  * http://www.chioka.in/meanshift-algorithm-for-the-rest-of-us-python/
  * https://en.wikipedia.org/wiki/Mean_shift
  * https://github.com/mattnedrich/MeanShift_py

```bash
git clone https://github.com/mattnedrich/MeanShift_py.git
```

The following is the entry from Stackoverflow http://stackoverflow.com/a/17912660 explaining the mean shift algorithm. It is copied here for convinience and readability.

### The image data is converted into feature space
![](https://i.stack.imgur.com/80Uaa.jpg)
In case of a gray-scale image, all you have are intensity values, so feature space will only be one-dimensional. (You might compute some texture features, for instance, and then your feature space would be two dimensional – and you would be segmenting based on intensity and texture)

### Search windows are distributed over the feature space 
![](https://i.stack.imgur.com/52HFX.jpg)
The number of windows, window size, and initial locations are arbitrary for this example – something that can be fine-tuned depending on specific applications

Mean-Shift iterations:

### 1.) The MEANs of the data samples within each window are computed 
![](https://i.stack.imgur.com/JaOo5.jpg)

### 2.) The windows are SHIFTed to the locations equal to their previously computed means 
![](https://i.stack.imgur.com/7Nk75.jpg)

### Steps 1.) and 2.) are repeated until convergence, i.e. all windows have settled on final locations 
![](https://i.stack.imgur.com/5127k.jpg)

### The windows that end up on the same locations are merged 
![](https://i.stack.imgur.com/SryA8.jpg)

### The data is clustered according to the window traversals 
![](https://i.stack.imgur.com/A871k.jpg)

That is, all data that was traversed by windows that ended up at, say, location "2", will form a cluster associated with that location.

So, this segmentation will (coincidentally) produce three groups. Choosing different window sizes and initial locations might produce different results.

# What else can I use this for?

For example for image segmentation

In [63]:
%matplotlib notebook
import os
import cv2
import webget


# url = 'https://github.com/opencv/opencv/raw/master/samples/data/rubberwhale2.png'
url = 'https://github.com/opencv/opencv/raw/master/samples/data/baboon.jpg'
# url = 'https://github.com/mattnedrich/MeanShift_py/raw/master/sample_images/mean_shift_image.jpg'
webget.download(url)

img = cv2.imread(os.path.basename(url))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
height, width = img.shape[:2]

plt.imshow(img, interpolation='none')

Downloading file to ./baboon.jpg


<IPython.core.display.Javascript object>

<matplotlib.image.AxesImage at 0x1c29cdeeb8>

In [64]:
%matplotlib notebook


lab_image = cv2.cvtColor(img, cv2.COLOR_RGB2Lab)
img = cv2.medianBlur(lab_image, 5)
    
img_lst = img.reshape((img.shape[0] * img.shape[1], 3))
img_lst_orig = np.copy(img_lst)

labels, cluster_centers, n_clusters = mean_shift(img_lst)

label_img = labels.reshape(height, width)
for l in range(n_clusters):
    img[label_img == l] = cluster_centers[l]

rgb_segments = cv2.cvtColor(img, cv2.COLOR_Lab2RGB)
plt.imshow(rgb_segments, interpolation='none')

Number of estimated clusters : 3


<IPython.core.display.Javascript object>

<matplotlib.image.AxesImage at 0x1c2e5172e8>

In [66]:
%matplotlib notebook
from itertools import cycle


labels, cluster_centers, n_clusters = mean_shift(img_lst_orig)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

colors = cycle('cmybgr')
for k, col in zip(range(n_clusters), colors):
    my_members = (labels == k)
    cluster_center = cluster_centers[k]
    
    x, y, z = img_lst_orig[my_members,0], img_lst_orig[my_members,1], img_lst_orig[my_members,2]
    ax.scatter(x, y, z, c=col,  linewidth=0.2, alpha=0.1)
    ax.scatter(cluster_center[0], cluster_center[1], cluster_center[2], s=150, c='k')
    
plt.title('Estimated number of clusters: {}'.format(n_clusters))
plt.show()

Number of estimated clusters : 3


<IPython.core.display.Javascript object>

In [45]:
print(cluster_centers)

[[128.94711591 120.63186471 140.11071404]
 [177.45388819 121.67841037 123.62975733]
 [144.70102024 183.51332999 170.7958187 ]]
