In [1]:
import numpy as np

## numpy boradcasting.   
The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.  
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when
 - they are equal, or. 
 - one of them is 1.   

If these conditions are not met, a ValueError: operands could not be broadcast together exception is thrown, indicating that the arrays have incompatible shapes.   
* Note that missing dimensions are assumed to have size one.  
For example, if you have a 256x256x3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:  

	Image  (3d array): 256 x 256 x 3  
	Scale  (1d array):             3  => 1 x 1 x 3 => each 1 stretched (copied) 
	Result (3d array): 256 x 256 x 3  

When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.   
In the following example, both the A and B arrays have axes with length one that are expanded to a larger size during the broadcast operation:  

```
	A      (4d array):  8 x 1 x 6 x 1   
	B      (3d array):      7 x 1 x 5    
	Result (4d array):  8 x 7 x 6 x 5    

	A      (2d array):  5 x 4  
	B      (1d array):      1  
	Result (2d array):  5 x 4  

	A      (2d array):  5 x 4  
	B      (1d array):      4  
	Result (2d array):  5 x 4  

	A      (3d array):  15 x 3 x 5  
	B      (3d array):  15 x 1 x 5  
	Result (3d array):  15 x 3 x 5  

	A      (3d array):  15 x 3 x 5  
	B      (2d array):       3 x 5  
	Result (3d array):  15 x 3 x 5  

	A      (3d array):  15 x 3 x 5  
	B      (2d array):       3 x 1  
	Result (3d array):  15 x 3 x 5. 
```
"should define which dimention should be copied!!!" => make the dimention 1

In [22]:
# 2 data x 3 features 
np.random.seed(1)
arr1 = np.random.randint(1,5, size=(2,3))
print("pints: ", arr1, sep='\n')

# their center 1 data x 3 feature
centroid = arr1.mean(axis=0)
print("center: ", centroid, sep='\n')

#distance from the center to the points
# distance from point(x1, x2) to center(c1, c2)
# sqrt((x1-c1)**2 + (x2-c2)**2 + (x3-c3))
square_diff = np.power(arr1 - centroid, 2)
print("square_diff : ", square_diff, sep='\n')
distance = np.sqrt(np.sum(square_diff, axis=1))
print('distance: ', distance, sep='\n')
np.linalg.norm(arr1-centroid, axis=1)

pints: 
[[2 4 1]
 [1 4 2]]
center: 
[1.5 4.  1.5]
square_diff : 
[[0.25 0.   0.25]
 [0.25 0.   0.25]]
distance: 
[0.70710678 0.70710678]


array([0.70710678, 0.70710678])

In [36]:
# (3,) or (1,3) - (2,3) = (2,3)
# boradcasting
a1 = np.array([2,4,1]).reshape((-1,3))
a2 = np.array([[4, 1, 3], [1, 1, 2]])
print(a1.shape)
print(a1 - a2)
print(a2 - a1)

(1, 3)
[[-2  3 -2]
 [ 1  3 -1]]
[[ 2 -3  2]
 [-1 -3  1]]


In [99]:
# points: (5,3) - centroids: (2, 3) 
# => (5,3) - (2,1,3) => (5, 3) - (2, 5, 3)
# => (1, 5, 3) - (2, 5, 3) => (2, 5, 3) - (2, 5, 3)
# => (2,5,3)

# distance from pints to centroids
# => (2, 5)

# points (m x n), m: number of points, n: features
m = 10
n = 3

# data
points = np.random.random(size=(m,n))
print('points:', points, sep='\n')


points:
[[0.96826158 0.31342418 0.69232262]
 [0.87638915 0.89460666 0.08504421]
 [0.03905478 0.16983042 0.8781425 ]
 [0.09834683 0.42110763 0.95788953]
 [0.53316528 0.69187711 0.31551563]
 [0.68650093 0.83462567 0.01828828]
 [0.75014431 0.98886109 0.74816565]
 [0.28044399 0.78927933 0.10322601]
 [0.44789353 0.9085955  0.29361415]
 [0.28777534 0.13002857 0.01936696]]


In [98]:
# find nearest centroids from the points
diff = points - centroids[:,np.newaxis]
print('points - centroids: ', diff, 'diff shape:',diff.shape, sep='\n')
distances = np.linalg.norm(diff, axis=(2))
print('distances: clusters x points ', distances, sep='\n')
cluster_lables = np.argmin(distances, axis=0)
print('nearest cluster of the points: ', cluster_lables, sep='\n')


points - centroids: 
[[[1 1 1]
  [1 1 1]
  [1 1 1]
  [1 1 1]
  [1 1 1]]

 [[0 0 0]
  [0 0 0]
  [0 0 0]
  [0 0 0]
  [0 0 0]]]
diff shape:
(2, 5, 3)
distances: clusters x points 
[[1.73205081 1.73205081 1.73205081 1.73205081 1.73205081]
 [0.         0.         0.         0.         0.        ]]
nearest cluster of the points: 
[1 1 1 1 1]


In [29]:
# number of cluster k 
k = 2 

# 10 data x 3 features 
np.random.seed(1)
arr1 = np.random.randint(1,5, size=(10,3))
print("pints: ", arr1, sep='\n')
centroids = arr1[np.random.choice(
    arr1.shape[0], size=k, replace=False)]
print(f'{k} groups centers: ', centroids, sep='\n')

# distance from the points to k number of centers
# points: 10x3, centroids: 2x3
# distance: 10x2x3 (10 data x 2 clusters x 3 features)
diff = arr1 - centroids
print("points - centroids: ", diff, sep='\n')

pints: 
[[2 4 1]
 [1 4 2]
 [4 2 4]
 [1 1 2]
 [1 4 2]
 [1 3 2]
 [3 1 3]
 [2 3 1]
 [4 1 3]
 [1 2 3]]
2 groups centers: 
[[4 1 3]
 [1 1 2]]


ValueError: operands could not be broadcast together with shapes (10,3) (2,3) 

In [52]:

# number of cluster k
k = 2

# 10 data x 3 features
np.random.seed(1)
arr1 = np.random.randint(1, 5, size=(10, 3))
print("points:", arr1, sep='\n')

centroids = arr1[np.random.choice(arr1.shape[0], size=k, replace=False)]
print(f'{k} groups centers:', centroids, sep='\n')

# distance from the points to k number of centers
# points: 10x3, centroids: 2x3
# distance: 10x2x3 (10 data x 2 clusters x 3 features)
diff = arr1[:, np.newaxis, :] - centroids
print("points - centroids:", diff, sep='\n')

points:
[[2 4 1]
 [1 4 2]
 [4 2 4]
 [1 1 2]
 [1 4 2]
 [1 3 2]
 [3 1 3]
 [2 3 1]
 [4 1 3]
 [1 2 3]]
2 groups centers:
[[4 1 3]
 [1 1 2]]
points - centroids:
[[[-2  3 -2]
  [ 1  3 -1]]

 [[-3  3 -1]
  [ 0  3  0]]

 [[ 0  1  1]
  [ 3  1  2]]

 [[-3  0 -1]
  [ 0  0  0]]

 [[-3  3 -1]
  [ 0  3  0]]

 [[-3  2 -1]
  [ 0  2  0]]

 [[-1  0  0]
  [ 2  0  1]]

 [[-2  2 -2]
  [ 1  2 -1]]

 [[ 0  0  0]
  [ 3  0  1]]

 [[-3  1  0]
  [ 0  1  1]]]


In [51]:
# number of cluster k
k = 2

# 10 data x 3 features
np.random.seed(1)
arr1 = np.random.randint(1, 5, size=(10, 3))
print("points:", arr1, sep='\n')

centroids = arr1[np.random.choice(arr1.shape[0], size=k, replace=False)]
print(f'{k} groups centers:', centroids, sep='\n')

# To ensure broadcasting, add a new axis to centroids
centroids = centroids[:, np.newaxis, :]
print("Reshaped centroids:", centroids, sep='\n')

# distance from the points to k number of centers
# points: 10x3, centroids: 2x1x3
# distance: 10x2x3 (10 data x 2 clusters x 3 features)
# diff = arr1[:, np.newaxis, :] - centroids
print("points - centroids:", diff, sep='\n')

points:
[[2 4 1]
 [1 4 2]
 [4 2 4]
 [1 1 2]
 [1 4 2]
 [1 3 2]
 [3 1 3]
 [2 3 1]
 [4 1 3]
 [1 2 3]]
2 groups centers:
[[4 1 3]
 [1 1 2]]
Reshaped centroids:
[[[4 1 3]]

 [[1 1 2]]]
points - centroids:
[[[-2  3 -2]
  [ 1  3 -1]]

 [[-3  3 -1]
  [ 0  3  0]]

 [[ 0  1  1]
  [ 3  1  2]]

 [[-3  0 -1]
  [ 0  0  0]]

 [[-3  3 -1]
  [ 0  3  0]]

 [[-3  2 -1]
  [ 0  2  0]]

 [[-1  0  0]
  [ 2  0  1]]

 [[-2  2 -2]
  [ 1  2 -1]]

 [[ 0  0  0]
  [ 3  0  1]]

 [[-3  1  0]
  [ 0  1  1]]]
