### Q1: Is the mean of distances to a cluster as the same as the distance to the cluster centroid?
Let's check it out..

In [42]:
from sklearn.datasets import make_blobs
import pandas as pd
import numpy as np

# create data
x, y = make_blobs(n_samples=100, centers=5, n_features=2, random_state=42, cluster_std=0.5)
df = pd.DataFrame(dict(x1=x[:,0], x2=x[:,1], y=y))

# define a sample point
sample = df.sample(1, random_state=42)

# define the cluster
label = 4
cluster = df[df.y == label]

In [43]:
def l2(x1, x2, y1, y2):
    return np.sqrt((x1 - x2)**2 + (y1 - y2)**2)

def l1(x1, x2, y1, y2):
    return np.abs(x1 - x2) + np.abs(y1 - y2)

In [45]:
summation = 0
s_x1 = sample.x1
s_x2 = sample.x2
for c_x1, c_x2 in zip(cluster.x1, cluster.x2):
    summation += l2(c_x1, s_x1, c_x2, s_x2)
print("Mean of distances: %.2f" % (summation.item() / cluster.shape[0]))

s_x1 = sample.x1
s_x2 = sample.x2
cm_x1 = cluster.x1.mean()
cm_x2 = cluster.x2.mean()
summation = 0
for x1, x2 in zip(cluster.x1, cluster.x2):
    summation += l2(cm_x1, s_x1, cm_x2, s_x2)
print("Distance to cluster centroid: %.2f" % (summation.item() / cluster.shape[0]))

Mean of distances: 3.26
Distance to cluster centroid: 3.23


It seems that when std is decreasing the these two errors are getting closer.

### Q2: Is test sampling in recommendation matrix is correct?

Let's check it out

In [1]:
import pandas as pd
import numpy as np

In [3]:
r = np.array([[7, 6, 7, 4, 5, 4],
              [6, 7, np.nan, 4, 3, 4],
              [np.nan, 3, 3, 1, 1, np.nan],
              [1, 2, 3, 3, 3, 4],
              [1, np.nan, 1, 2, 3, 3]])

idx = np.random.choice(np.arange(6), 2, replace=False)

irow, jcol = np.where(~np.isnan(r))

r_copy1 = r.copy()
r_copy2 = r.copy()

test_irow = irow[idx]
test_jcol = jcol[idx]

for i in test_irow:
    for j in test_jcol:
        r_copy1[i][j] = np.nan

for (i, j) in zip(test_irow, test_jcol):
    r_copy2[i][j] = np.nan

print(np.nansum(r_copy1), np.nansum(r_copy2))

82.0 82.0


In [4]:
df = pd.read_csv('https://files.grouplens.org/datasets/movielens/ml-100k/u.data', delimiter=r'\t',
names=['user_id', 'item_id', 'rating', 'timestamp'])

r = df.pivot(index='user_id', columns='item_id', values='rating').values

irow, jcol = np.where(~np.isnan(r))

r_copy1 = r.copy()
r_copy2 = r.copy()

idx = np.random.choice(np.arange(100_000), 1000, replace=False)

test_irow = irow[idx]
test_jcol = jcol[idx]

for i in test_irow:
    for j in test_jcol:
        r_copy1[i][j] = np.nan

for (i, j) in zip(test_irow, test_jcol):
    r_copy2[i][j] = np.nan

print(np.nansum(r_copy1), np.nansum(r_copy2))

  return func(*args, **kwargs)


174331.0 349381.0


Seems correct in small data. But in large data it is not the same thing. Rate is almost twice. The code in the lesson also chooses opposite side of the diagonal.