Maxtix factorization can find the relation between users and products iwthout explicit relationship info.

Matrix factorization in recommendation systems is generally considered a collaborative filtering technique, not a content-based filtering approach. Here are the key points to understand this:

1. Collaborative filtering:

- Relies on past user behavior and interactions
- Doesn't require explicit content information about items
- Finds patterns in user-item interactions
2. Matrix factorization as collaborative filtering:
- Uses a user-item interaction matrix
- Decomposes this matrix into lower-dimensional user and item latent factor matrices
- These latent factors capture underlying patterns in user preferences and item characteristics
- Doesn't rely on explicit item features or content
3. How it works:
- Starts with a sparse user-item interaction matrix
- Factorizes this matrix into two lower-dimensional matrices (user factors and item factors)
- These factors represent latent features that explain user preferences and item attributes 
- Predictions are made by multiplying these factor matrices
4. Advantages:
- Can capture complex patterns in user behavior
- Doesn't require content information about items
- Can handle large, sparse datasets efficiently
5. Contrast with content-based filtering:
- Content-based filtering uses explicit features of items (e.g., genre, actors for movies)
- Matrix factorization doesn't require this explicit content information
6. Hybrid approaches:
- Some advanced techniques combine matrix factorization with content-based features for improved performance

- Utilizes Matrix Factorization and KMeans for movie predictions
- Dataset is Movie Lens: # F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. <a href="https://doi.org/10.1145/2827872">Link to Paper</a>
-  Associated YouTube Video: <a href="https://youtu.be/G4MBc40rQ2k">Supplemental Material</a>

In [1]:
# Data Citation:
# F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on 
# Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. 

! curl http://files.grouplens.org/datasets/movielens/ml-latest-small.zip -o ml-latest-small.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  955k  100  955k    0     0  1133k      0 --:--:-- --:--:-- --:--:-- 1131k


In [2]:
import zipfile
with zipfile.ZipFile('ml-latest-small.zip', 'r') as zip_ref:
    zip_ref.extractall('data')

In [1]:
# import the dataset
import pandas as pd
movies_df = pd.read_csv('data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('data/ml-latest-small/ratings.csv')

In [2]:
print('The dimensions of movies dataframe are:', movies_df.shape,'\nThe dimensions of ratings dataframe are:', ratings_df.shape)


The dimensions of movies dataframe are: (9742, 3) 
The dimensions of ratings dataframe are: (100836, 4)


In [3]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


#### Movie ID to movie name mapping

In [22]:
movie_names = movies_df.set_index('movieId')['title'].to_dict()
n_users = ratings_df.userId.nunique()
n_items = ratings_df.movieId.nunique()
print("# users: ", n_users, "\n# items: ", n_items, "\nThe full rating matrix will have: ", n_users*n_items, ' elements.')
print(f"# ratings: {len(ratings_df)}\nTherefore: {len(ratings_df)/(n_users*n_items)*100} % of the matrix is filled.")

# users:  610 
# items:  9724 
The full rating matrix will have:  5931640  elements.
# ratings: 100836
Therefore: 1.6999683055613624 % of the matrix is filled.


We have an <b><i>incredibly sparse matrix </i></b> to work with here.

And... as you can imagine, as the number of users and products grow, the number of elements will increase by n*2

You are going to need a lot of memory to work with global scale... storing a full matrix in memory would be a challenge.

One <b>advantage</b> here is that <u>matrix factorization</u> can realize the rating matrix implicitly, thus we don't need all the data

In [6]:
import torch
import numpy as np
from torch.autograd import Variable

class MatrixFactorization(torch.nn.Module):
    def __init__(self, n_users, n_items, n_factors=20):
        super().__init__()
        # create embeddings for users
        self.user_factor = torch.nn.Embedding(n_users, n_factors) # it's like a look up table for the input
        # Embedding for items
        self.item_factor = torch.nn.Embedding(n_items, n_factors)
        self.user_factor.weight.data.uniform_(0,0.05)
        self.item_factor.weight.data.uniform_(0,0.05)
        
    def forward(self, data):
        # matrix multiplication
        users, items = data[:,0], data[:,1]
        return (self.user_factor(users)*self.item_factor(items)).sum(1)
    
    def forward2(self, user, item):
        # matrix multiplication
        return self.user_factor(user)*self.item_factor(item).sum(1)
    
    def predict(self,user, item):
        return self.forward(user, item)

### Creating the dataloader in pytorch

In [7]:
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader 

class Loader(Dataset):
    def __init__(self):
        self.ratings = ratings_df.copy()
        
        # Extract all user IDs and movie IDs
        users = ratings_df.userId.unique()
        movies = ratings_df.movieId.unique()
        
        #--- Producing new continuous IDs for users and movies ---
        
        # Unique values : index
        self.userid2idx = {o:i for i,o in enumerate(users)}
        self.movieid2idx = {o:i for i,o in enumerate(movies)}
        
        # Obtained continuous ID for users and movies
        self.idx2userid = {i:o for o,i in self.userid2idx.items()}
        self.idx2movieid = {i:o for o,i in self.movieid2idx.items()}
        
        # return the id from the indexed values as noted in the lambda function down below.
        self.ratings.movieId = ratings_df.movieId.apply(lambda x: self.movieid2idx[x])
        self.ratings.userId = ratings_df.userId.apply(lambda x: self.userid2idx[x])
        
        
        self.x = self.ratings.drop(['rating', 'timestamp'], axis=1).values
        self.y = self.ratings['rating'].values
        self.x, self.y = torch.tensor(self.x), torch.tensor(self.y) # Transforms the data to tensors (ready for torch models.)

    def __getitem__(self, index):
        return (self.x[index], self.y[index])

    def __len__(self):
        return len(self.ratings)

### Preparing the training parameters and tools

In [8]:
num_epochs = 128
cuda = torch.cuda.is_available()

print("Is running on GPU:", cuda)

model = MatrixFactorization(n_users, n_items, n_factors=8)
print(model)
# for name, param in model.named_parameters():
#     if param.requires_grad:
#         print(name, param.data)
# GPU enable if you have a GPU...
if cuda:
    model = model.cuda()

# MSE loss
loss_fn = torch.nn.MSELoss()

# ADAM optimizier
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# Train data
train_set = Loader()
train_loader = DataLoader(train_set, 128, shuffle=True)

Is running on GPU: True
MatrixFactorization(
  (user_factor): Embedding(610, 8)
  (item_factor): Embedding(9724, 8)
)


  from .autonotebook import tqdm as notebook_tqdm


In [9]:
from tqdm import tqdm
for it in tqdm(range(num_epochs)):
    losses = []
    for x, y in train_loader:
        if cuda:
            x, y = x.cuda(), y.cuda()
        optimizer.zero_grad()
        outputs = model(x)
        loss = loss_fn(outputs.squeeze(), y.type(torch.float32))
        losses.append(loss.item())
        loss.backward()
        optimizer.step()
    print("iter #{}".format(it), "Loss:", sum(losses) / len(losses))
     

  1%|          | 1/128 [00:03<08:03,  3.81s/it]

iter #0 Loss: 11.063573892951617


  2%|▏         | 2/128 [00:05<05:40,  2.70s/it]

iter #1 Loss: 4.747916829162443


  2%|▏         | 3/128 [00:07<04:52,  2.34s/it]

iter #2 Loss: 2.476611403190545


  3%|▎         | 4/128 [00:09<04:29,  2.18s/it]

iter #3 Loss: 1.722014089374978


  4%|▍         | 5/128 [00:11<04:16,  2.08s/it]

iter #4 Loss: 1.3462853428676043


  5%|▍         | 6/128 [00:13<04:08,  2.04s/it]

iter #5 Loss: 1.1287034175874013


  5%|▌         | 7/128 [00:15<04:01,  2.00s/it]

iter #6 Loss: 0.9916549667262183


  6%|▋         | 8/128 [00:17<03:56,  1.97s/it]

iter #7 Loss: 0.9004620889267946


  7%|▋         | 9/128 [00:19<03:53,  1.96s/it]

iter #8 Loss: 0.8372120944225243


  8%|▊         | 10/128 [00:21<03:50,  1.95s/it]

iter #9 Loss: 0.7923444573800575


  9%|▊         | 11/128 [00:23<03:55,  2.02s/it]

iter #10 Loss: 0.7592515959428047


  9%|▉         | 12/128 [00:25<03:50,  1.99s/it]

iter #11 Loss: 0.7348318553061655


 10%|█         | 13/128 [00:27<03:46,  1.97s/it]

iter #12 Loss: 0.716253412329606


 11%|█         | 14/128 [00:29<03:42,  1.95s/it]

iter #13 Loss: 0.7017597187745389


 12%|█▏        | 15/128 [00:31<03:47,  2.01s/it]

iter #14 Loss: 0.6904315062737102


 12%|█▎        | 16/128 [00:33<03:52,  2.07s/it]

iter #15 Loss: 0.6818492265446537


 13%|█▎        | 17/128 [00:35<03:53,  2.11s/it]

iter #16 Loss: 0.6749020837965956


 14%|█▍        | 18/128 [00:37<03:57,  2.16s/it]

iter #17 Loss: 0.6698954220728826


 15%|█▍        | 19/128 [00:40<03:57,  2.18s/it]

iter #18 Loss: 0.6658838670417137


 16%|█▌        | 20/128 [00:42<03:51,  2.14s/it]

iter #19 Loss: 0.6629400907207261


 16%|█▋        | 21/128 [00:44<03:51,  2.17s/it]

iter #20 Loss: 0.6606530618622218


 17%|█▋        | 22/128 [00:46<03:50,  2.17s/it]

iter #21 Loss: 0.6588392979299962


 18%|█▊        | 23/128 [00:48<03:46,  2.16s/it]

iter #22 Loss: 0.6575668083154006


 19%|█▉        | 24/128 [00:50<03:44,  2.16s/it]

iter #23 Loss: 0.6566906307781408


 20%|█▉        | 25/128 [00:53<03:42,  2.16s/it]

iter #24 Loss: 0.6558078140171651


 20%|██        | 26/128 [00:55<03:41,  2.17s/it]

iter #25 Loss: 0.6550135704769096


 21%|██        | 27/128 [00:57<03:39,  2.17s/it]

iter #26 Loss: 0.6540080409515933


 22%|██▏       | 28/128 [00:59<03:38,  2.18s/it]

iter #27 Loss: 0.653331786885782


 23%|██▎       | 29/128 [01:01<03:38,  2.20s/it]

iter #28 Loss: 0.6520605890203248


 23%|██▎       | 30/128 [01:04<03:36,  2.21s/it]

iter #29 Loss: 0.6510417556323981


 24%|██▍       | 31/128 [01:06<03:33,  2.20s/it]

iter #30 Loss: 0.6494638351014423


 25%|██▌       | 32/128 [01:08<03:30,  2.19s/it]

iter #31 Loss: 0.6477866210232531


 26%|██▌       | 33/128 [01:10<03:27,  2.18s/it]

iter #32 Loss: 0.6453777202570499


 27%|██▋       | 34/128 [01:12<03:25,  2.18s/it]

iter #33 Loss: 0.6425141768558377


 27%|██▋       | 35/128 [01:15<03:25,  2.21s/it]

iter #34 Loss: 0.6393514826591244


 28%|██▊       | 36/128 [01:17<03:22,  2.20s/it]

iter #35 Loss: 0.6352594585588136


 29%|██▉       | 37/128 [01:19<03:21,  2.21s/it]

iter #36 Loss: 0.6302398298748859


 30%|██▉       | 38/128 [01:21<03:10,  2.12s/it]

iter #37 Loss: 0.624462768397658


 30%|███       | 39/128 [01:23<03:05,  2.08s/it]

iter #38 Loss: 0.617964971035265


 31%|███▏      | 40/128 [01:25<03:05,  2.11s/it]

iter #39 Loss: 0.6101860784485861


 32%|███▏      | 41/128 [01:27<03:04,  2.12s/it]

iter #40 Loss: 0.6023945927922496


 33%|███▎      | 42/128 [01:29<03:04,  2.14s/it]

iter #41 Loss: 0.5936383745316322


 34%|███▎      | 43/128 [01:32<03:03,  2.16s/it]

iter #42 Loss: 0.5845031061677763


 34%|███▍      | 44/128 [01:34<03:01,  2.16s/it]

iter #43 Loss: 0.5750925666487157


 35%|███▌      | 45/128 [01:36<03:00,  2.17s/it]

iter #44 Loss: 0.565863612453042


 36%|███▌      | 46/128 [01:38<02:58,  2.17s/it]

iter #45 Loss: 0.5560428280440078


 37%|███▋      | 47/128 [01:40<02:55,  2.17s/it]

iter #46 Loss: 0.546473297559973


 38%|███▊      | 48/128 [01:42<02:53,  2.17s/it]

iter #47 Loss: 0.537038322821789


 38%|███▊      | 49/128 [01:45<02:52,  2.19s/it]

iter #48 Loss: 0.5278839129330543


 39%|███▉      | 50/128 [01:47<02:50,  2.19s/it]

iter #49 Loss: 0.5187877588314453


 40%|███▉      | 51/128 [01:49<02:47,  2.18s/it]

iter #50 Loss: 0.5097662834497878


 41%|████      | 52/128 [01:51<02:45,  2.17s/it]

iter #51 Loss: 0.5011509834661096


 41%|████▏     | 53/128 [01:53<02:42,  2.17s/it]

iter #52 Loss: 0.4928647421186951


 42%|████▏     | 54/128 [01:56<02:40,  2.17s/it]

iter #53 Loss: 0.48508087783900616


 43%|████▎     | 55/128 [01:58<02:37,  2.16s/it]

iter #54 Loss: 0.4774417973638791


 44%|████▍     | 56/128 [02:00<02:35,  2.17s/it]

iter #55 Loss: 0.47027393088198555


 45%|████▍     | 57/128 [02:02<02:34,  2.17s/it]

iter #56 Loss: 0.463314883108369


 45%|████▌     | 58/128 [02:04<02:31,  2.16s/it]

iter #57 Loss: 0.45685123659783816


 46%|████▌     | 59/128 [02:06<02:28,  2.16s/it]

iter #58 Loss: 0.4508332490618459


 47%|████▋     | 60/128 [02:08<02:26,  2.15s/it]

iter #59 Loss: 0.44523390397656387


 48%|████▊     | 61/128 [02:11<02:23,  2.15s/it]

iter #60 Loss: 0.4398912843789546


 48%|████▊     | 62/128 [02:13<02:22,  2.17s/it]

iter #61 Loss: 0.4345819679689286


 49%|████▉     | 63/128 [02:15<02:22,  2.20s/it]

iter #62 Loss: 0.43002639173856244


 50%|█████     | 64/128 [02:17<02:19,  2.18s/it]

iter #63 Loss: 0.42530597950601334


 51%|█████     | 65/128 [02:19<02:17,  2.18s/it]

iter #64 Loss: 0.4207819109300369


 52%|█████▏    | 66/128 [02:22<02:14,  2.18s/it]

iter #65 Loss: 0.41697768723299056


 52%|█████▏    | 67/128 [02:24<02:12,  2.17s/it]

iter #66 Loss: 0.41319475254885435


 53%|█████▎    | 68/128 [02:26<02:09,  2.16s/it]

iter #67 Loss: 0.40927709813045365


 54%|█████▍    | 69/128 [02:28<02:07,  2.16s/it]

iter #68 Loss: 0.40601458616063074


 55%|█████▍    | 70/128 [02:30<02:06,  2.18s/it]

iter #69 Loss: 0.40263649758878095


 55%|█████▌    | 71/128 [02:32<02:00,  2.12s/it]

iter #70 Loss: 0.39930354694121983


 56%|█████▋    | 72/128 [02:34<01:59,  2.13s/it]

iter #71 Loss: 0.3963605581503834


 57%|█████▋    | 73/128 [02:36<01:57,  2.13s/it]

iter #72 Loss: 0.3935012433696822


 58%|█████▊    | 74/128 [02:39<01:55,  2.14s/it]

iter #73 Loss: 0.3907095291060845


 59%|█████▊    | 75/128 [02:41<01:53,  2.14s/it]

iter #74 Loss: 0.3883046552463231


 59%|█████▉    | 76/128 [02:43<01:51,  2.14s/it]

iter #75 Loss: 0.38584540954549906


 60%|██████    | 77/128 [02:45<01:48,  2.14s/it]

iter #76 Loss: 0.38337191097854356


 61%|██████    | 78/128 [02:47<01:46,  2.14s/it]

iter #77 Loss: 0.3808787355416922


 62%|██████▏   | 79/128 [02:49<01:44,  2.14s/it]

iter #78 Loss: 0.3787766049242564


 62%|██████▎   | 80/128 [02:51<01:42,  2.13s/it]

iter #79 Loss: 0.37672356296765624


 63%|██████▎   | 81/128 [02:54<01:40,  2.14s/it]

iter #80 Loss: 0.3749277426431022


 64%|██████▍   | 82/128 [02:56<01:38,  2.14s/it]

iter #81 Loss: 0.37302621968355276


 65%|██████▍   | 83/128 [02:58<01:36,  2.15s/it]

iter #82 Loss: 0.37105374811642666


 66%|██████▌   | 84/128 [03:00<01:34,  2.15s/it]

iter #83 Loss: 0.36917162104155204


 66%|██████▋   | 85/128 [03:02<01:32,  2.15s/it]

iter #84 Loss: 0.3674945010411255


 67%|██████▋   | 86/128 [03:04<01:30,  2.15s/it]

iter #85 Loss: 0.36592464307843126


 68%|██████▊   | 87/128 [03:06<01:27,  2.14s/it]

iter #86 Loss: 0.36432325548780753


 69%|██████▉   | 88/128 [03:09<01:26,  2.17s/it]

iter #87 Loss: 0.36289932919183965


 70%|██████▉   | 89/128 [03:11<01:24,  2.18s/it]

iter #88 Loss: 0.3612512306021857


 70%|███████   | 90/128 [03:13<01:22,  2.17s/it]

iter #89 Loss: 0.35990218761457404


 71%|███████   | 91/128 [03:15<01:20,  2.16s/it]

iter #90 Loss: 0.35855108987407636


 72%|███████▏  | 92/128 [03:17<01:17,  2.16s/it]

iter #91 Loss: 0.3571487407809889


 73%|███████▎  | 93/128 [03:19<01:15,  2.15s/it]

iter #92 Loss: 0.3559282099270276


 73%|███████▎  | 94/128 [03:22<01:12,  2.15s/it]

iter #93 Loss: 0.3546745123218764


 74%|███████▍  | 95/128 [03:24<01:10,  2.14s/it]

iter #94 Loss: 0.35343895962244365


 75%|███████▌  | 96/128 [03:26<01:08,  2.14s/it]

iter #95 Loss: 0.3523075644702173


 76%|███████▌  | 97/128 [03:28<01:05,  2.12s/it]

iter #96 Loss: 0.3511196159507115


 77%|███████▋  | 98/128 [03:30<01:03,  2.13s/it]

iter #97 Loss: 0.34999926027080736


 77%|███████▋  | 99/128 [03:32<01:01,  2.13s/it]

iter #98 Loss: 0.3491189095952789


 78%|███████▊  | 100/128 [03:34<00:59,  2.13s/it]

iter #99 Loss: 0.34814902058410163


 79%|███████▉  | 101/128 [03:37<00:57,  2.14s/it]

iter #100 Loss: 0.34707768042075454


 80%|███████▉  | 102/128 [03:39<00:55,  2.13s/it]

iter #101 Loss: 0.34622198623146505


 80%|████████  | 103/128 [03:41<00:53,  2.13s/it]

iter #102 Loss: 0.345082392490606


 81%|████████▏ | 104/128 [03:43<00:51,  2.14s/it]

iter #103 Loss: 0.344296695295778


 82%|████████▏ | 105/128 [03:45<00:49,  2.14s/it]

iter #104 Loss: 0.3432754522690616


 83%|████████▎ | 106/128 [03:47<00:46,  2.13s/it]

iter #105 Loss: 0.34267002122883267


 84%|████████▎ | 107/128 [03:49<00:44,  2.13s/it]

iter #106 Loss: 0.3417090610766471


 84%|████████▍ | 108/128 [03:52<00:42,  2.14s/it]

iter #107 Loss: 0.34091715607319384


 85%|████████▌ | 109/128 [03:54<00:40,  2.14s/it]

iter #108 Loss: 0.34024724190307754


 86%|████████▌ | 110/128 [03:56<00:38,  2.13s/it]

iter #109 Loss: 0.3394328419705333


 87%|████████▋ | 111/128 [03:58<00:36,  2.14s/it]

iter #110 Loss: 0.3386275561781704


 88%|████████▊ | 112/128 [04:00<00:34,  2.14s/it]

iter #111 Loss: 0.33790423004382153


 88%|████████▊ | 113/128 [04:02<00:32,  2.14s/it]

iter #112 Loss: 0.3371430547142089


 89%|████████▉ | 114/128 [04:04<00:30,  2.15s/it]

iter #113 Loss: 0.3366369243692323


 90%|████████▉ | 115/128 [04:07<00:28,  2.18s/it]

iter #114 Loss: 0.33574132384912014


 91%|█████████ | 116/128 [04:09<00:25,  2.09s/it]

iter #115 Loss: 0.3352578627465643


 91%|█████████▏| 117/128 [04:10<00:22,  2.02s/it]

iter #116 Loss: 0.33457679520857514


 92%|█████████▏| 118/128 [04:12<00:19,  1.97s/it]

iter #117 Loss: 0.3338580181075232


 93%|█████████▎| 119/128 [04:14<00:17,  1.95s/it]

iter #118 Loss: 0.33357080444693565


 94%|█████████▍| 120/128 [04:16<00:16,  2.06s/it]

iter #119 Loss: 0.3327278375323049


 95%|█████████▍| 121/128 [04:18<00:14,  2.00s/it]

iter #120 Loss: 0.33218453039750834


 95%|█████████▌| 122/128 [04:21<00:12,  2.06s/it]

iter #121 Loss: 0.33166381561665365


 96%|█████████▌| 123/128 [04:23<00:10,  2.10s/it]

iter #122 Loss: 0.33111858693173696


 97%|█████████▋| 124/128 [04:25<00:08,  2.11s/it]

iter #123 Loss: 0.330774118301227


 98%|█████████▊| 125/128 [04:27<00:06,  2.12s/it]

iter #124 Loss: 0.33010784404229393


 98%|█████████▊| 126/128 [04:29<00:04,  2.14s/it]

iter #125 Loss: 0.32959684479962753


 99%|█████████▉| 127/128 [04:31<00:02,  2.14s/it]

iter #126 Loss: 0.32910189817323904


100%|██████████| 128/128 [04:33<00:00,  2.14s/it]

iter #127 Loss: 0.3287796936361923





### Fit the clusters based on the movie weights

In [16]:
trained_movie_embeddings = model.item_factor.weight.data.cpu().numpy()

In [17]:
len(trained_movie_embeddings) # unique movie factor weights

9724

In [18]:
trained_movie_embeddings

array([[0.7210944 , 0.4119839 , 0.85131705, ..., 0.44539163, 0.32182238,
        0.33362895],
       [0.3040543 , 0.7816727 , 0.59153974, ..., 0.00246055, 0.29792678,
        0.5620495 ],
       [0.70916766, 0.34539557, 0.55498797, ..., 0.38553005, 0.44276807,
        0.240126  ],
       ...,
       [0.3520334 , 0.35725495, 0.35788986, ..., 0.3598867 , 0.376347  ,
        0.35604864],
       [0.4135562 , 0.41313234, 0.41165113, ..., 0.40019327, 0.43589655,
        0.3934105 ],
       [0.39046398, 0.3834934 , 0.41870788, ..., 0.38096786, 0.40370354,
        0.43732196]], dtype=float32)

In [19]:
# !pip install threadpoolctl==3.1.0
import sklearn
sklearn.show_versions()


System:
    python: 3.8.13 (default, Mar 28 2022, 11:38:47)  [GCC 7.5.0]
executable: /home/mmoha014/.conda/envs/tf-torch/bin/python
   machine: Linux-4.18.0-513.9.1.el8_9.x86_64-x86_64-with-glibc2.17

Python dependencies:
      sklearn: 1.1.2
          pip: 22.2.2
   setuptools: 63.4.1
        numpy: 1.19.5
        scipy: 1.9.1
       Cython: None
       pandas: 1.5.0
   matplotlib: 3.7.1
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
         prefix: libmkl_rt
       filepath: /home/mmoha014/.conda/envs/tf-torch/lib/libmkl_rt.so.1
        version: 2021.4-Product
threading_layer: intel
    num_threads: 3

       user_api: openmp
   internal_api: openmp
         prefix: libiomp
       filepath: /home/mmoha014/.conda/envs/tf-torch/lib/libiomp5.so
        version: None
    num_threads: 3

       user_api: openmp
   internal_api: openmp
         prefix: libgomp
       filepath: /home/mmoha014/.conda/envs/t

In [20]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10)#, random_state=0)
if trained_movie_embeddings is not None and len(trained_movie_embeddings)>0:
    kmeans.fit(trained_movie_embeddings)

It can be seen here that the movies that are in the same cluster tend to have similar genres. 

Also note that the algorithm is unfamiliar with the movie name and only obtained the relationships by looking at the numbers representing how
users have responded to the movie selections.

In [23]:
for cluster in range(10):
  print("Cluster #{}".format(cluster))
  movs = []
  for movidx in np.where(kmeans.labels_ == cluster)[0]:
    movid = train_set.idx2movieid[movidx]
    rat_count = ratings_df.loc[ratings_df['movieId']==movid].count()[0]
    movs.append((movie_names[movid], rat_count))
  for mov in sorted(movs, key=lambda tup: tup[1], reverse=True)[:10]:
    print("\t", mov[0])

Cluster #0
	 Batman & Robin (1997)
	 Super Mario Bros. (1993)
	 Joe Dirt (2001)
	 Speed 2: Cruise Control (1997)
	 Rocky V (1990)
	 Superman IV: The Quest for Peace (1987)
	 Nutty Professor II: The Klumps (2000)
	 Karate Kid, Part III, The (1989)
	 Shark Tale (2004)
	 Dungeons & Dragons (2000)
Cluster #1
	 Forrest Gump (1994)
	 Shawshank Redemption, The (1994)
	 Silence of the Lambs, The (1991)
	 Matrix, The (1999)
	 Star Wars: Episode IV - A New Hope (1977)
	 Braveheart (1995)
	 Fight Club (1999)
	 Seven (a.k.a. Se7en) (1995)
	 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
	 Star Wars: Episode VI - Return of the Jedi (1983)
Cluster #2
	 Schindler's List (1993)
	 Toy Story (1995)
	 Aladdin (1992)
	 Back to the Future (1985)
	 Mask, The (1994)
	 Beauty and the Beast (1991)
	 Princess Bride, The (1987)
	 Beautiful Mind, A (2001)
	 E.T. the Extra-Terrestrial (1982)
	 Willy Wonka & the Chocolate Factory (1971)
Cluster #3
	 Iron Man (2008)
	 Starship Trooper