In [1]:
import os
import pandas as pd
import time

start = time.time()
data_path = os.getcwd()[:os.getcwd().find("Code")] + "Data/netflix-prize/complete-csv/all_samples.csv"
df = pd.read_csv(data_path)
del df["Unnamed: 0"]
del df["date"]
print("The loading process took", round(time.time() - start), "seconds")

The loading process took 97 seconds


In [2]:
df.shape

(100477253, 3)

In [3]:
df.head()

Unnamed: 0,movie_id,user_id,rating
0,1,1488844,3
1,1,822109,5
2,1,885013,4
3,1,30878,4
4,1,823519,3


## Clustering Distance Metric:

$$ d(A, B) = \frac{1}{n\cdot5^2} \sum_i (r_{Ai} - r_{Bi})^2 $$

where $r_{Ai}$ is the rating of user $A$ to movie $i$, and $n$ is the number of movies both user $A$ has reviewed.

In [4]:
len(df["user_id"].unique())

480189

In [9]:
# This is too inefficient -- don't run!
def first_try():
    start = time.time()
    a, b = 1488844, 6
    distance = {}
    rate, d = 0, 0
    for movie_id in set(df[df["user_id"] == a]["movie_id"]) & set(df[df["user_id"] == b]["movie_id"]):
        #if rate == 0:
        #    print(len(set(df[df["user_id"] == a]["movie_id"]) & set(df[df["user_id"] == b]["movie_id"])))
        r_a = df[(df["user_id"] == a) & (df["movie_id"] == movie_id)]["rating"].iloc[0]
        r_b = df[(df["user_id"] == b) & (df["movie_id"] == movie_id)]["rating"].iloc[0]
        #print(r_a, r_b)
        #break
        d += (r_a - r_b)**2
        rate += 1
        #if rate % 10 == 0:
        #    print("distance ----", (r_a - r_b)**2/(5**2), "-----", rate, "going")
    distance = d/(rate*(5**2))
    print(time.time() - start)
    
#first_try()
# Output -> 285 secs

284.9979431629181


Let's try getting the user's dataframes first

In [17]:
start = time.time()
a, b = 1488844, 6
df_a = df[df["user_id"] == a]
df_b = df[df["user_id"] == b]
rate, d = 0, 0
for movie_id in set(df_a["movie_id"]) & set(df_b["movie_id"]):
    d += (df_a[df_a["movie_id"] == movie_id]["rating"].iloc[0] - df_b[df_b["movie_id"] == movie_id]["rating"].iloc[0])**2
distance = d/(len(set(df_a["movie_id"]) & set(df_b["movie_id"]))*(5**2))
print(time.time() - start)
distance

0.8659329414367676


0.043485838779956425

Wow... But even so, 0.87 second per user, and we have 480189 users... It would take 116 hours to get the distance of all the users with respect to a single user... We need to remove both users and movies with few ratings and movies, but even doing so, this seems unfeasible. We can also split intro training and testing now, or rather, we should. 50-50 split is reasonable for such large data. Without removing irrelevant users, this would come down to 58 hours for only one user... still too much, even if we do multiple users in parallel.

Using NumPy vects will probably speed it up even more... let's try

In [50]:
start = time.time()
a, b = 1488844, 6
df_a = df[df["user_id"] == a]
df_b = df[df["user_id"] == b]
rate, d = 0, 0
common_movies = set(df_a["movie_id"]) & set(df_b["movie_id"])
ratings_a = df_a[df_a["movie_id"].isin(common_movies)]["rating"].values
ratings_b = df_b[df_b["movie_id"].isin(common_movies)]["rating"].values
print(time.time() - start)
((ratings_a - ratings_b)**2).mean()/5**2

0.23475003242492676


0.04348583877995643

Still... this would mean 15 hours for one user. Maybe after removing non-relevant users and movies, this will become feasible if we do it in parallel and on the cloud. Need your thoughts on this before trying this though, as I am not very optimistic in the matter. TODO. Aaron's code will come in handy here.

The last approach would be KMeans via sklearn, but for that we need to create a matrix like

| users | movie_1 | movie_2 | movie_3 | .... |
|-------|---------|---------|---------|------|
| a     | 3       | 4       | 1       | ...  |
| b     | 2       | 2       | 5       | ...  |
| ...   | ...     | ...     | ...     | ...  |

but we would get a lot of NaNs... for the users which didn't rate movie_1, movie_2, etc. We could trim it down to include as features only the movies that were rated by all users, but I do not know how good the clustering would be at that point...

In [10]:
movies_user = set(df[df["user_id"] == df["user_id"].unique()[0]]["movie_id"])
count = 0
for user in df["user_id"].unique()[1:]:
    movies_user = movies_user & set(df[df["user_id"] == user]["movie_id"])
    count += 1
    if count == 6:
        break

In [11]:
len(movies_user)

1

Yeap, this is even worse than I though. For the first 6 users, there is only one movie which all of them rated!!

In [8]:
movies_user

{1}

After removing users with few raitings, this will probably improve, but still, we have too many users...