### Table to Dictionary
* In code, I want to ask questions like:
    * Given user i, which movies j did they rate?
    * Given movie j, which users i have rated it?
    * Given user i and movie j, what is the rating?
* Theoratically, Pandas Dataframe is like an SQL table, so we should be able to write "queries" to grab this info?
* Python dictionaries are already a key>value lookup
    * user2movie: user ID -> movie ID
    * movie2user: movieID -> user ID
    * usermovie2rating: (user ID, movie ID) -> rating
* Why dictionaries?    
    * Looping through the array would be O(MN)    
    * Looping through dictionary is O(|omega|), omega: set pf our ratings

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

In [9]:
df = pd.read_csv('./Data/very_small_rating.csv')

In [10]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,userId,movieId,rating,movie_idx
0,960,960,7307,1,4.5,10
1,961,961,7307,10,2.5,68
2,962,962,7307,19,3.5,143
3,963,963,7307,32,5.0,19
4,964,964,7307,39,4.5,85


In [20]:
N = df.userId.max()+1
M = df.movie_idx.max()+1

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X = df.drop('rating',axis=1)
y = df['rating']

In [49]:
# train test split
df = shuffle(df)
cutoff = int(0.8*len(df))
df_train = df.iloc[:cutoff]
df_test = df.iloc[cutoff:]

In [59]:
user2movie=dict()
movie2user=dict()
usermovie2rating=dict()

In [64]:
count=0
def update_user2movie_and_movie2user(row):
    global count # using global count here
    count+=1
    if count%100000 == 0:
        print("preprocessed: %.3f" % (float(count)/cutoff))
    i = int(row.userId)
    j = int(row.movie_idx)
    if i not in user2movie:
        user2movie[i]=[j]
    else:
        user2movie[i].append(j)
    if j not in movie2user:
        movie2user[j]=[i]
    else:
        movie2user[j].append(i)
    
    usermovie2rating[(i,j)] = row.rating

In [75]:
df_train.apply(update_user2movie_and_movie2user,axis=1)

preprocessed: 0.510
preprocessed: 0.533
preprocessed: 0.556
preprocessed: 0.580
preprocessed: 0.603
preprocessed: 0.626
preprocessed: 0.649
preprocessed: 0.672
preprocessed: 0.695
preprocessed: 0.719
preprocessed: 0.742
preprocessed: 0.765
preprocessed: 0.788
preprocessed: 0.811
preprocessed: 0.835
preprocessed: 0.858
preprocessed: 0.881
preprocessed: 0.904
preprocessed: 0.927
preprocessed: 0.950
preprocessed: 0.974
preprocessed: 0.997
preprocessed: 1.020
preprocessed: 1.043
preprocessed: 1.066
preprocessed: 1.090
preprocessed: 1.113
preprocessed: 1.136
preprocessed: 1.159
preprocessed: 1.182
preprocessed: 1.205
preprocessed: 1.229
preprocessed: 1.252
preprocessed: 1.275
preprocessed: 1.298
preprocessed: 1.321
preprocessed: 1.345
preprocessed: 1.368
preprocessed: 1.391
preprocessed: 1.414
preprocessed: 1.437
preprocessed: 1.460
preprocessed: 1.484


832817     None
541292     None
5232619    None
3534255    None
1193841    None
           ... 
1591962    None
5295376    None
4243952    None
4248271    None
1471064    None
Length: 4313620, dtype: object

In [67]:
# test ratings dictionary
usermovie2rating_test = dict()


In [68]:
count=0
def update_usermovie2rating_test(row):
    global count
    count+=1
    if count % 100000 == 0:
        print("preprocessed: %.3f" % (float(count)/cutoff))
    i = int(row.userId)
    j = int(row.movie_idx)
    usermovie2rating_test[(i,j)] = row.rating

In [70]:
df_test.apply(update_usermovie2rating_test,axis=1)

preprocessed: 0.255
preprocessed: 0.278
preprocessed: 0.301
preprocessed: 0.325
preprocessed: 0.348
preprocessed: 0.371
preprocessed: 0.394
preprocessed: 0.417
preprocessed: 0.440
preprocessed: 0.464
preprocessed: 0.487


4618337    None
3917868    None
5126314    None
5319645    None
2496638    None
           ... 
4246265    None
1102791    None
4319043    None
3755660    None
3009782    None
Length: 1078405, dtype: object

In [73]:
import pickle
with open('user2movie.json','wb') as f:
    pickle.dump(user2movie,f)
with open('movie2user.json','wb') as f:
    pickle.dump(movie2user,f)    
with open('usermovie2rating.json','wb') as f:
    pickle.dump(usermovie2rating,f)
with open('usermovie2rating_test.json','wb') as f:
    pickle.dump(usermovie2rating_test,f)