### Table to Dictionary
* In code, I want to ask questions like:
    * Given user i, which movies j did they rate?
    * Given movie j, which users i have rated it?
    * Given user i and movie j, what is the rating?
* Theoratically, Pandas Dataframe is like an SQL table, so we should be able to write "queries" to grab this info?
* Python dictionaries are already a key>value lookup
    * user2movie: user ID -> movie ID
    * movie2user: movieID -> user ID
    * usermovie2rating: (user ID, movie ID) -> rating
* Why dictionaries?    
    * Looping through the array would be O(MN)    
    * Looping through dictionary is O(|omega|), omega: set pf our ratings

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

In [2]:
df = pd.read_csv('../Data/very_small_rating.csv')

In [3]:
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,userId,movieId,rating,movie_idx
0,13174,13174,886,1,3.0,11
1,13175,13175,886,2,2.0,125
2,13177,13177,886,6,1.5,107
3,13180,13180,886,10,2.0,68
4,13181,13181,886,11,2.0,188


In [4]:
N = df.userId.max()+1
M = df.movie_idx.max()+1

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
# train test split
df = shuffle(df)
cutoff = int(0.8*len(df))
df_train = df.iloc[:cutoff]
df_test = df.iloc[cutoff:]

In [7]:
user2movie=dict()
movie2user=dict()
usermovie2rating=dict()

In [8]:
count=0
def update_user2movie_and_movie2user(row):
    global count # using global count here
    count+=1
    if count%100000 == 0:
        print("preprocessed: %.3f" % (float(count)/cutoff))
    i = int(row.userId)
    j = int(row.movie_idx)
    if i not in user2movie:
        user2movie[i]=[j]
    else:
        user2movie[i].append(j)
    if j not in movie2user:
        movie2user[j]=[i]
    else:
        movie2user[j].append(i)
    
    usermovie2rating[(i,j)] = row.rating

In [9]:
df_train.apply(update_user2movie_and_movie2user,axis=1)

preprocessed: 0.767


16247     None
133253    None
35716     None
152989    None
33912     None
          ... 
138269    None
111279    None
97022     None
111602    None
129756    None
Length: 130307, dtype: object

In [10]:
# test ratings dictionary
usermovie2rating_test = dict()


In [11]:
count=0
def update_usermovie2rating_test(row):
    global count
    count+=1
    if count % 100000 == 0:
        print("preprocessed: %.3f" % (float(count)/len(df_test)))
    i = int(row.userId)
    j = int(row.movie_idx)
    usermovie2rating_test[(i,j)] = row.rating

In [13]:
df_test.apply(update_usermovie2rating_test,axis=1)

47257     None
161728    None
160769    None
47917     None
54546     None
          ... 
54575     None
91244     None
95011     None
35794     None
25206     None
Length: 32577, dtype: object

In [15]:
usermovie2rating_test

{(389, 195): 4.5,
 (278, 129): 2.0,
 (192, 136): 4.0,
 (593, 29): 4.5,
 (801, 141): 3.5,
 (887, 176): 2.5,
 (353, 160): 4.0,
 (771, 103): 4.0,
 (836, 6): 3.5,
 (700, 139): 4.5,
 (864, 175): 2.0,
 (520, 60): 4.0,
 (523, 94): 2.0,
 (109, 115): 3.0,
 (351, 112): 4.0,
 (635, 20): 2.0,
 (285, 20): 4.0,
 (804, 30): 4.5,
 (112, 101): 2.0,
 (68, 95): 5.0,
 (279, 196): 4.0,
 (98, 0): 5.0,
 (781, 154): 5.0,
 (368, 31): 4.0,
 (541, 69): 4.0,
 (953, 78): 5.0,
 (149, 75): 3.5,
 (170, 35): 4.5,
 (389, 23): 5.0,
 (812, 34): 4.0,
 (690, 85): 3.0,
 (648, 54): 4.5,
 (304, 132): 2.5,
 (678, 136): 4.0,
 (907, 107): 3.0,
 (162, 116): 5.0,
 (5, 6): 4.5,
 (994, 92): 4.0,
 (623, 93): 1.0,
 (902, 24): 4.0,
 (707, 115): 4.0,
 (177, 90): 3.5,
 (315, 164): 3.5,
 (185, 108): 4.0,
 (607, 37): 2.0,
 (846, 9): 2.0,
 (853, 194): 3.0,
 (300, 193): 5.0,
 (130, 147): 4.5,
 (271, 133): 5.0,
 (39, 162): 4.0,
 (827, 42): 4.0,
 (24, 186): 4.0,
 (612, 93): 5.0,
 (193, 159): 3.5,
 (175, 20): 4.0,
 (888, 163): 3.0,
 (629, 156):

In [17]:
import pickle
with open('user2movie.json','wb') as f:
    pickle.dump(user2movie,f)
with open('movie2user.json','wb') as f:
    pickle.dump(movie2user,f)    
with open('usermovie2rating.json','wb') as f:
    pickle.dump(usermovie2rating,f)
with open('usermovie2rating_test.json','wb') as f:
    pickle.dump(usermovie2rating_test,f)