### Table to Dictionary
* In code, I want to ask questions like:
    * Given user i, which movies j did they rate?
    * Given movie j, which users i have rated it?
    * Given user i and movie j, what is the rating?
* Theoratically, Pandas Dataframe is like an SQL table, so we should be able to write "queries" to grab this info?
* Python dictionaries are already a key>value lookup
    * user2movie: user ID -> movie ID
    * movie2user: movieID -> user ID
    * usermovie2rating: (user ID, movie ID) -> rating
* Why dictionaries?    
    * Looping through the array would be O(MN)    
    * Looping through dictionary is O(|omega|), omega: set pf our ratings

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

In [2]:
df = pd.read_csv('./Data/very_small_rating.csv')

In [3]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,userId,movieId,rating,movie_idx
0,19846,19846,190,1,5.0,10
1,19847,19847,190,2,5.0,125
2,19851,19851,190,6,4.0,104
3,19854,19854,190,10,4.0,68
4,19855,19855,190,11,5.0,186


In [4]:
N = df.userId.max()+1
M = df.movie_idx.max()+1

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
# train test split
df = shuffle(df)
cutoff = int(0.8*len(df))
df_train = df.iloc[:cutoff]
df_test = df.iloc[cutoff:]

In [7]:
user2movie=dict()
movie2user=dict()
usermovie2rating=dict()

In [8]:
count=0
def update_user2movie_and_movie2user(row):
    global count # using global count here
    count+=1
    if count%100000 == 0:
        print("preprocessed: %.3f" % (float(count)/cutoff))
    i = int(row.userId)
    j = int(row.movie_idx)
    if i not in user2movie:
        user2movie[i]=[j]
    else:
        user2movie[i].append(j)
    if j not in movie2user:
        movie2user[j]=[i]
    else:
        movie2user[j].append(i)
    
    usermovie2rating[(i,j)] = row.rating

In [9]:
df_train.apply(update_user2movie_and_movie2user,axis=1)

preprocessed: 0.748


48683     None
166238    None
14850     None
121967    None
85546     None
          ... 
101473    None
33281     None
50428     None
69305     None
76651     None
Length: 133628, dtype: object

In [10]:
# test ratings dictionary
usermovie2rating_test = dict()


In [11]:
count=0
def update_usermovie2rating_test(row):
    global count
    count+=1
    if count % 100000 == 0:
        print("preprocessed: %.3f" % (float(count)/len(df_test)))
    i = int(row.userId)
    j = int(row.movie_idx)
    usermovie2rating_test[(i,j)] = row.rating

In [12]:
df_test.apply(update_usermovie2rating_test,axis=1)

76183     None
104245    None
157931    None
98088     None
96592     None
          ... 
117772    None
107594    None
88631     None
33300     None
21228     None
Length: 33407, dtype: object

In [13]:
import pickle
with open('user2movie.json','wb') as f:
    pickle.dump(user2movie,f)
with open('movie2user.json','wb') as f:
    pickle.dump(movie2user,f)    
with open('usermovie2rating.json','wb') as f:
    pickle.dump(usermovie2rating,f)
with open('usermovie2rating_test.json','wb') as f:
    pickle.dump(usermovie2rating_test,f)