## Download movielens 1M dataset(https://grouplens.org/datasets/movielens/)

In [1]:
!mkdir data
!wget -q http://www.grouplens.org/system/files/ml-1m.zip ./data
!unzip -o ml-1m -d data

Archive:  ml-1m.zip
   creating: data/ml-1m/
  inflating: data/ml-1m/movies.dat   
  inflating: data/ml-1m/ratings.dat  
  inflating: data/ml-1m/README       
  inflating: data/ml-1m/users.dat    


## Transform ml-1m dataset into Matrix Market Form

If you are not familiar with mm(matrix market) format, refer [this](http://networkrepository.com/mtx-matrix-market-format.html)

If you need to know further on how buffalo handle data, check [Documentation on database of Buffalo](https://buffalo-recsys.readthedocs.io/en/latest/intro.html#database)

In [2]:
import numpy as np
import pandas as pd
from scipy.io import mmwrite
from scipy.io import mmread
from scipy.sparse import csr_matrix

In [3]:
ratings = pd.read_csv("data/ml-1m/ratings.dat", header=None, sep="::", engine='python')
ratings.columns = ["uid", "iid", "rating", "timestamp"]

In [4]:
movies = pd.read_csv('data/ml-1m/movies.dat', header=None, sep="::", engine='python')
movies.columns = ['iid', 'movie_name', 'genre']

buffalo iid does not support string with utf-8 encoding and having spaces.

Therefore, we have to replace spaces and utf-8 text.

In [5]:
def parse_moviename(movie_name):
    return movie_name.replace(' ', '_').encode('utf-8').decode('ascii', 'ignore')

In [6]:
iid_to_movie_name = dict(zip(movies.iid.tolist(), movies.movie_name.tolist()))
iid_to_movie_name = {iid: parse_moviename(movie_name) for (iid, movie_name) in iid_to_movie_name.items()}

In [7]:
uid_to_idx = {uid: idx for (idx, uid) in enumerate(ratings.uid.unique().tolist())}
iid_to_idx = {iid: idx for (idx, iid) in enumerate(ratings.iid.unique().tolist())}
idx_to_movie_name = {idx:iid_to_movie_name[iid] for (iid, idx) in iid_to_idx.items()}

In [8]:
print("Examples of movie names\n")

for i in range(30, 35):
    print("[index %d] movie_name: %s" % (i, idx_to_movie_name[i]))

Examples of movie names

[index 30] movie_name: Antz_(1998)
[index 31] movie_name: Girl,_Interrupted_(1999)
[index 32] movie_name: Hercules_(1997)
[index 33] movie_name: Aladdin_(1992)
[index 34] movie_name: Mulan_(1998)


In [9]:
row, col, dat = ratings.uid.tolist(), ratings.iid.tolist(), ratings.rating.tolist()
row = [uid_to_idx[r] for r in row]
col = [iid_to_idx[c] for c in col]

In [10]:
train_matrix = csr_matrix((dat, (row,col)), shape=(1 + np.max(row), 1 + np.max(col)))

In [11]:
print(train_matrix.shape)

(6040, 3706)


#### To transform csr matrix into matrix market format easily, we use mmwrite (matrix market write)

In [12]:
mmwrite('data/ml-1m/main', train_matrix)

In [13]:
with open("data/ml-1m/uid", "w") as f:
    for uid in uid_to_idx:
        print(uid, file=f)

with open("data/ml-1m/iid", "w") as f:
    for iid, movie_name in idx_to_movie_name.items():
        print(movie_name, file=f)

## Transform ml-1m dataset into Stream format

Stream file format used in buffalo contains lines lists, having space as delimiter.

One line is ordered list of items that each user interacted (ordered by time)

This is useful when the order between interactions are considered(e.g., word2vec, Cofactor).

See `2. Cofactor` or `3. Word2vec` to see the case where Stream format data is used

If you need to know further on Stream format data, check [Documentation on database of Buffalo](https://buffalo-recsys.readthedocs.io/en/latest/intro.html#database)

In [14]:
ratings_as_list = ratings.sort_values(by='timestamp').groupby('uid').iid.apply(list).reset_index()
uid = ratings_as_list.uid.tolist()
seen_iids = ratings_as_list.iid.tolist()

In [15]:
seen_iids = [' '.join([iid_to_movie_name[iid] for iid in iids]) for iids in seen_iids]

In [16]:
print(seen_iids[0])

Girl,_Interrupted_(1999) Titanic_(1997) Back_to_the_Future_(1985) Cinderella_(1950) Meet_Joe_Black_(1998) Last_Days_of_Disco,_The_(1998) Erin_Brockovich_(2000) To_Kill_a_Mockingbird_(1962) Christmas_Story,_A_(1983) Star_Wars:_Episode_IV_-_A_New_Hope_(1977) Wallace_&_Gromit:_The_Best_of_Aardman_Animation_(1996) One_Flew_Over_the_Cuckoo's_Nest_(1975) Wizard_of_Oz,_The_(1939) Fargo_(1996) Run_Lola_Run_(Lola_rennt)_(1998) Rain_Man_(1988) Saving_Private_Ryan_(1998) Awakenings_(1990) Gigi_(1958) Sound_of_Music,_The_(1965) Driving_Miss_Daisy_(1989) Mary_Poppins_(1964) Bambi_(1942) Apollo_13_(1995) E.T._the_Extra-Terrestrial_(1982) My_Fair_Lady_(1964) Ben-Hur_(1959) Big_(1988) Dead_Poets_Society_(1989) Sixth_Sense,_The_(1999) James_and_the_Giant_Peach_(1996) Ferris_Bueller's_Day_Off_(1986) Secret_Garden,_The_(1993) Toy_Story_2_(1999) Airplane!_(1980) Dumbo_(1941) Pleasantville_(1998) Princess_Bride,_The_(1987) Snow_White_and_the_Seven_Dwarfs_(1937) Miracle_on_34th_Street_(1947) Ponette_(1996) 

In [17]:
with open("data/ml-1m/stream", "w") as f:
    for iid_list in seen_iids:
        print(iid_list, file=f)