This notebook downloads the MovieLens 100K dataset and LastFM 1K User dataset in their raw form, pre-process them into the form of


`[user_id, item_id]`, in which data of one `user_id` are grouped together and his/her `item_id` are sorted in the ascending order in timestamp.

In [4]:
import os
import time
import copy
import random
import zipfile, tarfile
import pandas as pd
import numpy as np

#MovieLens 100K dataset


### Load Raw Data

In [1]:
!wget https://files.grouplens.org/datasets/movielens/ml-100k.zip

--2024-05-13 03:48:45--  https://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2024-05-13 03:48:47 (5.70 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]



## To DataFrame

In [2]:
def read_data_ml100k():
    base_dir = os.path.dirname('ml-100k.zip')
    data_dir, ext = os.path.splitext('ml-100k.zip')
    fp = zipfile.ZipFile('ml-100k.zip', 'r')
    fp.extractall(base_dir)
    names = ['user_id', 'item_id', 'rating', 'timestamp']
    data = pd.read_csv(os.path.join(data_dir, 'u.data'), sep='\t',
                       names=names, engine='python')
    num_users = data.user_id.unique().shape[0]
    num_items = data.item_id.unique().shape[0]
    return data, num_users, num_items

In [5]:
data, num_users, num_items = read_data_ml100k()
sparsity = 1 - len(data) / (num_users * num_items)
print(f'number of users: {num_users}, number of items: {num_items}')
print(f'matrix sparsity: {sparsity:f}')
print(data.head(5))

number of users: 943, number of items: 1682
matrix sparsity: 0.936953
   user_id  item_id  rating  timestamp
0      196      242       3  881250949
1      186      302       3  891717742
2       22      377       1  878887116
3      244       51       2  880606923
4      166      346       1  886397596


## Sort DataFrame and Save

Sort in the `user_id` and `timestamp` dimension.

Save the dataset (`user_id`, `item_id` pairs) in a .txt file for reuse.

File path: `data/movielens_100k.txt`

In [7]:
data_name = 'movielens_100k'
user_set, item_set = set(data['user_id'].unique()), set(data['item_id'].unique())
user_map = dict()
item_map = dict()
for u, user in enumerate(user_set):
    user_map[user] = u+1
for i, item in enumerate(item_set):
    item_map[item] = i+1

# sorted raw data on the user_id dimension and timestamp dimension
data["user_id"] = data["user_id"].apply(lambda x: user_map[x])
data["item_id"] = data["item_id"].apply(lambda x: item_map[x])
data = data.sort_values(by=["user_id", "timestamp"])
data.head(10)

# For dataset, only keep the [user_id] and [item_id] column
data_to_file = data.drop(columns=['timestamp','rating'])
if not os.path.isdir('data'):
  os.makedirs('data')
data_to_file.to_csv(os.path.join('data',data_name+'.txt'), sep="\t", header=False, index=False)

# LastFM 1K User Dataset

This is a very big dataset
### Load Raw Data

In [9]:
!wget http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz

--2024-05-13 04:13:46--  http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz
Resolving mtg.upf.edu (mtg.upf.edu)... 84.89.139.55
Connecting to mtg.upf.edu (mtg.upf.edu)|84.89.139.55|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 672741554 (642M) [application/octet-stream]
Saving to: ‘lastfm-dataset-1K.tar.gz’


2024-05-13 04:16:06 (4.61 MB/s) - ‘lastfm-dataset-1K.tar.gz’ saved [672741554/672741554]



## To DataFrame and Sort

Also sorted based on `user_id` and `timestamp`

In [10]:
base_dir = os.path.dirname('lastfm-dataset-1K.tar.gz')
file = tarfile.open('lastfm-dataset-1K.tar.gz')
file.extractall(base_dir)
file.close()
data_dir, ext = os.path.splitext('lastfm-dataset-1K.tar.gz')
data_dir, ext = os.path.splitext(data_dir)
filepath = os.path.join(data_dir, 'userid-timestamp-artid-artname-traid-traname.tsv')
df = pd.read_csv(
    filepath, sep='\t', header=None,
    names=[
        'user_id', 'timestamp', 'artist_id', 'artist_name', 'track_id', 'track_name'
    ],
    skiprows=[
        2120260-1, 2446318-1, 11141081-1,
        11152099-1, 11152402-1, 11882087-1,
        12902539-1, 12935044-1, 17589539-1
    ]
)
df["timestamp"] = pd.to_datetime(df.timestamp)
df.sort_values(['user_id', 'timestamp'], ascending=True, inplace=True)
print(f'Number of Records: {len(df):,}\nUnique Users: {df.user_id.nunique()}\nUnique Artist:{df.artist_id.nunique():,}')
df.head(5)

Number of Records: 19,098,853
Unique Users: 992
Unique Artist:107,295


Unnamed: 0,user_id,timestamp,artist_id,artist_name,track_id,track_name
16684,user_000001,2006-08-13 13:59:20+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,c4633ab1-e715-477f-8685-afa5f2058e42,The Launching Of Big Face
16683,user_000001,2006-08-13 14:03:29+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,bc2765af-208c-44c5-b3b0-cf597a646660,Zn Zero
16682,user_000001,2006-08-13 14:10:43+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,aa9c5a80-5cbe-42aa-a966-eb3cfa37d832,The Return Of Super Barrio - End Credits
16681,user_000001,2006-08-13 14:17:40+00:00,67fb65b5-6589-47f0-9371-8a40eb268dfb,Tommy Guerrero,d9b1c1da-7e47-4f97-a135-77260f2f559d,Mission Flats
16680,user_000001,2006-08-13 14:19:06+00:00,1cfbc7d1-299c-46e6-ba4c-1facb84ba435,Artful Dodger,120bb01c-03e4-465f-94a0-dce5e9fac711,What You Gonna Do?


## String to Int, Remove Redundant Info
Transform the identifiers from string to integers

In [11]:
df['new_user_id']=df.groupby(['user_id']).ngroup()
df['new_artist_id']=df.groupby(['artist_id']).ngroup()
df['new_artist_name']=df.groupby(['artist_name']).ngroup()

The users interact with artist and tracks. Since there are way more tracks than artists, and also the same artist usually has his/her own style, and the tracks can be viewed a sub-category under each artist, we'll only keep information for the interaction with artist.

In [12]:
df=df[['new_user_id','new_artist_name']]

## Save to File

In [13]:
data_name = 'lastfm_1k_user'
df.to_csv(os.path.join('data',data_name+'.txt'), sep="\t", header=False, index=False)