<h1>Dataset Composing</h1>
<blockquotes>Dataset from Last.FM about users' listening history</blockquotes>

<h2>Preparation</h2>
<blockquotes>Importing necessary modules and define some contants</blockquotes>

In [1]:
import os
import pandas as pd

from dotenv import load_dotenv
load_dotenv()

True

In [2]:
file_name = {
  "listening_history": "userid-timestamp-artid-artname-traid-traname.tsv",
  "user_profile": "userid-profile.tsv",
  "listening_history_parquet": "listening_history.snappy.parquet",
  "user_profile_parquet": "user_data.snappy.parquet",
}

dirname = os.getenv('DATASET_FOLDER_PATH')

<h2>Get Listening History Dataset</h2>
<blockquotes>Getting listening history dataset and save them in parquet file</blockquotes>

In [3]:
filename_listening_history = os.path.join(dirname, file_name['listening_history'])
save_listening_history = os.path.join(dirname, file_name['listening_history_parquet'])

In [4]:
df = pd.read_csv(
  filename_listening_history, 
  sep='\t', 
  header=None,
  names=[
    'user_id', 'timestamp', 'artist_id', 'artist_name', 'track_id', 'track_name'
  ],
  skiprows=[
    2120260-1, 2446318-1, 11141081-1, 11152099-1, 11152402-1, 11882087-1, 12902539-1, 12935044-1, 17589539-1
  ]
)

df['timestamp'] = pd.to_datetime(df.timestamp)
df.sort_values(['user_id', 'timestamp'], ascending=True, inplace=True)

print(f'Number of Records: {len(df):,}\nUnique Users: {df.user_id.nunique()}\nUnique Artist:{df.artist_id.nunique():,}')
df.head(5)

Number of Records: 19,098,853
Unique Users: 992
Unique Artist:107,295


Unnamed: 0,user_id,timestamp,artist_id,artist_name,track_id,track_name
16684,user_000001,2006-08-13 13:59:20+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,c4633ab1-e715-477f-8685-afa5f2058e42,The Launching Of Big Face
16683,user_000001,2006-08-13 14:03:29+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,bc2765af-208c-44c5-b3b0-cf597a646660,Zn Zero
16682,user_000001,2006-08-13 14:10:43+00:00,09a114d9-7723-4e14-b524-379697f6d2b5,Plaid & Bob Jaroc,aa9c5a80-5cbe-42aa-a966-eb3cfa37d832,The Return Of Super Barrio - End Credits
16681,user_000001,2006-08-13 14:17:40+00:00,67fb65b5-6589-47f0-9371-8a40eb268dfb,Tommy Guerrero,d9b1c1da-7e47-4f97-a135-77260f2f559d,Mission Flats
16680,user_000001,2006-08-13 14:19:06+00:00,1cfbc7d1-299c-46e6-ba4c-1facb84ba435,Artful Dodger,120bb01c-03e4-465f-94a0-dce5e9fac711,What You Gonna Do?


In [5]:
df.to_parquet(save_listening_history, compression='snappy', index=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19098853 entries, 16684 to 19080480
Data columns (total 6 columns):
 #   Column       Dtype              
---  ------       -----              
 0   user_id      object             
 1   timestamp    datetime64[ns, UTC]
 2   artist_id    object             
 3   artist_name  object             
 4   track_id     object             
 5   track_name   object             
dtypes: datetime64[ns, UTC](1), object(5)
memory usage: 1020.0+ MB


In [6]:
del df

<h2>Get Users Information Dataset</h2>
<blockquotes>Getting users' information dataset and save them in parquet file</blockquotes>

In [7]:
filename_user_profile = os.path.join(dirname, file_name['user_profile'])
save_user_profile = os.path.join(dirname, file_name['user_profile_parquet'])

In [8]:
df = pd.read_csv(
  filename_user_profile, 
  sep='\t', 
  header=0,
  names=[
    'user_id', 'gender', 'age', 'country', 'signup',
  ],
)

df.drop('signup', axis=1, inplace=True)

print(f'Number of Records: {len(df):,}')
df.head(5)

Number of Records: 992


Unnamed: 0,user_id,gender,age,country
0,user_000001,m,,Japan
1,user_000002,f,,Peru
2,user_000003,m,22.0,United States
3,user_000004,f,,
4,user_000005,m,,Bulgaria


In [9]:
df.to_parquet(save_user_profile, compression='snappy', index=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 992 entries, 0 to 991
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   user_id  992 non-null    object 
 1   gender   884 non-null    object 
 2   age      286 non-null    float64
 3   country  907 non-null    object 
dtypes: float64(1), object(3)
memory usage: 31.1+ KB


In [10]:
del df