## Processing MusicMicro Dataset for JUST

## MusicMicro Dataset

This page contains the MusicMicro 11.11-09.12 data set, a [paper](http://www.cp.jku.at/people/schedl/Research/Publications/pdf/schedl_ecir_2013.pdf) on which was accepted for ECIR 2013.

The data set contains listening histories inferred from microblogs. Each listening event identified via twitter-id and user-id is annotated with temporal (month and weekday) and spatial (longitude, latitude, country, and city) information. In addition, pointers to artist and track are provided as a matter of course.

- **listening_data.txt**	-> twitter-id user-id month weekday longitude latitude country-id city-id artist-id track-id
- **artist_mapping.txt**	-> artist-id artist
- **track_mapping.txt**	-> track-id track
- **country_mapping.tx**	-> country-id country
- **city_mapping.txt**	-> city-id city

### Step 1: Loading from .embedding files into Pandas Dataframes

Pandas DataFrames are tables of data that can be created from [many input sources](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html), such as [CSV files](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [SQL databases](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html).

In [1]:
import pandas as pd

#Set correct directory paths to files
listening_filename = '../musicmicro/listening_data.txt'
artist_filename = '../musicmicro/artist_mapping.txt'
track_filename = '../musicmicro/track_mapping.txt'
city_filename = '../musicmicro/city_mapping.txt'
country_filename = '../musicmicro/country_mapping.txt'

# Read the artist_mapping.txt and track_mapping.txt file and create Node Dataframes
artists_df = pd.read_csv(artist_filename, delimiter='\t', dtype = str, header=0, encoding = 'ISO-8859-1')
tracks_df = pd.read_csv(track_filename, delimiter='\t', dtype = str, header=0, encoding = 'ISO-8859-1')
cities_df = pd.read_csv(city_filename, delimiter='\t', dtype = str, header=0, encoding = 'ISO-8859-1')
countries_df = pd.read_csv(country_filename, delimiter='\t', dtype = str, header=0, encoding = 'ISO-8859-1')

# Read the listening_data.txt file and create Edge Dataframes
column_names = [
    'twitter_id', 'user_id', 'month', 'weekday', 
    'longitude', 'latitude', 'country_id', 'city_id', 
    'artist_id', 'track_id'
]

listening_df = pd.read_csv(listening_filename, names=column_names, delimiter='\t', dtype={'user_id': str, 'city_id': str, 'country_id': str, 'artist_id': str, 'track_id': str}, encoding = 'ISO-8859-1')
listening_df = listening_df.drop(listening_df.index[0])

# Remove unnecessary columns in Dataframes as needed
listening_df_keep = ['twitter_id', 'user_id', 'month', 'weekday',  'country_id', 'city_id', 'artist_id', 'track_id']

listening_df = listening_df[listening_df_keep]

# Display the first few rows
display(listening_df)
print("\n")
print(artists_df)
print("\n")
print(tracks_df)
print("\n")
print(cities_df)
print("\n")
print(countries_df)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,twitter_id,user_id,month,weekday,country_id,city_id,artist_id,track_id
1,134243699369590784,74717431,11,2,0,0,450514,7748381
2,134243700380401664,127821914,11,2,1,1,202085,3529910
3,134243869201154048,174194590,11,2,2,2,330061,5762915
4,134244034020524032,141847381,11,2,1,3,404350,6987845
5,134244371557122048,87215499,11,2,3,4,227460,4082536
...,...,...,...,...,...,...,...,...
594302,250849073052143616,134108604,9,2,3,548,283679,5020296
594303,250849250706092032,219099036,9,2,3,5091,68747,1224442
594304,250849785316249600,192959967,9,2,2,2700,152498,2592493
594305,250849797165182976,226486664,9,2,1,1381,375884,6536720




      artist-id                  artist
0        450514                 Tihuana
1        202085          James Morrison
2        330061           Pet Shop Boys
3        404350                   Suede
4        227460                 Kaskade
...         ...                     ...
19524    190059                    ILYA
19525    287958          Milla Jovovich
19526    269025            Manu Dibango
19527    340887  Prolyphic & Reanimator
19528     64964        California Wives

[19529 rows x 2 columns]


      track-id                     track
0      1692316             Avec le temps
1      8030594                Wild Night
2      1201495  Le plus beau du quartier
3      6129176        Probably a Robbery
4      9474108     Sadness Is a Blessing
...        ...                       ...
71405  8535638                     Bebas
71406  8276352                     Rondo
71407  6118830                 Levemente
71408    68841    Running With the Light
71409  5582091       Surviving Disaster

### Creating Remaining Node and Edge Dataframes

Since 'listening_data.txt' has all transactions occurring between tracks, artists, and users, we need to extract the edge lists from its dataframe. We already have the artist and track node dataframes. We still need the user node dataframe.


In [2]:
# Assuming you have already loaded listening_df, artists_df, and tracks_df

# Create a DataFrame of unique user IDs
users_df = pd.DataFrame(listening_df['user_id'].unique(), columns=['user_id'])

# Optionally, reset the index for cleanliness
users_df.reset_index(drop=True, inplace=True)

print(f"Number of unique user IDs: {len(users_df)}")

Number of unique user IDs: 136867


Plus following edge dataframes:

**Edge DFs**
- user_track_df
- track_artist_df
- user_city_df
- city_country_df

In [3]:

# Creating user_track_df
user_track_df = listening_df[['user_id', 'track_id']].drop_duplicates()

# Creating track_artist_df
track_artist_df = listening_df[['track_id', 'artist_id']].drop_duplicates()

# Creating user_city_df
user_city_df = listening_df[['user_id', 'city_id']].drop_duplicates()

# Creating city_country_df
city_country_df = listening_df[['city_id', 'country_id']].drop_duplicates()

# Display the first few rows for verification
print("\nUser-Track DataFrame:")
display(user_track_df)
print("\nTrack-Artist DataFrame:")
display(track_artist_df)
print("\nUser-City DataFrame:")
display(user_city_df)
print("\nCity-Country DataFrame:")
display(city_country_df)



User-Track DataFrame:


Unnamed: 0,user_id,track_id
1,74717431,7748381
2,127821914,3529910
3,174194590,5762915
4,141847381,6987845
5,87215499,4082536
...,...,...
594302,134108604,5020296
594303,219099036,1224442
594304,192959967,2592493
594305,226486664,6536720



Track-Artist DataFrame:


Unnamed: 0,track_id,artist_id
1,7748381,450514
2,3529910,202085
3,5762915,330061
4,6987845,404350
5,4082536,227460
...,...,...
594283,4175404,232731
594297,4949647,279207
594298,9530975,301871
594303,1224442,68747



User-City DataFrame:


Unnamed: 0,user_id,city_id
1,74717431,0
2,127821914,1
3,174194590,2
4,141847381,3
5,87215499,4
...,...,...
594292,34778757,42
594295,222512324,214
594296,240531876,549
594298,70563820,1907



City-Country DataFrame:


Unnamed: 0,city_id,country_id
1,0,0
2,1,1
3,2,2
4,3,1
5,4,3
...,...,...
594098,20717,3
594156,20718,9
594172,20719,5
594182,20720,3


Please note that each user can be associated with multiple cities. This could occur if users are moving and listening from different locations over time, as shown below:

In [4]:
user_city_counts = listening_df.groupby('user_id')['city_id'].nunique()
user_city_counts.sort_values(ascending=False).head()

user_id
99960172     140
261816444     34
178114579     32
115730516     28
15276542      28
Name: city_id, dtype: int64

In [5]:
duplicates_before_prefix = city_country_df.duplicated(keep=False)
print(f"Duplicates before adding prefix: {duplicates_before_prefix.any()}")
if duplicates_before_prefix.any():
    print(city_country_df[duplicates_before_prefix])

Duplicates before adding prefix: False


### Step 2: Defining Node Data

Assuming IDs for each node type are unique, then we can use these to index the dataframe for each node-type. We also need to remove the names from each data frame, as StellarGraph only passes numeric data.

In [6]:
# Copy original node DataFrames
processed_artists_df = artists_df.copy()
processed_tracks_df = tracks_df.copy()
processed_users_df = users_df.copy()
processed_cities_df = cities_df.copy()
processed_countries_df = countries_df.copy()

# Copy original edge DataFrames
processed_user_track_df = user_track_df
processed_track_artist_df = track_artist_df.copy()
processed_user_city_df = user_city_df.copy()
processed_city_country_df = city_country_df.copy()

# Apply processing steps
# Set ID columns as index and add prefix in Node Dataframes
processed_artists_df.set_index('artist-id', inplace=True)
processed_artists_df.index = processed_artists_df.index.map(lambda x: f'a{x}')

processed_tracks_df.set_index('track-id', inplace=True)
processed_tracks_df.index = processed_tracks_df.index.map(lambda x: f't{x}')

processed_users_df.set_index('user_id', inplace=True)
processed_users_df.index = processed_users_df.index.map(lambda x: f'u{x}')

processed_cities_df.set_index('city-id', inplace=True)
processed_cities_df.index = processed_cities_df.index.map(lambda x: f'ci{x}')

processed_countries_df.set_index('country-id', inplace=True)
processed_countries_df.index = processed_countries_df.index.map(lambda x: f'co{x}')

# Update Edge DataFrames
processed_user_track_df['user_id'] = processed_user_track_df['user_id'].apply(lambda x: f'u{x}'.strip())
processed_user_track_df['track_id'] = processed_user_track_df['track_id'].apply(lambda x: f't{x}'.strip())

processed_track_artist_df['artist_id'] = processed_track_artist_df['artist_id'].apply(lambda x: f'a{x}'.strip())
processed_track_artist_df['track_id'] = processed_track_artist_df['track_id'].apply(lambda x: f't{x}'.strip())

processed_user_city_df['user_id'] = processed_user_city_df['user_id'].apply(lambda x: f'u{x}'.strip())
processed_user_city_df['city_id'] = processed_user_city_df['city_id'].apply(lambda x: f'ci{x}'.strip())

processed_city_country_df['city_id'] = processed_city_country_df['city_id'].apply(lambda x: f'ci{x}'.strip())
processed_city_country_df['country_id'] = processed_city_country_df['country_id'].apply(lambda x: f'co{x}'.strip())

#Remove columns with String values, as StellarGraph only passes numeric formats
processed_artists_df.drop(columns=['artist'], inplace=True, errors='ignore')
processed_tracks_df.drop(columns=['track'], inplace=True, errors='ignore')
processed_cities_df.drop(columns=['city'], inplace=True, errors='ignore')
processed_countries_df.drop(columns=['country'], inplace=True, errors='ignore')

#Rename Edge DataFrames columns as source and target respectively
processed_user_track_df.rename(columns={'user_id': 'source', 'track_id': 'target'}, inplace=True)
processed_track_artist_df.rename(columns={'track_id': 'source', 'artist_id': 'target'}, inplace=True)
processed_user_city_df.rename(columns={'user_id': 'source', 'city_id': 'target'}, inplace=True)
processed_city_country_df.rename(columns={'city_id': 'source', 'country_id': 'target'}, inplace=True)

print(processed_artists_df)
print("\n")
print(processed_tracks_df)
print("\n")
print(processed_cities_df)
print("\n")
print(processed_countries_df)
print("\n")
print(processed_users_df)
print("\n")
print(processed_user_track_df)
print("\n")
print(processed_track_artist_df)
print("\n")
print(processed_user_city_df)
print("\n")
print(processed_city_country_df)
print("\n")

# Check for unique indices in node DataFrames
print("Duplicate indices in artists:", processed_artists_df.index.duplicated().any())
print("Duplicate indices in tracks:", processed_tracks_df.index.duplicated().any())
print("Duplicate indices in users:", processed_users_df.index.duplicated().any())
print("Duplicate indices in cities:", processed_cities_df.index.duplicated().any())
print("Duplicate indices in users:", processed_countries_df.index.duplicated().any())

Empty DataFrame
Columns: []
Index: [a450514, a202085, a330061, a404350, a227460, a320569, a439113, a220615, a471013, a322177, a317460, a98394, a197860, a272025, a240055, a274007, a92107, a233683, a417979, a360143, a155809, a459099, a478007, a151666, a117659, a58322, a270067, a135108, a335480, a167669, a229442, a122991, a233682, a110583, a34087, a356772, a56531, a213787, a386633, a9192, a361211, a476146, a281988, a250929, a59333, a197223, a367391, a246486, a324021, a77470, a301735, a49788, a56373, a42944, a104957, a325402, a259514, a40551, a263077, a302277, a413819, a52897, a302854, a30731, a133938, a384551, a74752, a211444, a287024, a2338, a149709, a37490, a51307, a280254, a193089, a241718, a388020, a343878, a483729, a247855, a301448, a220806, a45556, a128744, a347468, a474988, a48184, a466569, a289233, a409372, a194226, a56822, a298166, a34173, a251952, a320064, a487164, a208968, a253391, a356219, ...]

[19529 rows x 0 columns]


Empty DataFrame
Columns: []
Index: [t1692316, t8030594,

### Checking for Duplicates in Edge DFs

Any two edges that are identical in the edge DFs must be resolved. We first check for duplicates in each edge DF, trying to make sense of their occurences.

In [7]:
#Check for and drop duplicates in 'listens to' relationship
if processed_user_track_df.duplicated().any():
    print("There are duplicate edges in 'listens to' relationship.")
    duplicate_edges_listens = processed_user_track_df[processed_user_track_df.duplicated(keep=False)]
    print("Duplicate edges in 'listens to' relationship:")
    print(duplicate_edges_listens)
else:
    print("No duplicate edges in 'listens to' relationship.")

#Check for and drop duplicates in 'produced by' relationship
if processed_track_artist_df.duplicated().any():
    print("There are duplicate edges in 'produced by' relationship.")
    duplicate_edges_produced = processed_track_artist_df[processed_track_artist_df.duplicated(keep=False)]
    print("Duplicate edges in 'produced by' relationship:")
    print(duplicate_edges_produced)
else:
    print("No duplicate edges in 'produced by' relationship.")

#Check for and drop duplicates in 'lives in' relationship
if processed_user_city_df.duplicated().any():
    print("There are duplicate edges in 'lives in' relationship.")
    duplicate_edges_lives = processed_user_city_df[processed_user_city_df.duplicated(keep=False)]
    print("Duplicate edges in 'lives in' relationship:")
    print(duplicate_edges_lives)
else:
    print("No duplicate edges in 'lives in' relationship.")

#Check for and drop duplicates in 'located in' relationship
if processed_city_country_df.duplicated().any():
    print("There are duplicate edges in 'located in' relationship.")
    duplicate_edges_located = processed_city_country_df[processed_city_country_df.duplicated(keep=False)]
    print("Duplicate edges in 'located in' relationship:")
    print(duplicate_edges_located)
else:
    print("No duplicate edges in 'located in' relationship.")


No duplicate edges in 'listens to' relationship.
No duplicate edges in 'produced by' relationship.
No duplicate edges in 'lives in' relationship.
No duplicate edges in 'located in' relationship.


### Checking for Unconnected Nodes

In order for the random walks to cover ALL nodes and generate embeddings for each one, only nodes that are connected to others as part of the graph can be considered. We check for unconnected nodes by inspecting the edge dataframes and cross-checking with the node dataframes. 

If any index is in the node DF BUT NOT in ANY edge DF, THEN we drop it. 

In [8]:
# Users not connected to tracks or cities
isolated_users = processed_users_df.index[
    ~processed_users_df.index.isin(processed_user_track_df['source']) &
    ~processed_users_df.index.isin(processed_user_city_df['source'])
]
if not isolated_users.empty:
    print("Isolated user nodes:")
    print(isolated_users)
else:
    print("No isolated user nodes.")

# Artists not connected to tracks
isolated_artists = processed_artists_df.index[
    ~processed_artists_df.index.isin(processed_track_artist_df['target'])
]
if not isolated_artists.empty:
    print("Isolated artist nodes:")
    print(isolated_artists)
else:
    print("No isolated artist nodes.")

# Cities not connected to countries
isolated_cities = processed_cities_df.index[
    ~processed_cities_df.index.isin(processed_city_country_df['source'])
]
if not isolated_cities.empty:
    print("Isolated city nodes:")
    print(isolated_cities)
else:
    print("No isolated city nodes.")

# Tracks not connected to artists
isolated_tracks = processed_tracks_df.index[
    ~processed_tracks_df.index.isin(processed_track_artist_df['source'])
]
if not isolated_tracks.empty:
    print("Isolated track nodes:")
    print(isolated_tracks)
else:
    print("No isolated track nodes.")

No isolated user nodes.
No isolated artist nodes.
No isolated city nodes.
No isolated track nodes.


### Step 3: Checking for Invalid Edges 

Edges must also be assigned between two nodes that exist in the node DFs (i.e., it must point to a valid node), otherwise it is considered an invalid edge.

In [9]:
# Check for invalid edge references in 'listens to' relationship (User-Track)
invalid_edges_listens = processed_user_track_df[
    ~processed_user_track_df['source'].isin(processed_users_df.index) |
    ~processed_user_track_df['target'].isin(processed_tracks_df.index)  # Assuming tracks_df index already prefixed
]
if not invalid_edges_listens.empty:
    print("Invalid edges in 'listens to' relationship:")
    print(invalid_edges_listens)
else:
    print("No invalid edges in 'listens to' relationship.")

# Check for invalid edge references in 'produced by' relationship (Track-Artist)
invalid_edges_produced = processed_track_artist_df[
    ~processed_track_artist_df['source'].isin(processed_tracks_df.index) |
    ~processed_track_artist_df['target'].isin(processed_artists_df.index)  # Assuming artists_df index already prefixed
]
if not invalid_edges_produced.empty:
    print("Invalid edges in 'produced by' relationship:")
    print(invalid_edges_produced)
else:
    print("No invalid edges in 'produced by' relationship.")

# Check for invalid edge references in 'lives in' relationship (User-City)
invalid_edges_lives = processed_user_city_df[
    ~processed_user_city_df['source'].isin(processed_users_df.index) |
    ~processed_user_city_df['target'].isin(processed_cities_df.index)  # Assuming cities_df index already prefixed
]
if not invalid_edges_lives.empty:
    print("Invalid edges in 'lives in' relationship:")
    print(invalid_edges_lives)
else:
    print("No invalid edges in 'lives in' relationship.")

# Check for invalid edge references in 'located in' relationship (City-Country)
invalid_edges_located = processed_city_country_df[
    ~processed_city_country_df['source'].isin(processed_cities_df.index) |
    ~processed_city_country_df['target'].isin(processed_countries_df.index)  # Assuming countries_df index already prefixed
]
if not invalid_edges_located.empty:
    print("Invalid edges in 'located in' relationship:")
    print(invalid_edges_located)
else:
    print("No invalid edges in 'located in' relationship.")

Invalid edges in 'listens to' relationship:
            source    target
186      u33288782  t2106409
250     u126896322  t8193670
278     u282398048  t2106409
314     u223394258  t2743928
366     u215181798  t8144350
...            ...       ...
594297  u452477250  t4949647
594299  u199622717   t798930
594301  u429236647  t9524166
594304  u192959967  t2592493
594306   u46547043  t1050122

[118287 rows x 2 columns]
Invalid edges in 'produced by' relationship:
          source   target
186     t2106409  a122991
191     t1347540   a75948
250     t8193670  a478007
314     t2743928  a159714
366     t8144350  a474988
...          ...      ...
594259  t6580018  a379005
594267  t4657814  a261232
594283  t4175404  a232731
594297  t4949647  a279207
594304  t2592493  a152498

[27633 rows x 2 columns]
No invalid edges in 'lives in' relationship.
No invalid edges in 'located in' relationship.


### Removing Invalid Edges

It seems that there are invalid edges in both the 'listens to' and 'produced by' Edge DFs. After cross checking with the original .txt files, it is obvious that the missing node IDs arfe existent in the listening_data.txt but not in the corresponding mapping.txt file. For this reason, we will drop these edges from the dataframe.


In [10]:
# Removing invalid 'listens to' edges
processed_user_track_df = processed_user_track_df[
    processed_user_track_df['source'].isin(processed_users_df.index) & 
    processed_user_track_df['target'].isin(processed_tracks_df.index)
]

# Removing invalid 'produced by' edges
processed_track_artist_df = processed_track_artist_df[
    processed_track_artist_df['source'].isin(processed_tracks_df.index) & 
    processed_track_artist_df['target'].isin(processed_artists_df.index)
]

# Removing invalid 'lives in' edges
processed_user_city_df = processed_user_city_df[
    processed_user_city_df['source'].isin(processed_users_df.index) & 
    processed_user_city_df['target'].isin(processed_cities_df.index)
]

# Removing invalid 'located in' edges
processed_city_country_df = processed_city_country_df[
    processed_city_country_df['source'].isin(processed_cities_df.index) & 
    processed_city_country_df['target'].isin(processed_countries_df.index)
]

In [11]:
print(processed_user_track_df)
print("\n")
print(processed_track_artist_df)
print("\n")
print(processed_user_city_df)
print("\n")
print(processed_city_country_df)
print("\n")

            source    target
1        u74717431  t7748381
2       u127821914  t3529910
3       u174194590  t5762915
4       u141847381  t6987845
5        u87215499  t4082536
...            ...       ...
594298   u70563820  t9530975
594300  u370457976  t3848019
594302  u134108604  t5020296
594303  u219099036  t1224442
594305  u226486664  t6536720

[372834 rows x 2 columns]


          source   target
1       t7748381  a450514
2       t3529910  a202085
3       t5762915  a330061
4       t6987845  a404350
5       t4082536  a227460
...          ...      ...
594221  t9062706   a64964
594223  t4702581  a264449
594280  t5255922  a298430
594298  t9530975  a301871
594303  t1224442   a68747

[70959 rows x 2 columns]


            source  target
1        u74717431     ci0
2       u127821914     ci1
3       u174194590     ci2
4       u141847381     ci3
5        u87215499     ci4
...            ...     ...
594292   u34778757    ci42
594295  u222512324   ci214
594296  u240531876   ci549
594298   u705

In [13]:
# Check if all IDs are strings and have been correctly prefixed
for df, prefix in zip([processed_users_df, processed_tracks_df, processed_artists_df, processed_cities_df, processed_countries_df], ['u', 't', 'a', 'ci', 'co']):
    if all(df.index.astype(str).str.startswith(prefix)):
        print(f"All IDs in {df.index.name} correctly prefixed with {prefix}.")
    else:
        print(f"ID prefix issues found in {df.index.name}.")

# Check for any null or missing values that could indicate data issues
for df_name, df in zip(['Users', 'Tracks', 'Artists', 'Cities', 'Countries'], [processed_users_df, processed_tracks_df, processed_artists_df, processed_cities_df, processed_countries_df]):
    if df.isnull().any().any():
        print(f"Null or missing values found in {df_name} DataFrame.")
    else:
        print(f"No null or missing values in {df_name} DataFrame.")

All IDs in user_id correctly prefixed with u.
All IDs in track-id correctly prefixed with t.
All IDs in artist-id correctly prefixed with a.
All IDs in city-id correctly prefixed with ci.
All IDs in country-id correctly prefixed with co.
No null or missing values in Users DataFrame.
No null or missing values in Tracks DataFrame.
No null or missing values in Artists DataFrame.
No null or missing values in Cities DataFrame.
No null or missing values in Countries DataFrame.


We can modify the approach to create unique numerical IDs across all edge DataFrames. To do this, you can concatenate all edge DataFrames, create a unique numerical ID, and then split them back. This way, each edge in the entire dataset will have a unique ID.

This approach ensures that each edge across all your relationships (i.e., edges) has a unique, numerical identifier.

In [14]:
# Adjusting to match your current edge DataFrames
all_edges = pd.concat([
    processed_user_track_df.assign(edge_type="listens to"),
    processed_track_artist_df.assign(edge_type="produced by"),
    processed_user_city_df.assign(edge_type="lives in"),
    processed_city_country_df.assign(edge_type="located in")
]).reset_index(drop=True)

# Adding unique numerical edge IDs
all_edges['edge_id'] = range(1, all_edges.shape[0] + 1)  # Starting IDs from 1

# Splitting the DataFrames back
processed_user_track_df = all_edges[all_edges['edge_type'] == "listens to"].drop('edge_type', axis=1)
processed_track_artist_df = all_edges[all_edges['edge_type'] == "produced by"].drop('edge_type', axis=1)
processed_user_city_df = all_edges[all_edges['edge_type'] == "lives in"].drop('edge_type', axis=1)
processed_city_country_df = all_edges[all_edges['edge_type'] == "located in"].drop('edge_type', axis=1)

print(all_edges)


            source    target   edge_type  edge_id
0        u74717431  t7748381  listens to        1
1       u127821914  t3529910  listens to        2
2       u174194590  t5762915  listens to        3
3       u141847381  t6987845  listens to        4
4        u87215499  t4082536  listens to        5
...            ...       ...         ...      ...
641279     ci20717       co3  located in   641280
641280     ci20718       co9  located in   641281
641281     ci20719       co5  located in   641282
641282     ci20720       co3  located in   641283
641283     ci20721       co4  located in   641284

[641284 rows x 4 columns]


### Step 4: Input DF into JUST 

JUST requires the following input formats: 

1) Edgelist format, the prefix of each node should indicate the node type.
    - typeID1 typeID2

2) Node_types: Structure of heterogeneous graph. Indicating heterogeneous connections. Eg:
    - user : track
    - track : user
    - track : artist
    - artist : track
    - user : city
    - city : user
    - city : country
    - country : city

So we will use the all_edges dataframe to extract source and target columns to save as edgelist. 

In [15]:
all_edges[['source', 'target']].to_csv('Just_Inputs/micromusic.edgelist', sep=' ', index=False, header=False)

### Step 5: Run JUST on Terminal

#Run the following: python src/main.py --input Just_Inputs/micromusic.edgelist --node_types Just_Inputs/micromusic_node_types.txt --dimensions 128 --walk_length 100 --num_walks 10 --window-size 10 --alpha 0.5 --output JUST_Outputs/micromusic.embeddings
