Now that we have written and run several flows, we can use Metaflow's Client API as a handy way to fetch results, analyze performance and decide how to iterate on embeddings, modeling approaches, and experiment design. You can follow along in [this notebook](https://github.com/outerbounds/tutorials/blob/main/recsys/recsys-1.ipynb) as we load and analyze flow results, and then use [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to produce a data visualization. 

First we import the packages we need and define some config variables:

In [1]:
from metaflow import Flow
import numpy as np
from random import choice
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.manifold import TSNE

In [2]:
FLOW_NAME = 'RecSysTuningFlow'

Let's retrieved the artifacts from the latest successful run. 
The `get_latest_successful_run` uses the `metaflow.Flow` object to get results of runs using the (class) name of your flows. 

In [3]:
def get_latest_successful_run(flow_name: str):
    "Gets the latest successful run."
    for r in Flow(flow_name).runs():
        if r.successful: 
            return r

In [4]:
latest_run = get_latest_successful_run(FLOW_NAME)
latest_model = latest_run.data.final_vectors
latest_dataset = latest_run.data.final_dataset

First, check all is in order by printing out datasets and rows and stats:

In [5]:
latest_dataset.head(3)

Unnamed: 0,playlist_id,artist_sequence,track_sequence,track_test_x,track_test_y,predictions,hit
3716,3d0f759337e6aa576c75ecd3fbf14968-March 2013,"[Miguel, A$AP Rocky, Drake, Justin Timberlake,...","[Miguel|||Do You..., A$AP Rocky|||F**kin' Prob...","[Miguel|||Do You..., A$AP Rocky|||F**kin' Prob...",Bruno Mars|||When I Was Your Man,"[Justin Timberlake|||Strawberry Bubblegum, Jus...",0
10837,cf1a043e8f7f6fe0d8d2741e1791e4bb-float shuffle...,"[P!nk, Amy Winehouse, Sara Bareilles, The Chem...","[P!nk|||'Cuz I Can, Amy Winehouse|||'Round Mid...","[P!nk|||'Cuz I Can, Amy Winehouse|||'Round Mid...",Beifus|||lava,"[This Mortal Coil|||Song To The Siren, The Che...",0
6140,cae6ff399c07ca2b9e2f3a47f5175958-Cheap Diamonds,"[Phoenix, The Neighbourhood, St. Lucia, Panic!...","[Phoenix|||1901, The Neighbourhood|||Afraid, S...","[Phoenix|||1901, The Neighbourhood|||Afraid, S...",Sir Sly|||You Haunt Me,"[Hillsong Young & Free|||Wake - Live, Robbie W...",0


In [6]:
len(latest_dataset)

2172

Now, let's turn our attention to the model - the embedding space we trained: let's check how big it is and use it to make a test prediction.

In [7]:
print("# track vectors in the space: {}".format(len(latest_model)))
test_track = choice(list(latest_model.index_to_key))
print("Example track: '{}'".format(test_track))
test_vector = latest_model[test_track]
print("Test vector for '{}': {}".format(test_track, test_vector[:5]))
test_sims = latest_model.most_similar(test_track, topn=3)
print("Similar songs to '{}': {}".format(test_track, test_sims))

# track vectors in the space: 27419
Example track: 'The Box Tops|||The Letter'
Test vector for 'The Box Tops|||The Letter': [-0.58819056  0.14980857  0.58946514 -0.21597098  1.6484333 ]
Similar songs to 'The Box Tops|||The Letter': [('The Rolling Stones|||The Last Time', 0.9870179891586304), ('B.B. King|||The Thrill Is Gone', 0.9854589104652405), ('The Platters|||The Great Pretender', 0.9850826859474182)]


The skip-gram model we trained is an embedding space: if we did our job correctly, the space is such that tracks closer in the space are actually similar, and tracks that are far apart are pretty unrelated.

[Judging the quality of "fantastic embeddings" is hard](https://arxiv.org/abs/2007.14906), but we point here to some common qualitative checks you can run.

In [8]:
# qualitative check, make sure to change with a song that is in the set
test_track = 'Daft Punk|||Get Lucky - Radio Edit'
test_sims = latest_model.most_similar(test_track, topn=3)
print("Similar songs to '{}': {}".format(test_track, test_sims))

Similar songs to 'Daft Punk|||Get Lucky - Radio Edit': [("deadmau5|||Ghosts 'n' Stuff - feat. Rob Swire", 0.9701901078224182), ('PSY|||Gangnam Style (강남스타일)', 0.9514472484588623), ('PSY|||Gentleman', 0.9451815485954285)]


If you use 'Daft Punk|||Get Lucky - Radio Edit' as the query item in the space, you will discover a pretty interesting phenomenon, that is, that there are unfortunately many duplicates in the datasets, that is, songs which are technically different but semantically the same, i.e. Daft Punk|||Get Lucky - Radio Edit vs Daft Punk|||Get Lucky.

This is a problem as i) working with dirty data may be misleading, and ii) these issues make data sparsity worse, so the task for our model is now harder. That said, it is cool that KNN can be used to quickly identify and potentially remove duplicates, depending on your dataset and use cases.

Let's map some tracks to known categories: the intuition is that songs that are similar will be colored in the same way in the chart, and so we will expect them to be close in the embedding space.

In [9]:
track_sequence = latest_dataset['track_sequence'] 
songs = [item for sublist in track_sequence for item in sublist]
song_counter = Counter(songs)

In [10]:
# we downsample the vector space a bit to the K most common songs to avoid crowding the plot / analysis
TOP_N_TRACKS = 250
top_tracks = [_[0] for _ in song_counter.most_common(TOP_N_TRACKS)]
tracks = [_ for _ in latest_model.index_to_key if _ in top_tracks]

assert TOP_N_TRACKS == len(tracks)

In [11]:
# 0 is the generic "unnamed" category
tracks_to_category = {t: 'unknown' for t in tracks}

In [12]:
# we tag songs based on keywords found in the playlist name. Of course, better heuristics are possible ;-)
all_playlists_names = set(latest_dataset['playlist_id'].apply(lambda r: r.split('-')[1].lower().strip()))
target_categories = [
    'rock',
    'rap',
]

In [13]:
# while not pretty, this select the playlists with the target keyword, and mark the tracks
# as belonging to that category

def tag_tracks_with_category(_df, target_word, tracks_to_category):
    _df = _df[_df['playlist_id'].str.contains(target_word)]
    # debug
    print(len(_df))
    # unnest the list
    songs = [item for sublist in _df['track_sequence'] for item in sublist]
    for song in songs:
        if song in tracks_to_category and tracks_to_category[song] == 'unknown':
            tracks_to_category[song] = target_word
    
    return tracks_to_category


for cat in target_categories:
    print("Processing {}".format(cat))
    tracks_to_category = tag_tracks_with_category(latest_dataset, cat, tracks_to_category)

Processing rock
4
Processing rap
7


Note: to visualize a n-dimensional space, we need to be in 2D. We can use a dimensionality reduction technique like [TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) for this.

In [14]:
def tsne_analysis(embeddings, perplexity=50, n_iter=1000):
    """
    TSNE dimensionality reduction of track embeddings - it may take a while!
    """
    tsne = TSNE(n_components=2, perplexity=perplexity, n_iter=n_iter, verbose=1, learning_rate='auto', init='random')

    return tsne.fit_transform(embeddings)

In [15]:
# add all the tagged tracks to the embedding space, on top of the popular tracks
for track, cat in tracks_to_category.items():
    # add a track if we have a tag, if not there already, if we have a vector for it
    if cat in target_categories and track in latest_model.index_to_key and track not in tracks:
        tracks.append(track)
    
print(len(tracks)) 

250


In [16]:
#meta:filter_words=sklearn
# extract the vectors from the model and project them in 2D
embeddings = np.array([latest_model[t] for t in tracks])
# debug, print out embedding shape
print(embeddings.shape)
tsne_results = tsne_analysis(embeddings)
assert len(tsne_results) == len(tracks)

(250, 48)
[t-SNE] Computing 151 nearest neighbors...
[t-SNE] Indexed 250 samples in 0.000s...
[t-SNE] Computed neighbors for 250 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 250 / 250
[t-SNE] Mean sigma: 6.607835
[t-SNE] KL divergence after 250 iterations with early exaggeration: 48.096527
[t-SNE] KL divergence after 1000 iterations: 0.297004


Now we can define a function to plot the 2D representations produced by the TSNE algorithm.

In [25]:
def plot_scatterplot_with_lookup(
    title: str, 
    items: list, 
    items_to_target_cat: dict,
    vectors: list,
    output_path: str = './song_TSNE.png'
):
    """
    Plot the 2-D vectors in the space, and use the mapping items_to_target_cat
    to color-code the points for convenience
    """
    
    plt.ioff()
    
    groups = {}
    for item, target_cat in items_to_target_cat.items():
        if item not in items:
            continue

        item_idx = items.index(item)
        x = vectors[item_idx][0]
        y = vectors[item_idx][1]
        if target_cat in groups:
            groups[target_cat]['x'].append(x)
            groups[target_cat]['y'].append(y)
        else:
            groups[target_cat] = {
                'x': [x], 'y': [y]
                }
    
    fig, ax = plt.subplots(figsize=(6,6))
    for group, data in groups.items():
        ax.scatter(data['x'], data['y'], 
                   alpha=0.1 if group == 'unknown' else 0.8, 
                   edgecolors='none', 
                   s=25, 
                   marker='o',
                   label=group)
        
    [ax.spines[dir].set_visible(False) for dir in ['top', 'bottom', 'left', 'right']]
    ax.set_xticks([])
    ax.set_yticks([])
    plt.title(title)
    plt.legend(loc=2)
    fig.savefig(output_path)

Finally, we are ready to plot the latent space!

In [26]:
plot_scatterplot_with_lookup(
    'Music in (latent) space', 
    tracks, 
    tracks_to_category, 
    tsne_results)

![](./song_TSNE.png)

So far, you have trained embeddings and models, tuned them to find the most promising candidates, and analyzed the results using Metaflow's Client API. In the final episode of this tutorial, we will make another `FlowSpec` object that shows how you can combine these processes with Sagemaker's convenient deployment tools. The end result will be a recommender system you can use to serve real-time predictions about what song to suggest next to a user of an app. See you there!