# Analysis: do a song's lyrics inform playlist formation?

*Do distances between lyrics embeddings correlate with distances between song embeddings?*

**Lucas De Oliveira**

**ML Algorithms:** Word2Vec, LinearRegression

## Computing word embeddings

We have previously saved lyrics for about 138,000 unique songs in the Spotify playlist data. We will perform the following tasks:

* Load data frame from mongoDB
* Extract the words from the lyrics
* Use Word2Vec to train word embeddings
* For each song, take an average of the embeddings for each word in its lyrics as the "lyrical embedding" of the song
* Save this in a Spark DataFrame to MongoDB

In [None]:
# Libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
from datetime import datetime
from pyspark.ml.feature import Word2Vec
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os

# Start Spark session
spark = SparkSession.builder\
                    .appName("spotify")\
                    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1")\
                    .config("spark.mongodb.input.uri", "mongodb+srv://**********")\
                    .config("spark.mongodb.output.uri", "mongodb+srv://**********")\
                    .config("spark.network.timeout", "7200s")\
                    .config("spark.executor.heartbeatInterval", "1200s")\
                    .config("spark.databricks.adaptive.autoOptimizeShuffle.enabled", True)\
                    .getOrCreate()

# MongoDB connection info
database = '*****'
collection = 'songs_lyrics'
user_name = '*****'
password = '*****'
address = '**********.mongodb.net'
connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"

In [None]:
# Read song lyrics data from MongoDB (pre-processed)
df = spark.read.format('mongo').option("uri", connection_string).load()
df.show(5)

In the following code blocks we first inspect the lyrics DataFrame, then we train word embeddings using Word2Vec on a subset of the data in order to get the pipeline for the full data.

In [None]:
df_lyrics = df.select("track_uri", "lyrics_words").distinct().where(df['lyrics_words'].isNotNull()).cache()

In [None]:
print(df_lyrics.count())
df_lyrics.show(5)

In [None]:
df_test = df_lyrics.limit(10).cache()
df_test.show()

In [None]:
word2Vec = Word2Vec(vectorSize=32, seed=42, inputCol="lyrics_words", outputCol="model")
word2Vec.setMaxIter(10)
model = word2Vec.fit(df_test)

In [None]:
vectors_test = model.getVectors().cache()
vectors_test.show()

In [None]:
# Fit Word2Vec on entire dataframe (original tokenization with Tokenizer in preprocessing)
word2Vec = Word2Vec(vectorSize=32, seed=42, inputCol="lyrics_words", outputCol="model")
word2Vec.setMaxIter(10)
model = word2Vec.fit(df_lyrics)

In [None]:
word_vectors = model.getVectors().cache()

In [None]:
word_vectors.show(10)

The words created by the Tokenizer during the preprocessing stage ends up giving us some strange results (see DataFrame above). Instead, we'll grab the lyrics from the DataFrame and directly extract our own words using the **lyrics2words** function below.

In [None]:
# Function to clean words
def lyrics2words(lyrics):
    chars = [c for c in lyrics.lower() if c.isalnum() or c.isspace()]
    return ''.join(chars).split()

To make sure that our function can handle non-English characters, we inspect the following:

In [None]:
'ã'.isalnum()

In [None]:
'諺文'.isalnum()

Next, we subset only songs with lyrics and apply the lyrics2songs function to extract a words column.

In [None]:
# Distinct rows for songs with lyrics
songs_w_lyrics = df.select("track_uri", "lyrics").distinct().where(df['lyrics_words'].isNotNull()).cache()
songs_w_lyrics.show(5)

In [None]:
udf_lyrics2songs = udf(lambda x: lyrics2words(x), returnType=ArrayType(StringType()))
songs_df = songs_w_lyrics.select("track_uri", "lyrics", udf_lyrics2songs(songs_w_lyrics['lyrics']).alias("words")).cache()

In [None]:
songs_df.show(5)

In [None]:
# Fit Word2Vec on new 
word2Vec = Word2Vec(vectorSize=32, seed=42, inputCol="words", outputCol="model")
word2Vec.setMaxIter(10)
model = word2Vec.fit(songs_df)

In [None]:
word_vectors = model.getVectors().cache()

In [None]:
word_vectors.show(5)

In [None]:
word_vectors.count()

In [None]:
word_vectors.printSchema()

In [None]:
# Create UDF to convert the vector to an array for storage in MongoDB
vec2arr = udf(lambda v: v.toArray().tolist(), ArrayType(FloatType()))

In [None]:
wordvec_df = word_vectors.select("word", vec2arr(word_vectors["vector"]).alias("vector"))

In [None]:
wordvec_df.show(5)

In [None]:
wordvec_df.printSchema()

Because we have already saved the word vectors to mongoDB, we commented out the code below. Uncomment to do this the first time.

In [None]:
# Uncomment to save word, vector pairs in mongodb
# collection = 'word_vectors'
# connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"
# wordvec_df.write.format("mongo").option("uri", connection_string).mode("append").save()

### Computing lyrics embeddings

Below we compute the average lyrical embedding for each song as the average of the embeddings for all words in the lyrics.

In [None]:
# Get word vector dictionary for faster processing
word_vec_dict = wordvec_df.toPandas().set_index("word").to_dict()
word_vecs = word_vec_dict['vector']
word_vecs

In [None]:
word_vecs

In [None]:
songs_df.show(5)

In [None]:
word_vecs['issues']

In [None]:
# Compute average embedding for each song
import numpy as np

def get_song_avg(words, word_vecs):
    all_vecs = [word_vecs[w] for w in words if w in word_vecs]
    if len(all_vecs) > 1:
        return np.mean(all_vecs, axis=0).tolist()

words2songvec = udf(lambda row: get_song_avg(row, word_vecs), returnType=ArrayType(FloatType()))

song_embs = songs_df.select("track_uri", "lyrics", "words", words2songvec(songs_df['words']).alias("embedding"))

In [None]:
song_embs.show(5)

In [None]:
song_embs.count()

In [None]:
# Uncomment below to write lyrics vectors to new collection in MongoDB
# collection = 'lyrics_vectors'
# connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"
# song_embs.write.format("mongo").option("uri", connection_string).mode("append").save()

## Computing lyrical distance between songs

In an effort to understand how similar two songs are *lyrically*, we will take the Euclidean distance between the lyrical embeddings for all pairs of songs in the dataset. We will later compare this distance to the song embeddings calculated by all songs' co-occurrence in playlists. In this section we take the following steps:

* Read lyrical embeddings DataFrame back from MongoDB
* Get new dataframe with all pairs of songs
* Merge this dataframe with the lyrical embeddings of each song
* Calculate the distance between the two lyrical embeddings for each row

In [None]:
# Read lyrical embeddings from MongoDB
collection = 'lyrics_vectors'
connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"
lyrics_df = spark.read.format('mongo').option("uri", connection_string).load()

In [None]:
lyrics_df.show(5)

In [None]:
lyrics_df = lyrics_df.select("track_uri", "embedding")
lyrics_df.show(5)

In [None]:
lyrics_df.count()

In [None]:
# Get all song, song pairs in dataframe
pairs = lyrics_df.alias("left")\
         .join(lyrics_df.alias("right"), col("left.track_uri") < col("right.track_uri"), "inner")\
         .select(col("left.track_uri").alias("track_1"), col("right.track_uri").alias("track_2"))
pairs.count()

Initially, I attempted to run the analysis on all song, song pairs in the dataframe above (9.5 billion observations). This proved too computationally intensive, slow, and ultimately expensive. Instead, we will proceed by sampling 0.1% of song pairs, which will still give us a sizeable dataset of about 9.5 million pairs.

In [None]:
# Running the analysis on 10 billion word pairs is too slow, cumbersome, and expensive
# Instead, sub-sample pairs dataframe 0.1% of dataframe (about 10M pairs)
print(10_000_000 / 9_538_912_269)

pairs = pairs.sample(fraction=0.001, seed=42)
pairs.count()

In [None]:
pairs.show(5)

Next, we join our pairs dataframe with the lyrical embeddings dataframe to add the lyrical embeddings for each track.

In [None]:
# Add track 1's lyrical embedding
pairs_embs = pairs.alias("left")\
                 .join(lyrics_df.alias("right"), col("left.track_1") == col("right.track_uri"), "left")\
                 .select(col("left.track_1").alias("track_1"), col("left.track_2").alias("track_2"), 
                         col("right.embedding").alias("track_1_lyrics_emb"))

In [None]:
# Add track 2's lyrical embedding
pairs_embs = pairs_embs.alias("left")\
                 .join(lyrics_df.alias("right"), col("left.track_2") == col("right.track_uri"), "left")\
                 .select(col("left.track_1").alias("track_1"), col("left.track_2").alias("track_2"), 
                         col("left.track_1_lyrics_emb").alias("track_1_lyrics_emb"), col("right.embedding").alias("track_2_lyrics_emb"))
pairs_embs.show(10)

In [None]:
pairs_embs.printSchema()

Somehow, the pipeline above produces nulls for the lyrics embeddings as shown in the schema above (these could be created in the Word2Vec pipeline and read from MongoDB). We verify if there are in fact nulls below:

In [None]:
pairs_embs.filter("track_1_lyrics_emb is NULL").show()

In [None]:
# Drop rows with nulls in any column
pairs_embs = pairs_embs.na.drop("any")
pairs_embs.filter("track_1_lyrics_emb is NULL").show()

In [None]:
pairs_embs.filter("track_2_lyrics_emb is NULL").show()

In [None]:
# Check processed data size
pairs_embs.count()

In [None]:
pairs_embs.show(10)

Victory! We were able to remove the null values and still maintain 9.46 million rows. Below, we calculate the distance between each pair's lyrics embeddings.

In [None]:
import numpy as np

def calculate_distance(arr_1, arr_2):
    # Calculate the distance between two lists.
    try:
        arr_1, arr_2 = np.array(arr_1), np.array(arr_2)
        return float(np.linalg.norm(arr_1 - arr_2))
    except:
        return -99999.99

# Define distance UDF and calculate lyrics distance --> new dataframe
distance = udf(lambda x, y: calculate_distance(x, y), FloatType())
lyrics_dist = pairs_embs.select("track_1", "track_2", 
                                distance(pairs_embs["track_1_lyrics_emb"],
                                         pairs_embs["track_2_lyrics_emb"]).alias("lyrics_distance")).cache()
lyrics_dist.show(10)

In [None]:
lyrics_dist.count()

## Load song embeddings from MongoDB

Previously, Chandrish computed song embeddings using Word2Vec, which he saved in MongoDB. We will load this into a Spark DataFrame and join with the lyrics distance DataFrame.

In [None]:
# Read song vectors from MongoDB
collection = 'song_vectors'
connection_string = f"mongodb+srv://{user_name}:{password}@{address}/{database}.{collection}"
song_embs = spark.read.format('mongo').option("uri", connection_string).load().select("track_uri", "values")
song_embs.show(5)

In [None]:
# Join with lyrical embeddings df (track 1)
combined = lyrics_dist.alias("left")\
                .join(song_embs.alias("right"), col("left.track_1") == col("right.track_uri"), "left")\
                .select(col("left.track_1").alias("track_1"), col("left.track_2").alias("track_2"),
                        col("left.lyrics_distance").alias("lyrics_distance"), col("right.values").alias("track_1_song_emb"))
combined.show(5)

In [None]:
# Join with track 2 embedding
combined = combined.alias("left")\
                .join(song_embs.alias("right"), col("left.track_2") == col("right.track_uri"), "left")\
                .select(col("left.track_1").alias("track_1"), 
                        col("left.track_2").alias("track_2"),
                        col("left.lyrics_distance").alias("lyrics_distance"),
                        col("left.track_1_song_emb").alias("track_1_song_emb"),
                        col("right.values").alias("track_2_song_emb"))
combined.show(5)

In [None]:
# Compute song vector distances
ml_df = combined.select("track_1", "track_2", "lyrics_distance",
                          distance(combined["track_1_song_emb"],
                                   combined["track_2_song_emb"]).alias("song_distance")).cache()
ml_df.show()

In [None]:
ml_df.count()

In [None]:
# Drop any NAs that may exist
ml_df = ml_df.na.drop("any").cache()
ml_df.count()

In [None]:
ml_df.show()

## Linear regression

Finally, we're able to run the linear regression to see how much variance in song distance can be explained by changes in lyrics distance (or how much of song similarity is explained by lyrics similarity).

In [None]:
# Create training and test data
splits = ml_df.randomSplit([0.8, 0.2], 1)
train = splits[0].cache()
valid = splits[1].cache()

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

va = VectorAssembler(outputCol="features", inputCols=["lyrics_distance"])
train_va = va.transform(train).select("features", "song_distance").withColumnRenamed("song_distance", "label").cache()
valid_va = va.transform(valid).select("features", "song_distance").withColumnRenamed("song_distance", "label").cache()

In [None]:
train_va.show()

In [None]:
# Fit linear regression to data
lr = LinearRegression(regParam=0.0)
model = lr.fit(train_va)

In [None]:
# Get validation predictions
valid_predictions = model.transform(valid_va)

In [None]:
valid_predictions.show()

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator()
print(evaluator.getMetricName() + ": " + str(evaluator.evaluate(valid_predictions)))

In [None]:
evaluator.setMetricName("r2")
print(evaluator.getMetricName() + ": " + str(evaluator.evaluate(valid_predictions)))

In [None]:
# Scale and run regression again
from pyspark.ml.feature import MinMaxScaler

mmScaler = MinMaxScaler(outputCol="lyrics_dist_scaled", inputCol="features")

lyrics_scaler = mmScaler.fit(train_va)

In [None]:
print("Original lyrics (min, max):", (lyrics_scaler.originalMin, lyrics_scaler.originalMax))

In [None]:
row1 = ml_df.agg({"song_distance": "max"}).collect()[0]
print(row1)

In [None]:
row1 = ml_df.agg({"song_distance": "min"}).collect()[0]
print(row1)

In [None]:
train_va_scaled = lyrics_scaler.transform(train_va)
train_va_scaled.show()

In [None]:
train_va_scaled = train_va_scaled.select("lyrics_dist_scaled", "label")\
                                 .withColumnRenamed("lyrics_dist_scaled", "features")\
                                 .cache()

In [None]:
train_va_scaled.show()

In [None]:
# Fit linear regression to *scaled* data
lr = LinearRegression(regParam=0.0)
model = lr.fit(train_va_scaled)

In [None]:
valid_va_scaled = lyrics_scaler.transform(valid_va)\
                               .select("lyrics_dist_scaled", "label")\
                               .withColumnRenamed("lyrics_dist_scaled", "features")\
                               .cache()

In [None]:
valid_preds_scaled = model.transform(valid_va_scaled)

In [None]:
# Get results
evaluator = RegressionEvaluator()
print(evaluator.getMetricName() + ": " + str(evaluator.evaluate(valid_preds_scaled)))
evaluator.setMetricName("r2")
print(evaluator.getMetricName() + ": " + str(evaluator.evaluate(valid_preds_scaled)))

So far linear regression seems to suggest no relationship on both scaled and unscaled lyrical distances. Next, fit with a Huber loss function and see if any better results arise.

In [None]:
# Fit with huber loss
lr_huber = LinearRegression(regParam=0.0, loss="huber")
model_huber = lr_huber.fit(train_va_scaled)

In [None]:
# Predict on validation
huber_preds = model_huber.transform(valid_va_scaled)
# Evaluate and print results
evaluator = RegressionEvaluator()
print(evaluator.getMetricName() + ": " + str(evaluator.evaluate(valid_preds_scaled)))
evaluator.setMetricName("r2")
print(evaluator.getMetricName() + ": " + str(evaluator.evaluate(valid_preds_scaled)))

## Conclusion

Clearly the lyrics distances can only account for about 3% of the variance in the distance between songs. This intuitively makes sense when you consider the range of themes, dialects, and languages that often coincide in the same playlist.

Future considerations of this question may weight the word embeddings by TFIDF score in the averaging process for lyrical embeddings. Also, other tree-based models or radius-based models (KNN regressors) could be implemented in case the relationship between lyrics and song placement is not linear.

In truth, the lyrics embedding should be fed as input to a much more sophisticated model that takes into account other metrics in order to produce anything meaningful.

In [None]:
spark.stop()