<h1> Spotify Song Classification by </h1>
<h1>Camilo and Joy </h1>

In this notebook, we implemented a Logistic Regression model to determine whether a song is suitable for listening when reading. Our training and validation data is composed of a "positive label" dataframe (songs good for reading) and a "negative label" dataframe (songs not good for reading). The positive dataframe is a compilation of playlists from Spotify that have been curated by Spotify or Spotify users specifically for reading and have received hundreds of thousands of likes by Spotify's users. Some examples of these playlists are 'The Ultimate Reading Playlist', 'Reading Chill Out', and 'Quiet Music for Reading'. The 'negative labels' dataframe is created in a similar manner. Some examples of the negative label playlists are 'Unlistenable: The World's Worst Playlist', and 'Worst Songs Ever Heard'. Each record of our dataframe has features such as danceability, energy, key, loudness, mode, speechiness, instrumentalness, liveness, valence and tempo.

In [0]:

from pyspark.sql import SparkSession
mongo_username = "chuang86"
mongo_password = "msds697"
mongo_ip_address = "msds697-cluster.sig8e.mongodb.net"
database_name = "test_db"
collection_name = "SpotifyPositive"
collection_name2 = "SpotifyNegative"
collection_name3 = "Spotify_Batches"
connection_string= "mongodb+srv://chuang86:msds697@msds697-cluster.sig8e.mongodb.net"+'/'+database_name+'.'+collection_name
connection_string2= "mongodb+srv://chuang86:msds697@msds697-cluster.sig8e.mongodb.net"+'/'+database_name+'.'+collection_name2
connection_string3= "mongodb+srv://chuang86:msds697@msds697-cluster.sig8e.mongodb.net"+'/'+database_name+'.'+collection_name3
ss = SparkSession.builder.getOrCreate()
df_positive = spark.read.format("mongo").option("uri",connection_string).option('header', 'true').load()
df_negative = spark.read.format("mongo").option("uri",connection_string2).option('header', 'true').load()
df_test = spark.read.format("mongo").option("uri",connection_string3).option('header', 'true').load()

In [0]:
df_positive.show()

In [0]:
df_negative.show()

In [0]:
df_test.show()

In [0]:
df_test.count()

In [0]:
df_negative.count()

In [0]:
df_positive.count()

<h2>Preprocessing & EDA<2>

In [0]:
#remove duplicates and add label column to training data
df_positive = df_positive.distinct() 
df_negative = df_negative.distinct()
df_test = df_test.distinct()

In [0]:
from pyspark.sql.functions import *
df_positive = df_positive.withColumn("target", lit(1))
df_negative = df_negative.withColumn("target", lit(0))

In [0]:
df_positive.show(1)

In [0]:
df_train = df_positive.union(df_negative)

In [0]:
df_train.printSchema()

In [0]:
#Check the most commonly used vals.
df_train.groupBy(df_train["artist"]).count().orderBy("count",ascending=False).show()
df_train.groupBy(df_train["key"]).count().orderBy("count",ascending=False).show()
df_train.groupBy(df_train["time_signature"]).count().orderBy("count",ascending=False).show()


#df_train.groupBy(df_train["artist"]).avg('energy').orderBy("energy",ascending=False).show()

In [0]:
#remove columns with null values
df_train.dropna()

<h3> Create Feature Vector </h3>

In [0]:
# Merging the data with Vector Assembler.
from pyspark.ml.feature import VectorAssembler
input_cols= ['danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

#VectorAssembler takes a number of collumn names(inputCols) and output column name (outputCol)
#and transforms a DataFrame to assemble the values in inputCols into one single vector with outputCol.
va = VectorAssembler(outputCol="features", inputCols=input_cols)
#lpoints - labeled data.
lpoints = va.transform(df_train).select("features", "target").withColumnRenamed("target", "label")

In [0]:
lpoints.filter('label==0').show()

In [0]:
lpoints_test = va.transform(df_test).select("features")

In [0]:
#Divide the dataset into training and vaildation sets.
splits = lpoints.randomSplit([0.8, 0.2])

#cache() : the algorithm is interative and training and data sets are going to be reused many times.
spoti_train = splits[0].cache()
spoti_valid = splits[1].cache()

In [0]:
spoti_train

<h1>Train the model</h1>

In [0]:

from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(regParam=0.01, maxIter=1000, fitIntercept=True)
lrmodel = lr.fit(spoti_train)


In [0]:
print(lrmodel.coefficients)
print(lrmodel.intercept)

In [0]:
validpredicts = lrmodel.transform(spoti_valid)
validpredicts.show()

<h1>Evaluate the model (AUC) </h1>

Analytical Goal Conclusion
The output of this model is intended to work as a first step on the Song-2-Book and Song-2-Article recommendation pipeline. Songs suitable for reading are classified correctly with a high probability according to our AUC score on the validation set. Other models will use our logistic regression model to filter their song data to then map a song to a list of articles or books.

In [0]:

from pyspark.ml.evaluation import BinaryClassificationEvaluator
bceval = BinaryClassificationEvaluator()
print (bceval.getMetricName() +":" + str(bceval.evaluate(validpredicts)))

<h1>Predict on Test Set </h1>

In [0]:
predictions = lrmodel.transform(lpoints_test)

In [0]:
predictions.show()

In [0]:
df_test =df_test.select('*').withColumn('id', monotonically_increasing_id())

In [0]:
predictions = predictions.select('*').withColumn('id', monotonically_increasing_id())

In [0]:
final = df_test.join(predictions, 'id', 'outer')

In [0]:
final.show(1)

In [0]:
cols = ("features","rawPrediction","probability")

final1 = final.drop(*cols)

In [0]:
final1.show(2)