In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession \
    .builder \
    .appName("SongClassification") \
    .getOrCreate()

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/d/Hadoop/spark-3.1.3-bin-hadoop3.2/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/d/Hadoop/hadoop-3.3.3/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2022-08-25 09:58:51,990 WARN util.Utils: Your hostname, DESKTOP-RUMNQVP resolves to a loopback address: 127.0.1.1; using 172.30.39.228 instead (on interface eth0)
2022-08-25 09:58:51,990 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
2022-08-25 09:58:54,446 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(new

## 1. Preparing our dataset
<p><em>These recommendations are so on point! How does this playlist know me so well?</em></p>
<p><img src="https://assets.datacamp.com/production/project_449/img/iphone_music.jpg" alt="Project Image Record" width="600px"></p>
<p>Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer music that suits their tastes.</p>
<p>For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics. Today, we'll be examining data compiled by a research group known as The Echo Nest. Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will learn how to clean our data, do some exploratory data visualization, and use feature reduction towards the goal of feeding our data through some simple machine learning algorithms, such as decision trees and logistic regression.</p>
<p>To begin with, let's load the metadata about our tracks alongside the track metrics compiled by The Echo Nest. A song is about more than its title, artist, and number of listens. We have another dataset that has musical features of each track such as <code>danceability</code> and <code>acousticness</code> on a scale from -1 to 1. These exist in two different files, which are in different formats - CSV and JSON. While CSV is a popular file format for denoting tabular data, JSON is another common file format in which databases often return the results of a given query.</p>
<p>Let's start by creating two pandas <code>DataFrames</code> out of these files that we can merge so we have features and labels (often also referred to as <code>X</code> and <code>y</code>) for the classification later on.</p>

In [9]:
import pandas as pd

# Read in track metadata with genre labels
tracks = pd.read_csv("datasets/fma-rock-vs-hiphop.csv")

# Read in track metrics with the features
echonest_metrics = pd.read_json("datasets/echonest-metrics.json",precise_float=True)

# Merge the relevant columns of tracks and echonest_metrics
echo_tracks = echonest_metrics.merge(right=tracks[['track_id','genre_top']], on="track_id")

# Inspect the resultant dataframe
echo_tracks.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4802 entries, 0 to 4801
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          4802 non-null   int64  
 1   acousticness      4802 non-null   float64
 2   danceability      4802 non-null   float64
 3   energy            4802 non-null   float64
 4   instrumentalness  4802 non-null   float64
 5   liveness          4802 non-null   float64
 6   speechiness       4802 non-null   float64
 7   tempo             4802 non-null   float64
 8   valence           4802 non-null   float64
 9   genre_top         4802 non-null   object 
dtypes: float64(8), int64(1), object(1)
memory usage: 412.7+ KB


In [18]:
# Convert the current pandas dataframe to pyspark dataframe
spark_tracks = spark.createDataFrame(echo_tracks)

spark_tracks.printSchema()
spark_tracks.show()

root
 |-- track_id: long (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- genre_top: string (nullable = true)

+--------+------------+------------+------------+----------------+------------+------------+-------+------------+---------+
|track_id|acousticness|danceability|      energy|instrumentalness|    liveness| speechiness|  tempo|     valence|genre_top|
+--------+------------+------------+------------+----------------+------------+------------+-------+------------+---------+
|       2|0.4166752327|0.6758939853|0.6344762684|    0.0106280683|0.1776465712|0.1593100648|165.922| 0.576660988|  Hip-Hop|
|       3|0.3744077685|0.5286430621|0.8174611317|    0.0018511032|0.1058799438|0.46181

In [36]:
# Filter interested collumns 
spark_tracks = spark_tracks.drop('track_id')

spark_tracks.printSchema()

root
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- genre_top: string (nullable = true)



## 2. Pairwise relationships between continuous variables
<p>We typically want to avoid using variables that have strong correlations with each other -- hence avoiding feature redundancy -- for a few reasons:</p>
<ul>
<li>To keep the model simple and improve interpretability (with many features, we run the risk of overfitting).</li>
<li>When our datasets are very large, using fewer features can drastically speed up our computation time.</li>
</ul>
<p>To get a sense of whether there are any strongly correlated features in our data, we will use built-in functions in the <code>pandas</code> package.</p>

In [20]:
# Create a correlation matrix
corr_metrics = echo_tracks.corr()
corr_metrics

Unnamed: 0,track_id,acousticness,danceability,energy,instrumentalness,liveness,speechiness,tempo,valence
track_id,1.0,-0.372282,0.049454,0.140703,-0.275623,0.048231,-0.026995,-0.025392,0.01007
acousticness,-0.372282,1.0,-0.028954,-0.281619,0.19478,-0.019991,0.072204,-0.02631,-0.013841
danceability,0.049454,-0.028954,1.0,-0.242032,-0.255217,-0.106584,0.276206,-0.242089,0.473165
energy,0.140703,-0.281619,-0.242032,1.0,0.028238,0.113331,-0.109983,0.195227,0.038603
instrumentalness,-0.275623,0.19478,-0.255217,0.028238,1.0,-0.091022,-0.366762,0.022215,-0.219967
liveness,0.048231,-0.019991,-0.106584,0.113331,-0.091022,1.0,0.041173,0.002732,-0.045093
speechiness,-0.026995,0.072204,0.276206,-0.109983,-0.366762,0.041173,1.0,0.008241,0.149894
tempo,-0.025392,-0.02631,-0.242089,0.195227,0.022215,0.002732,0.008241,1.0,0.052221
valence,0.01007,-0.013841,0.473165,0.038603,-0.219967,-0.045093,0.149894,0.052221,1.0


## 3. Splitting our data
<p>As mentioned earlier, it can be particularly useful to simplify our models and use as few features as necessary to achieve the best result. Since we didn't find any particularly strong correlations between our features, we can now split our data into an array containing our features, and another containing the labels - the genre of the track. </p>
<p>Once we have split the data into these arrays, we will perform some preprocessing steps to optimize our model development.</p>

In [119]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(
    inputCols=['acousticness',
               'danceability',
               'energy',
               'instrumentalness',
               'liveness',
               'speechiness',
               'tempo',
               'valence'],
    outputCol='features'
)


tracks_assembled = assembler.transform(spark_tracks)

tracks_assembled = tracks_assembled.select('features','genre_top')

tracks_assembled.show(5)

tracks_train, tracks_test = tracks_assembled.randomSplit([.75,.25],seed=10)

tracks_train.show(5)
tracks_test.show(5)

+--------------------+---------+
|            features|genre_top|
+--------------------+---------+
|[0.4166752327,0.6...|  Hip-Hop|
|[0.3744077685,0.5...|  Hip-Hop|
|[0.0435668989,0.7...|  Hip-Hop|
|[0.4522173071,0.5...|  Hip-Hop|
|[0.9883055496,0.2...|     Rock|
+--------------------+---------+
only showing top 5 rows

+--------------------+---------+
|            features|genre_top|
+--------------------+---------+
|[7.8526E-6,0.2510...|     Rock|
|[2.58553E-5,0.376...|     Rock|
|[3.66041E-5,0.267...|     Rock|
|[3.87096E-5,0.397...|     Rock|
|[6.41779E-5,0.287...|     Rock|
+--------------------+---------+
only showing top 5 rows

+--------------------+---------+
|            features|genre_top|
+--------------------+---------+
|[1.26439E-5,0.433...|     Rock|
|[2.73313E-5,0.446...|     Rock|
|[5.82508E-5,0.245...|     Rock|
|[1.543012E-4,0.26...|     Rock|
|[2.430156E-4,0.22...|     Rock|
+--------------------+---------+
only showing top 5 rows



## 4. Normalizing the feature data
<p>As mentioned earlier, it can be particularly useful to simplify our models and use as few features as necessary to achieve the best result. Since we didn't find any particular strong correlations between our features, we can instead use a common approach to reduce the number of features called <strong>principal component analysis (PCA)</strong>. </p>
<p>It is possible that the variance between genres can be explained by just a few features in the dataset. PCA rotates the data along the axis of highest variance, thus allowing us to determine the relative contribution of each feature of our data towards the variance between classes. </p>
<p>However, since PCA uses the absolute variance of a feature to rotate the data, a feature with a broader range of values will overpower and bias the algorithm relative to the other features. To avoid this, we must first normalize our train and test features. There are a few methods to do this, but a common way is through <em>standardization</em>, such that all features have a mean = 0 and standard deviation = 1 (the resultant is a z-score). </p>

In [120]:
from pyspark.ml.feature import StandardScaler

# Scale the features and set the values to a new variable
scaler = StandardScaler(inputCol='features',outputCol="scaler_features")

scaler_model = scaler.fit(tracks_train)

tracks_train_scaler = scaler_model.transform(tracks_train)

tracks_test_scaler = scaler_model.transform(tracks_test)

tracks_train_scaler.show(5)
tracks_test_scaler.show(5)

+--------------------+---------+--------------------+
|            features|genre_top|     scaler_features|
+--------------------+---------+--------------------+
|[7.8526E-6,0.2510...|     Rock|[2.13851778600045...|
|[2.58553E-5,0.376...|     Rock|[7.04123715869616...|
|[3.66041E-5,0.267...|     Rock|[9.96848418237770...|
|[3.87096E-5,0.397...|     Rock|[1.05418801529382...|
|[6.41779E-5,0.287...|     Rock|[1.74777246540201...|
+--------------------+---------+--------------------+
only showing top 5 rows

+--------------------+---------+--------------------+
|            features|genre_top|     scaler_features|
+--------------------+---------+--------------------+
|[1.26439E-5,0.433...|     Rock|[3.44334424705334...|
|[2.73313E-5,0.446...|     Rock|[7.44319985285308...|
|[5.82508E-5,0.245...|     Rock|[1.58635829978294...|
|[1.543012E-4,0.26...|     Rock|[4.20212236203567...|
|[2.430156E-4,0.22...|     Rock|[6.61810333998384...|
+--------------------+---------+--------------------+
onl

## 5. Projecting on to our features

<p>We can use 6 components to perform PCA and reduce the dimensionality of our train and test features.</p>

In [121]:
from pyspark.ml.feature import PCA

# Get our explained variance ratios from PCA using all features
pca = PCA(k=6,inputCol='scaler_features',outputCol='pca_features')

pca_model = pca.fit(tracks_train_scaler)

tracks_train_pca = pca_model.transform(tracks_train_scaler)

tracks_test_pca = pca_model.transform(tracks_test_scaler)

tracks_train_pca.show(5)
tracks_test_pca.show(5)

+--------------------+---------+--------------------+--------------------+
|            features|genre_top|     scaler_features|        pca_features|
+--------------------+---------+--------------------+--------------------+
|[7.8526E-6,0.2510...|     Rock|[2.13851778600045...|[0.29384819551725...|
|[2.58553E-5,0.376...|     Rock|[7.04123715869616...|[1.07031586272625...|
|[3.66041E-5,0.267...|     Rock|[9.96848418237770...|[1.13362984781904...|
|[3.87096E-5,0.397...|     Rock|[1.05418801529382...|[-0.0674061346715...|
|[6.41779E-5,0.287...|     Rock|[1.74777246540201...|[1.23401135254151...|
+--------------------+---------+--------------------+--------------------+
only showing top 5 rows

+--------------------+---------+--------------------+--------------------+
|            features|genre_top|     scaler_features|        pca_features|
+--------------------+---------+--------------------+--------------------+
|[1.26439E-5,0.433...|     Rock|[3.44334424705334...|[1.11301714124127...|


## 6. Encode labels

Let's use StringIndexer function to replace string to number, because to make the prediction model we can only work with numbers

In [126]:
from pyspark.ml.feature import StringIndexer

# Get our explained variance ratios from PCA using all features
label_indexer = StringIndexer(inputCol='genre_top',outputCol='genre_index')

label_indexer_model = label_indexer.fit(tracks_train_pca)

tracks_train_indexed = label_indexer_model.transform(tracks_train_pca)

tracks_test_indexed = label_indexer_model.transform(tracks_test_pca)

tracks_train_indexed.select('genre_top','genre_index').distinct().show()
tracks_test_indexed.select('genre_top','genre_index').distinct().show()

+---------+-----------+
|genre_top|genre_index|
+---------+-----------+
|  Hip-Hop|        1.0|
|     Rock|        0.0|
+---------+-----------+

+---------+-----------+
|genre_top|genre_index|
+---------+-----------+
|  Hip-Hop|        1.0|
|     Rock|        0.0|
+---------+-----------+



## 7. Train a decision tree to classify genre
<p>Now we can use the lower dimensional PCA projection of the data to classify songs into genres. </p>
<p>Here, we will be using a simple algorithm known as a decision tree. Decision trees are rule-based classifiers that take in features and follow a 'tree structure' of binary decisions to ultimately classify a data point into one of two or more categories. In addition to being easy to both use and interpret, decision trees allow us to visualize the 'logic flowchart' that the model generates from the training data.</p>
<p>Here is an example of a decision tree that demonstrates the process by which an input image (in this case, of a shape) might be classified based on the number of sides it has and whether it is rotated.</p>
<p><img src="https://assets.datacamp.com/production/project_449/img/simple_decision_tree.png" alt="Decision Tree Flow Chart Example" width="350px"></p>

In [138]:
from pyspark.ml.classification import DecisionTreeClassifier

tree = DecisionTreeClassifier(
            featuresCol='pca_features',
            labelCol='genre_index',
            predictionCol='pred_tree')

tree_model = tree.fit(tracks_train_indexed)

pred_tree = tree_model.transform(tracks_test_indexed)

## 8. Compare our decision tree to a logistic regression
<p>Although our tree's performance is decent, it's a bad idea to immediately assume that it's therefore the perfect tool for this job -- there's always the possibility of other models that will perform even better! It's always a worthwhile idea to at least test a few other algorithms and find the one that's best for our data.</p>
<p>Sometimes simplest is best, and so we will start by applying <strong>logistic regression</strong>. Logistic regression makes use of what's called the logistic function to calculate the odds that a given data point belongs to a given class. Once we have both models, we can compare them on a few performance metrics, such as false positive and false negative rate (or how many points are inaccurately classified). </p>

In [159]:
from pyspark.ml.classification import LogisticRegression

logreg = LogisticRegression(
            featuresCol='pca_features',
            labelCol='genre_index',
            predictionCol='pred_logreg')

logreg_model = logreg.fit(tracks_train_indexed)

pred_logreg = logreg_model.transform(tracks_test_indexed)

In [165]:
def evaluator(data,pred_col):
    TN = data.filter(f'{pred_col} = 0 AND genre_index = {pred_col}').count()
    TP = data.filter(f'{pred_col} = 1 AND genre_index = {pred_col}').count()
    FN = data.filter(f'{pred_col} = 0 AND genre_index != {pred_col}').count()
    FP = data.filter(f'{pred_col} = 1 AND genre_index != {pred_col}').count()
    print(f'Precision = {TP/(TP/FP)}')
    print(f'Recall = {TP/(TP/FN)}')
    print(f'Accuracy = {(TP+TN)/(TP+TN+FP+FN)}')

In [166]:
print('Decision Tree Classifier results:')
evaluator(pred_tree,'pred_tree')

print('\nLogistic Regression results:')
evaluator(pred_logreg,'pred_logreg')

Decision Tree Classifier results:
Precision = 44.0
Recall = 87.0
Accuracy = 0.889451476793249

Logistic Regression results:
Precision = 43.0
Recall = 101.99999999999999
Accuracy = 0.8776371308016878
