# Task 4 - Group 7
## April Qianyun Li - lastfm Preprocessing

#### Preprocessing Goal:
* Read json files in all folders from s3 bucket to a single dataframe
* Extract Top 10 similar tracks, and unwind their track_ids and corresponding similarity scores to separate columns.
* Extract the top 1 tag from the tags array
* Drop tracks with no tags

#### Possible Analytics Goals with this dataset:
ML Related:
* Tags Classification with similar tracks' tags and similar scores, compared with tags classification using lyrics
* Sentiment Classification using tags, similar tracks' tags and similar scores, compared with sentiment classfication using lyrics

Non-ML Related:
* Arstist Sentiment Analysis
* Tags Sentiment Analysis

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *

ss = SparkSession.builder.getOrCreate()
sc = ss.sparkContext

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
13,application_1614556309029_0014,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
def similarity_preprocessing(df):
    df = df.select("track_id", "title", "artist", "tags", "similars")
    for i in range(10):
        df = df.withColumn(f"similar_{i+1}_trackid", df["similars"][i][0])
        df = df.withColumn(f"similar_{i+1}_score", df["similars"][i][1])

    df = df.withColumn(f"top_tag", df["tags"][0][0]).drop("similars").dropna()
    return df

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Data Clean by Folder

In [3]:
file_list = ["A", "B", "C", "D", "E",
             "F", "G", "H", "I", "J",
             "K", "L", "M", "N", "O",
             "P", "Q", "R", "S", "T",
             "U", "V", "W", "X", "Y", "Z"]
for l in file_list:
    lastfm_train_path = f"s3a://msds694-final-group7/lastfm_train/{l}/*/*/"
    lastfm_test_path = f"s3a://msds694-final-group7/lastfm_test/{l}/*/*/"
    lastfm_train = ss.read.json(lastfm_train_path)
    lastfm_test = ss.read.json(lastfm_test_path)

    lastfm_train_clean = similarity_preprocessing(lastfm_train)
    lastfm_test_clean = similarity_preprocessing(lastfm_test)

    output_train_file = f"s3a://msds694-final-group7/lastfm_train_clean/{l}"
    output_test_file = f"s3a://msds694-final-group7/lastfm_test_clean/{l}"
    lastfm_train_clean.write.parquet(output_train_file)
    lastfm_test_clean.write.parquet(output_test_file)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Combine Parquets in Folders Together

In [2]:
train_parquets_path = f"s3a://msds694-final-group7/lastfm_train_clean/*"
test_parquets_path = f"s3a://msds694-final-group7/lastfm_test_clean/*"
train = ss.read.parquet(train_parquets_path)
test = ss.read.parquet(test_parquets_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [7]:
train.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

347324

In [5]:
test.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

46239

In [8]:
train.write.parquet("s3a://msds694-final-group7/lastfm_train_clean/All")
test.write.parquet("s3a://msds694-final-group7/lastfm_test_clean/All")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Read From S3 Parquet Test

In [2]:
train_path = "s3a://msds694-final-group7/lastfm_train_clean/All"
test_path = "s3a://msds694-final-group7/lastfm_test_clean/All"
train = ss.read.parquet(train_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [5]:
train.show(1)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+--------+--------------+--------------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+----------------+--------+
|          track_id|   title|        artist|                tags| similar_1_trackid|similar_1_score| similar_2_trackid|similar_2_score| similar_3_trackid|similar_3_score| similar_4_trackid|similar_4_score| similar_5_trackid|similar_5_score| similar_6_trackid|similar_6_score| similar_7_trackid|similar_7_score| similar_8_trackid|similar_8_score| similar_9_trackid|similar_9_score|similar_10_trackid|similar_10_score| top_tag|
+------------------+--------+--------------+--------------------+------------------+---------------+------------------+---------------+-------------

In [3]:
test = ss.read.parquet(test_path)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
test.show(1)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+------------+---------------+--------------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+---------------+------------------+----------------+----------+
|          track_id|       title|         artist|                tags| similar_1_trackid|similar_1_score| similar_2_trackid|similar_2_score| similar_3_trackid|similar_3_score| similar_4_trackid|similar_4_score| similar_5_trackid|similar_5_score| similar_6_trackid|similar_6_score| similar_7_trackid|similar_7_score| similar_8_trackid|similar_8_score| similar_9_trackid|similar_9_score|similar_10_trackid|similar_10_score|   top_tag|
+------------------+------------+---------------+--------------------+------------------+---------------+------------------+----------