
# TikTok tracks analysis

|                |   |
:----------------|---|
| **Team**     | Foraneos  |
| **Date**      | 02/19/2025  |
| **Lab** | 04  |

## Description 

In teams, write a Jupyter Notebook (within the directory spark_cluster/notebooks/labs/lab05) with an efficient solution to complete the following analysis tasks for the tiktok dataset using PySpark

- Filter and Count Popular Tracks. Filter songs with a popularity score greater than 80 and count the number of such tracks.
- Calculate Average Duration of Songs by Genre. Group songs by genre and calculate the average duration mins for each genre.
- Find the Top 5 Most Energetic Songs. Sort songs by energy in descending order and retrieve the top 5 songs.
- Calculate the Total Duration of Songs in Each Playlist. Group songs by playlist name and calculate the total duration mins for each playlist

## Deliverable

When you complete you Notebook, please submit a PR link with the solution.

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQL-Transformations-Actions") \
    .master("spark://0638c7435d1d:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/25 01:41:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
import importlib
import foraneos.spark_utils as SpU
from pyspark.sql.functions import col, sum, avg

importlib.reload(SpU)

columns_info = [ ("track_id", "string"),
                ("track_name", "string"),
                ("artist_id", "string"),
                ("artist_name", "string"),
                ("album_id", "string"),
                ("duration", "int"),
                ("release_date", "timestamp"),
                ("popularity", "int"),
                ("danceability", "double"),
                ("energy", "double"),
                ("key", "int"),
                ("loudness", "double"),
                ("mode", "int"),
                ("speechiness", "double"),
                ("acousticness", "double"),
                ("instrumentalness", "double"),
                ("liveness", "double"),
                ("valence", "double"),
                ("tempo", "double"),
                ("playlist_id", "string"),
                ("playlist_name", "string"),
                ("duration_mins", "double"),
                ("genre", "string")]

schema = SpU.SparkUtils.generate_schema(columns_info)



In [4]:
# Create DataFrame
tiktok_df = spark \
                .read \
                .schema(schema) \
                .option("header", "true") \
                .csv("/home/jovyan/notebooks/data/tiktok.csv")

## Filter and Count Popular Tracks

In [None]:
filtered_tiktok = tiktok_df.select("popularity")                                #reduce memory usage
filtered_popularity_df = filtered_tiktok.filter(tiktok_df["popularity"] > 80)   #filter column by value
pop80_count = filtered_popularity_df.count()                                    #count column length
print(pop80_count)



1023


                                                                                

## Calculate Average Duration of Songs by Genre

In [None]:
filtered_tiktok_genre = tiktok_df.select("genre", "duration_mins")          #reduce memory usage
genre_names = filtered_tiktok_genre.groupBy("genre")                        #group data
avg_genre_duration_mins = genre_names.agg(avg(col("duration_mins")))        #aggregate data
avg_genre_duration_mins.show()  

+------------------+------------------+
|             genre|avg(duration_mins)|
+------------------+------------------+
|TIKTOK PHILIPPINES|3.2801328435737513|
|      TIKTOK DANCE| 3.015020713916861|
|           _TIKTOK| 3.251196442168827|
|        TIKTOK OPM| 4.257192861885788|
+------------------+------------------+



## Find the Top 5 Most Energetic Songs


In [None]:
filtered_tiktok_energy = tiktok_df.select("track_name", "energy")           #reduce memory usage
songs_desc_energy = filtered_tiktok_energy.orderBy(col("energy").desc())    #order data
songs_desc_energy.show(n=5)                                                 #show top 5

+--------------------+------------------+
|          track_name|            energy|
+--------------------+------------------+
|       Kiat Jud Dong|0.9990000000000001|
|       Bukan untukku|             0.998|
|    Ritmo Envolvente|             0.995|
|Tante Culik Aku Dong|             0.995|
|Biarlah Semua Ber...|             0.995|
+--------------------+------------------+
only showing top 5 rows



## Calculate the Total Duration of Songs in Each Playlist


In [None]:
filtered_tiktok_playlist = tiktok_df.select("playlist_name", "duration_mins")       #reduce memory usage
playlist_names = filtered_tiktok_playlist.groupBy("playlist_name")                  #group data
total_playlist_duration_mins = playlist_names.agg(sum(col("duration_mins")))        #aggregate data

#total_playlist_duration_mins_sorted = total_playlist_duration_mins.orderBy(col("sum(duration_mins)").desc())

total_playlist_duration_mins.show()

+--------------------+------------------+
|       playlist_name|sum(duration_mins)|
+--------------------+------------------+
|5IZc3KIVFhjzJ0L2k...| 7.474666666666667|
|08ia51KbTcfs4QVT5...|            4.1485|
|7xVLFuuYdAvcTfcP3...| 9.456433333333333|
|2RBILNmyq8p4fqVWO...| 2.162933333333333|
|6GdDjthxbTGBV9rl2...|3.3209166666666667|
|7krYEnB1OI1RbnJBa...|2.0957666666666666|
|1FgPyHX7HruKDL4Tx...|            2.4448|
|62RtxFf9epYNWOUHJ...|2.6694333333333335|
|5ow0sNF1zSqp71Ix5...|2.7334833333333335|
|0LlJbV4lyzJYE14YC...|10.709133333333334|
|6NFKf8vBApSvtzkap...|3.7074333333333334|
|5P8lyudWE7HQxb4lu...| 4.250666666666667|
|2BgEsaKNfHUdlh97K...| 3.116433333333333|
|7F9vK8hNFMml4GtHs...| 3.173783333333333|
|4vVTI94F9uJ8lHNDW...|3.3657666666666666|
|2uULRpRtKhCdojXwo...|               2.2|
|1tRlGMHsf21FDo6pj...|           1.79005|
|215fAfwkWtlj30ofd...|2.3214166666666665|
|3bidbhpOYeV4knp8A...|            4.3057|
|0YFocHKmrMme7Isel...| 4.810383333333333|
+--------------------+------------

In [27]:
# Stop the SparkContext
sc.stop()