The python script below is converting the contents of the played_at field from Unix Timestamp to normal Date and Time
Then fianlly takeing the converted data into a csv file which will be overwritten every time this script is ran.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Initialize Spark Session
spark = SparkSession.builder.appName("SpotifyDataNormalization").getOrCreate()

# Read data into a dataframe
df = spark.read.option("multiline", "true").json("all_recent_tracks.json")

# Convert timestamp to normal date and time 
df = df.withColumn("played_at", F.from_unixtime(F.col("played_at") / 1000, "yyyy-MM-dd HH:mm:ss"))

# Creating A CSV file of the transformed data
df.coalesce(1).write.option("header", "true").mode("overwrite").csv("clean_tracks.csv")

# Show the transformed column 'played_at'
df.select("played_at").show(truncate=False)


+-------------------+
|played_at          |
+-------------------+
|2025-03-29 01:30:46|
|2025-03-29 01:32:49|
|2025-03-29 01:35:48|
|2025-03-29 01:37:49|
|2025-03-29 01:42:10|
|2025-03-29 01:45:10|
|2025-03-29 01:48:15|
|2025-03-29 01:51:23|
|2025-03-29 01:54:55|
|2025-03-29 01:57:53|
|2025-03-29 02:00:28|
|2025-03-29 02:02:31|
|2025-03-29 02:05:23|
|2025-03-29 02:07:52|
|2025-03-29 02:10:10|
|2025-03-29 02:12:11|
|2025-03-29 02:16:11|
|2025-03-29 02:19:38|
|2025-03-29 02:22:25|
|2025-03-29 02:25:22|
+-------------------+
only showing top 20 rows



This script although commented plays a crucial role in some projects with it's intedned purpose being to move the needed CSV file out of the folder pyspark created after the creation of the CSV file containing the transformed data.
Why is it commented ? some may ask , well the name of the file changes each time the script is re-ran so the path of the file keeps changing and given the fact the data is aimed to keep growing, this is not feasible for this project.

In [None]:
""" import shutil
# Move the part file to desired location
part_file = "clean_tracks.csv\part-00000-e767b6d4-f389-44a7-91b0-dec958759f38-c000.csv"
destination = "clean_recent_tracks.csv"
shutil.move(part_file,destination) """


  """ import shutil


' import shutil\n# Move the part file to desired location\npart_file = "clean_tracks.csv\\part-00000-e767b6d4-f389-44a7-91b0-dec958759f38-c000.csv"\ndestination = "clean_recent_tracks.csv"\nshutil.move(part_file,destination) '

The script below trims down the id for a track into just the numbers , yes the spotify:track: may be part of the id .
I used this as a way to strech my skills in spliting entries and indexing.

In [12]:
df2 = spark.read.option("multiline", "true").json("all_top_tracks.json")

# Split by ":" and get just the id which is the last part
df2 = df2.withColumn("id", F.split("id", ":").getItem(2))

df2.coalesce(1).write.option("header", "true").mode("overwrite").csv("clean_top_racks.csv")
df2.show()

+---------------+--------+--------------------+----+----------+--------------------+
|    artist_name|duration|                  id|link|popularity|          track_name|
+---------------+--------+--------------------+----+----------+--------------------+
|  Claire Leslie|  180006|4HR5BN6hc4AmcPO1N...|NULL|        45|                24/7|
|     Grace Marr|  155494|5BH8UixV8wu3FR5xJ...|NULL|         2|              Belong|
|    Cade Kellam|  242886|7LcJx95sWXpqCMLxQ...|NULL|        44|             Blessed|
|           OAKS|  196130|2TUW3yDwsfNcBOtk8...|NULL|        27|               Clean|
|Gabriel Eziashi|  817850|2SHWUh366VIpgpDHx...|NULL|        36|Contemporary Prai...|
|      ONE HOUSE|  304015|4Cr2oltVAPC6rAUMT...|NULL|        29|    Down In My Heart|
|Gabriel Eziashi|  178512|3zfPYqEqCoO6ud6PM...|NULL|        42|           Dry Bones|
|  Dâmares Gomes|  198688|1V9AYsVkGvDL0XrkY...|NULL|        37|         Even Better|
|    Cade Kellam|  294186|4sZLEFie6r4WxCjCU...|NULL|        25|  