# Spark Session

All distributed operations with Spark are done using so-called Spark Session. Usually one is already created by your cluster's administrator.

In [0]:
spark

# Wasabi dataset

For this course, we will use [Wasabi dataset](https://doi.org/10.5281/zenodo.5603369), the same you have in the data visualisation course.

In [0]:
!pwd

In [0]:
# to download it faster, we use a Google Storage bucket
# here we download the archive to the driver machine (linux filesystem)
dbutils.fs.cp("gs://wasabi-dataset/wasabi-2-0.tar", "file:///databricks/driver")

In [0]:
# we need that because unarchive tools generaly won't work on distributed file systems
# notice a ``json.zip`` archive containing the data we need
!tar -xvf /databricks/driver/wasabi-2-0.tar

In [0]:
# we plan to work with songs data
!unzip /databricks/driver/json/json.zip

In [0]:
# this file is on the driver (linux) filesystem (non-distributed)
song_path = "/databricks/driver/song.json"

In [0]:
# notice that song.json is one large JSON object (a list of dictionaries)
# this is not particularly distributed-computations-friendly
# also notice that different songs are separated by a line containing only
# three symbols: },{
!head -500 {song_path}

In [0]:
from pyspark import pandas as pd

# by default, Pandas API creates a local index
pd.set_option("compute.default_index_type", "distributed")

In [0]:
# now we upload our songs JSON to DBFS (DataBricks File System)
# it's similar to GFS (Google File System), HDFS (Hadoop Distributed File System), and Amazon S3
dbutils.fs.cp(f"file://{song_path}", "/")

In [0]:
# check that the songs data appeared in the DBFS
dbutils.fs.ls("dbfs:///")

In [0]:
# Spark can read simple text files and split them into lines
# using any character sequence you wish

song_lines = spark.read.text(
    "dbfs:/song.json",
    lineSep="\n},{\n"
).to_pandas_on_spark()

In [0]:
# now we have one JSON per line

song_lines.head()

Unnamed: 0,value
0,"[{\n ""_id"": {\n ""$oid"": ""5714dec325ac0d8ae..."
1,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."
2,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."
3,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."
4,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."


In [0]:
# and around 2.1M songs in total

song_lines.count()

In [0]:
# notice that ``song_lines`` is a real ``pandas.DataFrame``

type(song_lines)

In [0]:
# it's only a wrapper around a Spark DataFrame

print(song_lines.spark.frame())
type(song_lines.spark.frame())

In [0]:
# which in turn is a wrapper around something called RDD
# (Resilient Distributed Dataset)

print(song_lines.spark.frame().rdd)
type(song_lines.spark.frame().rdd)

In [0]:
# RDD is distributed
# here it was spread over several partitions of roughly the same size

song_lines.spark.frame().rdd.getNumPartitions()

In [0]:
# RDD is resilient which means it's not data
# it's only a sequence of instructions, so-called lineage

song_lines.spark.explain()

In [0]:
# ``to_pandas`` method collects the data from all workers to driver
# usually that will kill the driver by OOM (out-of-memory error)
# but if we take only the head, that's OK

pandas_df = song_lines.head().to_pandas()
print(type(pandas_df))
pandas_df

Unnamed: 0,value
0,"[{\n ""_id"": {\n ""$oid"": ""5714dec325ac0d8ae..."
1,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."
2,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."
3,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."
4,"""_id"": {\n ""$oid"": ""5714dec325ac0d8aee380..."


In [0]:
def to_a_valid_json(
    pandas_df: pd.DataFrame["value": str]
) -> pd.DataFrame["value": str]:
    """
    :param pandas_df: a ``pandas.DataFrame`` with only one string column
    :return: the same dataframe, but where all lines are valid JSON strings
    """
    pandas_df["value"] = pandas_df["value"].str.replace("[{\n", " ", regex=False)
    pandas_df["value"] = pandas_df["value"].str.replace("}]", " ", regex=False)
    pandas_df["value"] = pandas_df["value"].str.replace("\n", " ", regex=False)
    pandas_df["value"] = "{" + pandas_df["value"] + "}"
    return pandas_df

In [0]:
# let's test that our function works on a real Pandas DataFrames well
import json
import pandas

pandas_df = (song_lines.head().to_pandas().append(song_lines.tail().to_pandas()))
transformed_df = to_a_valid_json(pandas_df)
assert isinstance(transformed_df, pandas.DataFrame)
for _, row in transformed_df.iterrows():
    json.loads(row.value)

In [0]:
# now we can apply our function in a batch mode to the Spark DataFrame
json_lines = song_lines.pandas_on_spark.transform_batch(to_a_valid_json)
json_lines.head()

Unnamed: 0,value
0,"{ ""_id"": { ""$oid"": ""5714dec325ac0d8aee38..."
1,"{ ""_id"": { ""$oid"": ""5714dec325ac0d8aee380..."
2,"{ ""_id"": { ""$oid"": ""5714dec325ac0d8aee380..."
3,"{ ""_id"": { ""$oid"": ""5714dec325ac0d8aee380..."
4,"{ ""_id"": { ""$oid"": ""5714dec325ac0d8aee380..."


In [0]:
# now we write the Spark DataFrame as a simple text files
# line by line to the disk
json_lines.spark.frame().write.mode("overwrite").text("song.json_lines")

In [0]:
# notice that the output is a directory rather than file

dbutils.fs.ls("song.json_lines")