---
layout: post
title:  Spark Dataset APIs
date:   2025-11-07
categories: [Spark, Scala]
mermaid: true
maths: true
typora-root-url: /Users/ojitha/GitHub/ojitha.github.io
typora-copy-images-to: ../../blog/assets/images/${filename}
---

<style>
/* Styles for the two-column layout */
.image-text-container {
    display: flex; /* Enables flexbox */
    flex-wrap: wrap; /* Allows columns to stack on small screens */
    gap: 20px; /* Space between the image and text */
    align-items: center; /* Vertically centers content in columns */
    margin-bottom: 20px; /* Space below this section */
}

.image-column {
    flex: 1; /* Allows this column to grow */
    min-width: 250px; /* Minimum width for the image column before stacking */
    max-width: 40%; /* Maximum width for the image column to not take up too much space initially */
    box-sizing: border-box; /* Include padding/border in element's total width/height */
}

.text-column {
    flex: 2; /* Allows this column to grow more (e.g., twice as much as image-column) */
    min-width: 300px; /* Minimum width for the text column before stacking */
    box-sizing: border-box;
}

</style>

<div class="image-text-container">
    <div class="image-column">
        <img src="https://raw.githubusercontent.com/ojitha/blog/master/assets/images/2025-10027-Scala-2-Collections/scala-collections-illustration.svg" alt="Scala Functors" width="150" height="150">
    </div>
    <div class="text-column">
<p>TBC</p>
    </div>
</div>

<!--more-->

------

* TOC
{:toc}
------



## Introduction

### What are Datasets?

Apache Spark Datasets are the foundational type in Spark's Structured APIs, providing a **type-safe**, distributed collection of strongly typed JVM objects. While DataFrames are Datasets of type `Row`, Datasets allow you to define custom domain-specific objects that each row will consist of, combining the benefits of RDDs (type safety, custom objects) with the optimizations of DataFrames (Catalyst optimizer, Tungsten execution).

**Key Characteristics:**

1. **Type Safety**: Compile-time type checking prevents runtime type errors
2. **Encoders**: Special serialization mechanism that maps domain-specific types to Spark's internal binary format
3. **Catalyst Optimization**: Benefits from Spark SQL's query optimizer
4. **JVM Language Feature**: Available only in Scala and Java (not Python or R)
5. **Functional API**: Supports functional transformations like `map`, `filter`, `flatMap`

**Dataset[T]**: A distributed collection of data elements of type `T`, where `T` is a domain-specific class (case class in Scala, JavaBean in Java) that Spark can encode and optimize.

$$
\text{Dataset}[T] = \{t_1, t_2, \ldots, t_n\} \text{ where } t_i \in T
$$

Translation: A Dataset of type T is a collection of n elements, where each element belongs to type T.

**Encoder[T]**: A mechanism that converts between JVM objects of type `T` and Spark SQL's internal binary format (InternalRow).

$$
\text{Encoder}[T]: T \leftrightarrow \text{InternalRow}
$$

Translation: An Encoder for type T provides bidirectional conversion between objects of type T and Spark's internal row representation.

### Mathematical Foundations

Datasets embody key functional programming concepts:

1. **Functor Laws** (for `map`):
    - Identity: `ds.map(x => x) = ds`
    - Composition: `ds.map(f).map(g) = ds.map(x => g(f(x)))`

2. **Monad Laws** (for `flatMap`):
    - Left identity: `Dataset(x).flatMap(f) = f(x)`
    - Right identity: `ds.flatMap(x => Dataset(x)) = ds`
    - Associativity: `ds.flatMap(f).flatMap(g) = ds.flatMap(x => f(x).flatMap(g))`

### Dataset Movie Lens

Let's examine the MovieLens dataset: [recommended for education and development](https://grouplens.org/datasets/movielens/){:target="_blank"} for simplicity.

```mermaid
---
config:
  look: neo
  theme: default
---
erDiagram
    Movies ||--o{ Ratings : "receives"
    Movies ||--o{ Tags : "has"
    Movies ||--|| Links : "references"
    
    Movies {
        int movieId PK "Primary Key"
        string title "Movie title with year"
        string genres "Pipe-separated genres"
    }
    
    Ratings {
        int userId FK "Foreign Key to User"
        int movieId FK "Foreign Key to Movie"
        float rating "Rating value (0.5-5.0)"
        long timestamp "Unix timestamp"
    }
    
    Tags {
        int userId FK "Foreign Key to User"
        int movieId FK "Foreign Key to Movie"
        string tag "User-generated tag"
        long timestamp "Unix timestamp"
    }
    
    Links {
        int movieId PK "Primary Key"
        int movieId FK "Foreign Key to Movie"
        string imdbId "IMDB identifier"
        string tmdbId "TMDB identifier"
    }
```

#### Entities and Attributes

1.  **Movies** (9,742 movies)
    -   `movieId` (Primary Key)
    -   `title` (includes release year)
    -   `genres` (pipe-separated list)
2.  **Ratings** (100,836 ratings)
    -   `userId` (Foreign Key)
    -   `movieId` (Foreign Key)
    -   `rating` (0.5 to 5.0 stars)
    -   `timestamp` (Unix timestamp)
3.  **Tags** (3,683 tags)
    -   `userId` (Foreign Key)
    -   `movieId` (Foreign Key)
    -   `tag` (user-generated metadata)
    -   `timestamp` (Unix timestamp)
4.  **Links** (9,742 links)
    -   `movieId` (Primary Key & Foreign Key)
    -   `imdbId` (IMDB identifier)
    -   `tmdbId` (The Movie Database identifier)

#### Relationships

-   **Movies ‚Üî Ratings**: One-to-Many (a movie can have multiple ratings)
-   **Movies ‚Üî Tags**: One-to-Many (a movie can have multiple tags)
-   **Movies ‚Üî Links**: One-to-One (each movie has one set of external links)



In [1]:
// Configure Coursier to fetch doc JARs
interp.repositories() ++= Seq(
coursierapi.MavenRepository.of("https://repo1.maven.org/maven2")
)

// Enable compiler to use Java classpath (REMOVED the invalid doc.value line)
interp.configureCompiler(c => {
c.settings.usejavacp.value = true
})

// Import Spark
import $ivy.`org.apache.spark::spark-sql:3.3.1` 
import org.apache.logging.log4j.{LogManager, Level}
import org.apache.logging.log4j.core.config.Configurator

// Set log levels BEFORE creating SparkSession
Configurator.setRootLevel(Level.WARN)
Configurator.setLevel("org.apache.spark", Level.WARN)
Configurator.setLevel("org.apache.spark.executor.Executor", Level.WARN)

[32mimport [39m[36m$ivy.$[39m
[32mimport [39m[36morg.apache.logging.log4j.{LogManager, Level}[39m
[32mimport [39m[36morg.apache.logging.log4j.core.config.Configurator[39m

In [2]:
import org.apache.spark.sql._

val spark = {
  NotebookSparkSession.builder()
    .master("local[*]")
    .getOrCreate()
}

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/jovyan/.cache/coursier/v1/https/repo1.maven.org/maven2/org/apache/logging/log4j/log4j-slf4j-impl/2.17.2/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/jovyan/.cache/coursier/v1/https/repo1.maven.org/maven2/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]


21:49:28.929 [scala-interpreter-1] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


[32mimport [39m[36morg.apache.spark.sql._[39m
[36mspark[39m: [32mSparkSession[39m = org.apache.spark.sql.SparkSession@2645c652

In [3]:
import spark.implicits._

[32mimport [39m[36mspark.implicits._[39m

Let's define the Case class

In [4]:
case class Movie(
  movieId: Int,
  title: String,
  genres: String
)

defined [32mclass[39m [36mMovie[39m

Create a DataSet using the above Case class:

In [5]:
// Read CSV and convert to Dataset
val moviesDS = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("ml-latest-small/movies.csv")
  .as[Movie]

// Example queries
moviesDS.show(2)


+-------+----------------+--------------------+
|movieId|           title|              genres|
+-------+----------------+--------------------+
|      1|Toy Story (1995)|Adventure|Animati...|
|      2|  Jumanji (1995)|Adventure|Childre...|
+-------+----------------+--------------------+
only showing top 2 rows



[36mmoviesDS[39m: [32mDataset[39m[[32mMovie[39m] = [movieId: int, title: string ... 1 more field]

**Key Points:**

- Case classes must be serializable
- All fields should have Spark-compatible types
- The `.as[T]` method performs the conversion from DataFrame to Dataset

##### Understanding Encoders

Encoders are a critical component of the Dataset API. They provide:

1. <span>Efficient Serialisation</span>{:gtxt}: Convert JVM objects to Spark's internal Tungsten binary format
2. <span>Schema Generation</span>{:gtxt}: Automatically infer schema from case class structure
3. <span>Code Generation</span>{:gtxt}: Enable whole-stage code generation for better performance


In [6]:
import org.apache.spark.sql.Dataset
// for primitive types
val intDS : Dataset[Int] = Seq(1,2,3).toDS()

[32mimport [39m[36morg.apache.spark.sql.Dataset[39m
[36mintDS[39m: [32mDataset[39m[[32mInt[39m] = [value: int]

In [7]:
val tupleDS: Dataset[(String, Int)] = Seq(("a",1), ("b", 2)).toDS

[36mtupleDS[39m: [32mDataset[39m[([32mString[39m, [32mInt[39m)] = [_1: string, _2: int]

Using Case classes:

In [8]:
case class Dog(name: String, age: Int)

val dogsDS: Dataset[Dog] = Seq(Dog("Liela",3), Dog("Tommy", 5)).toDS

defined [32mclass[39m [36mDog[39m
[36mdogsDS[39m: [32mDataset[39m[[32mDog[39m] = [name: string, age: int]

In [9]:
dogsDS.show()

+-----+---+
| name|age|
+-----+---+
|Liela|  3|
|Tommy|  5|
+-----+---+



## Dataset Transformations

### map Transformation

The `map` transformation applies a function to each element in the Dataset, producing a new Dataset with transformed elements. It's a **narrow transformation** (no shuffle required) and maintains a **one-to-one relationship** between input and output elements.

Signature:

```scala
def map[U](func: T => U)(implicit encoder: Encoder[U]): Dataset[U]
```
`f`: function

For example, to extract the movie title:


In [10]:
moviesDS.map(m => m.title).show(3, truncate=false)

+-----------------------+
|value                  |
+-----------------------+
|Toy Story (1995)       |
|Jumanji (1995)         |
|Grumpier Old Men (1995)|
+-----------------------+
only showing top 3 rows



In [11]:
def extractMovieInfoFun(movie: Movie): (String, String) = (movie.title, movie.genres)
moviesDS.map(extractMovieInfoFun)

defined [32mfunction[39m [36mextractMovieInfoFun[39m
[36mres11_1[39m: [32mDataset[39m[([32mString[39m, [32mString[39m)] = [_1: string, _2: string]

As shown above, you can create a function.

Or you can create a anonymous function as follows:

In [12]:
val extractMovieInfoAnonymousFun: Movie => (String, String) = movie => (movie.title, movie.genres)
moviesDS.map(extractMovieInfoAnonymousFun)

[36mextractMovieInfoAnonymousFun[39m: [32mMovie[39m => ([32mString[39m, [32mString[39m) = ammonite.$sess.cmd12$Helper$$Lambda$6521/992100117@655d1ff7
[36mres12_1[39m: [32mDataset[39m[([32mString[39m, [32mString[39m)] = [_1: string, _2: string]

Above can be directly written in the `map` function:

In [13]:
moviesDS.map(movie => (movie.title, movie.genres))

[36mres13[39m: [32mDataset[39m[([32mString[39m, [32mString[39m)] = [_1: string, _2: string]

### flatMap Transformation

The `flatMap` transformation applies a function to each element and **flattens** the results. Each input element can produce **zero, one, or multiple output elements**. This is essential for transformations like tokenization, exploding nested structures, or filtering with expansion.

Signature:

```scala
def flatMap[U](func: T => TraversableOnce[U])(implicit encoder: Encoder[U]): Dataset[U]
```

Translation: Given a function that transforms each element of type `T` into a collection of type `U`, flatten all collections into a single Dataset of type `U`.

- **Monad Operation**: flatMap enables chaining transformations that produce collections
- **One-to-Many Mapping**: Input orders (3) produce output items (6)
- Demonstrates nested iteration flattening

For a Dataset with $n$ elements, where each element produces $m_i$ results:

$$
|\text{flatMap}(ds, f)| = \sum_{i=1}^{n} m_i
$$

Translation: The size of the flatMapped Dataset equals the sum of results from each element's transformation.

In [14]:
case class MovieGenres (id: Int, genres: String)
val genres = moviesDS.map { movie =>
    MovieGenres(movie.movieId, movie.genres)
}

defined [32mclass[39m [36mMovieGenres[39m
[36mgenres[39m: [32mDataset[39m[[32mMovieGenres[39m] = [id: int, genres: string]

In [15]:
genres.show(3, truncate=false)

+---+-------------------------------------------+
|id |genres                                     |
+---+-------------------------------------------+
|1  |Adventure|Animation|Children|Comedy|Fantasy|
|2  |Adventure|Children|Fantasy                 |
|3  |Comedy|Romance                             |
+---+-------------------------------------------+
only showing top 3 rows



In [16]:
val genresDS = genres.flatMap(m => m.genres.split("\\|"))
genresDS.show(5)

+---------+
|    value|
+---------+
|Adventure|
|Animation|
| Children|
|   Comedy|
|  Fantasy|
+---------+
only showing top 5 rows



[36mgenresDS[39m: [32mDataset[39m[[32mString[39m] = [value: string]

> The `split()` method takes a *regex pattern, and `|` is a special character in regex meaning "OR"*{:rtxt}. So `split("|")` doesn't work as expected. *Instead, use `split("\\|")` for split*{:gtxt}.
{:.yellow}

Complex Example: Nested Structure Explosion 

In [22]:
case class GenreOccurences(id: Int, words: Seq[String], occurrences: Seq[Int])
// companion object for the above case class
object GenreOccurences {
  def fromMovie(movie: Movie): GenreOccurences = {
    val id = movie.movieId
    val text = movie.genres
    
    // Extract words
    val words = text.split("\\|").toSeq
    
    // Count occurrences of each word
    val wordCounts = words.groupBy(identity).mapValues(_.size).toMap
    val occurrences = words.map(word => wordCounts(word))
    
    GenreOccurences(id, words, occurrences)
  }
}

defined [32mclass[39m [36mGenreOccurences[39m
defined [32mobject[39m [36mGenreOccurences[39m

In [23]:
val genreOccurencesDS: Dataset[GenreOccurences] = moviesDS.map(GenreOccurences.fromMovie)

[36mgenreOccurencesDS[39m: [32mDataset[39m[[32mGenreOccurences[39m] = [id: int, words: array<string> ... 1 more field]

In [24]:
genreOccurencesDS.show(3)

+---+--------------------+---------------+
| id|               words|    occurrences|
+---+--------------------+---------------+
|  1|[Adventure, Anima...|[1, 1, 1, 1, 1]|
|  2|[Adventure, Child...|      [1, 1, 1]|
|  3|   [Comedy, Romance]|         [1, 1]|
+---+--------------------+---------------+
only showing top 3 rows



In [25]:
genreOccurencesDS.flatMap { genreOccurence =>
    genreOccurence.words.zip(genreOccurence.occurrences).map { case (word, numOccured) =>
        (genreOccurence.id, word, numOccured)
        
    }
}.show(5)

+---+---------+---+
| _1|       _2| _3|
+---+---------+---+
|  1|Adventure|  1|
|  1|Animation|  1|
|  1| Children|  1|
|  1|   Comedy|  1|
|  1|  Fantasy|  1|
+---+---------+---+
only showing top 5 rows



Another simple example:

In [26]:
case class Sentence(id: Int, text: String)

defined [32mclass[39m [36mSentence[39m

Create a sample Dataset[^2]:

In [43]:
val sentences = Seq(
    Sentence(1, "Australia is a large continent and a island"),
    Sentence(2, "Sri Lanka is not a continent but a island"),
).toDS

[36msentences[39m: [32mDataset[39m[[32mSentence[39m] = [id: int, text: string]

if you use map:

In [44]:
val words = sentences.map( s => s.text.split("\\s+"))
words.show(truncate=false)

+----------------------------------------------------+
|value                                               |
+----------------------------------------------------+
|[Australia, is, a, large, continent, and, a, island]|
|[Sri, Lanka, is, not, a, continent, but, a, island] |
+----------------------------------------------------+



[36mwords[39m: [32mDataset[39m[[32mArray[39m[[32mString[39m]] = [value: array<string>]

As shown above, after splitting, the data is stored as an `Array` of `String`s.

if you use the `flatmap`:

In [45]:
val wordsFlat = sentences.flatMap( s => s.text.split("\\s+"))
wordsFlat.show(5, truncate=false)

+---------+
|value    |
+---------+
|Australia|
|is       |
|a        |
|large    |
|continent|
+---------+
only showing top 5 rows



[36mwordsFlat[39m: [32mDataset[39m[[32mString[39m] = [value: string]

### join Transformation

The `join` transformation combines two Datasets based on a join condition (typically equality on one or more columns). This is a **wide transformation** requiring a shuffle to co-locate matching keys. The result is a **DataFrame** (untyped), losing type information.

Signature:

```scala
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
```

Translation: Join this Dataset with another Dataset using a join expression and join type, returning a DataFrame.

#### Join Types

| Join Type             | Description                    | Behavior                                                   |
| --------------------- | ------------------------------ | ---------------------------------------------------------- |
| `inner`               | Inner join (default)           | Returns only matching rows from both Datasets              |
| `left`/`left_outer`   | Left outer join                | Returns all rows from left, nulls for non-matches on right |
| `right`/`right_outer` | Right outer join               | Returns all rows from right, nulls for non-matches on left |
| `full`/`full_outer`   | Full outer join                | Returns all rows from both, nulls for non-matches          |
| `left_semi`           | Left semi join                 | Returns rows from left that have matches in right          |
| `left_anti`           | Left anti join                 | Returns rows from left that don't have matches in right    |
| `cross`               | Cross join (Cartesian product) | Returns all combinations of rows                           |

Tableüìù[^1]: Join Types

In [46]:
case class Rating(
  userId: Int,
  movieId: Int,
  rating: Double,
  timestamp: Long
)

defined [32mclass[39m [36mRating[39m

In [47]:
val ratingsDS = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("ml-latest-small/ratings.csv")
  .as[Rating]

[36mratingsDS[39m: [32mDataset[39m[[32mRating[39m] = [userId: int, movieId: int ... 2 more fields]

The join is performed on the common `movieId` column that exists in both datasets.

In [48]:
val movieRatingsDS = ratingsDS.join(moviesDS, "movieId")

[36mmovieRatingsDS[39m: [32mDataFrame[39m = [movieId: int, userId: int ... 4 more fields]

In [49]:
movieRatingsDS.show(5, truncate=false)

+-------+------+------+---------+---------------------------+-------------------------------------------+
|movieId|userId|rating|timestamp|title                      |genres                                     |
+-------+------+------+---------+---------------------------+-------------------------------------------+
|1      |1     |4.0   |964982703|Toy Story (1995)           |Adventure|Animation|Children|Comedy|Fantasy|
|3      |1     |4.0   |964981247|Grumpier Old Men (1995)    |Comedy|Romance                             |
|6      |1     |4.0   |964982224|Heat (1995)                |Action|Crime|Thriller                      |
|47     |1     |5.0   |964983815|Seven (a.k.a. Se7en) (1995)|Mystery|Thriller                           |
|50     |1     |5.0   |964982931|Usual Suspects, The (1995) |Crime|Mystery|Thriller                     |
+-------+------+------+---------+---------------------------+-------------------------------------------+
only showing top 5 rows



In [55]:
val avgRatingsDS = ratingsDS.groupBy("movieId").avg("rating")
avgRatingsDS.show(5, truncate = false)

+-------+-----------------+
|movieId|avg(rating)      |
+-------+-----------------+
|1580   |3.487878787878788|
|2366   |3.64             |
|3175   |3.58             |
|1088   |3.369047619047619|
|32460  |4.25             |
+-------+-----------------+
only showing top 5 rows



[36mavgRatingsDS[39m: [32mDataFrame[39m = [movieId: int, avg(rating): double]

> üíÅüèª‚Äç‚ôÇÔ∏è Important to notice that the join output is a Dataframe(`Dataset[Row]`), not a Dataset.

Above `avgRatingsDS` Dataframe can be joined with `moviesDS` Dataset, but the result is Dataframe `Dataset[Row]`: 

In [57]:
val avgMovieRatingsDS =avgRatingsDS.join(moviesDS, "movieId")
    .select("movieId", "Title", "avg(rating)")
    .orderBy("avg(rating)")

[36mavgMovieRatingsDS[39m: [32mDataset[39m[[32mRow[39m] = [movieId: int, Title: string ... 1 more field]

In [58]:
avgMovieRatingsDS.show(5, truncate=false)

+-------+-----------------------+-----------+
|movieId|Title                  |avg(rating)|
+-------+-----------------------+-----------+
|138186 |Sorrow (2015)          |0.5        |
|5105   |Don't Look Now (1973)  |0.5        |
|89386  |Pearl Jam Twenty (2011)|0.5        |
|72424  |Derailed (2002)        |0.5        |
|134246 |Survivor (2015)        |0.5        |
+-------+-----------------------+-----------+
only showing top 5 rows



### joinWith Transformation

The `joinWith` transformation is a **type-safe** alternative to standard join. Unlike `join`, it returns a **Dataset of tuples** `Dataset[(T, U)]`, preserving type information from both Datasets. This is similar to **co-group** operations in RDD terminology.

#### Signature

```scala
def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
```

Translation: Join this Dataset[T] with another Dataset[U] using a condition, returning a Dataset of tuples containing elements from both Datasets (basically end up with two nested Datasets inside of one).

#### Key Differences from join

| Aspect        | join                | joinWith                  |
| ------------- | ------------------- | ------------------------- |
| Return Type   | DataFrame (untyped) | Dataset[(T, U)] (typed)   |
| Type Safety   | ‚ùå Lost              | ‚úÖ Preserved               |
| Column Access | By name (string)    | By object fields          |
| Use Case      | SQL-style queries   | Type-safe transformations |

Tableüìù[^1]: Key differences

#### JoinWith Examples

In [89]:
val movieRatingsDS = moviesDS.joinWith(
    avgRatingsDS, moviesDS("movieId") === avgRatingsDS("movieId") )
    .orderBy(avgRatingsDS("avg(rating)").desc)

[36mmovieRatingsDS[39m: [32mDataset[39m[([32mMovie[39m, [32mRow[39m)] = [_1: struct<movieId: int, title: string ... 1 more field>, _2: struct<movieId: int, avg(rating): double>]

In [90]:
movieRatingsDS.show(5, truncate=false)

+-----------------------------------------------------------------+-------------+
|_1                                                               |_2           |
+-----------------------------------------------------------------+-------------+
|{142444, The Editor (2015), Comedy|Horror|Mystery}               |{142444, 5.0}|
|{152711, Who Killed Chea Vichea? (2010), Documentary}            |{152711, 5.0}|
|{157775, Tenchi Muy√¥! In Love (1996), Animation|Comedy}          |{157775, 5.0}|
|{496, What Happened Was... (1994), Comedy|Drama|Romance|Thriller}|{496, 5.0}   |
|{8911, Raise Your Voice (2004), Romance}                         |{8911, 5.0}  |
+-----------------------------------------------------------------+-------------+
only showing top 5 rows



In [91]:
movieRatingsDS.printSchema()

root
 |-- _1: struct (nullable = false)
 |    |-- movieId: integer (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- genres: string (nullable = true)
 |-- _2: struct (nullable = false)
 |    |-- movieId: integer (nullable = true)
 |    |-- avg(rating): double (nullable = true)



> üíÅüèª‚Äç‚ôÇÔ∏è Important to notice that the return type of the `joinWith` operation is `Dataset`.

Safe access to the top 10 rated films:

In [93]:
movieRatingsDS.map{ case (m, r) => 
    s"avg ratings for the ${m.title} is ${r.getAs[Double]("avg(rating)")} " }.show(10, truncate=false)

+------------------------------------------------------------+
|value                                                       |
+------------------------------------------------------------+
|avg ratings for the Awfully Big Adventure, An (1995) is 5.0 |
|avg ratings for the What Happened Was... (1994) is 5.0      |
|avg ratings for the Strictly Sexual (2008) is 5.0           |
|avg ratings for the The Love Bug (1997) is 5.0              |
|avg ratings for the The Editor (2015) is 5.0                |
|avg ratings for the Tenchi Muy√¥! In Love (1996) is 5.0      |
|avg ratings for the Raise Your Voice (2004) is 5.0          |
|avg ratings for the One I Love, The (2014) is 5.0           |
|avg ratings for the Empties (2007) is 5.0                   |
|avg ratings for the Who Killed Chea Vichea? (2010) is 5.0   |
+------------------------------------------------------------+
only showing top 10 rows



Let's join the above `movieRatingsDS` with Tags data:

In [94]:
case class Tag(
  userId: Int,
  movieId: Int,
  tag: String,
  timestamp: Long
)

defined [32mclass[39m [36mTag[39m

In [98]:
val tagsDS = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("ml-latest-small/tags.csv")
  .as[Tag]

[36mtagsDS[39m: [32mDataset[39m[[32mTag[39m] = [userId: int, movieId: int ... 2 more fields]

In [99]:
tagsDS.show(3, truncate=false)

+------+-------+---------------+----------+
|userId|movieId|tag            |timestamp |
+------+-------+---------------+----------+
|2     |60756  |funny          |1445714994|
|2     |60756  |Highly quotable|1445714996|
|2     |60756  |will ferrell   |1445714992|
+------+-------+---------------+----------+
only showing top 3 rows



In [None]:
It is better to create an extensible key and join condition first:

In [100]:
val keyCols = Seq("movieId", "userId")
val keyCondition = keyCols.map(col => tagsDS(col) === ratingsDS(col)).reduce( _ && _ ) 

[36mkeyCols[39m: [32mSeq[39m[[32mString[39m] = [33mList[39m([32m"movieId"[39m, [32m"userId"[39m)
[36mkeyCondition[39m: [32mColumn[39m = ((movieId = movieId) AND (userId = userId))

First, join the `ratingsDS` and the `tagsDS`:

In [116]:
val tags4RatingsDS = ratingsDS.joinWith(tagsDS, keyCondition, "left")

[36mtags4RatingsDS[39m: [32mDataset[39m[([32mRating[39m, [32mTag[39m)] = [_1: struct<userId: int, movieId: int ... 2 more fields>, _2: struct<userId: int, movieId: int ... 2 more fields>]

In [117]:
tags4RatingsDS.printSchema()

root
 |-- _1: struct (nullable = false)
 |    |-- userId: integer (nullable = true)
 |    |-- movieId: integer (nullable = true)
 |    |-- rating: double (nullable = true)
 |    |-- timestamp: integer (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- userId: integer (nullable = true)
 |    |-- movieId: integer (nullable = true)
 |    |-- tag: string (nullable = true)
 |    |-- timestamp: integer (nullable = true)



In [118]:
tags4RatingsDS.show(3)

+--------------------+----+
|                  _1|  _2|
+--------------------+----+
|{1, 1, 4.0, 96498...|null|
|{1, 3, 4.0, 96498...|null|
|{1, 6, 4.0, 96498...|null|
+--------------------+----+
only showing top 3 rows



Secondly join the `moviesDS` where `movieId` is a FK for the `tags4RatingsDS`:

In [123]:
val tags4RatingsWithMoviesDS = tags4RatingsDS.joinWith(moviesDS, 
                        tags4RatingsDS("_1.movieId") === moviesDS("movieId"))

[36mtags4RatingsWithMoviesDS[39m: [32mDataset[39m[(([32mRating[39m, [32mTag[39m), [32mMovie[39m)] = [_1: struct<_1: struct<userId: int, movieId: int ... 2 more fields>, _2: struct<userId: int, movieId: int ... 2 more fields>>, _2: struct<movieId: int, title: string ... 1 more field>]

In [124]:
tags4RatingsWithMoviesDS.printSchema()

root
 |-- _1: struct (nullable = false)
 |    |-- _1: struct (nullable = false)
 |    |    |-- userId: integer (nullable = true)
 |    |    |-- movieId: integer (nullable = true)
 |    |    |-- rating: double (nullable = true)
 |    |    |-- timestamp: integer (nullable = true)
 |    |-- _2: struct (nullable = true)
 |    |    |-- userId: integer (nullable = true)
 |    |    |-- movieId: integer (nullable = true)
 |    |    |-- tag: string (nullable = true)
 |    |    |-- timestamp: integer (nullable = true)
 |-- _2: struct (nullable = false)
 |    |-- movieId: integer (nullable = true)
 |    |-- title: string (nullable = true)
 |    |-- genres: string (nullable = true)



```mermaid
---
config:
  look: neo
  theme: default
---
graph LR
    root[root]
    root --> _1_outer["_1: struct"]
    root --> _2_outer["_2: struct"]
    
    _1_outer --> _1_inner["_1: struct"]
    _1_outer --> _2_inner["_2: struct"]
    
    _1_inner --> userId1["userId: integer"]
    _1_inner --> movieId1["movieId: integer"]
    _1_inner --> rating["rating: double"]
    _1_inner --> timestamp1["timestamp: integer"]
    
    _2_inner --> userId2["userId: integer"]
    _2_inner --> movieId2["movieId: integer"]
    _2_inner --> tag["tag: string"]
    _2_inner --> timestamp2["timestamp: integer"]
    
    _2_outer --> movieId3["movieId: integer"]
    _2_outer --> title["title: string"]
    _2_outer --> genres["genres: string"]
    
    style root fill:#e1f5ff
    style _1_outer fill:#fff4e6
    style _2_outer fill:#fff4e6
    style _1_inner fill:#f0f0f0
    style _2_inner fill:#f0f0f0
```

Pattern matching extracts all nested elements. To access the `tags4RatingsWithMoviesDS`:

In [125]:
case class UserRatingTag(userId: Int, 
                         movie: String, 
                         rating: Double, 
                         tag: Option[String] 
                        )

val userRatingTagDS = tags4RatingsWithMoviesDS.map {

    case ((r, t), m) => UserRatingTag(
        r.userId,
        m.title,
        r.rating,
        Option(t).map(_.tag)
    )
}

defined [32mclass[39m [36mUserRatingTag[39m
[36muserRatingTagDS[39m: [32mDataset[39m[[32mUserRatingTag[39m] = [userId: int, movie: string ... 2 more fields]

In [127]:
userRatingTagDS.filter(u => u.tag != None).show(truncate=false)

+------+-------------------------------+------+-----------------+
|userId|movie                          |rating|tag              |
+------+-------------------------------+------+-----------------+
|2     |Step Brothers (2008)           |5.0   |will ferrell     |
|2     |Step Brothers (2008)           |5.0   |Highly quotable  |
|2     |Step Brothers (2008)           |5.0   |funny            |
|2     |Warrior (2011)                 |5.0   |Tom Hardy        |
|2     |Warrior (2011)                 |5.0   |MMA              |
|2     |Warrior (2011)                 |5.0   |Boxing story     |
|2     |Wolf of Wall Street, The (2013)|5.0   |Martin Scorsese  |
|2     |Wolf of Wall Street, The (2013)|5.0   |Leonardo DiCaprio|
|2     |Wolf of Wall Street, The (2013)|5.0   |drugs            |
|7     |Departed, The (2006)           |1.0   |way too long     |
|18    |Carlito's Way (1993)           |4.0   |mafia            |
|18    |Carlito's Way (1993)           |4.0   |gangster         |
|18    |Ca

[^1]: Chambers, B., Zaharia, M., 2018. Spark: The Definitive Guide. Ch. 11: "Datasets"

[^2]: Holden Karau, Rachel Warren., 2017. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. Ch. 3: "DataFrames, Datasets, and Spark SQL"

[^3]: Chambers, B., Zaharia, M., 2018. Spark: The Definitive Guide. Ch. 13: "Advanced RDDs"

[^4]: Holden Karau, Rachel Warren., 2017. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. Ch. 4: "Joins (SQL and Core)"

[^5]: Holden Karau, Rachel Warren., 2017. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. Ch. 6: "Working with Key/Value Data"

[^6]: Ryza, Sandy, Laserson, Uri, Owen, Sean, Wills, Josh., 2017. Advanced Analytics with Spark, 2nd Edition. Ch. 2: "Introduction to Data Analysis with Scala and Spark"

[^7]: [Apache Spark Dataset API Documentation](https://spark.apache.org/docs/2.4.8/api/scala/index.html#org.apache.spark.sql.Dataset) - Scala 2.x API

{:gtxt: .message color="green"}
{:ytxt: .message color="yellow"}
{:rtxt: .message color="red"}

In [21]:
scala.util.Properties.versionString


[36mres21[39m: [32mString[39m = [32m"version 2.12.20"[39m

In [21]:
// spark.stop()