# Spark basics

One of the most popular ways to use Spark is with the Python library `pyspark`.

# Installation

PySpark has many dependencies, not only with other Python packages, but also with other modules that are not easily installed using the convenient `pip install` command. While you can install pyspark using `pip install pyspark` this is probably not going to be enough. Therefore, we recommend you to follow the next steps:

1. Visit [PySpark download page](https://spark.apache.org/downloads.html) and:
- Choose latest release
- Download package locally

2. Create a folder (for example `spark`)  in a directory that you know will be safe. `~/` is usually a good option. 
3. Extract the files from the downloaded file into the created folder. At the time of writing, the last version was Spark 3.1.2, so, in that case, your directory will look like this (in case you are using the same examples):
```
~/
│
├── spark/
│   └── spark-3.1.2-bin-hadoop3.2  <--- SPARK_HOME
│         ├── bin
│         ├── conf
│         ├── data
... 
```
4. It is important you set the directory as SPARK_HOME, otherwise, PySpark won't know where to find the corresponding commands. To do so, simply set it as a environment variable copying the following command in your `~/.bashrc` file:

`export SPARK_HOME=<path to your home directory>/spark/spark-3.1.2-bin-hadoop3.2`

_Note: The command above depends on where you extracted the files you downloaded and the version_

> Don't skip this step. Having an incorrectly set `SPARK_HOME` environment variable is the cause of many common issues with Spark

5. Save your `~/.bashrc`. You should be able to use PySpark now! If not, try restarting vscode, then try restarting your computer if that doesn't work.

6. To check if the installation was successful, you can install findspark (`pip install findspark`) and run the following cell

In [2]:
import findspark

findspark.init()

Exception: Unable to find py4j in ~/tools/spark-3.2.1-bin-hadoop3.2/python, your SPARK_HOME may not be configured correctly


<details>
  <summary> <font size=+1> For Windows Users </font> </summary>
  
  Depending on your environment, the last steps might not work. In that case, you have to set the environment variable manually. Look at the following gif to know how to it

  <p align=center><img src=images/Spark_home.gif></p>

  If this still doesn't work

</details>


## findspark

The Spark functionalities might not be discoverable within a script or a notebook, so you can use `findspark` which will set the script or notebook to keep using Spark interactively. Remember that:

1. Inside the script you are going to define the instructions
2. Those instructions will be orchestrated amongst the executors using Spark
3. PySpark will be the API that helps you write in Python the instructions. Then, those instructions will be translated, so Spark actually understands it

Thus, you will create the script using PySpark, and then, you will send that script to Spark, usually using spark-submit, which we will see later in this notebook.

`findspark` will be useful when you are developing your application, to check if spark will respond the way you expect while you are writing your code.

- Run `findspark.init()` (which will set up necessary environment variables so `pyspark` can connect to JVM)
- You can also tun `findspark.find()` to see the directory where `SPARK_HOME` has been saved

In [1]:
import findspark

findspark.init()

## Spark config

Given all of the steps above, we can set up Spark's distributed processing engine using:
- A programmatic interface (`pyspark` in our case) - usable for application specific tasks and varying configuration
- Command line - usable for `spark-submit` and __overriding default values__
- Config file - usable as a base config and __when we submit job to the cluster__
- Global config file

> Above is also a priority list and the config for each overides the config from the ones below it

In [None]:
import multiprocessing

import pyspark

cfg = (
    pyspark.SparkConf()
    # Setting where master node is located [cores for multiprocessing]
    .setMaster(f"local[{multiprocessing.cpu_count()}]")
    # Setting application name
    .setAppName("TestApp")
    # Setting config value via string
    .set("spark.eventLog.enabled", False)
    # Setting environment variables for executors to use
    .setExecutorEnv(pairs=[("VAR3", "value3"), ("VAR4", "value4")])
    # Setting memory if this setting was not set previously
    .setIfMissing("spark.executor.memory", "1g")
)

# Getting a single variable
print(cfg.get("spark.executor.memory"))
# Listing all of them in string readable format
print(cfg.toDebugString())

1g
spark.master=local[8]
spark.app.name=TestApp
spark.eventLog.enabled=False
spark.executorEnv.VAR3=value3
spark.executorEnv.VAR4=value4
spark.executor.memory=1g


# Sessions

> PySpark's session object provides a unified connection to our Spark cluster.

There are a few ways to set up the Spark session:
- directly through named/unnamed arguments
- using `SparkConf` object (which we created and will use)
- Providing `SparkContext` with settings (this is deprecated, avoid doing this)

The Spark session is used to:
- create `DataFrame`s (the main object containing data within cluster)
- broadcast variables to machines within the cluster
- Run operations across HDFS enabled systems

Spark and `pyspark` provide a few objects that can be used to interact with the Spark engine:
- `pyspark.SparkContext`
- `org.apache.spark.sql.SQLContext` (Only for Scala)
- `org.apache.spark.sql.hive.HiveContext` (Only Scala)
- `pyspark.sql.SparkContext`

What are these different options and why do they exist?

### SparkContext

> `SparkContext` is the object used by any driver to communicate with the cluster manager, execute and coordinate jobs

This object is always used under the hood, if not directly, to interact with the cluster. Direct use of the Spark context is deprecated and should be avoided.

### SQLContext

Previously, you had to provide `SparkContext` to this object in order to interact with SQL-like capabilities (e.g. creating a `DataFrame`) using the `SparkSQL` library

### HiveContext

> __Extension of SQLContext providing gateway to Hive__

Hive is similar in structure to SQL but provides capabilities for data warehousing and is better suited for analyzing large scale data

## SparkSession

In Spark `v2.0` one object to rule them all was introduced. That was `spark.SparkSession`. It wraps functionalities of all of the contexts introduced above (SparkContext, SQLContext, HiveContext) into one API.

In `pyspark` one can use it via `spark.sql.SparkSession`. 

The the `builder` attribute has methods to obtain the appropriate `SparkSession`.

It's config method can be used to firstly set the config.

The `getOrCreate` method does the following:
- If no global `Session` exists create a new one with specified config
- If global `Session` exists:
    - Get an instance of it
    - Apply the new configuration to it

This approach is safe as using multiple context is a bad practice (although possible)


This SparkSession can be used just like the other context objects were historically.

In [7]:
session = pyspark.sql.SparkSession.builder.config(conf=cfg).getOrCreate()

# Data Structures

Before diving in we need to talk about `3` available data structures in `spark`:
- `RDD` - Resilient Distributed Dataset - fault-tolerant collection of elements that can be operated on in parallel
- `DataFrame` -  dataset organised into named columns. Conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimisations under the hood.
- `Dataset` - distributed collection of data. Provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine

![](./images/rdd_df_dataset_history.png)

# RDD & Core Spark API

> __Core and basic of Spark applications with "low-level" operations__

> __Fault-tolerant collection of elements that can be operated on in parallel.__

This structure provides strong typing (via `JVM` objects) and can be constructued in two ways:
- __parallelizing existing collection__ (e.g. Python's `list`)
- __referencing dataset in external storage__ (anything compatible with Hadoop's InputFormat like HDFS, HBase, Amazon S3, text files etc.)

Let's see these options:

In [8]:
rddDistributedData = session.sparkContext.parallelize([1, 2, 3, 4, 5])
rddDistributedFile = session.sparkContext.textFile("lorem.txt")

__Things to note for files__:
- __Each file has to be in the same path on each worker node!__ (in our case we are running locally hence this is fine)
- All file-based methods operate on:
    - directories - `textFile("/my/directory")`
    - wildcards - `textFile("/my/directory/*.txt")`
    - compressed files - `textFile("/my/directory/*.gz")`
- We can change number of partitions created for this file
- See API [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.textFile.html#pyspark.SparkContext.textFile)

> __Other ways to create RDD from file can be seen in [Spark Context API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#spark-context-apis), e.g. a way to create it from `pickle`__

## Lazy Evaulation

> Created RDDs __ARE NOT FILES__, they are merely a description of operation that __has to be run at some point__

What we did above means:
- Parallelize `list` operation
- Read from text file `lorem.txt` (__but the read wasn't performed!__)

> All of the operations will be run when we __request an ACTION__

Actions may include:
- return number of lines in file (whole map-reduce went through)
- sum the list and return the result

## Persist

> Persisting is used in order to speed-up computations (saving intermediate results in memory)

If we run the line below it means:

> Read data file and cache read contents in the memory (if possible)

> __If we run "action" on the file it will use the cached data (faster) rather than loading data from disk once again!__

Rule of thumb: 

> Use cache when the lineage (operations to run on certain RDD) of your RDD branches out or when an RDD is used multiple times like in a loop.

In [9]:
# All of the operations return self
# This allows us to chain operations (we will see it in the next cell)

rddDistributedFile = rddDistributedFile.cache()

> __`.cache()` is the same as `.persist()` with `StorageLevel.MEMORY_ONLY`__

There are few other options to store the data:
- `MEMORY_ONLY` - keep everything we can in memory otherwise do not cache and compute results
- `MEMORY_AND_DISK` - keep everything we can in memory otherwise serialize to disk (__encouraged for long running computations we would like to cache__)
- `DISK_ONLY` - cache everything on disk, nothing in memory (__discouraged__)
- `MEMORY_ONLY_2` - same as `MEMORY_ONLY` but replicates cache on two cluster nodes for improved fault tolerance (`DISK_ONLY_2` is also available)

In [7]:
pyspark.StorageLevel.DISK_ONLY

StorageLevel(True, False, False, False, 1)

## MapReduce operations

> Given parallelized data structure we can run map-reduce operations on it

All of them can be seen [in the documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#rdd-apis), a few interesting ones:
- [`rdd.checkpoint()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.checkpoint.html#pyspark.RDD.checkpoint) - will be saved in checkpoint directory and all the operations creating it __are discarded__ (action)
- [`rdd.collect()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.collect.html#pyspark.RDD.collect) - __return the structure__ (collect it after operations) (action)
- [`rdd.count()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.count.html#pyspark.RDD.count) - count elements in the structure (action)
- [`rdd.countByKey()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.countByKey.html#pyspark.RDD.countByKey) - count number of elements for each `key` in `(key, value)` pairs (similar to what the graphic before did)
- [`rdd.countByValue()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.countByValue.html#pyspark.RDD.countByValue) - count __how many unique values__ are in this structure (returned as `(value, count)` dictionary)

__And the essential ones we will use are:__
- [`rdd.map(f)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html#pyspark.RDD.map) - apply function __to each element in the collection__
- [`rdd.filter(f)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.filter.html#pyspark.RDD.filter) - __choose values which fulfill `f` function__
- [`rdd.flatMap(f)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.flatMap.html#pyspark.RDD.flatMap) - __apply function to each element and `flatten` the list if necessary__
- [`rdd.fold(neutralValue, f)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.fold.html#pyspark.RDD.fold) - __given associative function (like `add`) takes every 2 elements together and returns the result__
- [`rdd.sortBy(keyfunction)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.sortBy.html#pyspark.RDD.sortBy) - sort by specific function which returns some value from the `(key, value)` pair

> __PLEASE REFER TO DOCUMENTATION WHEN LOOKING FOR AN OPERATOR! MANY OF THEM ARE ALREADY IMPLEMENTED!__

> __TAKE TIME TO COME UP WITH THE OPERATORS NEEDED! EACH OPERATION SAVED MIGHT IMPROVE RUNTIME TREMENDOUSLY!__

Let's see an example chaining on data:

In [8]:
# sc is standard name for sparkContext
# it will be easier to use from now on

sc = session.sparkContext

In [9]:
import operator

data = list(range(10,-11,-1))
print(data)

result = (
    sc.parallelize(data)
    .filter(lambda val: val % 3 == 0)
    .map(operator.abs)
    .fold(0, operator.add)
)

result

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5, -6, -7, -8, -9, -10]


36

In [10]:
sc.parallelize(["b", "a", "c"]).count()

3

In [10]:
rddDistributedFile.flatMap(lambda text: text.split()).countByValue()

defaultdict(int,
            {'Lorem': 3,
             'ipsum': 11,
             'dolor': 10,
             'sit': 50,
             'amet,': 1,
             'consectetur': 12,
             'adipiscing': 23,
             'elit,': 1,
             'sed': 66,
             'do': 1,
             'eiusmod': 1,
             'tempor': 11,
             'incididunt': 1,
             'ut': 48,
             'labore': 1,
             'et': 34,
             'dolore': 1,
             'magna': 15,
             'aliqua.': 1,
             'Quam': 5,
             'lacus': 21,
             'suspendisse': 17,
             'faucibus': 25,
             'interdum': 12,
             'posuere.': 3,
             'Dui': 5,
             'accumsan': 10,
             'amet': 53,
             'nulla': 31,
             'facilisi': 11,
             'morbi': 25,
             'tempus.': 2,
             'Lobortis': 4,
             'scelerisque': 23,
             'fermentum': 17,
             'dui': 21,
             'in': 48

# Spark SQL

## Dataset and DataFrame

Dataset is a distributed collection of data which provides:
- strong typing and powerful lambda functions from `RDD`
- __allows for Spark SQL optimized execution engine__

It can be created from JVM objects __and manipulated in the same functional manner__.

> __`pyspark` has no Dataset API but many benefits of `Dataset` are available for `DataFrame`s DUE TO IT'S DYNAMIC NATURE__

DataFrame shortcomings included:
- No compile-time safety, hence __you cannot manipulate data of which structure is not specified__

> DataFrame is a a  Dataset organised into named columns (__same as for `pd.DataFrame`__)

From now on we will use `DataFrame`s (__not `Dataset`, also due to Python's community similarity with `pd.DataFrame`__) to keep our records.

See [this discussion](https://stackoverflow.com/questions/31508083/difference-between-dataframe-dataset-and-rdd-in-spark) for an extended description.

## Creating DataFrames

> __For all of the operations we can use `SparkSession` directly to interact with the cluster!__

There are a few options usable for us to read data residing on clusters (__for each node it has to be at the same location if reading from file!__):
- [`session.createDataFrame`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html#pyspark.sql.SparkSession.createDataFrame) - create `pyspark.sql.DataFrame` from:
    - `RDD`
    - `list`
    - `pandas.DataFrame`
    - __Optionally: with `schema`__ which specifies datatypes and format for data contained within it. See documentation for more info.
    - By default `schema` is inferred if possible
- [`session.range`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.range.html#pyspark.sql.SparkSession.range) - works like Python's range but distributed and as a `spark.DataFrame`
- [`session.sql(query)`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.sql.html#pyspark.sql.SparkSession.sql) - __return DataFrame which represents result of `sql` query__
- [`session.read.{how_to_read}()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame) - __returns `DataFrameReader` object__ which allows us to read `df` from:
    - `json`
    - `parquet`
    - `csv`
    - and many more
- [`session.readStream`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.readStream.html#pyspark.sql.SparkSession.readStream) - __used for streaming, we will see it a little later__

Let's see some code with `pyspark.sql.DataFrame`:

In [12]:
import numpy as np
import pandas as pd

df = session.createDataFrame(
    pd.DataFrame(
        np.random.randint(0, 100, size=(100, 4)),
        columns=list("ABCD"),
    )
)

df.show()

+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
| 10|  1| 63| 97|
| 15| 84| 33| 85|
| 50| 33| 73|  9|
| 49|  3| 12| 52|
| 49|  4| 35| 51|
| 31| 28| 76| 42|
| 99|  5| 49| 19|
| 71| 66| 68| 50|
| 45| 48|  7| 32|
| 92| 37| 84| 19|
| 40| 53| 63| 30|
| 43| 22| 64| 46|
| 66| 49|  6| 67|
| 20| 41| 21| 71|
| 40| 93| 81|  9|
| 50| 78|  6| 31|
| 64| 62| 18| 43|
| 52| 33| 37| 50|
| 75| 88| 65| 82|
| 90|  8| 87|  5|
+---+---+---+---+
only showing top 20 rows



In [13]:
df.printSchema()

root
 |-- A: long (nullable = true)
 |-- B: long (nullable = true)
 |-- C: long (nullable = true)
 |-- D: long (nullable = true)



In [14]:
# Show is an action, nothing would be returned without it
# Just an operation representing what will happen
df.select("A").show()

+---+
|  A|
+---+
| 10|
| 15|
| 50|
| 49|
| 49|
| 31|
| 99|
| 71|
| 45|
| 92|
| 40|
| 43|
| 66|
| 20|
| 40|
| 50|
| 64|
| 52|
| 75|
| 90|
+---+
only showing top 20 rows



In [15]:
df.select(df["A"], df["B"] + 1)

DataFrame[A: bigint, (B + 1): bigint]

In [16]:
# Increase column value by one
# This operation is shown in the output

df.select(df["A"], df["B"] + 1).show()

+---+-------+
|  A|(B + 1)|
+---+-------+
| 10|      2|
| 15|     85|
| 50|     34|
| 49|      4|
| 49|      5|
| 31|     29|
| 99|      6|
| 71|     67|
| 45|     49|
| 92|     38|
| 40|     54|
| 43|     23|
| 66|     50|
| 20|     42|
| 40|     94|
| 50|     79|
| 64|     63|
| 52|     34|
| 75|     89|
| 90|      9|
+---+-------+
only showing top 20 rows



In [17]:
counted = df.groupby("B").count().persist()
counted.filter(counted["count"] > 1).show()

+---+-----+
|  B|count|
+---+-----+
| 26|    3|
| 84|    2|
| 98|    2|
| 71|    2|
|  6|    2|
| 27|    2|
| 51|    3|
| 41|    2|
| 33|    2|
| 28|    3|
| 88|    3|
| 48|    5|
| 44|    2|
|  3|    3|
| 37|    2|
| 62|    3|
| 59|    2|
| 15|    2|
| 38|    2|
| 46|    4|
+---+-----+
only showing top 20 rows



## Operations on DataFrame

> __`pyspark.sql.DataFrame` supports most of the `pd.DataFrame` operations + the RDD ones__

You can see the whole list [here](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html#pyspark.sql.DataFrame)

> __In general one can work with it similarly to how one works with `pd.DataFrame` objects__

there are a few exceptions though...

## Running SQL queries

> In order to run SQL queries against the DataFrame __we have to register them as `TemporaryViews`__

Properties of `TemporaryViews`:
- __Session scoped__ - if session runs out of scope so will the views registered for it
- One can set up `DataFrame` globally for any `SparkSession` by using `df.createGlobalTempView("name_of_database")`

After that, we can run SQL queries against __distributed data across nodes__:

In [18]:
df.createOrReplaceTempView("any_name")

# WE USE SESSION TO RUN QUERIES!
sqlDf = session.sql("SELECT * FROM any_name")
sqlDf.show()

+---+---+---+---+
|  A|  B|  C|  D|
+---+---+---+---+
| 10|  1| 63| 97|
| 15| 84| 33| 85|
| 50| 33| 73|  9|
| 49|  3| 12| 52|
| 49|  4| 35| 51|
| 31| 28| 76| 42|
| 99|  5| 49| 19|
| 71| 66| 68| 50|
| 45| 48|  7| 32|
| 92| 37| 84| 19|
| 40| 53| 63| 30|
| 43| 22| 64| 46|
| 66| 49|  6| 67|
| 20| 41| 21| 71|
| 40| 93| 81|  9|
| 50| 78|  6| 31|
| 64| 62| 18| 43|
| 52| 33| 37| 50|
| 75| 88| 65| 82|
| 90|  8| 87|  5|
+---+---+---+---+
only showing top 20 rows



# Spark-Submit

The work you see in this notebook sent applications to a clusted interactively, meaning that you were running all cells sequentially. 

In a production environment, you are more likely to launch the applications from a script, where that script contains all the operations using PySpark. 

To do so, you can use spark-submit, which can be ran from the terminal to _submit_ your Spark applications. The syntax is as follow:

```
spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
```

where:

- __class__ is the entrypoint for your application
- __master__ the URL of your cluster. You can set it to `local` to run it locally
- __deploy-mode__ Whether to deploy on the worker or locally as a client
- __conf__ Configuration of the Spark application in a `key=value` way
- __application-jar__: Path to a your application

Within other options, you can specify number of workers or the number of cores:

- __--num-executors__
- __--num-cores__


In this case, we are going to submit the same example we were working with. This application will print put the words in lorem, and the number of occurences of each word

In [None]:
# example.py

import sys
import pyspark
from pyspark import SparkContext, SparkConf
 
if __name__ == "__main__":
 
    # create Spark context with Spark configuration
    conf = SparkConf().setAppName("Word Count - Python").setMaster('local[*]')
    session = pyspark.sql.SparkSession.builder.config(conf=conf).getOrCreate()

    # read in text file and split each document into words
    rddDistributedFile = session.sparkContext.textFile("lorem.txt")
    rddDistributedFile = rddDistributedFile.cache()
    # count the occurrence of each word
    print(rddDistributedFile.flatMap(lambda text: text.split()).countByValue())

In this repo, you will find a `example.py` files that you can try for submitting your application. You can run:

`<SPARK_HOME>/bin/spark-submit.cmd example.py`

If you encounter an error, you might need to paste a file `winutils.exe` for running the command above. You can download the corresponding version [here](https://github.com/steveloughran/winutils).

Your directory should look like this:

```
~/
│
├── spark/
│   └── spark-3.1.2-bin-hadoop3.2  <--- SPARK_HOME
│         ├── bin
│         ├── conf
│         ├── data
│         ├── examples
│         ├── hadoop               <--- Add this new folder = HADOOP_HOME
│         │    └── bin
│         │         └── winutils.exe
... 
```

Then, you'll have to set the a new environment variable `HADOOP_HOME` with the directory of the folder `hadoop`

## Next steps

- Check out [`rdd.aggregate`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.aggregate.html#pyspark.RDD.aggregate) method for RDDs.
- What is the difference between `forEach` and `map`? Check [this StackOverflow answer](https://stackoverflow.com/questions/354909/is-there-a-difference-between-foreach-and-map) if in doubt
- What is the difference between `reduce` and `fold`? check [this StackOverflow answer](https://stackoverflow.com/a/36060141/10886420). Which one is "safer" to use?
- Which operations on RDDs induce `shuffle` and why is it a problem? See [here](https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations) for more info
- Check how to use [Hive](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html) with PySpark. What Hive is and how it differs from SQL?
- Check out how to specify schema programmaticaly (presented [in this tutorial](https://spark.apache.org/docs/latest/sql-getting-started.html#programmatically-specifying-the-schema)). What are the upsides/downsides of using it?

- Read more about multiple `SparkContext`s and `SparkSession`s and why would we need it in some... contexts. Check it [over here](https://www.waitingforcode.com/apache-spark-sql/multiple-sparksession-one-sparkcontext/read)
- What is [`rdd.meanApprox`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.meanApprox.html#pyspark.RDD.meanApprox) and why might we need it?
- Generally discouraged, but what are the options to share data between tasks and nodes in the cluster? Check out [this part of RDD tutorial](https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables)
- Check [performance tuning options for `spark.sql`](https://spark.apache.org/docs/latest/sql-performance-tuning.html). One can use them when creating `pyspark.SparkConf()` object