# Introduction

### what is PySpark ?

PySpark can be summarized as being the Python API to Spark

### what is Spark ?

It is a “unified analytics engine for large-scale data processing”. It can be described as an *analytics factory*. Spark can process an increasingly vast amount of data by scaling out (across multiple smaller machines) instead of scaling up (adding more resources, such as CPU, RAM, and disk space, to a single machine).

**PySpark** is expressive
```
(
    spark.read.csv("./data/list_of_numbers/sample.csv", header=True)
    .withColumn(
        "new_column", F.when(F.col("old_column") > 10, 10).otherwise(0)
        )
    .where("old_column > 8")
    .groupby("new_column")
    .count()
    .write.csv("updated_frequencies.csv", mode="overwrite")
)
```

### How PySpark works

![How spark works](images/how_spark_works.png)

You have some workbenches that some workers are assigned to. The workbenches are like the computers in our Spark cluster: there is a fixed amount of them.  The workers are called executors in Spark’s literature: they perform the actual work on the machines/nodes. That top hat definitely makes him stand out from the crowd. In our data factory, he’s the manager of the work floor. In Spark terms, we call this the master. The master here sits on one of the workbenches/machines, but it can also sit on a distinct machine (or even your computer!) depending on the cluster manager and deployment mode.

Upon reception of the task, which is called a driver program in the Spark world, the factory starts running. This doesn’t mean that we get straight to processing. Before that, the cluster needs to plan the capacity it will allocate for your program. The entity or program taking care of this is aptly called the cluster manager.

Let's look at an *example*

\# Contents of the sample.csv
```
old_column
1
4
4
5
7
7
7
10
14
1
4
8
```
In the case of computing the average, each worker independently computes the sum of the values and their counts before moving the result — not all the data! —
over to a single worker (or the master directly, when the intermediate result is really small) that will process the aggregation into a single number, the average.

![Simple averaging of data](images/how_spark_works_1.png)

### Laziness: A Spark feature

laziness is (in part) how Spark achieves its incredible processing speed. Just like in a large-scale factory, you don’t go to each employee and give them a
list of tasks. No, here, the master/manager is responsible for the workers. The driver is where the action happens. Think of a driver as a floor lead: you provide them your list of steps and let them deal with it. In Spark, the driver/floor lead takes your instructions (carefully written in Python code), translates them into Spark steps, and then processes them across the worker. The driver also manages which worker/table has which slice of the data, and makes sure you don’t lose some bits in the process.

## Summary

- The master is like the factory owner, allocating resources as needed to complete
the jobs.
- The driver is responsible for completing a given job. It requests resources from
the master as needed.
- A worker is a set of computing/memory resources, like a workbench in our factory.
- Executors sit atop a worker and perform the work sent by the driver, like
employees at a workbench.

Every instruction you’re providing in Spark can be classified into two categories: transformations and actions. Actions are what many programming languages would consider I/O. The most typical actions are the following:
- Printing information on the screen
- Writing data to a hard drive or cloud bucket
- Counting the number of records

In Spark, we’ll see those instructions most often via the show(), write(), and count() methods on a data frame. Transformations are pretty much everything else. Some examples of transformations
are as follows:
- Adding a column to a table
- Performing an aggregation according to certain keys
- Computing summary statistics
- Training a machine learning model