# **Introduction to PySpark**

## Spark
Spark is a platform that makes it easier to work with large datasets by spreading out the data and computations across **clusters**, containing multiple **nodes**. Each node is like a small computer, that can focus purely on a subset of the data. This way the total data can be computed in **parallel**

### Spark in Python
A spark cluster consists of one **master** node which controls multiple **worker** nodes. In practice the cluster is hosted on a remote machine (e.g. in Azure or DataBricks)

Creating a connection to the spark cluster is done by using the `SparkContext` class


In [1]:
from pyspark import SparkContext

sc = SparkContext()

print(sc)
print(sc.version)

<SparkContext master=local[*] appName=pyspark-shell>
3.5.0


### DataFrames
The core data structure Spark uses is the **Resilient Distributed Dataset (RDD)**. This is a low level object that allows Spark to distribute data over its various nodes in the cluster. **DataFrames** are an abstraction built ontop of RDDs that make it easier to work with the data, for complex operations they can even be faster than RDDs.

Spark DataFrames are similar to Pandas DataFrames, especially in their syntax. There are some key difference though
|1|2|
|-|-|
|2|3|

Calculate pi using spark cluster

In [1]:
import random
from pyspark import SparkContext

sc = SparkContext()

def inside(_):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

n_samples = 100000000
count = sc.parallelize(range(0, n_samples)).filter(inside).count()
pi = 4 * count / n_samples

print("Pi is roughly ", pi)

Pi is roughly  3.1415276


In [2]:
# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)

<SparkContext master=local[*] appName=pyspark-shell>
3.5.0
