## Resilient Distributed Datasets (RDDs)

RDDs are the building blocks of Spark. It is useful for processing unstructured data, such as text or images. 

### RDDs has three Main properties:

1. **Resilient**: RDDs are fault-tolerant, meaning that you can recompute lost data due to node failures.
2. **Distributed**: RDDs are distributed across multiple nodes in a cluster.
3. **Operated in parallel**: RDDs can be operated in parallel.

## Start Coding with PySpark

`SparkSession` is the entry point to Spark SQL.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

`sparkContext` within `SparkSession` is the connection to the Spark cluster and can be used to create and transform RDDs.

We can create an RDD from data saved locally using `sparkContext.parallelize()`. We can add an argument to specify the number of partitions to split the data into. Spark defaults to the number of cores on the machine.

In [None]:
# default setting
rdd_par = spark.sparkContext.parallelize(dataset_name)

If we are working with external data, we can use `sparkContext.textFile()` to create an RDD from a text file.

In [None]:
# with partition argument of 10
rdd_txt = spark.sparkContext.textFile("file_name.txt", 10)

Verify the number of partitions using `getNumPartitions()`.

In [None]:
rdd_txt.getNumPartitions()
# output: 10

To end the Spark session, use `spark.stop()`.

In [None]:
spark.stop()