## From RDDs to DataFrames in Spark

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low-level object native to Spark. It enables distributed computing across multiple nodes within a Spark cluster. However, instead of using RDDs (which can be difficult to manipulate directly), Spark also provides a DataFrame abstraction based on RDDs.

The Spark DataFrame has some advantages over an RDD:
<ul>
    <li>Simpler to understand - Spark DataFrames are somewhat similar to SQL tables.</li>
    <li>Optimised for more complex operations.</li>
    <li>Spark DataFrame optimisation works 'out-of-the-box'.</li>
</ul>

A SparkSession object is the entry point for all functionality into SparkSQL - the Spark framework that enables working with Spark DataFrames. There are two ways to create a SparkSession object. The first way involves using a SparkSession builder. The second, more elaborate way (which also enables work with RDDs to take place) involves creating a SparkSession from the SparkContext. In the second case, the SparkContext is the connection to the cluster, while the SparkSession provides an interface to that cluster.

In [1]:

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("SparkSession2")
sc = SparkContext(conf = conf)

22/05/03 20:41:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:

# Create SparkSession using a SparkContext object
from pyspark.sql import SparkSession

spark = SparkSession(sc)

In [3]:
spark