### Resilient Distributed Dataset (RDD)
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

Consider these scenarios or common use cases for using RDDs when:

* you want low-level transformation and actions and control on your dataset;
* your data is unstructured, such as media streams or streams of text;
* you want to manipulate your data with functional programming constructs than domain specific expressions;
* you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
* you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

## Spark SQL and DataFrames

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API. When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkR shell.

### SQL
One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.

### Datasets and DataFrames
A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

__A DataFrame is a Dataset organized into named columns.__ It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.

source: https://spark.apache.org/docs/latest/sql-programming-guide.html

**Additional Reading:** [A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

In [3]:
# Similar to SparkContext, for SparkSQL you need a SparkSession
from pyspark.sql import SparkSession
# Also all the functions (select, where, groupby) needs to be imported
from pyspark.sql.functions import *

In [4]:
# instantiate spark session
spark = SparkSession.builder.getOrCreate()

### Loading Data to DataFrame
once you have your spark session you can read csv, json, or parquet file! Let's read our movie ratings CSV data with ease!

In [6]:
ratings_df = spark.read.csv("/FileStore/tables/movielens/ratings.csv", header=True)

In [7]:
# You can use show(n) to take a look into the dataframe
ratings_df.show(5)

In [8]:
# as we have many columns in our table, the result above is a bit messy! We can use printSchema() to print out the schema of our table
ratings_df.printSchema()

In [9]:
# you can ask spark's dataframe to also infer schema
ratings_df = spark.read.csv("/FileStore/tables/movielens/ratings.csv", header=True, inferSchema=True)

In [10]:
# inside databricks, we can use display() to have a nicer view of our dataframe
display(ratings_df)

### DataFrame Operations
DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.

As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.

Here we include some basic examples of structured data processing using Datasets:

In [12]:
# Select only the "movieId" and "rating" column
display(ratings_df.select(['movieId','rating']))

In [13]:
# Count the number of ratings per movie
display(ratings_df.groupBy("movieId").count())

### Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

In [15]:
# Register the DataFrame as a SQL temporary view
ratings_df.createOrReplaceTempView("ratings_df")

In [16]:
# Write the query in SQL
sql_ratings_df = spark.sql("SELECT movieId, count(*) FROM ratings_df Group By movieId")

In [17]:
# display the results
display(sql_ratings_df)

In [18]:
# Also, remember you can convert dataframe back to RDD anytime
ratings_rdd = ratings_df.rdd
ratings_rdd.take(2)

In [19]:
# you can access each elements in the following way:
ratings_rdd.map(lambda row: (row.movieId, row.rating)).take(10)

#### For this part of the course, I will stick with the Spark's DataFrame Opertaions. I find them to be a little bit more readable than SQL statements.
### Let's do more complicated DataFrame operation in the next notebook!