# Spark DataFrame Basics

Spark DataFrames are the workhouse and main way of working with Spark and Python post Spark 2.0. DataFrames act as powerful versions of tables, with rows and columns, easily handling large datasets. The shift to DataFrames provides many advantages:
* **A much simpler syntax**
* **Ability to use SQL directly in the dataframe**
* **Operations are automatically distributed across RDDs**
    
If you've used R or even the pandas library with Python you are probably already familiar with the concept of DataFrames. Spark DataFrame expand on a lot of these concepts, allowing you to transfer that knowledge easily by understanding the simple syntax of Spark DataFrames. Remember that the main advantage to using Spark DataFrames vs those other programs is that **Spark can handle data across many RDDs, huge data sets that would never fit on a single computer**. That comes at a slight cost of some "peculiar" syntax choices, but after this course you will feel very comfortable with all those topics!

Let's get started!

## Creating a DataFrame

First we need to start a SparkSession:

In [0]:
df = spark.read.json("dbfs:/databricks-datasets/structured-streaming/events/file-0.json") #databricks file system

Then start the SparkSession

In [0]:
# dbutils.fs.ls ("FileStore/tables")
type(df)

Out[2]: pyspark.sql.dataframe.DataFrame

You will first need to get the data from a file (or connect to a large distributed file like HDFS, we'll talk about this later once we move to larger datasets on AWS EC2).

#### Showing the data

In [0]:
df.show()

+------+----------+
|action|      time|
+------+----------+
|  Open|1469501107|
|  Open|1469501147|
|  Open|1469501202|
|  Open|1469501219|
|  Open|1469501225|
|  Open|1469501234|
|  Open|1469501245|
|  Open|1469501246|
|  Open|1469501248|
|  Open|1469501256|
|  Open|1469501264|
|  Open|1469501266|
|  Open|1469501267|
|  Open|1469501269|
|  Open|1469501271|
|  Open|1469501282|
|  Open|1469501285|
|  Open|1469501291|
|  Open|1469501297|
|  Open|1469501303|
+------+----------+
only showing top 20 rows



In [0]:
df.printSchema()

root
 |-- action: string (nullable = true)
 |-- time: long (nullable = true)



In [0]:
df.columns

Out[5]: ['action', 'time']

In [0]:
df.describe()

Out[6]: DataFrame[summary: string, action: string, time: string]

Some data types make it easier to infer schema (like tabular formats such as csv which we will show later). 

However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.

Spark has all the tools you need for this, it just requires a very specific structure:

In [0]:
from pyspark.sql.types import StructField, StringType, LongType, StructType # ya no voy a inferir el tipo de datos, sino que se lo voy a decir

Next we need to create the list of Structure fields
* Param name: string, name of the field.
* Param dataType: :class:`DataType` of the field.
* Param nullable: boolean, whether the field can be null (None) or not.

In [0]:
data_schema = [StructField("time", LongType(), True),
              StructField("action", StringType(), True)]

In [0]:
final_struc = StructType(fields=data_schema) #parametro = argument, StrucType, es el objeto final

In [0]:
df = spark.read.json("dbfs:/databricks-datasets/structured-streaming/events/file-0.json", schema = final_struc)

In [0]:
df.printSchema() # No va a atardar ni tener dudas

root
 |-- time: long (nullable = true)
 |-- action: string (nullable = true)



### Grabbing the data

In [0]:
df["time"]

Out[12]: Column<'time'>

In [0]:
type(df["time"])

Out[13]: pyspark.sql.column.Column

In [0]:
df.select("time")

Out[14]: DataFrame[time: bigint]

In [0]:
type(df.select("time"))

Out[15]: pyspark.sql.dataframe.DataFrame

In [0]:
df.select("time").show() # me muestra un nuevo dataframe

+----------+
|      time|
+----------+
|1469501107|
|1469501147|
|1469501202|
|1469501219|
|1469501225|
|1469501234|
|1469501245|
|1469501246|
|1469501248|
|1469501256|
|1469501264|
|1469501266|
|1469501267|
|1469501269|
|1469501271|
|1469501282|
|1469501285|
|1469501291|
|1469501297|
|1469501303|
+----------+
only showing top 20 rows



In [0]:
df.head(2)

Out[17]: [Row(time=1469501107, action='Open'), Row(time=1469501147, action='Open')]

In [0]:
df.select(df.columns).show()

+----------+------+
|      time|action|
+----------+------+
|1469501107|  Open|
|1469501147|  Open|
|1469501202|  Open|
|1469501219|  Open|
|1469501225|  Open|
|1469501234|  Open|
|1469501245|  Open|
|1469501246|  Open|
|1469501248|  Open|
|1469501256|  Open|
|1469501264|  Open|
|1469501266|  Open|
|1469501267|  Open|
|1469501269|  Open|
|1469501271|  Open|
|1469501282|  Open|
|1469501285|  Open|
|1469501291|  Open|
|1469501297|  Open|
|1469501303|  Open|
+----------+------+
only showing top 20 rows



Multiple Columns:

In [0]:
df.select("time", "action")

Out[19]: DataFrame[time: bigint, action: string]

In [0]:
df.select("time", "action").show()

+----------+------+
|      time|action|
+----------+------+
|1469501107|  Open|
|1469501147|  Open|
|1469501202|  Open|
|1469501219|  Open|
|1469501225|  Open|
|1469501234|  Open|
|1469501245|  Open|
|1469501246|  Open|
|1469501248|  Open|
|1469501256|  Open|
|1469501264|  Open|
|1469501266|  Open|
|1469501267|  Open|
|1469501269|  Open|
|1469501271|  Open|
|1469501282|  Open|
|1469501285|  Open|
|1469501291|  Open|
|1469501297|  Open|
|1469501303|  Open|
+----------+------+
only showing top 20 rows



### Creating new columns

In [0]:
df.withColumn("newtime", df["time"] + 5).show()

+----------+------+----------+
|      time|action|   newtime|
+----------+------+----------+
|1469501107|  Open|1469501112|
|1469501147|  Open|1469501152|
|1469501202|  Open|1469501207|
|1469501219|  Open|1469501224|
|1469501225|  Open|1469501230|
|1469501234|  Open|1469501239|
|1469501245|  Open|1469501250|
|1469501246|  Open|1469501251|
|1469501248|  Open|1469501253|
|1469501256|  Open|1469501261|
|1469501264|  Open|1469501269|
|1469501266|  Open|1469501271|
|1469501267|  Open|1469501272|
|1469501269|  Open|1469501274|
|1469501271|  Open|1469501276|
|1469501282|  Open|1469501287|
|1469501285|  Open|1469501290|
|1469501291|  Open|1469501296|
|1469501297|  Open|1469501302|
|1469501303|  Open|1469501308|
+----------+------+----------+
only showing top 20 rows



In [0]:
df.show()

+----------+------+
|      time|action|
+----------+------+
|1469501107|  Open|
|1469501147|  Open|
|1469501202|  Open|
|1469501219|  Open|
|1469501225|  Open|
|1469501234|  Open|
|1469501245|  Open|
|1469501246|  Open|
|1469501248|  Open|
|1469501256|  Open|
|1469501264|  Open|
|1469501266|  Open|
|1469501267|  Open|
|1469501269|  Open|
|1469501271|  Open|
|1469501282|  Open|
|1469501285|  Open|
|1469501291|  Open|
|1469501297|  Open|
|1469501303|  Open|
+----------+------+
only showing top 20 rows



In [0]:
df.withColumnRenamed("action", "superaction").show()

+----------+-----------+
|      time|superaction|
+----------+-----------+
|1469501107|       Open|
|1469501147|       Open|
|1469501202|       Open|
|1469501219|       Open|
|1469501225|       Open|
|1469501234|       Open|
|1469501245|       Open|
|1469501246|       Open|
|1469501248|       Open|
|1469501256|       Open|
|1469501264|       Open|
|1469501266|       Open|
|1469501267|       Open|
|1469501269|       Open|
|1469501271|       Open|
|1469501282|       Open|
|1469501285|       Open|
|1469501291|       Open|
|1469501297|       Open|
|1469501303|       Open|
+----------+-----------+
only showing top 20 rows



More complicated operations to create new columns

In [0]:
df.withColumn("doubletime", df["time"]*2).show(5)

+----------+------+----------+
|      time|action|doubletime|
+----------+------+----------+
|1469501107|  Open|2939002214|
|1469501147|  Open|2939002294|
|1469501202|  Open|2939002404|
|1469501219|  Open|2939002438|
|1469501225|  Open|2939002450|
+----------+------+----------+
only showing top 5 rows



In [0]:
df.withColumn("add_one_time", df["time"] + 1).show()

+----------+------+------------+
|      time|action|add_one_time|
+----------+------+------------+
|1469501107|  Open|  1469501108|
|1469501147|  Open|  1469501148|
|1469501202|  Open|  1469501203|
|1469501219|  Open|  1469501220|
|1469501225|  Open|  1469501226|
|1469501234|  Open|  1469501235|
|1469501245|  Open|  1469501246|
|1469501246|  Open|  1469501247|
|1469501248|  Open|  1469501249|
|1469501256|  Open|  1469501257|
|1469501264|  Open|  1469501265|
|1469501266|  Open|  1469501267|
|1469501267|  Open|  1469501268|
|1469501269|  Open|  1469501270|
|1469501271|  Open|  1469501272|
|1469501282|  Open|  1469501283|
|1469501285|  Open|  1469501286|
|1469501291|  Open|  1469501292|
|1469501297|  Open|  1469501298|
|1469501303|  Open|  1469501304|
+----------+------+------------+
only showing top 20 rows



In [0]:
df.withColumn("half_time", df["time"]/2).show(5)

+----------+------+-------------+
|      time|action|    half_time|
+----------+------+-------------+
|1469501107|  Open|7.347505535E8|
|1469501147|  Open|7.347505735E8|
|1469501202|  Open| 7.34750601E8|
|1469501219|  Open|7.347506095E8|
|1469501225|  Open|7.347506125E8|
+----------+------+-------------+
only showing top 5 rows



In [0]:
df.withColumn("half_time", df["time"]/2)

Out[32]: DataFrame[time: bigint, action: string, half_time: double]

We'll discuss much more complicated operations later on!

### Using SQL

To use SQL queries directly with the dataframe, you will need to register it to a temporary view:

In [0]:
df.createOrReplaceTempView("IoT")

In [0]:
sql_results = spark.sql("SELECT * FROM IoT")

In [0]:
sql_results.show()

+----------+------+
|      time|action|
+----------+------+
|1469501107|  Open|
|1469501147|  Open|
|1469501202|  Open|
|1469501219|  Open|
|1469501225|  Open|
|1469501234|  Open|
|1469501245|  Open|
|1469501246|  Open|
|1469501248|  Open|
|1469501256|  Open|
|1469501264|  Open|
|1469501266|  Open|
|1469501267|  Open|
|1469501269|  Open|
|1469501271|  Open|
|1469501282|  Open|
|1469501285|  Open|
|1469501291|  Open|
|1469501297|  Open|
|1469501303|  Open|
+----------+------+
only showing top 20 rows



We won't really be focusing on using the SQL syntax for this course in general, but keep in mind it is always there for you to get you out of bind quickly with your SQL skills!

Alright that is all we need to know for now!