# DataFrames
In order to run PySpark in Jupyter notebook, first you need to find the PySpark Install. We will use findspark package to do so. Since this is a third-party package we need to install it before using it.

In [1]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()

## Create an Empty DataFrame
### Create Empty RDD in PySpark
Create an empty RDD by using emptyRDD() of SparkContext for example spark.sparkContext.emptyRDD().<br>
Alternatively you can also get empty RDD by using spark.sparkContext.parallelize([]).<br>
Note: If you try to perform operations on empty RDD you going to get ValueError("RDD is empty").

In [2]:
#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
print(emptyRDD)

EmptyRDD[0] at emptyRDD at NativeMethodAccessorImpl.java:0


In [3]:
#Creates Empty RDD using parallelize
rdd2= spark.sparkContext.parallelize([])
print(rdd2)

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:262


### Create Empty DataFrame with Schema (StructType)
In order to create an empty PySpark DataFrame manually with schema ( column names & data types) first, Create a schema using StructType and StructField.

In [4]:
#Create Schema
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.

In [5]:
#Create empty DataFrame from empty RDD
df = spark.createDataFrame(emptyRDD,schema)
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



### Convert Empty RDD to DataFrame
You can also create empty DataFrame by converting empty RDD to DataFrame using toDF().

In [6]:
#Convert empty RDD to Dataframe
df1 = emptyRDD.toDF(schema)
df1.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



### Create Empty DataFrame with Schema
We can create empty dataframe manually with schema and without RDD.

In [7]:
#Create empty DataFrame directly.
df2 = spark.createDataFrame([], schema)
df2.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



### Create Empty DataFrame without Schema (no columns)

In [8]:
#Create empty DatFrame with no schema (no columns)
df3 = spark.createDataFrame([], StructType([]))
df3.printSchema()

root



## Convert PySpark RDD to DataFrame
In PySpark, toDF() function of the RDD is used to convert RDD to DataFrame. We would need to convert RDD to DataFrame as DataFrame provides more advantages over RDD. For instance, DataFrame is a distributed collection of data organized into named columns similar to Database tables and provides optimization and performance improvements.
### Create PySpark RDD

In [9]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]
rdd = spark.sparkContext.parallelize(dept)

### Convert PySpark RDD to DataFrame
Converting PySpark RDD to DataFrame can be done using toDF(), createDataFrame().
#### Using rdd.toDF() function
PySpark provides toDF() function in RDD which can be used to convert RDD into Dataframe. By default, toDF() function creates column names as “_1” and “_2”.

In [10]:
df = rdd.toDF()
df.printSchema()
df.show(truncate=False)

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

+---------+---+
|_1       |_2 |
+---------+---+
|Finance  |10 |
|Marketing|20 |
|Sales    |30 |
|IT       |40 |
+---------+---+



toDF() has another signature that takes arguments to define column names

In [11]:
deptColumns = ["dept_name","dept_id"]
df2 = rdd.toDF(deptColumns)
df2.printSchema()
df2.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



### Using PySpark createDataFrame() function
SparkSession class provides createDataFrame() method to create DataFrame and it takes rdd object as an argument.

In [12]:
deptDF = spark.createDataFrame(rdd, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



##### Using createDataFrame() with StructType schema
When you infer the schema, by default the datatype of the columns is derived from the data and set’s nullable to true for all columns. We can change this behavior by supplying schema using StructType – where we can specify a column name, data type and nullable for each field/column.

In [13]:
from pyspark.sql.types import StructType, StructField, StringType
deptSchema = StructType([       
    StructField('dept_name', StringType(), True),
    StructField('dept_id', StringType(), True)
])

deptDF1 = spark.createDataFrame(rdd, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: string (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



## Convert PySpark DataFrame to Pandas
PySpark DataFrame can be converted to Python pandas DataFrame using a function toPandas().<br>
Operations on Pyspark run faster than Pandas due to its distributed nature and parallel execution on multiple cores and machines. In other words, pandas run operations on a single node whereas PySpark runs on multiple machines.<br>
If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark processes operations many times faster than pandas.<br>
After processing data in PySpark we would need to convert it back to Pandas DataFrame for a further procession with Machine Learning application or any Python applications.
### Prepare PySpark DataFrame

In [15]:
data = [("James","","Smith","36636","M",60000),
        ("Michael","Rose","","40288","M",70000),
        ("Robert","","Williams","42114","",400000),
        ("Maria","Anne","Jones","39192","F",500000),
        ("Jen","Mary","Brown","","F",0)]

columns = ["first_name","middle_name","last_name","dob","gender","salary"]
pysparkDF = spark.createDataFrame(data = data, schema = columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)

root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob  |gender|salary|
+----------+-----------+---------+-----+------+------+
|James     |           |Smith    |36636|M     |60000 |
|Michael   |Rose       |         |40288|M     |70000 |
|Robert    |           |Williams |42114|      |400000|
|Maria     |Anne       |Jones    |39192|F     |500000|
|Jen       |Mary       |Brown    |     |F     |0     |
+----------+-----------+---------+-----+------+------+



### Convert PySpark Dataframe to Pandas DataFrame
PySpark DataFrame provides a method toPandas() to convert it to Python Pandas DataFrame. toPandas() results in the collection of all records in the PySpark DataFrame to the driver program and should be done only on a small subset of the data. Running on larger dataset’s results in memory error and crashes the application. To deal with a larger dataset, you can also try increasing memory on the driver.

In [16]:
pandasDF = pysparkDF.toPandas()
print(pandasDF)

  first_name middle_name last_name    dob gender  salary
0      James                 Smith  36636      M   60000
1    Michael        Rose            40288      M   70000
2     Robert              Williams  42114         400000
3      Maria        Anne     Jones  39192      F  500000
4        Jen        Mary     Brown             F       0


Note that pandas add a sequence number to the result as a row Index. You can rename pandas columns by using rename() function.

### Convert Spark Nested Struct DataFrame to Pandas 
Most of the time data in PySpark DataFrame will be in a structured format meaning one column contains other columns so let’s see how it convert to Pandas. Here is an example with nested struct where we have firstname, middlename and lastname are part of the name column.

In [17]:
# Nested structure elements
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
dataStruct = [(("James","","Smith"),"36636","M","3000"), \
      (("Michael","Rose",""),"40288","M","4000"), \
      (("Robert","","Williams"),"42114","M","4000"), \
      (("Maria","Anne","Jones"),"39192","F","4000"), \
      (("Jen","Mary","Brown"),"","F","-1") \
]

schemaStruct = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
          StructField('dob', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', StringType(), True)
         ])

df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)
df.printSchema()

pandasDF2 = df.toPandas()
print(pandasDF2)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)

                   name    dob gender salary
0      (James, , Smith)  36636      M   3000
1     (Michael, Rose, )  40288      M   4000
2  (Robert, , Williams)  42114      M   4000
3  (Maria, Anne, Jones)  39192      F   4000
4    (Jen, Mary, Brown)             F     -1


## PySpark show()
PySpark DataFrame show() is used to display the contents of the DataFrame in a Table Row and Column Format. By default, it shows only 20 Rows, and the column values are truncated at 20 characters.

In [18]:
# Default - displays 20 rows and 
# 20 charactes from column value 
df.show()

#Display full column contents
df.show(truncate=False)

# Display 2 rows and full column contents
df.show(2,truncate=False) 

# Display 2 rows & column values 25 characters
df.show(2,truncate=25) 

# Display DataFrame rows & columns vertically
df.show(n=3,truncate=25,vertical=True)

+--------------------+-----+------+------+
|                name|  dob|gender|salary|
+--------------------+-----+------+------+
|    [James, , Smith]|36636|     M|  3000|
|   [Michael, Rose, ]|40288|     M|  4000|
|[Robert, , Williams]|42114|     M|  4000|
|[Maria, Anne, Jones]|39192|     F|  4000|
|  [Jen, Mary, Brown]|     |     F|    -1|
+--------------------+-----+------+------+

+--------------------+-----+------+------+
|name                |dob  |gender|salary|
+--------------------+-----+------+------+
|[James, , Smith]    |36636|M     |3000  |
|[Michael, Rose, ]   |40288|M     |4000  |
|[Robert, , Williams]|42114|M     |4000  |
|[Maria, Anne, Jones]|39192|F     |4000  |
|[Jen, Mary, Brown]  |     |F     |-1    |
+--------------------+-----+------+------+

+-----------------+-----+------+------+
|name             |dob  |gender|salary|
+-----------------+-----+------+------+
|[James, , Smith] |36636|M     |3000  |
|[Michael, Rose, ]|40288|M     |4000  |
+-----------------+-----

### show() Syntax
Following is the syntax of the show() function.
def show(self, n=20, truncate=True, vertical=False):

### PySpark show() To Display Contents
Use PySpark show() method to display the contents of the DataFrame and use pyspark printSchema() method to print the schema. show() method by default shows only 20 rows/records from the DataFrame and truncates the column values at 20 characters.


In [19]:
columns = ["Seqno","Quote"]
data = [("1", "Be the change that you wish to see in the world"),
    ("2", "Everyone thinks of changing the world, but no one thinks of changing himself."),
    ("3", "The purpose of our lives is to be happy."),
    ("4", "Be cool.")]
df = spark.createDataFrame(data,columns)
df.show()

+-----+--------------------+
|Seqno|               Quote|
+-----+--------------------+
|    1|Be the change tha...|
|    2|Everyone thinks o...|
|    3|The purpose of ou...|
|    4|            Be cool.|
+-----+--------------------+



values in the Quote column are truncated at 20 characters

In [20]:
#Display full column contents
df.show(truncate=False)

+-----+-----------------------------------------------------------------------------+
|Seqno|Quote                                                                        |
+-----+-----------------------------------------------------------------------------+
|1    |Be the change that you wish to see in the world                              |
|2    |Everyone thinks of changing the world, but no one thinks of changing himself.|
|3    |The purpose of our lives is to be happy.                                     |
|4    |Be cool.                                                                     |
+-----+-----------------------------------------------------------------------------+



In [21]:
# Display 2 rows and full column contents
df.show(2,truncate=False) 

+-----+-----------------------------------------------------------------------------+
|Seqno|Quote                                                                        |
+-----+-----------------------------------------------------------------------------+
|1    |Be the change that you wish to see in the world                              |
|2    |Everyone thinks of changing the world, but no one thinks of changing himself.|
+-----+-----------------------------------------------------------------------------+
only showing top 2 rows



### Show() with Truncate Column Values
You can also truncate the column value at the desired length. By default it truncates after 20 characters however, you can display all contents by using truncate=False. If you wanted to truncate at a specific length use truncate=n.

In [22]:
# Display 2 rows & column values 25 characters
df.show(2,truncate=25) 

+-----+-------------------------+
|Seqno|                    Quote|
+-----+-------------------------+
|    1|Be the change that you...|
|    2|Everyone thinks of cha...|
+-----+-------------------------+
only showing top 2 rows



### Display Contents Vertically
Finally, let’s see how to display the DataFrame vertically record by record.

In [24]:
# Display DataFrame rows & columns vertically
df.show(n=3,truncate=25,vertical=True)

-RECORD 0--------------------------
 Seqno | 1                         
 Quote | Be the change that you... 
-RECORD 1--------------------------
 Seqno | 2                         
 Quote | Everyone thinks of cha... 
-RECORD 2--------------------------
 Seqno | 3                         
 Quote | The purpose of our liv... 
only showing top 3 rows

