In [None]:
While working with files, sometimes we may not receive a file for processing, however, we still need to create a DataFrame manually 
with the same schema we expect. If we don’t create with the same schema, our operations/transformations (like union’s) on DataFrame 
fail as we refer to the columns that may not present.

To handle situations similar to these, we always need to create a DataFrame with the same schema, which means the same column names 
and datatypes regardless of the file exists or empty file processing.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
print(emptyRDD)

#Creates Empty RDD using parallelize
rdd2= spark.sparkContext.parallelize([])
print(rdd2)

#If you try to perform operations on empty RDD you going to get ValueError("RDD is empty").

EmptyRDD[0] at emptyRDD at NativeMethodAccessorImpl.java:0
ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274


In [2]:
#Create Empty DataFrame with Schema (StructType)
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])
#Create empty DataFrame from empty RDD
df = spark.createDataFrame(emptyRDD,schema)
df.printSchema()


root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



In [3]:
#Convert Empty RDD to DataFrame
df1 = emptyRDD.toDF(schema)
df1.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



In [4]:
#Create empty DataFrame directly.
df2 = spark.createDataFrame([], schema)
df2.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



In [5]:
#Create empty DatFrame with no schema (no columns)
df3 = spark.createDataFrame([], StructType([]))
df3.printSchema()

root

