# Dataframes

In [35]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Dataframes").getOrCreate()


25/08/08 15:52:31 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


# Create an Empty DataFrame & RDD

Create an empty PySpark DataFrame/RDD manually with or without schema (column names).
While working with files, sometimes we may not receive a file for processing, however, we still need to create a DataFrame manually with the same schema we expect. If we don’t create with the same schema, our operations/transformations (like union’s) on DataFrame fail as we refer to the columns that may not present.

To handle situations similar to these, we always need to create a DataFrame with the same schema, which means the same column names and datatypes regardless of the file exists or empty file processing.

### Create Empty RDD in PySpark

In [36]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()
print(emptyRDD)

#Diplays
#EmptyRDD[188] at emptyRDD

# Alternatively you can also get empty RDD by using spark.sparkContext.parallelize([]).

#Creates Empty RDD using parallelize
rdd2= spark.sparkContext.parallelize([])
print(rdd2)

EmptyRDD[216] at emptyRDD at NativeMethodAccessorImpl.java:0
ParallelCollectionRDD[217] at readRDDFromFile at PythonRDD.scala:297


25/08/08 15:52:31 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Create Empty DataFrame with Schema (StructType)
 Create a schema using StructType and StructField

In [37]:
#Create Schema
from pyspark.sql.types import StructType,StructField, StringType
schema = StructType([
  StructField('firstname', StringType(), True),
  StructField('middlename', StringType(), True),
  StructField('lastname', StringType(), True)
  ])

Now use the empty RDD created above and pass it to createDataFrame() of SparkSession along with the schema for column names & data types.

In [38]:
#Create empty DataFrame from empty RDD
df = spark.createDataFrame(emptyRDD,schema)
df.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



### Convert Empty RDD to DataFrame
can also create empty DataFrame by converting empty RDD to DataFrame using toDF().

In [39]:
#Convert empty RDD to Dataframe
df1 = emptyRDD.toDF(schema)
df1.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



In [40]:
#Create empty DataFrame directly.
df2 = spark.createDataFrame([], schema)
df2.printSchema()

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)



###  Create Empty DataFrame without Schema (no columns)
To create empty DataFrame with out schema (no columns) just create a empty schema and use it while creating PySpark DataFrame.

In [41]:
#Create empty DatFrame with no schema (no columns)
df3 = spark.createDataFrame([], StructType([]))
df3.printSchema()

root



### Create DataFrame from RDD
create PySpark DataFrame from an existing RDD. First, let’s create a Spark RDD from a collection List by calling parallelize() function from SparkContext

In [42]:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

dept = [("Finance",10),("Marketing",20),("Sales",30),("IT",40)]
rdd2 = spark.sparkContext.parallelize(dept)

#### Using toDF() function
PySpark RDD’s toDF() method is used to create a DataFrame from existing RDD. Since RDD doesn’t have columns, the DataFrame is created with default column names “_1” and “_2” as we have two columns.

In [43]:
dfFromRDD1 = rdd.toDF()
dfFromRDD1.printSchema()
dfFromRDD1.show(truncate=False)

dfFromRDD2 = rdd2.toDF()
dfFromRDD2.printSchema()
dfFromRDD2.show(truncate=False)

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)

+------+------+
|_1    |_2    |
+------+------+
|Java  |20000 |
|Python|100000|
|Scala |3000  |
+------+------+

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)

+---------+---+
|_1       |_2 |
+---------+---+
|Finance  |10 |
|Marketing|20 |
|Sales    |30 |
|IT       |40 |
+---------+---+



If we want to provide column names to the DataFrame use toDF() method with column names as arguments as shown below.

In [44]:
columns = ["language","users_count"]
dfFromRDD1 = rdd.toDF(columns)
dfFromRDD1.printSchema()
dfFromRDD1.show(truncate=False)

deptColumns = ["dept_name","dept_id"]
df2 = rdd2.toDF(deptColumns)
df2.printSchema()
df2.show(truncate=False)

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)

+--------+-----------+
|language|users_count|
+--------+-----------+
|Java    |20000      |
|Python  |100000     |
|Scala   |3000       |
+--------+-----------+

root
 |-- dept_name: string (nullable = true)
 |-- dept_id: long (nullable = true)

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|20     |
|Sales    |30     |
|IT       |40     |
+---------+-------+



#### Using createDataFrame() from SparkSession
Using createDataFrame() from SparkSession is another way to create manually and it takes rdd object as an argument, and chain with toDF() to specify name to the columns.

In [45]:
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)
dfFromRDD2.printSchema()
dfFromRDD2.show(truncate=False)

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)

+--------+-----------+
|language|users_count|
+--------+-----------+
|Java    |20000      |
|Python  |100000     |
|Scala   |3000       |
+--------+-----------+



### Create DataFrame from List Collection
we use the list data object instead of “rdd” object to create DataFrame.

#### Using createDataFrame() from SparkSession
Calling createDataFrame() from SparkSession is another way to create PySpark DataFrame manually, it takes a list object as an argument. and chain with toDF() to specify names to the columns.

In [46]:
dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2.printSchema()
dfFromData2.show(truncate=False)

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)

+--------+-----------+
|language|users_count|
+--------+-----------+
|Java    |20000      |
|Python  |100000     |
|Scala   |3000       |
+--------+-----------+



#### Using createDataFrame() with the Row type
createDataFrame() has another signature in PySpark which takes the collection of Row type and schema for column names as arguments. To use this first we need to convert our “data” object from the list to list of Row.

In [47]:
from pyspark.sql import Row

rowData = map(lambda x: Row(*x), data) 
dfFromData3 = spark.createDataFrame(rowData,columns)
dfFromData3.printSchema()
dfFromData3.show(truncate=False)

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)

+--------+-----------+
|language|users_count|
+--------+-----------+
|Java    |20000      |
|Python  |100000     |
|Scala   |3000       |
+--------+-----------+



#### Create Empty DataFrame with Schema
If you wanted to specify the column names along with their data types, you should create the StructType schema first and then assign this while creating a DataFrame

In [48]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

data2 = [("James", "", "Smith", "36636", "M", 3000),
         ("Michael", "Rose", "", "40288", "M", 4000),
         ("Robert", "", "Williams", "42114", "M", 4000),
         ("Maria", "Anne", "Jones", "39192", "F", 4000),
         ("Jen", "Mary", "Brown", "", "F", -1)
         ]

# If you wanted to specify the column names along with their data types, you should
# create the StructType schema first and then assign this while creating a DataFrame.
schema = StructType([ \
    StructField("firstname", StringType(), True), \
    StructField("middlename", StringType(), True), \
    StructField("lastname", StringType(), True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
    ])


df = spark.createDataFrame(data=data2, schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



### Create DataFrame from Data sources
You can manually create a PySpark DataFrame using toDF() and createDataFrame() methods
Can also be created  from data sources like TXT, CSV, JSON, ORV, Avro, Parquet, XML formats by reading from HDFS, S3, DBFS, Azure Blob file systems, and by reading data from RDBMS Databases and NoSQL databases. 

#### Creating DataFrame from CSV
Use csv() method of the DataFrameReader object to create a DataFrame from CSV file. you can also provide options like what delimiter to use, whether you have quoted data, date formats, infer schema, and many more

In [49]:
df2 = spark.read.csv("./resources/zipcodes.csv", header=True)
df2.show(truncate=True)

+------------+-------+-----------+-------------------+-----+--------------+-----+------+-----+-----+-----+-----------+-------+--------------------+--------------------+--------------------+---------------+-------------------+----------+-----+
|RecordNumber|Zipcode|ZipCodeType|               City|State|  LocationType|  Lat|  Long|Xaxis|Yaxis|Zaxis|WorldRegion|Country|        LocationText|            Location|       Decommisioned|TaxReturnsFiled|EstimatedPopulation|TotalWages|Notes|
+------------+-------+-----------+-------------------+-----+--------------+-----+------+-----+-----+-----+-----------+-------+--------------------+--------------------+--------------------+---------------+-------------------+----------+-----+
|           1|    704|   STANDARD|        PARC PARQUE|   PR|NOT ACCEPTABLE|17.96|-66.22| null|-0.87|  0.3|         NA|     US|         Parc Parque|                  PR|NA-US-PR-PARC PARQUE|          false|               null|      null| null|
|           2|    704|   STA

### Convert PySpark DataFrame to Pandas
Main differences between the Pandas & PySpark, operations on Pyspark run faster than Pandas due to its distributed nature and parallel execution on multiple cores and machines.

In other words, pandas run operations on a single node whereas PySpark runs on multiple machines. If you are working on a Machine Learning application where you are dealing with larger datasets, PySpark processes operations many times faster than pandas.
#### Prepare PySpark DataFrame

In [50]:
data = [("James","","Smith","36636","M",60000),
        ("Michael","Rose","","40288","M",70000),
        ("Robert","","Williams","42114","",400000),
        ("Maria","Anne","Jones","39192","F",500000),
        ("Jen","Mary","Brown","","F",0)]

columns = ["first_name","middle_name","last_name","dob","gender","salary"]
pysparkDF = spark.createDataFrame(data = data, schema = columns)
pysparkDF.printSchema()
pysparkDF.show(truncate=False)

root
 |-- first_name: string (nullable = true)
 |-- middle_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+----------+-----------+---------+-----+------+------+
|first_name|middle_name|last_name|dob  |gender|salary|
+----------+-----------+---------+-----+------+------+
|James     |           |Smith    |36636|M     |60000 |
|Michael   |Rose       |         |40288|M     |70000 |
|Robert    |           |Williams |42114|      |400000|
|Maria     |Anne       |Jones    |39192|F     |500000|
|Jen       |Mary       |Brown    |     |F     |0     |
+----------+-----------+---------+-----+------+------+



#### Convert PySpark Dataframe to Pandas DataFrame
PySpark DataFrame provides a method toPandas() to convert it Python Pandas DataFrame.

In [51]:
pandasDF = pysparkDF.toPandas()
print(pandasDF)

  first_name middle_name last_name    dob gender  salary
0      James                 Smith  36636      M   60000
1    Michael        Rose            40288      M   70000
2     Robert              Williams  42114         400000
3      Maria        Anne     Jones  39192      F  500000
4        Jen        Mary     Brown             F       0


### Convert Spark Nested Struct DataFrame to Pandas
Most of the time data in PySpark DataFrame will be in a structured format meaning one column contains other columns so let’s see how it convert to Pandas. Here is an example with nested struct where we have firstname, middlename and lastname are part of the name column

In [52]:
from pyspark.sql.types import StructType, StructField, StringType,IntegerType
dataStruct = [(("James","","Smith"),"36636","M","3000"), \
      (("Michael","Rose",""),"40288","M","4000"), \
      (("Robert","","Williams"),"42114","M","4000"), \
      (("Maria","Anne","Jones"),"39192","F","4000"), \
      (("Jen","Mary","Brown"),"","F","-1") \
]

schemaStruct = StructType([
      StructField('name', StructType([
            StructField('firstname', StringType(), True),
            StructField('middlename', StringType(), True),
            StructField('lastname', StringType(), True)
             ])),
      StructField('dob', StringType(), True),
            StructField('gender', StringType(), True),
            StructField('salary', StringType(), True)
         ])
df = spark.createDataFrame(data=dataStruct, schema = schemaStruct)
df.printSchema()
df.show(truncate=False)
pandasDF2 = df.toPandas()
print(pandasDF2)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- dob: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)

+--------------------+-----+------+------+
|name                |dob  |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3000  |
|{Michael, Rose, }   |40288|M     |4000  |
|{Robert, , Williams}|42114|M     |4000  |
|{Maria, Anne, Jones}|39192|F     |4000  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+

                   name    dob gender salary
0      (James, , Smith)  36636      M   3000
1     (Michael, Rose, )  40288      M   4000
2  (Robert, , Williams)  42114      M   4000
3  (Maria, Anne, Jones)  39192      F   4000
4    (Jen, Mary, Brown)             F     -1
