### Overview

DataFrames are common Spark data objects. This script covers methods for creating DataFrames, specifying input, and explicitly declaring column names and formats.

### Creating Spark DataFrames

##### Explicity State Data and Columns:

In [0]:
# Example 1: Explicitly state data and column params
sdf = spark.createDataFrame(
  data = [
    (1001,'Chicago',535),          ##  
    (1002,'Boston',495),            # #Values
    (1003,'Seattle',318),          ##
  ],
  schema = ['station_id','city','rainfall']  # column names 
)

display(sdf)

# Example 2: Ignore explicit data and column params
sdf = spark.createDataFrame(
  [                                # <-- this line no longer has a "data = "
    (1001,'Chicago',535),          
    (1002,'Boston',495) ,          
    (1003,'Seattle',318),          
  ],
  ['station_id','city','rainfall'] # <-- this line no longer has a "schema = "
)

display(sdf)

station_id,city,rainfall
1001,Chicago,535
1002,Boston,495
1003,Seattle,318


station_id,city,rainfall
1001,Chicago,535
1002,Boston,495
1003,Seattle,318


##### Implicitly State Data and Columns Using Lists:
Alternatively, you can input lists into the createDataFrame() function...

In [0]:
# Example 1: Implicitly state data and column params
data = [
    (1001,'Chicago',535),            
    (1002,'Boston',495),           
    (1003,'Seattle',318),          
]

columns = ['station_id','city','rainfall']

sdf = spark.createDataFrame(data=data, schema=columns) # <-- the data and column objects referenced here 

display(sdf)

# Example 2: Ignore implicit data and column params
data = [
    (1001,'Chicago',535),            
    (1002,'Boston',495),           
    (1003,'Seattle',318),          
]

columns = ['station_id','city','rainfall']

sdf = spark.createDataFrame(data, columns)             # <-- "data =" & "schema =" no longer stated

display(sdf)

##### Creating a Spark DataFrame with Explicity Stated Formatting
The pyspark.sql.types module provides functions which will allow you to define each columns data types when creating as Spark DataFrame.

In [0]:
from pyspark.sql.types import * # <-- this will import all of the different datatypes 

data = [
    (1001,'Chicago',535 , None),    ##  
    (1002,'Boston' ,495 , None),     # Values
    (1003,'Seattle',318 , None),     #
    (None,None     ,None, None),    ##
]

schema = StructType([
   StructField("station_id", IntegerType(), True),   ##
   StructField("city"      , StringType() , True),    # Column Names & Formats 
   StructField("rainfall"  , IntegerType(), True),    # IntegerType()/ StringType() are 2 of the imported datatypes
   StructField("comments"  , StringType() , True)])  ##

sdf = spark.createDataFrame(data, schema)

display(sdf)

station_id,city,rainfall,comments
1001.0,Chicago,535.0,
1002.0,Boston,495.0,
1003.0,Seattle,318.0,
,,,


### Viewing Data with the display() and show() Functions:  
The display() function is only 1 way to view data withina Spark DataFrame. The show() provides some additional flexibily such as truncating text, and transposing the view.

In [0]:
# the display function can be used two ways
sdf.display()
# or 
display(sdf)

##################################################
# Method 2: show() function
# 

sdf.show(n        = 2    ,  # number of rows 
         truncate = 3)      # Uses truncate = 3 to limit cell output to 3 characters in length; 20 characters by default

# Uses truncate = False to eliminate cell truncation
sdf.show(n        = 2    ,  
         truncate = False)  

# Uses vertical = True to display the data vertically
sdf.show(vertical  = True)  

station_id,city,rainfall,comments
1001.0,Chicago,535.0,
1002.0,Boston,495.0,
1003.0,Seattle,318.0,
,,,


station_id,city,rainfall,comments
1001.0,Chicago,535.0,
1002.0,Boston,495.0,
1003.0,Seattle,318.0,
,,,


+----------+----+--------+--------+
|station_id|city|rainfall|comments|
+----------+----+--------+--------+
|       100| Chi|     535|     nul|
|       100| Bos|     495|     nul|
+----------+----+--------+--------+
only showing top 2 rows

+----------+-------+--------+--------+
|station_id|city   |rainfall|comments|
+----------+-------+--------+--------+
|1001      |Chicago|535     |null    |
|1002      |Boston |495     |null    |
+----------+-------+--------+--------+
only showing top 2 rows

-RECORD 0-------------
 station_id | 1001    
 city       | Chicago 
 rainfall   | 535     
 comments   | null    
-RECORD 1-------------
 station_id | 1002    
 city       | Boston  
 rainfall   | 495     
 comments   | null    
-RECORD 2-------------
 station_id | 1003    
 city       | Seattle 
 rainfall   | 318     
 comments   | null    
-RECORD 3-------------
 station_id | null    
 city       | null    
 rainfall   | null    
 comments   | null    



### Some Additional Tips and Tricks:

The ".columns" command will return a list of column names...

In [0]:
sdf.columns # obtain list of column names

Out[5]: ['station_id', 'city', 'rainfall', 'comments']

PySpark allows programmers to chain functions onto multiple lines via a backslash " \ " :

In [0]:
sdf \
  .show(truncate=False)

+----------+-------+--------+--------+
|station_id|city   |rainfall|comments|
+----------+-------+--------+--------+
|1001      |Chicago|535     |null    |
|1002      |Boston |495     |null    |
|1003      |Seattle|318     |null    |
|null      |null   |null    |null    |
+----------+-------+--------+--------+



No explicit "shape" command exists for Spark similar to Pandas as of the time of this writing. However, the shape of a Spark DataFrame can be obtained via workaround:

In [0]:
def sdf_shape(sdf):
    print(sdf.count(),len(sdf.columns))
    
sdf_shape(sdf)

4 4
