# schemas and creating dataframe

A schema in Spark defines the column names and associated data types for a Data‐Frame

Defining a schema up front as opposed to taking a schema-on-read approach offers three benefits

    1. You relieve Spark from the onus of inferring data types.
    2. You prevent Spark from creating a separate job just to read a large portion of your file to ascertain the schema, which for a large data file can be expensive and time-consuming.
    3. You can detect errors early if data doesn’t match the schema

In [1]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession

In [2]:
# defining schema programmatically

schema = StructType([StructField("author", StringType(), False),
                     StructField("title", StringType(), False),
                     StructField("pages", IntegerType(), False)])

In [3]:
# defining schema using DDL

schema_ddl = "author STRING, title STRING, pages INT"


In [4]:
data = [["john", "book-1", 123],
        ["jane", "book-2", 234],
        ["smith", "who are you", 121]]

In [5]:
spark = SparkSession.builder.appName("schema").getOrCreate()

df = spark.createDataFrame(data, schema_ddl)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/28 12:57:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/02/28 12:57:00 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [6]:
df.show()

                                                                                

+------+-----------+-----+
|author|      title|pages|
+------+-----------+-----+
|  john|     book-1|  123|
|  jane|     book-2|  234|
| smith|who are you|  121|
+------+-----------+-----+



In [7]:
# print schema
df.printSchema()

root
 |-- author: string (nullable = true)
 |-- title: string (nullable = true)
 |-- pages: integer (nullable = true)



In [8]:
# show schema in programmatic way which can be referred and used later

df.schema

StructType([StructField('author', StringType(), True), StructField('title', StringType(), True), StructField('pages', IntegerType(), True)])