## Lab Overview
- In this lab, you will work with structured data in PySpark. You will define a schema using StructType and StructField to specify the structure of your data, including column names and data types. Then, you will create a PySpark DataFrame using this schema() method and populate it with sample student data containing various data types (string, integer, and float). Finally, you will learn how to display the schema and its fields and print the schema in a tree format for better readability.

- In this lab, we will create data within the list[] that contains five rows and six columns and assign column names as “rollno” with the string data type, a name with the string data type, an age with the integer type, the height with a float type, the weight with an integer, and an address with the string data type. Finally, we will display the DataFrame schema using the schema() method.

### Learning Objective
- Define the schema for a PySpark DataFrame using StructType and StructField.
- Create a DataFrame with columns containing various data types (string, integer, and float).
- Display the schema and its fields, and print the schema in a tree format.

#### Example 1

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

In [3]:
spark_app = SparkSession.builder.appName('sparkdemo').getOrCreate()

# ------create student data with 5 rows and 6 attributes------
students =[['001', 'john', 23, 5.79, 67, 'NY'], 
            ['002', 'James', 18, 3.79, 34, 'NY'], 
            ['003', 'Eric', 17, 2.79, 17, 'NJ'],
            ['004', 'Shahparan', 19, 3.69, 28, 'NJ'], 
            ['005', 'Flex', 37, 5.59, 54, 'Dallas']
        ]

#----------define the StructType and StructFields-------
#for the below column names
schema = StructType([
    StructField("rollno", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("height", FloatType(), True),
    StructField("weight", IntegerType(), True),
    StructField("address", StringType(), True)  ])
 
#-----create the dataframe and add schema to the dataframe---
df = spark_app.createDataFrame(students, schema=schema)
df.show()
df.printSchema()


+------+---------+---+------+------+-------+
|rollno|     name|age|height|weight|address|
+------+---------+---+------+------+-------+
|   001|     john| 23|  5.79|    67|     NY|
|   002|    James| 18|  3.79|    34|     NY|
|   003|     Eric| 17|  2.79|    17|     NJ|
|   004|Shahparan| 19|  3.69|    28|     NJ|
|   005|     Flex| 37|  5.59|    54| Dallas|
+------+---------+---+------+------+-------+

root
 |-- rollno: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- height: float (nullable = true)
 |-- weight: integer (nullable = true)
 |-- address: string (nullable = true)



#### Example 2 - Print DF Schema
- Returns the DF type along with columns

In [3]:
df.schema

StructType([StructField('rollno', StringType(), True), StructField('name', StringType(), True), StructField('age', IntegerType(), True), StructField('height', FloatType(), True), StructField('weight', IntegerType(), True), StructField('address', StringType(), True)])

In [None]:
# Display fields
df.schema.fields

[StructField('rollno', StringType(), True),
 StructField('name', StringType(), True),
 StructField('age', IntegerType(), True),
 StructField('height', FloatType(), True),
 StructField('weight', IntegerType(), True),
 StructField('address', StringType(), True)]

In [None]:
# Display schema in tree format
df.printSchema()

root
 |-- rollno: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- height: float (nullable = true)
 |-- weight: integer (nullable = true)
 |-- address: string (nullable = true)

