## Row class usage for DataFrame and RDD
`Row` class is available by importing `pyspark.sql.Row` which is represented as a record/row in DataFrame, one can create a `Row` object by using named arguments, or create a custom `Row` like class

In [1]:
from pyspark.sql import SparkSession, Row
spark = SparkSession.builder.appName('row').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/08 17:26:22 WARN Utils: Your hostname, javier-ubuntu, resolves to a loopback address: 127.0.1.1; using 172.17.0.1 instead (on interface docker0)
25/08/08 17:26:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/08 17:26:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/08 17:26:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Create a Row Object
Row class extends the tuple hence it takes variable number of arguments, Row() is used to create the row object. Once the row object created, we can retrieve the data from Row using index similar to tuple.

In [2]:
from pyspark.sql import Row

row = Row("James",40)
print(row[0] +","+str(row[1]))

James,40


write with named arguments. Benefits with the named argument is you can access with field name `row.name`

In [3]:
row = Row(name="Alice", age=11)
print(row.name) 

Alice


### Create Custom Class from Row
create a Row like class, for example “Person” and use it similar to Row object. This would be helpful when you wanted to create real time object and refer it’s properties.

In [4]:
Person = Row("name", "age")
p1 = Person("James", 40)
p2 = Person("Alice", 35)
print(p1.name +","+p2.name)

James,Alice


### Using Row class on PySpark RDD
use Row class on PySpark RDD. When you use Row to create an RDD, after collecting the data you will get the result back in Row

In [5]:
data = [Row(name="James,,Smith",lang=["Java","Scala","C++"],state="CA"), 
        Row(name="Michael,Rose,",lang=["Spark","Java","C++"],state="NJ"),
        Row(name="Robert,,Williams",lang=["CSharp","VB"],state="NV")]
rdd=spark.sparkContext.parallelize(data)
print(rdd.collect())

[Row(name='James,,Smith', lang=['Java', 'Scala', 'C++'], state='CA'), Row(name='Michael,Rose,', lang=['Spark', 'Java', 'C++'], state='NJ'), Row(name='Robert,,Williams', lang=['CSharp', 'VB'], state='NV')]


 collect the data and access the data using its properties

In [6]:
collData=rdd.collect()
for row in collData:
    print(row.name + "," +str(row.lang))

James,,Smith,['Java', 'Scala', 'C++']
Michael,Rose,,['Spark', 'Java', 'C++']
Robert,,Williams,['CSharp', 'VB']


Alternatively, you can also do by creating a Row like class “Person”

In [7]:
Person=Row("name","lang","state")
data = [Person("James,,Smith",["Java","Scala","C++"],"CA"), 
    Person("Michael,Rose,",["Spark","Java","C++"],"NJ"),
    Person("Robert,,Williams",["CSharp","VB"],"NV")]

df=spark.createDataFrame(data)
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- lang: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- state: string (nullable = true)



                                                                                

+----------------+------------------+-----+
|            name|              lang|state|
+----------------+------------------+-----+
|    James,,Smith|[Java, Scala, C++]|   CA|
|   Michael,Rose,|[Spark, Java, C++]|   NJ|
|Robert,,Williams|      [CSharp, VB]|   NV|
+----------------+------------------+-----+



### Using Row class on PySpark DataFrame
Row class also can be used with PySpark DataFrame, By default data in DataFrame represent as Row. To demonstrate, I will use the same data that was created for RDD.

In [8]:
columns = ["name","languagesAtSchool","currentState"]
df = spark.createDataFrame(data).toDF(*columns)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- languagesAtSchool: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- currentState: string (nullable = true)



### Create Nested Struct Using Row Class
way to create a struct type using the Row class.

In [9]:
data = [Row(name="James",prop=Row(hair="black",eye="blue")),
        Row(name="Ann",prop=Row(hair="grey",eye="black"))]
df = spark.createDataFrame(data)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- prop: struct (nullable = true)
 |    |-- hair: string (nullable = true)
 |    |-- eye: string (nullable = true)

