# Managing Big Data for Connected Devices

## 420-N63-NA

## Kawser Wazed Nafi
 ----------------------------------------------------------------------------------------------------------------------------------
    
## StructType & StructField

PySpark StructType & StructField classes are used to programmatically specify the schema to the DataFrame and create complex columns like nested struct, array, and map columns. StructType is a collection of StructField’s that defines column name, column data type, boolean to specify if the field can be nullable or not and metadata.

If a data is given in an unstructured way, StructType and StructField are used to make them structured and use them as input the PySpark Dataframe. This helped us to perform the data analysis with proper data understanding and with more structured and regulated way.

At the time of creating a PySpark Dataframe, we can specify the structure of the data using StructType and StructField.

## StructType Example 1

Let's us consider an input data which has no structure itself. Using StructType we can give the data a name as well as we can define the dataType of the given data as well.

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

ss = SparkSession.builder.master("local[4]") \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

data = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
dataframe = ss.createDataFrame(data=data,schema=schema)
dataframe.printSchema()
dataframe.show(truncate=False)

root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



## StructType Example 2

Let's us consider that the same input data contains the firstname, middle name and last name section as tuple. When we have an additional tuple in the given data, we can consider this data as a nested structure. To address that data in your program, you have to used nexted StrutType Object.

In [2]:
structureData = [
    (("James","","Smith"),"36636","M",3100),
    (("Michael","Rose",""),"40288","M",4300),
    (("Robert","","Williams"),"42114","M",1400),
    (("Maria","Anne","Jones"),"39192","F",5500),
    (("Jen","Mary","Brown"),"","F",-1)
  ]
structureSchema = StructType([
        StructField('name', StructType([
             StructField('firstname', StringType(), True),
             StructField('middlename', StringType(), True),
             StructField('lastname', StringType(), True)
             ])),
         StructField('id', StringType(), True),
         StructField('gender', StringType(), True),
         StructField('salary', IntegerType(), True)
         ])

dataframe2 = ss.createDataFrame(data=structureData,schema=structureSchema)
dataframe2.printSchema()
dataframe2.show(truncate=False)

root
 |-- name: struct (nullable = true)
 |    |-- firstname: string (nullable = true)
 |    |-- middlename: string (nullable = true)
 |    |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: integer (nullable = true)

+--------------------+-----+------+------+
|name                |id   |gender|salary|
+--------------------+-----+------+------+
|{James, , Smith}    |36636|M     |3100  |
|{Michael, Rose, }   |40288|M     |4300  |
|{Robert, , Williams}|42114|M     |1400  |
|{Maria, Anne, Jones}|39192|F     |5500  |
|{Jen, Mary, Brown}  |     |F     |-1    |
+--------------------+-----+------+------+



## Exercise 1

From our latest Movie dataset we got the following data:
    
data = [((1,4.2),(1,"funny")),
((2,4.5),(3,"funny")),
((1,4.0),(6,"funny")),
((3,5.0),(47,"action")),
((4,4.3),(50,"romantic")),
((3,3.2),(70,"biography")),
((4,5.0),(101,"biography")),
((4,4.6),(110,"Scientific")),
((1,5.0),(151,"action")),
((1,4.6),( 157,"action")),
((2,3.5),(167,"funny")),
((1,4.1),(172,"funny")),
((3,4.7),(181,"action")),
((4,3.9),(192,"romantic")),
((3,3.8),(201,"biography")),
((4,5.0),(211,"biography")),
((4,4.6),(224,"Scientific")),
((1,5.0),(231,"action"))]

the data is divided in the following structure: (userID, rating),(movieID, generes)

Please structure the data and load it into dataframe for additional additional study.

     

In [7]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, FloatType

sps = SparkSession.builder.master("local[4]").appName("dataframe-app").getOrCreate()
data = [((1,4.2),(1,"funny")), ((2,4.5),(3,"funny")), ((1,4.0),(6,"funny")), ((3,5.0),(47,"action")), ((4,4.3),(50,"romantic")), ((3,3.2),(70,"biography")), ((4,5.0),(101,"biography")), ((4,4.6),(110,"Scientific")), ((1,5.0),(151,"action")), ((1,4.6),( 157,"action")), ((2,3.5),(167,"funny")), ((1,4.1),(172,"funny")), ((3,4.7),(181,"action")), ((4,3.9),(192,"romantic")), ((3,3.8),(201,"biography")), ((4,5.0),(211,"biography")), ((4,4.6),(224,"Scientific")), ((1,5.0),(231,"action"))]

dataSchema = StructType([
    StructField('user', StructType([
            StructField('userID', IntegerType(), True),
            StructField('rating', FloatType(), True)
        ])),
    StructField('movie', StructType([
            StructField('movieID', IntegerType(), True),
            StructField('genres', StringType(), True)
        ]))
])

dataDF = sps.createDataFrame(data=data, schema=dataSchema)
dataDF.printSchema()
dataDF.show()

root
 |-- user: struct (nullable = true)
 |    |-- userID: integer (nullable = true)
 |    |-- rating: float (nullable = true)
 |-- movie: struct (nullable = true)
 |    |-- movieID: integer (nullable = true)
 |    |-- genres: string (nullable = true)

+--------+-----------------+
|    user|            movie|
+--------+-----------------+
|{1, 4.2}|       {1, funny}|
|{2, 4.5}|       {3, funny}|
|{1, 4.0}|       {6, funny}|
|{3, 5.0}|     {47, action}|
|{4, 4.3}|   {50, romantic}|
|{3, 3.2}|  {70, biography}|
|{4, 5.0}| {101, biography}|
|{4, 4.6}|{110, Scientific}|
|{1, 5.0}|    {151, action}|
|{1, 4.6}|    {157, action}|
|{2, 3.5}|     {167, funny}|
|{1, 4.1}|     {172, funny}|
|{3, 4.7}|    {181, action}|
|{4, 3.9}|  {192, romantic}|
|{3, 3.8}| {201, biography}|
|{4, 5.0}| {211, biography}|
|{4, 4.6}|{224, Scientific}|
|{1, 5.0}|    {231, action}|
+--------+-----------------+



## Transform

Another method or API provided by Spark to prepare the dataFrame for further analysis is pyspark.sql.DataFrame.transform(). The pyspark.sql.DataFrame.transform() is used to chain the custom transformations and this function returns the new DataFrame after applying the specified transformations.
This function returns the new data maintaining the same number of rows.

### Syntax

DataFrame.transform(func: Callable[[…], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame


In [5]:

# Imports
from pyspark.sql import SparkSession

# Create SparkSession
ss = SparkSession.builder \
            .appName('SparkByExamples.com') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
dataframe = ss.createDataFrame(data = simpleData, schema = columns)
dataframe.printSchema()
dataframe.show(truncate=False)


root
 |-- CourseName: string (nullable = true)
 |-- fee: long (nullable = true)
 |-- discount: long (nullable = true)

+----------+----+--------+
|CourseName|fee |discount|
+----------+----+--------+
|Java      |4000|5       |
|Python    |4600|10      |
|Scala     |4100|15      |
|Scala     |4500|15      |
|PHP       |3000|20      |
+----------+----+--------+



We can add custom transformation function in our program, pass our dataframe to them and will finally get the transformed data together.

In [9]:
# Custom transformation 1
from pyspark.sql.functions import upper
def to_upper_str_columns(dataframe):
    return dataframe.withColumn("CourseName",upper(dataframe.CourseName))

# Custom transformation 2
def reduce_price(dataframe,reduceBy):
    return dataframe.withColumn("new_fee",dataframe.fee - reduceBy)

# Custom transformation 3
def apply_discount(dataframe):
    return dataframe.withColumn("discounted_fee",  \
             dataframe.new_fee - (dataframe.new_fee * dataframe.discount) / 100)

# We are going to reduce the reduce the course price 1000 CAD for all the courses. At the same time, we are going to transform all the course names to uppercase.
dataframe2 =  dataframe.transform(to_upper_str_columns) \
        .transform(reduce_price,1000) \
        .transform(apply_discount)
dataframe2.show()

+----------+----+--------+-------+--------------+
|CourseName| fee|discount|new_fee|discounted_fee|
+----------+----+--------+-------+--------------+
|      JAVA|4000|       5|   3000|        2850.0|
|    PYTHON|4600|      10|   3600|        3240.0|
|     SCALA|4100|      15|   3100|        2635.0|
|     SCALA|4500|      15|   3500|        2975.0|
|       PHP|3000|      20|   2000|        1600.0|
+----------+----+--------+-------+--------------+



## Exercise 2

From the Exercise 1, we have got the following dataset

rom our latest Movie dataset we got the following data:
    
data = [((1,4.2),(1,"funny")),
((2,4.5),(3,"funny")),
((1,4.0),(6,"funny")),
((3,5.0),(47,"action")),
((4,4.3),(50,"romantic")),
((3,3.2),(70,"biography")),
((4,5.0),(101,"biography")),
((4,4.6),(110,"Scientific")),
((1,5.0),(151,"action")),
((1,4.6),( 157,"action")),
((2,3.5),(167,"funny")),
((1,4.1),(172,"funny")),
((3,4.7),(181,"action")),
((4,3.9),(192,"romantic")),
((3,3.8),(201,"biography")),
((4,5.0),(211,"biography")),
((4,4.6),(224,"Scientific")),
((1,5.0),(231,"action"))]

the data is divided in the following structure: (userID, rating),(movieID, generes)

The system got an issue and found that for some unaccountable reason, ratings for the "Funny" generes movies reduced by 15\% and those reduced ratings were recorded. But this reduction didnot happen with all the ratings. It happened only for the ratings lower that 4.5.

Please increase the ratings by 15\% for the "Funny" Generes movies whose ratings are recorded lower than 4.5. List both the old ratings and new ratings side by side as shown in the examples.

In [10]:

def increase_funny(dataDF, by):
    return dataDF.withColumn("new_rating", dataDF.user.rating * (1 + by / 100))

dataDF.where((dataDF.user.rating < 4.5) & (dataDF.movie.genres == "funny")).show(truncate=False)
transformedDF = dataDF.where((dataDF.user.rating < 4.5) & (dataDF.movie.genres == "funny")).transform(increase_funny, 15)
transformedDF.show(truncate=False)


+--------+------------+
|user    |movie       |
+--------+------------+
|{1, 4.2}|{1, funny}  |
|{1, 4.0}|{6, funny}  |
|{2, 3.5}|{167, funny}|
|{1, 4.1}|{172, funny}|
+--------+------------+

+--------+------------+------------------+
|user    |movie       |new_rating        |
+--------+------------+------------------+
|{1, 4.2}|{1, funny}  |4.829999780654907 |
|{1, 4.0}|{6, funny}  |4.6               |
|{2, 3.5}|{167, funny}|4.0249999999999995|
|{1, 4.1}|{172, funny}|4.714999890327453 |
+--------+------------+------------------+



## Union, Unionall and UnionByName

PySpark union() and unionAll() transformations are used to merge two or more DataFrame’s of the same schema or structure. The Union operation directly merge the two dataframes together one by one without seeing the data at all.




In [32]:
simpleData = [("James","Sales","NY",90000,34,10000), \
    ("Michael","Sales","NY",86000,56,20000), \
    ("Robert","Sales","CA",81000,30,23000), \
    ("Maria","Finance","CA",90000,24,23000) \
  ]

columns= ["employee_name","department","state","salary","age","bonus"]
dataframe = ss.createDataFrame(data = simpleData, schema = columns)
dataframe.printSchema()
dataframe.show(truncate=False)


simpleData2 = [("James","Sales","NY",90000,34,10000), \
    ("Maria","Finance","CA",90000,24,23000), \
    ("Jen","Finance","NY",79000,53,15000), \
    ("Jeff","Marketing","CA",80000,25,18000), \
    ("Kumar","Marketing","NY",91000,50,21000) \
  ]
columns2= ["employee_name","department","state","salary","age","bonus"]

dataframe2 = ss.createDataFrame(data = simpleData2, schema = columns2)

dataframe2.printSchema()
dataframe2.show(truncate=False)


unionDF = dataframe.union(dataframe2)
unionDF.printSchema()
unionDF.show(truncate=False)


unionAllDF = dataframe.unionAll(dataframe2)
unionAllDF.printSchema()
unionAllDF.show(truncate=False)

unionAllDFbyName = dataframe.unionByName(dataframe2)
unionAllDF.printSchema()
unionAllDF.show(truncate=False)


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|Maria        |Finance   |CA   |90000 |24 |23000|
+-------------+----------+-----+------+---+-----+

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----

## Exercise 3

Can you see the differences between Union, UnionAll and UnionByName? Please state them over here.

| .union() | .unionAll() | .unionByName() |
|----------|-------------|----------------|
| Joins columns by their order in the DataFrame | An alias for .union(), more widely used in older versions of PySpark | Joins columns by their names |

## Exercise 4
Consider the given data over here. Perform Union, UnionAll and UnionByName operation on these two given data.

data1 = [((1,4.2),(1,"funny")),
((2,4.5),(3,"funny")),
((1,4.0),(6,"funny")),
((3,5.0),(47,"action")),
((4,4.3),(50,"romantic")),
((3,3.2),(70,"biography")),
((4,5.0),(101,"biography")),
((4,4.6),(110,"Scientific")),
((1,5.0),(151,"action")),
((1,4.6),( 157,"action")),
((2,3.5),(167,"funny")),
((1,4.1),(172,"funny")),
((3,4.7),(181,"action")),
((4,3.9),(192,"romantic")),
((3,3.8),(201,"biography")),
((4,5.0),(211,"biography")),
((4,4.6),(224,"Scientific")),
((1,5.0),(231,"action"))]


data2 = [((2,4.1),(1,"funny")),
((1,4.2),(3,"funny")),
((2,4.0),(6,"funny")),
((1,4.0),(47,"action")),
((2,4.7),(50,"romantic")),
((1,3.6),(70,"biography")),
((2,4.2),(101,"biography")),
((4,4.7),(111,"Scientific")),
((3,5.0),(151,"action")),
((2,4.6),( 157,"action")),
((1,3.5),(167,"funny")),
((5,4.1),(172,"funny")),
((2,4.5),(181,"action")),
((3,4.3),(192,"romantic")),
((2,4.2),(201,"biography")),
((3,5.0),(211,"biography")),
((3,4.6),(224,"Scientific")),
((2,5.0),(231,"action"))]



In [33]:
data1 = [((1,4.2),(1,"funny")), ((2,4.5),(3,"funny")), ((1,4.0),(6,"funny")), ((3,5.0),(47,"action")), ((4,4.3),(50,"romantic")), ((3,3.2),(70,"biography")), ((4,5.0),(101,"biography")), ((4,4.6),(110,"Scientific")), ((1,5.0),(151,"action")), ((1,4.6),( 157,"action")), ((2,3.5),(167,"funny")), ((1,4.1),(172,"funny")), ((3,4.7),(181,"action")), ((4,3.9),(192,"romantic")), ((3,3.8),(201,"biography")), ((4,5.0),(211,"biography")), ((4,4.6),(224,"Scientific")), ((1,5.0),(231,"action"))]

data2 = [((2,4.1),(1,"funny")), ((1,4.2),(3,"funny")), ((2,4.0),(6,"funny")), ((1,4.0),(47,"action")), ((2,4.7),(50,"romantic")), ((1,3.6),(70,"biography")), ((2,4.2),(101,"biography")), ((4,4.7),(111,"Scientific")), ((3,5.0),(151,"action")), ((2,4.6),( 157,"action")), ((1,3.5),(167,"funny")), ((5,4.1),(172,"funny")), ((2,4.5),(181,"action")), ((3,4.3),(192,"romantic")), ((2,4.2),(201,"biography")), ((3,5.0),(211,"biography")), ((3,4.6),(224,"Scientific")), ((2,5.0),(231,"action"))]

In [39]:
dataSchema = StructType([
    StructField('user', StructType([
            StructField('userID', IntegerType(), True),
            StructField('rating', FloatType(), True)
        ])),
    StructField('movie', StructType([
            StructField('movieID', IntegerType(), True),
            StructField('genres', StringType(), True)
        ]))
])

dfOne = sps.createDataFrame(data=data1, schema=dataSchema)
dfTwo = sps.createDataFrame(data=data2, schema=dataSchema)
unionDF = dfOne.union(dfTwo)
unionDF.printSchema()
unionDF.show()

unionAllDF = dfOne.unionAll(dfTwo)
unionAllDF.printSchema()
unionAllDF.show()

unionByNameDF = dfOne.unionByName(dfTwo)
unionByNameDF.printSchema()
unionByNameDF.show()

root
 |-- user: struct (nullable = true)
 |    |-- userID: integer (nullable = true)
 |    |-- rating: float (nullable = true)
 |-- movie: struct (nullable = true)
 |    |-- movieID: integer (nullable = true)
 |    |-- genres: string (nullable = true)

+--------+-----------------+
|    user|            movie|
+--------+-----------------+
|{1, 4.2}|       {1, funny}|
|{2, 4.5}|       {3, funny}|
|{1, 4.0}|       {6, funny}|
|{3, 5.0}|     {47, action}|
|{4, 4.3}|   {50, romantic}|
|{3, 3.2}|  {70, biography}|
|{4, 5.0}| {101, biography}|
|{4, 4.6}|{110, Scientific}|
|{1, 5.0}|    {151, action}|
|{1, 4.6}|    {157, action}|
|{2, 3.5}|     {167, funny}|
|{1, 4.1}|     {172, funny}|
|{3, 4.7}|    {181, action}|
|{4, 3.9}|  {192, romantic}|
|{3, 3.8}| {201, biography}|
|{4, 5.0}| {211, biography}|
|{4, 4.6}|{224, Scientific}|
|{1, 5.0}|    {231, action}|
|{2, 4.1}|       {1, funny}|
|{1, 4.2}|       {3, funny}|
+--------+-----------------+
only showing top 20 rows

root
 |-- user: struct (n