Tasks:
* Create custom schema for json files
* Read files 
* Add new column via UDF - timestamp 
* Add new column - Solder's High salary 
* Rename column 
* Append rows (contatinatin) 
* Join all file types 
* Write to JSON 
* Filtering 
* Sorting 
* Generate new rows 
* Aggregations
* Grouping 

### Import Pyspark package

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

### Initialize SparkSession

In [2]:
spark = SparkSession.builder\
	.appName("Lesson 2 Spark Exercise 01")\
	.getOrCreate()

### Task: Read Event types CSV file 

Requirements: 
* Read source data from S3 bucket: event_types.csv -> path: s3a://wix-pyspark-labs/data/war-data/event_types.csv
* Use modes. Throws an exception when it meets corrupted records.
* Apply delimiter option via '|'
* Rename 'event type' column to 'event_type'

Expected table:
```
+---+--------------+
| id|    event_type|
+---+--------------+
|  1|          kill|
|...|          ... |
+---+--------------+
```

In [3]:
dfEventTypes = spark.read \
.format("csv") \
.option("header","true") \
.option("mode", "FAILFAST") \
.option("delimiter", "|") \
.load("event_types.csv") \
.withColumnRenamed("event type", "event_type")

Task: Show the existing schema on the current DataFrame. Then print all the data. 

Please provide the code for the following task:

In [4]:
dfEventTypes.printSchema()
dfEventTypes.show()

root
 |-- id: string (nullable = true)
 |-- event_type: string (nullable = true)

+---+--------------+
| id|    event_type|
+---+--------------+
|  1|          kill|
|  2|         wound|
|  3|           hit|
|  4|          shot|
|  5|       misfire|
|  6|   close range|
|  7|avgerage range|
|  8|    long range|
+---+--------------+



### Task: Read Waapon types CSV file 

Requirements: 
* Read source data from S3 bucket: weapon_types.csv -> path: `s3a://wix-pyspark-labs/data/war-data/weapon_types.csv`
* Use modes. Throws an exception when it meets corrupted records.
* Add custom schema: 
    * 'in range' should be int type value
* Rename 'name', 'in range' columns to 'weapon_name','weapon_range'.

Expected table:
```
+---+--------------+------------+
| id|   weapon_name|weapon_range|
+---+--------------+------------+
|  1|          m 16|        2000|
|  2|           ...|         ...|
+---+--------------+------------+
```

Task: Create a new custom schema 'weaponTypesSchema' on the current DataFrame

Please provide the code for the following task:

In [5]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

weaponTypesSchema =  StructType([
    StructField("id", IntegerType(), True),
    StructField("weapon_name", StringType(), True),
    StructField("weapon_range", IntegerType(), True)
])

Task: Read Waapon types CSV file 

In [6]:
dfWeaponTypes = spark.read \
.format("csv") \
.option("header","true") \
.option("mode", "FAILFAST") \
.schema(weaponTypesSchema) \
.load("weapon_types.csv") \
.withColumnRenamed("in range", "weapon_range") \
.withColumnRenamed("name", "weapon_name")

Task: Show the existing schema on the current DataFrame. Then print all the data.

Please provide the code for the following task:

In [7]:
dfWeaponTypes.printSchema()
dfWeaponTypes.show()

root
 |-- id: integer (nullable = true)
 |-- weapon_name: string (nullable = true)
 |-- weapon_range: integer (nullable = true)

+---+--------------+------------+
| id|   weapon_name|weapon_range|
+---+--------------+------------+
|  1|          m 16|        2000|
|  2|           uzi|         200|
|  3|           akm|        2200|
|  4|      revolver|         100|
|  5|Smith & Wesson|         150|
+---+--------------+------------+



### Task: Read Soldiers JSON file 


Requirements: 
* Read source data from S3 bucket: soldiers.json -> path: S3://
* Use modes. Throws an exception when it meets corrupted records.
* Use inferSchema.
* Rename 'name' column to 'soldier_name'.

Expected table:
```
+---+-------------------+------+
| id|       soldier_name|salary|
+---+-------------------+------+
|  1|   Haegon Blackfyre| 18477|
+---+-------------------+------+
```

In [8]:
dfSoldiers = spark.read \
.format("json") \
.option("mode", "FAILFAST") \
.option("inferSchema", "true") \
.load("soldiers.json") \
.withColumnRenamed("name", "soldier_name")

Task: Show the existing schema on the current DataFrame. Then print all the data.

Please provide the code for the following task:

In [9]:
dfSoldiers.printSchema()
dfSoldiers.show()

root
 |-- id: long (nullable = true)
 |-- soldier_name: string (nullable = true)
 |-- salary: long (nullable = true)

+---+-------------------+------+
| id|       soldier_name|salary|
+---+-------------------+------+
|  1|   Haegon Blackfyre| 18477|
|  2|   Walder Goodbrook| 11371|
|  3|              Quent| 18689|
|  4|        Androw Frey| 13961|
|  5|         Blind Doss| 18662|
|  6|    Victaria Tyrell| 13073|
|  7|Belaquo Bonebreaker| 16006|
|  8|       Mariya Darry| 17818|
|  9|    Alyn Connington| 18486|
| 10|             Lharys| 11102|
+---+-------------------+------+



### Task: Read raw data JSON file 


Requirements: 
* Read source data from S3 bucket: raw_data.json -> path: S3://
* Use modes. Throws an exception when it meets corrupted records.
* Add custom schema.



Expected schema:
```
root
 |-- distance: double (nullable = true)
 |-- eventId: integer (nullable = true)
 |-- soldierId: integer (nullable = true)
 |-- type: integer (nullable = true)
 |-- weaponId: integer (nullable = true)
 |-- when: double (nullable = true)
```

Task: Create a new custom schema on the current DataFrame¶

Please provide the code for the following task:

In [10]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

rawDataSchema =  StructType([
    StructField("distance", DoubleType(), False),
    StructField("eventId", IntegerType(), False),
    StructField("soldierId", IntegerType(), False),
    StructField("type", IntegerType(), False),
    StructField("weaponId", IntegerType(), False),
    StructField("when", DoubleType(), False)
])

Task: Read raw data JSON file 

In [11]:
dfRawData = spark.read \
.format("json") \
.option("mode", "FAILFAST") \
.schema(rawDataSchema) \
.load("raw_data.json") 

Task: Show the existing schema on the current DataFrame. Then print all data.

Please provide the code for the following task:

In [12]:
dfRawData.printSchema()
dfRawData.show()

root
 |-- distance: double (nullable = true)
 |-- eventId: integer (nullable = true)
 |-- soldierId: integer (nullable = true)
 |-- type: integer (nullable = true)
 |-- weaponId: integer (nullable = true)
 |-- when: double (nullable = true)

+------------------+-------+---------+----+--------+-----------------+
|          distance|eventId|soldierId|type|weaponId|             when|
+------------------+-------+---------+----+--------+-----------------+
| 699.2676057033572|      1|        9|   8|       1|1.563061295302E12|
| 235.0068658232094|      2|        4|   4|       2|1.563061295458E12|
|345.48506612195996|      3|        9|   4|       2|1.563061295561E12|
|391.70559819126726|      4|        1|   4|       3|1.563061295661E12|
| 827.2495155710195|      5|        2|   7|       3|1.563061295778E12|
| 78.52588556109919|      6|        8|   1|       4|1.563061295884E12|
| 90.40024953001647|      7|        2|   3|       2|1.563061295985E12|
| 28.29980077487848|      8|        3|   8|     

### Task: Rename multiple columns at once
* `eventId` into event_id
* `soldierId` into soldier_id
* `type` into event_type_id
* `weaponId` into weapon_id
* `when` into epochTimestamp

In [13]:
dfRawDataRenamed = dfRawData \
.withColumnRenamed("eventId", "event_id") \
.withColumnRenamed("soldierId", "soldier_id") \
.withColumnRenamed("type", "event_type_id") \
.withColumnRenamed("weaponId", "weapon_id") \
.withColumnRenamed("when", "epochTimestamp")

Task: Show the existing schema on the current DataFrame. Then print all data.

Please provide the code for the following task:

In [14]:
dfRawDataRenamed.printSchema()
dfRawDataRenamed.show()

root
 |-- distance: double (nullable = true)
 |-- event_id: integer (nullable = true)
 |-- soldier_id: integer (nullable = true)
 |-- event_type_id: integer (nullable = true)
 |-- weapon_id: integer (nullable = true)
 |-- epochTimestamp: double (nullable = true)

+------------------+--------+----------+-------------+---------+-----------------+
|          distance|event_id|soldier_id|event_type_id|weapon_id|   epochTimestamp|
+------------------+--------+----------+-------------+---------+-----------------+
| 699.2676057033572|       1|         9|            8|        1|1.563061295302E12|
| 235.0068658232094|       2|         4|            4|        2|1.563061295458E12|
|345.48506612195996|       3|         9|            4|        2|1.563061295561E12|
|391.70559819126726|       4|         1|            4|        3|1.563061295661E12|
| 827.2495155710195|       5|         2|            7|        3|1.563061295778E12|
| 78.52588556109919|       6|         8|            1|        4|1.563061

### Task: Add `timestamp` new column via UDF to dfRawData 


In [15]:
import pyspark.sql.functions as F
dfRawData = dfRawDataRenamed.withColumn("timestamp",F.to_timestamp(dfRawDataRenamed["epochTimestamp"]/1000))

Task: Print all the data `using truncate`

Please provide the code for the following task:

In [16]:
dfRawData.show(truncate=False)

+------------------+--------+----------+-------------+---------+-----------------+-----------------------+
|distance          |event_id|soldier_id|event_type_id|weapon_id|epochTimestamp   |timestamp              |
+------------------+--------+----------+-------------+---------+-----------------+-----------------------+
|699.2676057033572 |1       |9         |8            |1        |1.563061295302E12|2019-07-14 02:41:35.302|
|235.0068658232094 |2       |4         |4            |2        |1.563061295458E12|2019-07-14 02:41:35.458|
|345.48506612195996|3       |9         |4            |2        |1.563061295561E12|2019-07-14 02:41:35.561|
|391.70559819126726|4       |1         |4            |3        |1.563061295661E12|2019-07-14 02:41:35.661|
|827.2495155710195 |5       |2         |7            |3        |1.563061295778E12|2019-07-14 02:41:35.778|
|78.52588556109919 |6       |8         |1            |4        |1.563061295884E12|2019-07-14 02:41:35.884|
|90.40024953001647 |7       |2       

### Task: Register as a temporary views 'rawTable' based create DataFrames
Please provide the code for the following task:

In [17]:
dfWeaponTypes.createOrReplaceTempView("weaponTypes")
dfEventTypes.createOrReplaceTempView("eventTypes")
dfSoldiers.createOrReplaceTempView("soldiers")
dfRawData.createOrReplaceTempView("rawData")

### Task: Create new rows based existing row data

Create the rows based event types:

```
+---+--------------+
| id|    event_type|
+---+--------------+
|  1|          kill|
|  2|         wound|
|  3|           hit|
|  4|          shot|
|  5|       misfire|
|  6|   close range|
|  7|avgerage range|
|  8|    long range|
+---+--------------+
```

Task: Create findEventType function. The function should apply filter and and change value. The return dataframe.

3 parameters: df - rawdata dataFrame, eventTypes - array of events, selected_eventType.

In [18]:
def findEventType(df, eventTypes, selected_eventType):
    return df.filter((df["event_type_id"]).isin(eventTypes)).withColumn("event_type_id",F.lit(selected_eventType))

Task: Create new rows based existing row data through created function

In [19]:

dfEventType3 = findEventType(dfRawData, [1,2], 3)
dfEventType4 = findEventType(dfRawData, [1,2,3,6,7,8], 4)


dfRawAddedData = dfRawData \
    .union(dfEventType3) \
    .union(dfEventType4) 

### Task: Join all file types
* Create Join and Drop the columns 'ID' after the join
* Join with: dfRawData with dfSoldiers, dfWeaponTypes and dfEventTypes
* Specify dfRawData left DataFrame and join the right in the JOIN expressions


In [20]:
dfRawDataJoined = dfRawAddedData \
.join(dfSoldiers, dfRawData["soldier_id"] == dfSoldiers["id"]) \
.drop("id") \
.join(dfWeaponTypes, dfRawData["weapon_id"] == dfWeaponTypes["id"]) \
.drop("id") \
.join(dfEventTypes, dfRawData["event_type_id"] == dfEventTypes["id"]) \
.drop("id") 

Task: Print total count

Then print selected columns: 
* distance
* soldier_name
* event_id
* event_type_id
* event_type
* weapon_id
* weapon_name

In [None]:
print(dfRawDataJoined.count())
dfRawDataJoined.select(F.col("distance"), \
                       F.col("soldier_name"), \
                       F.col("event_id"), \
                       F.col("event_type_id"), \
                       F.col("event_type"), \
                       F.col("weapon_id"), \
                       F.col("weapon_name")
                      ).show(n=1000, truncate=False)

### Task: Write JSON files 

* Create a `single JSON file` from multiple partitions in Amazon S3
* Overwrite files 
* S3 path: S3 


In [None]:
pathtarget = 'enriched_data.json'

dfRawDataJoined \
.coalesce(1) \
.write \
.mode('overwrite') \
.format('json') \
.json(pathtarget)