Tasks:
* Create custom schema for json files
* Read files 
* Add new column via UDF - timestamp 
* Add new column - Solder's High salary 
* Rename column 
* Append rows (contatinatin) 
* Join all file types 
* Write to JSON and Parquet
* Filter 
* Sorting 
* Generate new rows 
* Getting unique rows 


* Aggregations
* Grouping 

### Import Pyspark package

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

### Initialize SparkSession

In [2]:
spark = SparkSession.builder\
	.appName("Lesson 2 Spark Exercise 01")\
	.getOrCreate()

### Read Event types CSV file 

* read from S3 buckets 
* use modes 
* use delimiter option via '|'
Files:
* event_types.csv -> path: S3://

In [3]:
dfEventTypes = spark.read \
.format("csv") \
.option("header","true") \
.option("mode", "FAILFAST") \
.option("delimiter", "|") \
.load("event_types.csv") 

In [4]:
# Task: Show the existing schema on the current DataFrame¶
# Please provide the code for the following task:
    
dfEventTypes.printSchema()

root
 |-- id: string (nullable = true)
 |-- event type: string (nullable = true)



In [5]:
# Task: Print all the data
# Please provide the code for the following task:

dfEventTypes.show()

+---+--------------+
| id|    event type|
+---+--------------+
|  1|          kill|
|  2|         wound|
|  3|           hit|
|  4|          shot|
|  5|       misfire|
|  6|   close range|
|  7|avgerage range|
|  8|    long range|
+---+--------------+



### Read Waapon types CSV file 

* read from S3 buckets 
* use modes 
* add custom schema . in range should be int value
Files:
* weapon_types.csv -> path: S3://

In [6]:
dfWeaponTypes = spark.read \
.format("csv") \
.option("header","true") \
.option("mode", "FAILFAST") \
.load("weapon_types.csv") 

In [7]:
# Task: Show the existing schema on the current DataFrame¶
# Please provide the code for the following task:
    
dfWeaponTypes.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- in range: string (nullable = true)



In [8]:
# Task: Print all the data
# Please provide the code for the following task:

dfWeaponTypes.show()

+---+--------------+--------+
| id|          name|in range|
+---+--------------+--------+
|  1|          m 16|    2000|
|  2|           uzi|     200|
|  3|           akm|    2200|
|  4|      revolver|     100|
|  5|Smith & Wesson|     150|
+---+--------------+--------+



### Rename column 'in range' into range 

In [9]:
dfWeaponTypes = dfWeaponTypes.withColumnRenamed("in range", "range")

### Read Soldiers JSON file 

* read from S3 buckets 
* use modes 
* USE inferSchema
Files:
* soldiers.json -> path: S3://

In [10]:
dfSoldiers = spark.read \
.format("json") \
.option("mode", "FAILFAST") \
.option("inferSchema", "true") \
.load("soldiers.json") 

In [11]:
# Task: Show the existing schema on the current DataFrame¶
# Please provide the code for the following task:
dfSoldiers.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)



In [12]:
# Task: Print all the data
# Please provide the code for the following task:

dfSoldiers.show()

+---+-------------------+------+
| id|               name|salary|
+---+-------------------+------+
|  1|   Haegon Blackfyre| 18477|
|  2|   Walder Goodbrook| 11371|
|  3|              Quent| 18689|
|  4|        Androw Frey| 13961|
|  5|         Blind Doss| 18662|
|  6|    Victaria Tyrell| 13073|
|  7|Belaquo Bonebreaker| 16006|
|  8|       Mariya Darry| 17818|
|  9|    Alyn Connington| 18486|
| 10|             Lharys| 11102|
+---+-------------------+------+



### Read raw data JSON file 

* read from S3 buckets 
* use modes 
* * add custom schema . when should be int value, 
Files:
* raw_data.json -> path: S3://

In [13]:
# Task: Create a new custom schema on the current DataFrame¶
# Please provide the code for the following task:

In [14]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

rawDataSchema =  StructType([
    StructField("distance", DoubleType(), False),
    StructField("eventId", IntegerType(), False),
    StructField("soldierId", IntegerType(), False),
    StructField("type", IntegerType(), False),
    StructField("weaponId", IntegerType(), False),
    StructField("when", DoubleType(), False)
])

In [15]:
dfRawData = spark.read \
.format("json") \
.option("mode", "FAILFAST") \
.schema(rawDataSchema) \
.load("raw_data.json") 

In [16]:
# Task: Show the existing schema on the current DataFrame¶
# Please provide the code for the following task:
    
dfRawData.printSchema()

root
 |-- distance: double (nullable = true)
 |-- eventId: integer (nullable = true)
 |-- soldierId: integer (nullable = true)
 |-- type: integer (nullable = true)
 |-- weaponId: integer (nullable = true)
 |-- when: double (nullable = true)



In [17]:
# Task: Print all the data
# Please provide the code for the following task:

dfRawData.show()

+------------------+-------+---------+----+--------+-----------------+
|          distance|eventId|soldierId|type|weaponId|             when|
+------------------+-------+---------+----+--------+-----------------+
| 699.2676057033572|      1|        9|   8|      10|1.563061295302E12|
| 235.0068658232094|      2|        4|   4|       2|1.563061295458E12|
|345.48506612195996|      3|        9|   4|       2|1.563061295561E12|
|391.70559819126726|      4|        1|   4|       3|1.563061295661E12|
| 827.2495155710195|      5|        2|   7|       3|1.563061295778E12|
| 78.52588556109919|      6|        8|   1|       7|1.563061295884E12|
| 90.40024953001647|      7|        2|   3|       9|1.563061295985E12|
| 28.29980077487848|      8|        3|   8|       9|1.563061296089E12|
| 690.8141219504818|      9|        6|   5|       7|1.563061296191E12|
| 715.5096147977694|     10|        2|   4|       4|1.563061296303E12|
|235.18340224223178|     11|        5|   7|       7|1.563061296419E12|
| 50.8

### Rename multiple columns at once
* `type` into weaponType
* `when` into epochTimestamp

In [18]:
dfRawDataRenamed = dfRawData \
.withColumnRenamed("type", "weaponType") \
.withColumnRenamed("when", "epochTimestamp")

In [19]:
dfRawDataRenamed.printSchema()

root
 |-- distance: double (nullable = true)
 |-- eventId: integer (nullable = true)
 |-- soldierId: integer (nullable = true)
 |-- weaponType: integer (nullable = true)
 |-- weaponId: integer (nullable = true)
 |-- epochTimestamp: double (nullable = true)



In [20]:
dfRawDataRenamed.show()

+------------------+-------+---------+----------+--------+-----------------+
|          distance|eventId|soldierId|weaponType|weaponId|   epochTimestamp|
+------------------+-------+---------+----------+--------+-----------------+
| 699.2676057033572|      1|        9|         8|      10|1.563061295302E12|
| 235.0068658232094|      2|        4|         4|       2|1.563061295458E12|
|345.48506612195996|      3|        9|         4|       2|1.563061295561E12|
|391.70559819126726|      4|        1|         4|       3|1.563061295661E12|
| 827.2495155710195|      5|        2|         7|       3|1.563061295778E12|
| 78.52588556109919|      6|        8|         1|       7|1.563061295884E12|
| 90.40024953001647|      7|        2|         3|       9|1.563061295985E12|
| 28.29980077487848|      8|        3|         8|       9|1.563061296089E12|
| 690.8141219504818|      9|        6|         5|       7|1.563061296191E12|
| 715.5096147977694|     10|        2|         4|       4|1.563061296303E12|

### Add new column via UDF - timestamp 


In [21]:
import pyspark.sql.functions as F
dfRawData = dfRawDataRenamed.withColumn("timestamp",F.to_timestamp(dfRawDataRenamed["epochTimestamp"]/1000))

In [22]:
# Task: Print all the data `using truncate`
# Please provide the code for the following task:
dfRawData.show(truncate=False)

+------------------+-------+---------+----------+--------+-----------------+-----------------------+
|distance          |eventId|soldierId|weaponType|weaponId|epochTimestamp   |timestamp              |
+------------------+-------+---------+----------+--------+-----------------+-----------------------+
|699.2676057033572 |1      |9        |8         |10      |1.563061295302E12|2019-07-14 02:41:35.302|
|235.0068658232094 |2      |4        |4         |2       |1.563061295458E12|2019-07-14 02:41:35.458|
|345.48506612195996|3      |9        |4         |2       |1.563061295561E12|2019-07-14 02:41:35.561|
|391.70559819126726|4      |1        |4         |3       |1.563061295661E12|2019-07-14 02:41:35.661|
|827.2495155710195 |5      |2        |7         |3       |1.563061295778E12|2019-07-14 02:41:35.778|
|78.52588556109919 |6      |8        |1         |7       |1.563061295884E12|2019-07-14 02:41:35.884|
|90.40024953001647 |7      |2        |3         |9       |1.563061295985E12|2019-07-14 02:4

### Task: Register as a temporary views 'rawTable' based create DataFrames
Please provide the code for the following task:

In [23]:
dfWeaponTypes.createOrReplaceTempView("weaponTypes")
dfEventTypes.createOrReplaceTempView("eventTypes")
dfSoldiers.createOrReplaceTempView("soldiers")
dfRawData.createOrReplaceTempView("rawData")

### Join all file types

In [24]:
### Create Inner Join and Drop the columns 'ID' after the join

# Inner joins are the default join, so we just need to specify our left DataFrame and join the right in the JOIN expression: 


dfRawData1 = dfRawData \
.join(dfEventTypes, dfRawData["eventId"] == dfEventTypes["id"]) \
.join(dfWeaponTypes, dfRawData["weaponId"] == dfWeaponTypes["id"]) \
.join(dfSoldiers, dfRawData["soldierId"] == dfSoldiers["id"]) \
.drop("id") \
.select(F.col("distance"), \
        F.col("eventId"), \
        F.col("event type").alias("event_type"), \
        F.col("soldierId"), \
        F.col("name").alias("soldier_name"), \
        F.col("weaponId"), \
        F.col("weaponType"), \
        F.col("timestamp")  
      )

dfRawData1.printSchema()
dfRawData1.show(n=200, truncate=False)

root
 |-- distance: double (nullable = true)
 |-- eventId: integer (nullable = true)
 |-- event_type: string (nullable = true)
 |-- soldierId: integer (nullable = true)
 |-- soldier_name: string (nullable = true)
 |-- weaponId: integer (nullable = true)
 |-- weaponType: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)

+------------------+-------+----------+---------+----------------+--------+----------+-----------------------+
|distance          |eventId|event_type|soldierId|soldier_name    |weaponId|weaponType|timestamp              |
+------------------+-------+----------+---------+----------------+--------+----------+-----------------------+
|235.0068658232094 |2      |wound     |4        |Androw Frey     |2       |4         |2019-07-14 02:41:35.458|
|345.48506612195996|3      |hit       |9        |Alyn Connington |2       |4         |2019-07-14 02:41:35.561|
|391.70559819126726|4      |shot      |1        |Haegon Blackfyre|3       |4         |2019-07-14 02:41: