Tasks:
* Create custom schema for json files
* Read files 
* Add new column via UDF - timestamp 
* Add new column - Solder's High salary 
* Rename column 
* Append rows (contatinatin) 
* Join all file types 
* Write to JSON 
* Filtering 
* Sorting 
* Generate new rows 
* Aggregations
* Grouping 

### Import Pyspark package

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.functions import udf

### Initialize SparkSession

In [None]:
spark = SparkSession.builder\
	.appName("Lesson 2 Spark Exercise 01")\
	.getOrCreate()

### Task 01: Read Event types CSV file 

Requirements: 
* Read source data from S3 bucket: event_types.csv -> path: s3a://wix-pyspark-labs/data/war-data/event_types.csv
* Use modes. Throws an exception when it meets corrupted records.
* Apply delimiter option via '|'
* Rename 'event type' column to 'event_type'

Expected table:
```
+---+--------------+
| id|    event_type|
+---+--------------+
|  1|          kill|
|...|          ... |
+---+--------------+
```

Show the existing schema on the current DataFrame. Then print all the data. 

Please provide the code for the following task:

### Task 02: Read Weapon types CSV file 

Requirements: 
* Read source data from S3 bucket: weapon_types.csv -> path: `s3a://wix-pyspark-labs/data/war-data/weapon_types.csv`
* Use modes. Throws an exception when it meets corrupted records.
* Add custom schema: 
    * 'in range' should be int type value
* Rename 'name', 'in range' columns to 'weapon_name','weapon_range'.

Expected table:
```
+---+--------------+------------+
| id|   weapon_name|weapon_range|
+---+--------------+------------+
|  1|          m 16|        2000|
|  2|           ...|         ...|
+---+--------------+------------+
```

Create a new custom schema 'weaponTypesSchema' on the current DataFrame

Please provide the code for the following task:

Read Weapon types CSV file 

Show the existing schema on the current DataFrame. Then print all the data.

Please provide the code for the following task:

### Task 03: Read Soldiers JSON file 


Requirements: 
* Read source data from S3 bucket: soldiers.json -> path: S3://
* Use modes. Throws an exception when it meets corrupted records.
* Use inferSchema.
* Rename 'name' column to 'soldier_name'.

Expected table:
```
+---+-------------------+------+
| id|       soldier_name|salary|
+---+-------------------+------+
|  1|   Haegon Blackfyre| 18477|
+---+-------------------+------+
```

Show the existing schema on the current DataFrame. Then print all the data.

Please provide the code for the following task:

### Task 04: Read raw data JSON file 


Requirements: 
* Read source data from S3 bucket: raw_data.json -> path: S3://
* Use modes. Throws an exception when it meets corrupted records.
* Add custom schema.



Expected schema:
```
root
 |-- distance: double (nullable = true)
 |-- eventId: integer (nullable = true)
 |-- soldierId: integer (nullable = true)
 |-- type: integer (nullable = true)
 |-- weaponId: integer (nullable = true)
 |-- when: double (nullable = true)
```

Task: Create a new custom schema on the current DataFrame¶

Please provide the code for the following task:

Task: Read raw data JSON file 

Task: Show the existing schema on the current DataFrame. Then print all data.

Please provide the code for the following task:

### Task 05: Rename multiple columns at once
* `eventId` into event_id
* `soldierId` into soldier_id
* `type` into event_type_id
* `weaponId` into weapon_id
* `when` into epochTimestamp

Task: Show the existing schema on the current DataFrame. Then print all data.

Please provide the code for the following task:

### Task 06: Add `timestamp` new column via UDF to dfRawData 


Task: Print all the data `using truncate`

Please provide the code for the following task:

### Task 07: Register as a temporary views 'rawTable' based create DataFrames
Please provide the code for the following task:

### Task 08: Create new rows based existing row data

Create the rows based event types:

```
+---+--------------+
| id|    event_type|
+---+--------------+
|  1|          kill|
|  2|         wound|
|  3|           hit|
|  4|          shot|
|  5|       misfire|
|  6|   close range|
|  7|avgerage range|
|  8|    long range|
+---+--------------+
```

Task: Create findEventType function. The function should apply filter and and change value. The return dataframe.

3 parameters: df - rawdata dataFrame, eventTypes - array of events, selected_eventType.

Task: Create new rows based existing row data through created function

### Task 09: Join all file types
* Create Join and Drop the columns 'ID' after the join
* Join with: dfRawData with dfSoldiers, dfWeaponTypes and dfEventTypes
* Specify dfRawData left DataFrame and join the right in the JOIN expressions


Print total count

Then print selected columns: 
* distance
* soldier_name
* event_id
* event_type_id
* event_type
* weapon_id
* weapon_name

### Task 10: Write JSON files 

* Create a `single JSON file` from multiple partitions in Amazon S3
* Overwrite files 
* S3 path: S3 


### Task 11: Group Solders and sort by rating


* Add an UDF fuciton which will calculate the rating of shooting skills.
* The rating popularity calculate by event types:
    * 1 for event_type = 6 (close range)
    * 2 for event_type = 7 (avgerage range)
    * 3 for event_type = 8 (long range)
* Add a new column `rating`
* Group by solder id and count of rating values



### Task 12: Create validation for check invalid data


* Add two validations:
    * Weapons validations 
    * Event validations
* Weapons validations:
    * There are only 5 weapons but in raw data we have more than 5. 
    * Add a new column `weapon_validation` where values: `1` - invalid, `2` - valid
* Event validations:
    * Compare distanse with event types.
    * Event types are:
        * 6 - close range
        * 7 - avgerage range
        * 8 - long range
    * Add a new column `event_validation` where values:
        * `1` - A shot was as `close range` but from `long distance` (distance more than 100).
        * `2` - A shot was as `avgerage range` but from `long distance` (distance less than 100).
        * `3` - A shot was as `long range` but from `avgerage distance` (distance more than 500).
        * `4` - A shot was as `long range` but from `close distance` (distance less than 500).
* Print columns: distance, event_id, event_type_id, weapon_id, weapon_validation, event_validation. 
* Use Row data DataFrame without joins.