<h1 align='center'> Challenge Presentation </h1>

<div align='center'>
<img src="..\..\aux_files\img\UK_Traffic_Data\1.png" height='20%' width='20%'/>
<img src="..\..\aux_files\img\UK_Traffic_Data\2.png" height='20%' width='20%'/>
<img src="..\..\aux_files\img\UK_Traffic_Data\3.png" height='20%' width='20%'/>
</div>
<div align='center'>
<img src="..\..\aux_files\img\UK_Traffic_Data\4.png" height='20%' width='20%'/>
<img src="..\..\aux_files\img\UK_Traffic_Data\5.png" height='20%' width='20%'/>
<img src="..\..\aux_files\img\UK_Traffic_Data\6.png" height='20%' width='20%'/>
</div>
<div align='center'><img src="..\..\aux_files\img\UK_Traffic_Data\7.png" height='20%' width='20%'/></div>



### Importing necessary libraries

In [72]:
from pyspark.sql.session import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import col, to_date, split, weekofyear, month, lpad

### Defining Spark Session

In [73]:
spark = SparkSession.builder.appName("myapp").getOrCreate()

### Consuming data and cleaning column names

In [74]:
df = spark.read.parquet(r'..\..\aux_files\source\UK_Traffic_Data\dft_traffic_counts_raw_counts\dft_traffic_counts_raw_counts.parquet')
new_columns = [col(column).alias(column.lower()) for column in df.columns]
df = df.select(*new_columns)

### First of all, let's check our data

AADF: Annual average daily flows
* Count_point_id – a unique reference for the road link that links the AADFs to the road network.
* Direction_of_travel – Direction of travel.
* Year – Counts are shown for each year from 2000 onwards.
* Count_date – the date when the actual count took place.
* Hour – the time when the counts in questions took place where 7 represents between 7am and 8am, and 17 represents between 5pm and 6pm.
* Region_id – Website region identifier.
* Region_name – the name of the Region that the CP sits within.
* Region_ons_code – the Office for National Statistics code identifier for the region.
* Local_authority_id – Website local authority identifier.
* Local_authority_name – the local authority that the CP sits within
* Local_authority_code – the Office for National Statistics code identifier for the local authority.
* Road_name – this is the road name (for instance M25 or A3).
* Road_category – the classification of the road type (see data definitions for the full list).
* Road_type – Whether the road is a ‘major’ or ‘minor’ road.
* Start_junction_road_name – The road name of the start junction of the link
* End_junction_road_name – The road name of the end junction of the link
* Easting – Easting coordinates of the CP location.
* Northing – Northing coordinates of the CP location.
* Latitude – Latitude of the CP location.
* Longitude – Longitude of the CP location.
* Link_length_km – Total length of the network road link for that CP (in kilometres).
* Link_length _miles – Total length of the network road link for that CP (in miles).
* Pedal_cycles – Counts for pedal cycles.
* Two_wheeled_motor_vehicles – Counts for two-wheeled motor vehicles.
* Cars_and_taxis - Counts for Cars and Taxis.
* Buses_and_coaches – Counts for Buses and Coaches
* LGVs – Counts for LGVs.
* HGVs_2_rigid_axle – Counts for two-rigid axle HGVs.
* HGVs_3_rigid_axle – Counts for three-rigid axle HGVs.
* HGVs_4_or_more_rigid_axle – Counts for four or more rigid axle HGVs.
* HGVs_3_or_4_articulated_axle – Counts for three or four-articulated axle HGVs.
* HGVs_5_articulated_axle – Counts for five-articulated axle HGVs.
* HGVs_6_articulated_axle – Counts for six-articulated axle HGVs.
* All_HGVs – Counts for all HGVs.
* All_motor_vehicles – Counts for all motor vehicles.

In [75]:
df.show()

+--------------+-------------------+----+-------------------+----+---------+-----------+---------------+------------------+--------------------+--------------------+---------+-------------+---------+------------------------+----------------------+-------+--------+------------+------------+--------------+-------------------+------------+--------------------------+--------------+-----------------+----+-----------------+-----------------+-------------------------+----------------------------+-----------------------+-----------------------+--------+------------------+
|count_point_id|direction_of_travel|year|         count_date|hour|region_id|region_name|region_ons_code|local_authority_id|local_authority_name|local_authority_code|road_name|road_category|road_type|start_junction_road_name|end_junction_road_name|easting|northing|    latitude|   longitude|link_length_km|  link_length_miles|pedal_cycles|two_wheeled_motor_vehicles|cars_and_taxis|buses_and_coaches|lgvs|hgvs_2_rigid_axle|hgvs_3_r

### As we don't need to distinguish between vehicle types, we will delete all numeric columns but "all_motor_vehicles" and "pedal_cycles".
### We will also delete every unnecessary column for this challenge.

In [76]:
df = df.select(["year", "count_date", "hour", "pedal_cycles", "all_motor_vehicles"])

In [77]:
df.show()

+----+-------------------+----+------------+------------------+
|year|         count_date|hour|pedal_cycles|all_motor_vehicles|
+----+-------------------+----+------------+------------------+
|2014|2014-06-25 00:00:00|   7|           0|               935|
|2014|2014-06-25 00:00:00|   8|           0|              1102|
|2014|2014-06-25 00:00:00|   9|           0|               773|
|2014|2014-06-25 00:00:00|  10|           0|               778|
|2014|2014-06-25 00:00:00|  11|           0|               875|
|2014|2014-06-25 00:00:00|  12|           0|               754|
|2014|2014-06-25 00:00:00|  13|           0|               896|
|2014|2014-06-25 00:00:00|  14|           0|               990|
|2014|2014-06-25 00:00:00|  15|           0|               972|
|2014|2014-06-25 00:00:00|  16|           0|              1184|
|2014|2014-06-25 00:00:00|  17|           0|              1271|
|2014|2014-06-25 00:00:00|  18|           0|               958|
|2014|2014-06-25 00:00:00|   7|         

### Let's make every transformation needed for this challenge

In [78]:
df = df.withColumn("count_date", to_date(split(col("count_date"), ' ').getItem(0), format='yyyy-MM-dd')).\
        withColumnRenamed("count_date", "date").\
        withColumn("pedal_cycles", col("pedal_cycles").cast('integer') ).\
        withColumn("all_motor_vehicles", col("all_motor_vehicles").cast('integer')).\
        withColumn("month", lpad( month(col("date")).cast('string'), 2, '0') ).\
        withColumn("weekNumber", lpad( weekofyear(col("date")).cast('string'), 2, '0') )

In [79]:
df.select(col("hour")).distinct().show()

+----+
|hour|
+----+
|   7|
|  15|
|  11|
|   3|
|   8|
|  16|
|   0|
|   5|
|  18|
|  17|
|   9|
|   1|
|  10|
|   4|
|  12|
|  13|
|  14|
+----+

