WeatherData JSON Source File Path : "abfss://bronze@datalakestorageaccountname.dfs.core.windows.net/weather-data/
"

- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join" target="_blank">**DataFrame Joins** </a>

##### Step 1: Define the Variables to read weather-data ingested in bronze Layer

1. Replace <datalakestorageaccountname> with the ADLS account name crated in your account


In [0]:
weatherDataSourceLayerName = 'bronze'
weatherDataSourceStorageAccountName = 'lckudadatalakehousedev'
weatherDataSourceFolderName = 'weather-data'

weatherDataSourceFolderPath = f"abfss://{weatherDataSourceLayerName}@{weatherDataSourceStorageAccountName}.dfs.core.windows.net/{weatherDataSourceFolderName}"

##### Step 2: Create Spark Dataframe For weather-data in Json form stored in bronze layer

1. Define Spark Dataframe variable name as weatherDataBronzeDF
1. Use spark.read.json method to read the source data path defined above using the variable weatherDataSourceFolderPath 
1. Include display for converted Spark Dataframe variables to view the dataframe columns and data for further processing


In [0]:
weatherDataBronzeDF = (spark
                       .read
                       .json(weatherDataSourceFolderPath))

display(weatherDataBronzeDF)

##### Step 3: Convert Weathe Date Values in ARRAY format to ROWS Using Explode

1. Import all functions from pyspark.sql.functions package
1. Define New Spark Dataframe variable name as weatherDataDailyDateTransDF
1. Use Dataframe select method to select the columns given below from source Spark Dataframe variable weatherDataBronzeDF
1. First select column is "daily.time" and apply the explode function on this source column and also add alias for exploded values column as "weatherDate"
1. Along with above explode select the columns "marketName" , "latitude" , "longitude" from source Spark Dataframe
1. Last column in the select is running sequence id generated by Spark function monotonically_increasing_id() and add alias name as 'sequenceId'
1. Include display for converted Spark Dataframe variables to view the dataframe columns and data for further processing


In [0]:
from pyspark.sql.functions import *
weatherDataDailyDateTransDF = (weatherDataBronzeDF
                          .select(
                          explode("daily.time").alias("weatherDate")
                          ,col("marketName")
                          ,col("latitude").alias("latitude")
                          ,col("longitude").alias("longitude")
                          ,monotonically_increasing_id().alias('sequenceId')
                          ))

display(weatherDataDailyDateTransDF)

##### Step 4: Convert Maximum Temparature Values in ARRAY format to ROWS Using Explode

1. Define New Spark Dataframe variable name as weatherDataMaxTemparatureTransDF
1. Use Dataframe select method to select the columns given below from source Spark Dataframe variable weatherDataBronzeDF
1. First select column is "daily.temperature_2m_max" and apply the explode function on this source column and also add alias for exploded values column as "maximumTemparature"
1. Along with above explode select the columns "marketName" , "latitude" , "longitude" from source Spark Dataframe
1. Last column in the select is running sequence id generated by Spark function monotonically_increasing_id() and add alias name as 'sequenceId'
1. Add one more column from the Source Spark Dataframe "daily_units.temperature_2m_max" and provide alias name as "unitOfTemparature"
1. Include display for converted Spark Dataframe variables to view the dataframe columns and data for further processing

In [0]:
weatherDataMaxTemparatureTransDF = (weatherDataBronzeDF
                          .select(
                          explode("daily.temperature_2m_max").alias("maximumTemparature")
                          ,col("marketName")
                          ,col("latitude").alias("latitude")
                          ,col("longitude").alias("longitude")
                          ,monotonically_increasing_id().alias('sequenceId')
                          ,col("daily_units.temperature_2m_max").alias("unitOfTemparature")

                          ))

display(weatherDataMaxTemparatureTransDF)

##### Step 5: Convert Minimum Temparature Values in ARRAY format to ROWS Using Explode

1. Define New Spark Dataframe variable name as weatherDataMinTemparatureTransDF
1. Use Dataframe select method to select the columns given below from source Spark Dataframe variable weatherDataBronzeDF
1. First select column is "daily.temperature_2m_min" and apply the explode function on this source column and also add alias for exploded values column as "minimumTemparature"
1. Along with above explode select the columns "marketName" , "latitude" , "longitude" from source Spark Dataframe
1. Last column in the select is running sequence id generated by Spark function monotonically_increasing_id() and add alias name as 'sequenceId'
1. Include display for converted Spark Dataframe variables to view the dataframe columns and data for further processing

In [0]:
weatherDataMinTemparatureTransDF = (weatherDataBronzeDF
                          .select(
                          explode("daily.temperature_2m_min").alias("minimumTemparature")
                          ,col("marketName")
                          ,col("latitude").alias("latitude")
                          ,col("longitude").alias("longitude")                          
                          ,monotonically_increasing_id().alias('sequenceId')

                          ))

display(weatherDataMinTemparatureTransDF)

##### Step 6: Convert Rain Fall Values in ARRAY format to ROWS Using Explode

1. Define New Spark Dataframe variable name as weatherDataRainFallTransDF
1. Use Dataframe select method to select the columns given below from source Spark Dataframe variable weatherDataBronzeDF
1. First select column is "daily.rain_sum" and apply the explode function on this source column and also add alias for exploded values column as "rainFall"
1. Along with above explode select the columns "marketName" , "latitude" , "longitude" from source Spark Dataframe
1. Last column in the select is running sequence id generated by Spark function monotonically_increasing_id() and add alias name as 'sequenceId'
1. Add one more column from the Source Spark Dataframe "daily_units.rain_sum" and provide alias name as "unitOfRainFall"
1. Include display for converted Spark Dataframe variables to view the dataframe columns and data for further processing

In [0]:
weatherDataRainFallTransDF = (weatherDataBronzeDF
                          .select(
                          explode("daily.rain_sum").alias("rainFall")
                          ,col("marketName")
                          ,col("latitude").alias("latitude")
                          ,col("longitude").alias("longitude")                          
                          ,monotonically_increasing_id().alias('sequenceId')
                          ,col("daily_units.rain_sum").alias("unitOfRainFall")

                          ))

display(weatherDataRainFallTransDF)

##### Step 7: Join All Intermediate Dataframes To Merge All Data & Write Into Silver Layer

1. Define New Spark Dataframe variable name as weatherDataTransDF
1. Join weatherDataDailyDateTransDF with weatherDataMaxTemparatureTransDF Using the Joining Columns ['marketName','latitude','longitude','sequenceId']
1. Extend weatherDataDailyDateTransDF with weatherDataMinTemparatureTransDF Using the Joining Columns ['marketName','latitude','longitude','sequenceId']
1. Extend weatherDataDailyDateTransDF with weatherDataRainFallTransDF Using the Joining Columns ['marketName','latitude','longitude','sequenceId']
1. Select the Columns "marketName" , "weatherDate" , "unitOfTemparature" , "maximumTemparature" , "minimumTemparature" , "unitOfRainFall" , "rainFall" , "latitude" and "longitude" to write final output columns into silve layer

In [0]:
weatherDataTransDF = (weatherDataDailyDateTransDF
                      .join(weatherDataMaxTemparatureTransDF, ['marketName','latitude','longitude','sequenceId'])
                      .join(weatherDataMinTemparatureTransDF, ['marketName','latitude','longitude','sequenceId'])
                      .join(weatherDataRainFallTransDF, ['marketName','latitude','longitude','sequenceId'])
                      .select(col("marketName")
                              ,col("weatherDate")
                              ,col("unitOfTemparature")
                              ,col("maximumTemparature")
                              ,col("minimumTemparature")
                              ,col("unitOfRainFall")
                              ,col("rainFall")
                              ,col("latitude")
                              ,col("longitude"))
                     
)

In [0]:
pdf = weatherDataTransDF.toPandas()
weatherDataTransDF = spark.createDataFrame(pdf)

##### Step 8: Write the Final Transformed Dataframe Into Silve Layer As Delta Table

1. Write Final Spark Dataframe weatherDataTransDF values using spark.write method
1. Use Write mode as overwrite 
1. Write the data into the Datalake Table "pricing_analytics.silver.weather_data_silver" using saveAsTable Method

In [0]:
(weatherDataTransDF
 .write
 .mode("overwrite")  
 .saveAsTable("pricing_analytics.silver.weather_data_silver"))

##### Step 9: Test The Data Stored in Tranformed Silve Layer Table
1. Write SELECT query to select the data from pricing_analytics.silver.weather_data_silver table
1. Check the data for any one of the Market matches with the source data in Complex JSON format

In [0]:
spark.sql("SELECT * FROM pricing_analytics.silver.weather_data_silver").show()
