### Notebook to perform Analysis and get insight about Data
* 1) Calculate total sales amount per customer
* 2) Determine the average order quantity per product
* 3) Identify the top-selling products or customers
* 4) Analyze sales trends over time (e.g., monthly or quarterly sales)
* 5) Include any other aggregations or data manipulations that you think are relevant
* 6) Include weather data in the analysis (e.g., average sales amount per weather condition)


### Include Required libraries

In [39]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg, year, month

### Create Spark Session

In [15]:
# Spark Session
spark = SparkSession.builder.appName("AIQ Assignment") \
            .config("spark.jars.packages","org.postgresql:postgresql:42.5.4") \
            .getOrCreate()

sqlContext = SparkSession(spark)

#Dont Show warning only error
spark.sparkContext.setLogLevel("ERROR")

### Read Transformed Data from table

In [13]:
trSales_df = spark.read \
  .format("jdbc") \
  .option("url", "jdbc:postgresql://192.168.5.154:5432/postgres") \
  .option("driver", "org.postgresql.Driver") \
  .option("dbtable", "aiq.tra_sales") \
  .option("user", "postgres") \
  .option("password", "postgres")\
  .load()

trSales_df.createOrReplaceTempView("tra_sales") 
trSales_df.show(2)

+--------+-----------+----------+--------+-----+-------------------+----------------+--------+--------------------+--------------+--------+---------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|order_id|customer_id|product_id|quantity|price|         order_date|            name|username|               email|          city|     lat|      lng|store_address|         temperature|            temp_min|            temp_max|            pressure|            humidity|         description|
+--------+-----------+----------+--------+-----+-------------------+----------------+--------+--------------------+--------------+--------+---------+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    5862|          5|        14|       2|29.35|2022-12-12 00:00:00|Chelsey Dietrich|  Kamren|Lucio_Hettinger@a...|    Roscoeview|

### 1.) Calculate total sales amount per customer

In [43]:
totalSalPerCust_df = sqlContext.sql('''
			WITH salesRes
			AS(
				SELECT name,
					SUM(quantity * price) AS total_sales
				FROM tra_sales
				GROUP BY name
			)
			SELECT name, ROUND(total_sales, 2) AS total_sales_amount
			FROM salesRes
			ORDER BY total_sales DESC
        ''')
		
totalSalPerCust_df.show()

+--------------------+------------------+
|                name|total_sales_amount|
+--------------------+------------------+
|  Clementina DuBuque|          36704.17|
|        Ervin Howell|          33147.26|
|     Glenna Reichert|          33040.69|
|Nicholas Runolfsd...|          31860.25|
|    Chelsey Dietrich|          31156.73|
|    Clementine Bauch|           31018.8|
|Mrs. Dennis Schulist|          30168.84|
|     Kurtis Weissnat|          28737.81|
|    Patricia Lebsack|          28625.48|
|       Leanne Graham|          24680.98|
+--------------------+------------------+



### 2.) Determine the average order quantity per product

In [37]:
avgOrderPerProduct_df = sqlContext.sql('''
			SELECT product_id,
				ROUND(AVG(quantity), 2) AS avg_quantity
			FROM tra_sales
			GROUP BY product_id
			ORDER BY product_id
        ''')
		
avgOrderPerProduct_df.show()

+----------+------------+
|product_id|avg_quantity|
+----------+------------+
|         1|         5.0|
|         2|        5.92|
|         3|        6.31|
|         4|        6.75|
|         5|         5.0|
|         6|        5.46|
|         7|        4.87|
|         8|        6.22|
|         9|        5.13|
|        10|        4.95|
|        11|        5.03|
|        12|        6.35|
|        13|        6.09|
|        14|        7.19|
|        15|         5.5|
|        16|         5.2|
|        17|        5.56|
|        18|         6.0|
|        19|        6.23|
|        20|         6.0|
+----------+------------+
only showing top 20 rows



### 3.) Identify the top-selling products

In [45]:
# Top 5 selling products
top5SellinProducts_df = sqlContext.sql('''
			WITH topSellingProducts
			AS(
				SELECT product_id,
					SUM(quantity) AS total_quantity
				FROM tra_sales
				GROUP BY product_id
			)
			SELECT *
			FROM topSellingProducts
			ORDER BY total_quantity DESC
			LIMIT 5
        ''')
		
top5SellinProducts_df.show()

+----------+--------------+
|product_id|total_quantity|
+----------+--------------+
|        11|           181|
|        36|           159|
|        23|           156|
|        26|           155|
|        44|           151|
+----------+--------------+



### 4.) Analyze sales trends over time (e.g., monthly or quarterly sales)

In [47]:
salesTrend_df = sqlContext.sql('''
			CREATE OR REPLACE TEMP VIEW salesTrend
			AS
			SELECT date_part('YEAR', order_date) as year_num,
				date_part('MONTH', order_date) as month_num,
				CASE
					WHEN MONTH(order_date) BETWEEN 1 AND 3 THEN 'Q1'
					WHEN MONTH(order_date) BETWEEN 4 AND 6 THEN 'Q2'
					WHEN MONTH(order_date) BETWEEN 7 AND 9 THEN 'Q3'
					WHEN MONTH(order_date) BETWEEN 10 AND 12 THEN 'Q4'
				END AS quarter_num,
				(quantity * price) AS total_amount
			FROM tra_sales
			ORDER BY year_num, month_num
        ''')

##### Monthly Sales Trends

In [50]:
monthlySalesTrend_df = sqlContext.sql(''' 
			SELECT year_num,
				month_num,
				ROUND(SUM(total_amount), 2) AS monthly_sales_amount
			FROM salesTrend
			GROUP BY year_num, month_num
			ORDER BY year_num DESC, month_num DESC 
		''')
monthlySalesTrend_df.show()

+--------+---------+--------------------+
|year_num|month_num|monthly_sales_amount|
+--------+---------+--------------------+
|    2023|        6|              427.45|
|    2023|        5|            22360.13|
|    2023|        4|            27713.28|
|    2023|        3|            24914.67|
|    2023|        2|            27308.78|
|    2023|        1|            22724.84|
|    2022|       12|            29656.92|
|    2022|       11|            22888.05|
|    2022|       10|            25448.66|
|    2022|        9|            26819.41|
|    2022|        8|            25121.81|
|    2022|        7|            28217.77|
|    2022|        6|            25539.24|
+--------+---------+--------------------+



##### Qaurterly Sales Trends

In [52]:
quarterlySalesTrend_df = sqlContext.sql(''' 
			SELECT year_num,
				quarter_num,
				ROUND(SUM(total_amount), 2) AS quarterly_sales_amount
			FROM salesTrend
			GROUP BY year_num, quarter_num
			ORDER BY year_num DESC, quarter_num DESC 
		''')
quarterlySalesTrend_df.show()

+--------+-----------+----------------------+
|year_num|quarter_num|quarterly_sales_amount|
+--------+-----------+----------------------+
|    2023|         Q2|              50500.86|
|    2023|         Q1|              74948.29|
|    2022|         Q4|              77993.63|
|    2022|         Q3|              80158.99|
|    2022|         Q2|              25539.24|
+--------+-----------+----------------------+



### 5.) Other Relevant Statistics

##### Sales By Fictional Store Address

In [60]:
salesByStoreAddress = sqlContext.sql(''' 
			SELECT store_address,
				ROUND(SUM((quantity * price))) AS sales_by_store_address
			FROM tra_sales
			GROUP BY store_address
		''')
salesByStoreAddress.show()

+-------------+----------------------+
|store_address|sales_by_store_address|
+-------------+----------------------+
|      Beijing|               36704.0|
|       London|               28625.0|
|       Mumbai|               30169.0|
|          Goa|               31860.0|
|        Paris|               31157.0|
|       Riyadh|               31019.0|
|    Abu Dhabi|               24681.0|
|    Hyderabad|               28738.0|
|        Dubai|               33147.0|
|    Washigton|               33041.0|
+-------------+----------------------+



### 6.) Average sales amount By weather condition

In [58]:
# The unrealistic values starting with "net.raz*" are set to "NA" in description column which holds weather condition
avgSalesAmountByWeather_df = sqlContext.sql(''' 
			WITH salesTrendsByWeather
			AS
			(
				SELECT CASE 
							WHEN description LIKE "%net.razorvine.pickle.objects.ClassDictConstructor%" THEN 'NA'
							ELSE description 
						END AS weather_condition,
					(quantity * price) AS sales_amount
				FROM tra_sales
			)
			SELECT weather_condition,
				ROUND(AVG(sales_amount)) AS avg_sales_amount
			FROM salesTrendsByWeather
			GROUP BY weather_condition
		''')
avgSalesAmountByWeather_df.show()		

+-----------------+----------------+
|weather_condition|avg_sales_amount|
+-----------------+----------------+
|          drizzle|           305.0|
|               NA|           312.0|
|    broken clouds|           295.0|
|        clear sky|           306.0|
|  overcast clouds|           316.0|
| scattered clouds|           312.0|
|             mist|           314.0|
+-----------------+----------------+

