In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

In [3]:
spark = SparkSession.builder.master("local[4]").appName("1251").getOrCreate()
spark

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/07/22 01:23:16 WARN Utils: Your hostname, de24, resolves to a loopback address: 127.0.1.1; using 192.168.0.102 instead (on interface enp0s3)
25/07/22 01:23:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/22 01:23:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
'''
Table: Prices

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| product_id    | int     |
| start_date    | date    |
| end_date      | date    |
| price         | int     |
+---------------+---------+
(product_id, start_date, end_date) is the primary key (combination of columns with unique values) 
for this table.
Each row of this table indicates the price of the product_id in the period from start_date to 
end_date.
For each product_id there will be no two overlapping periods. That means there will be no two 
intersecting periods for the 
same product_id.
 

Table: UnitsSold

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| product_id    | int     |
| purchase_date | date    |
| units         | int     |
+---------------+---------+
This table may contain duplicate rows.
Each row of this table indicates the date, units, and product_id of each product sold. 
 

Write a solution to find the average selling price for each product. average_price should be 
rounded to 2 decimal places.

Return the result table in any order.

The result format is in the following example.

 

Example 1:

Input: 
Prices table:
+------------+------------+------------+--------+
| product_id | start_date | end_date   | price  |
+------------+------------+------------+--------+
| 1          | 2019-02-17 | 2019-02-28 | 5      |
| 1          | 2019-03-01 | 2019-03-22 | 20     |
| 2          | 2019-02-01 | 2019-02-20 | 15     |
| 2          | 2019-02-21 | 2019-03-31 | 30     |
+------------+------------+------------+--------+
UnitsSold table:
+------------+---------------+-------+
| product_id | purchase_date | units |
+------------+---------------+-------+
| 1          | 2019-02-25    | 100   |
| 1          | 2019-03-01    | 15    |
| 2          | 2019-02-10    | 200   |
| 2          | 2019-03-22    | 30    |
+------------+---------------+-------+
Output: 
+------------+---------------+
| product_id | average_price |
+------------+---------------+
| 1          | 6.96          |
| 2          | 16.96         |
+------------+---------------+
Explanation: 
Average selling price = Total Price of Product / Number of products sold.
Average selling price for product 1 = ((100 * 5) + (15 * 20)) / 115 = 6.96
Average selling price for product 2 = ((200 * 15) + (30 * 30)) / 230 = 16.96
'''

In [4]:
prices = [
(1,'2019-02-17','2019-02-28',5 ),
(1,'2019-03-01','2019-03-22',20),
(2,'2019-02-01','2019-02-20',15),
(2,'2019-02-21','2019-03-31',30)
]
prices_schema = ['product_id','start_date','end_date','price']

unitssold = [
(1,'2019-02-25',100),
(1,'2019-03-01',15 ),
(2,'2019-02-10',200),
(2,'2019-03-22',30 )
]
unitssold_schema = ['product_id','purchase_date','units']

In [7]:
df_prices = spark.createDataFrame(data=prices,schema=prices_schema)
df_unitssold = spark.createDataFrame(data=unitssold,schema=unitssold_schema)

In [16]:
df_prices.alias("p").join( df_unitssold.alias("u"), 
                           F.col("p.product_id") == F.col("u.product_id"),
                           "inner"
                         )\
    .where((F.col("u.purchase_date") >= F.col("p.start_date")) & (F.col("u.purchase_date") <= F.col("p.end_date")))\
    .groupBy(F.col("u.product_id"))\
    .agg( F.round(F.sum(F.col("p.price") * F.col("u.units")) / F.sum(F.col("u.units")),2).alias("average_price") )\
    .show()

+----------+-------------+
|product_id|average_price|
+----------+-------------+
|         1|         6.96|
|         2|        16.96|
+----------+-------------+



## SQL Solution 

<pre>
    SELECT u.product_id, ROUND( SUM(p.price * u.units)/SUM(u.units), 2) 
    FROM Prices p 
    LEFT JOIN UnitsSold u ON p.product_id = u.product_id
    WHERE u.purchase_date BETWEEN p.start_date AND p.end_date
    GROUP BY u.product_id
</pre>