## 1445. Apples & Oranges

### Table: Sales

| Column Name | Type |
|-------------|------|
| sale_date   | date |
| fruit       | enum |
| sold_num    | int  |

(sale_date,fruit) is the primary key for this table.  
This table contains the sales of "apples" and "oranges" sold each day.

Write an SQL query to report the difference between number of apples and oranges sold each day.  
SQL courses

Return the result table ordered by sale_date in format ('YYYY-MM-DD').

#### Sales table:

| sale_date  | fruit   | sold_num |
|------------|---------|----------|
| 2020-05-01 | apples  | 10       |
| 2020-05-01 | oranges | 8        |
| 2020-05-02 | apples  | 15       |
| 2020-05-02 | oranges | 15       |
| 2020-05-03 | apples  | 20       |
| 2020-05-03 | oranges | 0        |
| 2020-05-04 | apples  | 15       |
| 2020-05-04 | oranges | 16       |

#### Result table:

| sale_date  | diff |
|------------|------|
| 2020-05-01 | 2    |
| 2020-05-02 | 0    |
| 2020-05-03 | 20   |
| 2020-05-04 | -1   |

Day 2020-05-01, 10 apples and 8 oranges were sold (Difference 10 - 8 = 2).  
Day 2020-05-02, 15 apples and 15 oranges were sold (Difference 15 - 15 = 0).  
Day 2020-05-03, 20 apples and 0 oranges were sold (Difference 20 - 0 = 20).  
Day 2020-05-04, 15 apples and 16 oranges were sold (Difference 15 - 16 = -1).

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType
from pyspark.sql.functions import expr

# Create Spark session
spark = SparkSession.builder.appName("FruitSalesDiff").getOrCreate()

# Define schema
schema = StructType([
    StructField("sale_date", DateType(), True),
    StructField("fruit", StringType(), True),
    StructField("sold_num", IntegerType(), True)
])

# Sample data
from datetime import datetime

# Sample data with proper date conversion
data = [
    (datetime.strptime("2020-05-01", "%Y-%m-%d").date(), "apples", 10),
    (datetime.strptime("2020-05-01", "%Y-%m-%d").date(), "oranges", 8),
    (datetime.strptime("2020-05-02", "%Y-%m-%d").date(), "apples", 15),
    (datetime.strptime("2020-05-02", "%Y-%m-%d").date(), "oranges", 15),
    (datetime.strptime("2020-05-03", "%Y-%m-%d").date(), "apples", 20),
    (datetime.strptime("2020-05-03", "%Y-%m-%d").date(), "oranges", 0),
    (datetime.strptime("2020-05-04", "%Y-%m-%d").date(), "apples", 15),
    (datetime.strptime("2020-05-04", "%Y-%m-%d").date(), "oranges", 16)
]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Register temp view
df.createOrReplaceTempView("Sales")



In [0]:
from pyspark.sql.functions import *
apple = df.filter(col("fruit")=="apples").selectExpr("sale_date as a_sd","fruit as apple", "sold_num as a_sn")
oranges = df.filter(col("fruit")=="oranges").selectExpr("sale_date as o_sd","fruit as oranges", "sold_num as o_sn")
apple.join(oranges,(col("a_sd")== col("o_sd"))&(col("apple")!=col("oranges")),"full_outer")\
    .withColumn("diff", col("a_sn")-col("o_sn"))\
    .selectExpr("coalesce(a_sd , o_sd) as sale_date","diff")\
        .display()
        

In [0]:
%sql

Select coalesce(a.sale_date , o.sale_date ) as sale_date, (a.sold_num - o.sold_num) as diff  from Sales a full outer join Sales o on a.sale_date = o.sale_date
where a.fruit = 'apples' and o.fruit = 'oranges'
 

In [0]:
# SQL logic
query = """
SELECT
  sale_date,
  MAX(CASE WHEN fruit = 'apples' THEN sold_num ELSE 0 END) -
  MAX(CASE WHEN fruit = 'oranges' THEN sold_num ELSE 0 END) AS diff
FROM Sales
GROUP BY sale_date
ORDER BY sale_date
"""

# Execute and display
result = spark.sql(query)
display(result)