# Supercloud Customer  
**Microsoft SQL Interview Question**

---

### Question  
A Microsoft Azure Supercloud customer is defined as a customer who has purchased at least one product from every product category listed in the products table.  

Write a query that identifies the customer IDs of these Supercloud customers.

---

### Tables  

#### `customer_contracts` Table:
| Column Name    | Type    |
|----------------|---------|
| customer_id    | integer |
| product_id     | integer |
| amount         | integer |

**Example Input:**
| customer_id | product_id | amount |
|-------------|------------|--------|
| 1           | 1          | 1000   |
| 1           | 3          | 2000   |
| 1           | 5          | 1500   |
| 2           | 2          | 3000   |
| 2           | 6          | 2000   |

---

#### `products` Table:
| Column Name      | Type    |
|------------------|---------|
| product_id       | integer |
| product_category | string  |
| product_name     | string  |

**Example Input:**
| product_id | product_category | product_name              |
|------------|------------------|---------------------------|
| 1          | Analytics        | Azure Databricks          |
| 2          | Analytics        | Azure Stream Analytics    |
| 4          | Containers       | Azure Kubernetes Service  |
| 5          | Containers       | Azure Service Fabric      |
| 6          | Compute          | Virtual Machines          |
| 7          | Compute          | Azure Functions           |

---

### Example Output:
| customer_id |
|-------------|
| 1           |

---

### Explanation:  
Customer 1 purchased products from **Analytics**, **Containers**, and **Compute**, covering all categories, hence is a Supercloud customer.  
Customer 2 did not purchase from all categories (e.g., missing Containers), so they are excluded.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Initialize Spark session
spark = SparkSession.builder.master('local[1]').getOrCreate()

# Sample data for customer_contracts
customer_contracts_data = [
    (1, 1, 1000),
    (1, 3, 2000),
    (1, 5, 1500),
    (2, 2, 3000),
    (2, 6, 2000)
]

customer_contracts_columns = ["customer_id", "product_id", "amount"]

customer_contracts_df = spark.createDataFrame(customer_contracts_data, customer_contracts_columns)

# Sample data for products
products_data = [
    (1, "Analytics", "Azure Databricks"),
    (2, "Analytics", "Azure Stream Analytics"),
    (3, "Containers", "Azure Kubernetes Service"),
    (4, "Containers", "Azure Service Fabric"),
    (5, "Compute", "Virtual Machines"),
    (6, "Compute", "Azure Functions")
]

products_columns = ["product_id", "product_category", "product_name"]

products_df = spark.createDataFrame(products_data, products_columns)

# Show DataFrames (optional)
customer_contracts_df.show()
products_df.show()


+-----------+----------+------+
|customer_id|product_id|amount|
+-----------+----------+------+
|          1|         1|  1000|
|          1|         3|  2000|
|          1|         5|  1500|
|          2|         2|  3000|
|          2|         6|  2000|
+-----------+----------+------+

+----------+----------------+--------------------+
|product_id|product_category|        product_name|
+----------+----------------+--------------------+
|         1|       Analytics|    Azure Databricks|
|         2|       Analytics|Azure Stream Anal...|
|         3|      Containers|Azure Kubernetes ...|
|         4|      Containers|Azure Service Fabric|
|         5|         Compute|    Virtual Machines|
|         6|         Compute|     Azure Functions|
+----------+----------------+--------------------+



In [2]:
k= products_df.agg(countDistinct('product_category')).collect()[0][0]
customer_contracts_df.join(products_df,
                           customer_contracts_df.product_id==products_df.product_id)\
                    .groupBy('customer_id').agg(countDistinct('product_category').alias('cnt'))\
                    .where(col('cnt')==k)\
                    .drop('cnt').show()

+-----------+
|customer_id|
+-----------+
|          1|
+-----------+



In [3]:
customer_contracts_df.createOrReplaceTempView('customer_contracts')
products_df.createOrReplaceTempView('products')

spark.sql(
'''
SELECT customer_id 
FROM customer_contracts c JOIN products p
on c.product_id=p.product_id
GROUP by customer_id
HAVING count(DISTINCT product_category) = (SELECT count(DISTINCT product_category) FROM products)
''').show()

+-----------+
|customer_id|
+-----------+
|          1|
+-----------+

