# Histogram of Users and Purchases
**Walmart SQL Interview Question**

### Question
Assume you're given a table on Walmart user transactions. Based on their most recent transaction date, write a query that retrieves the users along with the number of products they bought.

Output the user's most recent transaction date, user ID, and the number of products, sorted in chronological order by the transaction date.

### user_transactions Table:
| Column Name      | Type       |
|------------------|------------|
| product_id       | integer    |
| user_id          | integer    |
| spend            | decimal    |
| transaction_date | timestamp  |

### user_transactions Example Input:
| product_id | user_id | spend   | transaction_date       |
|------------|---------|---------|------------------------|
| 3673       | 123     | 68.90   | 07/08/2022 12:00:00    |
| 9623       | 123     | 274.10  | 07/08/2022 12:00:00    |
| 1467       | 115     | 19.90   | 07/08/2022 12:00:00    |
| 2513       | 159     | 25.00   | 07/08/2022 12:00:00    |
| 1452       | 159     | 74.50   | 07/10/2022 12:00:00    |

### Example Output:
| transaction_date        | user_id | purchase_count |
|-------------------------|---------|----------------|
| 07/08/2022 12:00:00     | 115     | 1              |
| 07/08/2022 12:00:00     | 123     | 2              |
| 07/10/2022 12:00:00     | 159     | 1              |

### Explanation:
- User 115 made 1 purchase on 07/08/2022.
- User 123 made 2 purchases on 07/08/2022.
- User 159 made 1 purchase on 07/10/2022.


In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime

# Initialize Spark session
spark = SparkSession.builder.master('local[1]').appName("WalmartUserTransactions").getOrCreate()

# Define the data
data = [
    (3673, 123, 68.90, datetime(2022, 7, 8, 12, 0, 0)),
    (9623, 123, 274.10, datetime(2022, 7, 8, 12, 0, 0)),
    (1467, 115, 19.90, datetime(2022, 7, 8, 12, 0, 0)),
    (2513, 159, 25.00, datetime(2022, 7, 8, 12, 0, 0)),
    (1452, 159, 74.50, datetime(2022, 7, 10, 12, 0, 0))
]

# Define the column names
columns = ['product_id', 'user_id', 'spend', 'transaction_date']

# Create the DataFrame
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show(truncate=False)


+----------+-------+-----+-------------------+
|product_id|user_id|spend|transaction_date   |
+----------+-------+-----+-------------------+
|3673      |123    |68.9 |2022-07-08 12:00:00|
|9623      |123    |274.1|2022-07-08 12:00:00|
|1467      |115    |19.9 |2022-07-08 12:00:00|
|2513      |159    |25.0 |2022-07-08 12:00:00|
|1452      |159    |74.5 |2022-07-10 12:00:00|
+----------+-------+-----+-------------------+



In [19]:
from pyspark.sql.window import Window
winspec= Window.partitionBy('user_id').orderBy(col('transaction_date').desc())
df\
    .withColumn('rnk', rank().over(winspec))\
    .where('rnk=1')\
    .groupBy('transaction_date','user_id')\
    .agg(count('product_id').alias('purchase_count'))\
    .show()

+-------------------+-------+--------------+
|   transaction_date|user_id|purchase_count|
+-------------------+-------+--------------+
|2022-07-08 12:00:00|    115|             1|
|2022-07-08 12:00:00|    123|             2|
|2022-07-10 12:00:00|    159|             1|
+-------------------+-------+--------------+



In [7]:
df.createOrReplaceTempView('user_transactions')
spark.sql("""
with cte as (SELECT *,
rank() OVER(partition by user_id order by transaction_date desc) as rnk
FROM user_transactions)

select 
transaction_date,
user_id,
count(product_id) as purchase_count
from cte
where rnk =1
GROUP BY 1,2
ORDER by 1
""").show()

+-------------------+-------+--------------+
|   transaction_date|user_id|purchase_count|
+-------------------+-------+--------------+
|2022-07-08 12:00:00|    115|             1|
|2022-07-08 12:00:00|    123|             2|
|2022-07-10 12:00:00|    159|             1|
+-------------------+-------+--------------+

