# User Shopping Sprees
**Amazon SQL Interview Question**

### Question
In an effort to identify high-value customers, Amazon asked for your help to obtain data about users who go on shopping sprees. A shopping spree occurs when a user makes purchases on 3 or more consecutive days.

List the user IDs who have gone on at least 1 shopping spree in ascending order.

### transactions Table:
| Column Name       | Type      |
|-------------------|-----------|
| user_id           | integer   |
| amount            | float     |
| transaction_date  | timestamp |

### transactions Example Input:
| user_id | amount  | transaction_date        |
|---------|---------|-------------------------|
| 1       | 9.99    | 08/01/2022 10:00:00     |
| 1       | 55      | 08/17/2022 10:00:00     |
| 2       | 149.5   | 08/05/2022 10:00:00     |
| 2       | 4.89    | 08/06/2022 10:00:00     |
| 2       | 34      | 08/07/2022 10:00:00     |

### Example Output:
| user_id |
|---------|
| 2       |

### Explanation:
In this example, user_id 2 is the only one who has gone on a shopping spree.


In [4]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, TimestampType
from pyspark.sql.functions import *
from datetime import datetime

# Create Spark session
spark = SparkSession.builder.master('local[1]').appName("UserShoppingSprees").getOrCreate()

# Define the schema for transactions table
schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("amount", FloatType(), True),
    StructField("transaction_date", TimestampType(), True)
])

# Define the data for transactions table with datetime objects
data = [
    (1, 9.99, datetime.strptime("2022-08-01 10:00:00", "%Y-%m-%d %H:%M:%S")),
    (1, 55.00, datetime.strptime("2022-08-17 10:00:00", "%Y-%m-%d %H:%M:%S")),
    (2, 149.5, datetime.strptime("2022-08-05 10:00:00", "%Y-%m-%d %H:%M:%S")),
    (2, 4.89, datetime.strptime("2022-08-06 10:00:00", "%Y-%m-%d %H:%M:%S")),
    (2, 34.00, datetime.strptime("2022-08-07 10:00:00", "%Y-%m-%d %H:%M:%S"))
]



# Create the Spark DataFrame
transactions_df = spark.createDataFrame(data, schema=schema)
transactions_df = transactions_df.withColumn("transaction_date", to_date("transaction_date"))
# Show the DataFrame
transactions_df.show(truncate=False)



+-------+------+----------------+
|user_id|amount|transaction_date|
+-------+------+----------------+
|1      |9.99  |2022-08-01      |
|1      |55.0  |2022-08-17      |
|2      |149.5 |2022-08-05      |
|2      |4.89  |2022-08-06      |
|2      |34.0  |2022-08-07      |
+-------+------+----------------+



In [5]:
transactions_df.alias('t1').join(
    transactions_df.alias('t2'),
    (col('t1.user_id')==col('t2.user_id')) & (col('t1.transaction_date')==date_add(col('t2.transaction_date'),1)),
    'inner')\
    .join(transactions_df.alias('t3'),
          (col('t1.user_id')==col('t3.user_id')) & (col('t1.transaction_date')==date_add(col('t3.transaction_date'),2)),
    'inner')\
    .select('t1.user_id')\
    .orderBy('user_id')\
    .show()

+-------+
|user_id|
+-------+
|      2|
+-------+



In [6]:
transactions_df.createOrReplaceTempView('transactions')

spark.sql(
    """
    SELECT t1.user_id

FROM transactions t1 

JOIN transactions t2
  on t1.user_id=t2.user_id AND
  date(t1.transaction_date)=date(t2.transaction_date)+1
  
JOIN transactions t3
  on t1.user_id=t3.user_id AND
  date(t1.transaction_date)=date(t3.transaction_date)+2
  """
).show()

+-------+
|user_id|
+-------+
|      2|
+-------+

