# User's Third Transaction

## Uber SQL Interview Question

### Question  
This is the same question as problem #11 in the SQL Chapter of *Ace the Data Science Interview!*

Assume you are given the table below on Uber transactions made by users. Write a query to obtain the **third transaction** of every user.  
Output the `user_id`, `spend`, and `transaction_date`.

---

### `transactions` Table:

| Column Name       | Type      |
|-------------------|-----------|
| user_id           | integer   |
| spend             | decimal   |
| transaction_date  | timestamp |

---

### Example Input:

| user_id | spend | transaction_date     |
|---------|--------|----------------------|
| 111     | 100.50 | 01/08/2022 12:00:00  |
| 111     | 55.00  | 01/10/2022 12:00:00  |
| 121     | 36.00  | 01/18/2022 12:00:00  |
| 145     | 24.99  | 01/26/2022 12:00:00  |
| 111     | 89.60  | 02/05/2022 12:00:00  |

---

### Example Output:

| user_id | spend | transaction_date     |
|---------|--------|----------------------|
| 111     | 89.60  | 02/05/2022 12:00:00  |

---

### Explanation:

User **111** is the only one with three or more transactions.  
Their **third transaction** occurred on **02/05/2022** with a spend of **89.60**.



In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Create Spark session
spark = SparkSession.builder.master('local[1]').appName("UsersThirdTransaction").getOrCreate()

# Define schema
schema = StructType([
    StructField("user_id", IntegerType(), True),
    StructField("spend", FloatType(), True),
    StructField("transaction_date", StringType(), True),
])

# Sample data
data = [
    (111, 100.50, "2022-01-08 12:00:00"),
    (111, 55.00, "2022-01-10 12:00:00"),
    (121, 36.00, "2022-01-18 12:00:00"),
    (145, 24.99, "2022-01-26 12:00:00"),
    (111, 89.60, "2022-02-05 12:00:00"),
]

# Create DataFrame
transactions_df = spark.createDataFrame(data, schema)
transactions_df = transactions_df.withColumn("transaction_date", col("transaction_date").cast(TimestampType()))
# Show DataFrame
transactions_df.show(truncate=False)


+-------+-----+-------------------+
|user_id|spend|transaction_date   |
+-------+-----+-------------------+
|111    |100.5|2022-01-08 12:00:00|
|111    |55.0 |2022-01-10 12:00:00|
|121    |36.0 |2022-01-18 12:00:00|
|145    |24.99|2022-01-26 12:00:00|
|111    |89.6 |2022-02-05 12:00:00|
+-------+-----+-------------------+



In [8]:
from pyspark.sql.window import Window
winspec= Window.partitionBy('user_id').orderBy('transaction_date')

transactions_df\
    .withColumn('rnk', row_number().over(winspec))\
    .where('rnk=3')\
    .drop('rnk').show()

+-------+-----+-------------------+
|user_id|spend|   transaction_date|
+-------+-----+-------------------+
|    111| 89.6|2022-02-05 12:00:00|
+-------+-----+-------------------+



In [5]:
transactions_df.createOrReplaceTempView('transactions')

spark.sql(
'''
with cte as (
SELECT user_id,
  spend,
  transaction_date,
  row_number() over(PARTITION by user_id order by transaction_date) as rnk
FROM transactions)

SELECT 
  user_id,
  spend,
  transaction_date
FROM cte
WHERE rnk=3;
'''
).show()

+-------+-----+-------------------+
|user_id|spend|   transaction_date|
+-------+-----+-------------------+
|    111| 89.6|2022-02-05 12:00:00|
+-------+-----+-------------------+

