**Problem:**  
You have a DataFrame `df` with the following data:

| transaction_id | product_id | quantity | price |
|----------------|------------|----------|-------|
| 1              | 1          | 2        | 1000  |
| 2              | 2          | 3        | 500   |
| 3              | 1          | 1        | 1000  |
| 4              | 3          | 4        | 300   |

**Task:** Calculate the total sales amount for each product. Ensure that each product appears only once in the result.


In [None]:
import pandas as pd

data = {
    'transaction_id': [1, 2, 3, 4],
    'product_id': [1, 2, 1, 3],
    'quantity': [2, 3, 1, 4],
    'price': [1000, 500, 1000, 300]
}

df = pd.DataFrame(data)

df['total_sales'] = df['quantity'] * df['price']
result = df.groupby('product_id')['total_sales'].sum().reset_index()
print(result)


**Problem:**  
You have a PySpark DataFrame `df` with the following columns:

| order_id | customer_id | order_date        | amount |
|----------|-------------|-------------------|--------|
| 1        | 101         | 2024-08-01 12:00  | 200    |
| 2        | 102         | 2024-08-02 14:00  | NULL   |
| 3        | 101         | 2024-08-03 09:00  | 300    |
| 4        | 103         | 2024-08-04 10:00  | NULL   |

**Task:** Fill the missing values in the `amount` column with the average amount for each customer. Then, calculate the running total of the amount for each customer, ordered by `order_date`.


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window

spark = SparkSession.builder.getOrCreate()

data = [
    (1, 101, '2024-08-01 12:00', 200),
    (2, 102, '2024-08-02 14:00', None),
    (3, 101, '2024-08-03 09:00', 300),
    (4, 103, '2024-08-04 10:00', None)
]

columns = ['order_id', 'customer_id', 'order_date', 'amount']
df = spark.createDataFrame(data, schema=columns)

avg_amount_df = df.groupBy('customer_id').agg(avg('amount').alias('avg_amount'))
df_filled = df.join(avg_amount_df, on='customer_id', how='left')
df_filled = df_filled.withColumn('amount', when(col('amount').isNull(), col('avg_amount')).otherwise(col('amount')))

window_spec = Window.partitionBy('customer_id').orderBy('order_date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
df_result = df_filled.withColumn('running_total', sum(col('amount')).over(window_spec))

df_result.show()
