## 1211. Queries Quality and Percentage
## Table: Queries

| Column Name | Type    |
|-------------|---------|
| query_name  | varchar |
| result      | varchar |
| position    | int     |
| rating      | int     |

### Primary Key:
This table may have duplicate rows.

### Column Descriptions:
- `query_name`: Name of the query.
- `result`: Result returned by the query.
- `position`: Position of the result in the list (1 to 500).
- `rating`: Rating of the result (1 to 5). Query with rating less than 3 is a poor query.

---

This table may have duplicate rows.  
This table contains information collected from some queries on a database.  
The position column has a value from 1 to 500.  
The rating column has a value from 1 to 5. Query with rating less than 3 is a poor query.

We define query quality as:

The average of the ratio between query rating and its position.

We also define poor query percentage as:

The percentage of all queries with rating less than 3.

Write a solution to find each query_name, the quality and poor_query_percentage.

Both quality and poor_query_percentage should be rounded to 2 decimal places.

Return the result table in any order.

The result format is in the following example.

---

### Example 1:

**Input:**

Queries table:  
| query_name | result            | position | rating |
|------------|-------------------|----------|--------|
| Dog        | Golden Retriever  | 1        | 5      |
| Dog        | German Shepherd   | 2        | 5      |
| Dog        | Mule              | 200      | 1      |
| Cat        | Shirazi           | 5        | 2      |
| Cat        | Siamese           | 3        | 3      |
| Cat        | Sphynx            | 7        | 4      |

**Output:**

| query_name | quality | poor_query_percentage |
|------------|---------|-----------------------|
| Dog        | 2.50    | 33.33                 |
| Cat        | 0.66    | 33.33                 |

**Explanation:**  
Dog queries quality is ((5 / 1) + (5 / 2) + (1 / 200)) / 3 = 2.50  
Dog queries poor_query_percentage is (1 / 3) * 100 = 33.33  

Cat queries quality equals ((2 / 5) + (3 / 3) + (4 / 7)) / 3 = 0.66  
Cat queries poor_query_percentage is (1 / 3) * 100 = 33.33

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, avg, round, count, sum, when

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Define schema
schema = StructType([
    StructField("query_name", StringType(), True),
    StructField("result", StringType(), True),
    StructField("position", IntegerType(), True),
    StructField("rating", IntegerType(), True)
])

# Sample data
data = [
    ("Dog", "Golden Retriever", 1, 5),
    ("Dog", "German Shepherd", 2, 5),
    ("Dog", "Mule", 200, 1),
    ("Cat", "Shirazi", 5, 2),
    ("Cat", "Siamese", 3, 3),
    ("Cat", "Sphynx", 7, 4)
]

# Create DataFrame
df = spark.createDataFrame(data, schema)
df.createOrReplaceTempView("Queries")

# SQL logic
spark.sql("""
    SELECT
        query_name,
        ROUND(AVG(rating * 1.0 / position), 2) AS quality,
        ROUND(SUM(CASE WHEN rating < 3 THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS poor_query_percentage
    FROM Queries
    GROUP BY query_name
""").createOrReplaceTempView("QueryStats")

# Display result
display(spark.sql("SELECT * FROM QueryStats"))

In [0]:
df_agg = (
    df.groupBy("query_name")
    .agg(
        (avg(col("rating") * 1.0 / col("position")).cast("decimal(5,2)")).alias(
            "quality"
        ),
        (
            (sum(when(col("rating") < 3, 1).otherwise(0)) * 100.0 / count("*")).cast(
                "decimal(5,2)"
            )
        ).alias("poor_query_percentage"),
    )
    .display()
)

Great question, Shubham — you're close, but there are a few subtle mistakes in your PySpark chain. Let’s break it down and fix it step by step:

---

### ❌ Mistakes in Your Code

```python
df_2 = df.groupBy(col("query_name")).agg(
    avg(1.0 * col("rating") / col("position")).alias("quality").cast("decimal(4,2)")
).withColumn(
    "poor_query_percentage", 
    sum(when(col("rating") < 3, 1).otherwise 0 )/count(*)
).display()
```

#### 🔴 Mistake 1: `.cast()` is applied **after** `.alias()`
- You’re trying to cast the result of `avg(...)` **after** aliasing it, which doesn’t work.
- `.cast()` must be applied **before** `.alias()`.

#### 🔴 Mistake 2: `otherwise 0` is missing parentheses
- `otherwise` is a method — it needs to be called like `otherwise(0)`.

#### 🔴 Mistake 3: `count(*)` is invalid in PySpark
- You need to use `count("*")` as a string or `count(col("*"))`.

#### 🔴 Mistake 4: `withColumn()` after `groupBy().agg()` won’t work as expected
- After aggregation, the original columns like `rating` are no longer available.
- You need to compute both metrics **inside the same `agg()` block**.

---

### ✅ Corrected Version

```python
from pyspark.sql.functions import col, avg, sum, count, when, round

df_2 = df.groupBy("query_name").agg(
    round(avg(col("rating") * 1.0 / col("position")), 2).alias("quality"),
    round(
        sum(when(col("rating") < 3, 1).otherwise(0)) * 100.0 / count("*"),
        2
    ).alias("poor_query_percentage")
)

display(df_2)
```

---

### 🧠 Pro Tip
If you really want to cast to `DecimalType(4,2)` instead of rounding, you can use:

```python
from pyspark.sql.types import DecimalType

(col("rating") * 1.0 / col("position")).cast(DecimalType(4,2))
```

But for most reporting and display purposes, `round(..., 2)` is cleaner and safer.

Let me know if you want this wrapped into a reusable function or added to a pipeline!


Thanks for sharing the full error, Shubham — this is a classic precision mismatch in PySpark when casting floating-point values to a `DecimalType`.

---

### ⚠️ Root Cause

You're trying to cast a `Double` (from `avg(...)` or a division) into `Decimal(3,2)`, which means:
- **3 total digits**, **2 after the decimal** → max value allowed is `9.99`.

But your computed value (e.g., `quality = 2.50`, `poor_query_percentage = 33.33`) **exceeds** that range — especially `33.33`, which needs at least `Decimal(5,2)`.

---

### ✅ Fix Options

#### Option 1: Use a wider decimal type
```python
.cast("decimal(5,2)")
```
This allows values up to `999.99`, which safely covers your use case.

#### Option 2: Use `round()` instead of `cast()`
```python
round(avg(...), 2)
```
This avoids precision errors and is more display-friendly.

#### Option 3: Disable ANSI mode (not recommended unless you're okay with silent nulls)
```python
spark.conf.set("spark.sql.ansi.enabled", "false")
```
This will suppress the error and return `null` for out-of-range casts — but can hide bugs.

---

### 🔧 Recommended Fix

Here’s your corrected snippet:

```python
df_agg = df.groupBy("query_name").agg(
    (avg(col("rating") * 1.0 / col("position")).cast("decimal(5,2)")).alias("quality"),
    ((sum(when(col("rating") < 3, 1).otherwise(0)) * 100.0 / count("*")).cast("decimal(5,2)")).alias("poor_query_percentage")
)

display(df_agg)
```

Let me know if you want to format the output as strings like `"2.50"` or `"33.33"` for presentation or export!
