# IBM Db2 Product Analytics  
**IBM SQL Interview Question**

---

## Question

IBM is analyzing how their employees are utilizing the Db2 database by tracking the SQL queries executed by their employees.  

The objective is to generate data to populate a histogram that shows the **number of unique queries run by employees during the third quarter of 2023 (July to September)**. Additionally, it should **count the number of employees who did not run any queries** during this period.

---

## Task

Display:
- The number of **unique queries** (as histogram categories)
- The **count of employees** who executed that number of unique queries

---

## Schema

### `queries` Table:
| Column Name      | Type      | Description                                |
|------------------|-----------|--------------------------------------------|
| employee_id      | integer   | The ID of the employee who executed query  |
| query_id         | integer   | The unique identifier for each query       |
| query_starttime  | datetime  | The timestamp when the query started       |
| execution_time   | integer   | The duration of the query in seconds       |

### `employees` Table:
| Column Name   | Type     | Description                            |
|---------------|----------|----------------------------------------|
| employee_id   | integer  | The ID of the employee                 |
| full_name     | string   | The full name of the employee          |
| gender        | string   | The gender of the employee             |

---

## Example Input

### `queries`
| employee_id | query_id | query_starttime        | execution_time |
|-------------|----------|------------------------|----------------|
| 226         | 856987   | 07/01/2023 01:04:43     | 2698           |
| 132         | 286115   | 07/01/2023 03:25:12     | 2705           |
| 226         | 33683    | 07/01/2023 04:34:38     | 91             |
| 226         | 17745    | 07/01/2023 14:33:47     | 2093           |
| 132         | 413477   | 07/02/2023 10:55:14     | 470            |

### `employees`
| employee_id | full_name         | gender |
|-------------|-------------------|--------|
| 132         | Judas Beardon     | Male   |
| 226         | Lainey Franciotti | Female |
| 349         | Ashbey Strahan    | Male   |

---

## Example Output

| unique_queries | employee_count |
|----------------|----------------|
| 0              | 1              |
| 2              | 1              |
| 3              | 1              |

---

## Explanation

- **191 employees** did not run any queries between **July and September 2023**.
- **46 employees** ran **exactly 1 unique query**, and so on.

The result is suitable for building a **histogram** to visualize employee activity across the quarter.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
from pyspark.sql.functions import *
from datetime import datetime


# Initialize Spark session
spark = SparkSession.builder.master('local[1]').appName("IBMDb2Analytics").getOrCreate()

# Sample data for `queries` table
queries_data = [
    (132, 286115, ("2023-07-01 03:25:12"), 2705),
    (226, 33683,  ("2023-07-01 04:34:38"), 91),
    (226, 856987, "2023-07-01 01:04:43", 2698),
    (226, 17745,  "2023-07-01 14:33:47", 2093),
    (132, 413477, "2023-07-02 10:55:14", 470),
]

# Schema for `queries`
queries_schema = StructType([
    StructField("employee_id", IntegerType(), True),
    StructField("query_id", IntegerType(), True),
    StructField("query_starttime", StringType(), True),
    StructField("execution_time", IntegerType(), True),
])

# Create `queries` DataFrame
queries_df = spark.createDataFrame(data=queries_data, schema=queries_schema)

# Sample data for `employees` table
employees_data = [
    (132, "Judas Beardon", "Male"),
    (226, "Lainey Franciotti", "Female"),
    (349, "Ashbey Strahan", "Male"),
]

# Schema for `employees`
employees_schema = StructType([
    StructField("employee_id", IntegerType(), True),
    StructField("full_name", StringType(), True),
    StructField("gender", StringType(), True),
])

# Create `employees` DataFrame
employees_df = spark.createDataFrame(data=employees_data, schema=employees_schema)

# Show DataFrames
queries_df.show()
employees_df.show()


+-----------+--------+-------------------+--------------+
|employee_id|query_id|    query_starttime|execution_time|
+-----------+--------+-------------------+--------------+
|        132|  286115|2023-07-01 03:25:12|          2705|
|        226|   33683|2023-07-01 04:34:38|            91|
|        226|  856987|2023-07-01 01:04:43|          2698|
|        226|   17745|2023-07-01 14:33:47|          2093|
|        132|  413477|2023-07-02 10:55:14|           470|
+-----------+--------+-------------------+--------------+

+-----------+-----------------+------+
|employee_id|        full_name|gender|
+-----------+-----------------+------+
|        132|    Judas Beardon|  Male|
|        226|Lainey Franciotti|Female|
|        349|   Ashbey Strahan|  Male|
+-----------+-----------------+------+



In [2]:
employees_df.join(queries_df,['employee_id'],'left')\
    .groupBy('employee_id').agg(count('query_id').alias('unique_queries'))\
    .groupBy('unique_queries').agg(count('employee_id').alias('employee_count'))\
    .orderBy('unique_queries').show()

+--------------+--------------+
|unique_queries|employee_count|
+--------------+--------------+
|             0|             1|
|             2|             1|
|             3|             1|
+--------------+--------------+



In [3]:
employees_df.createOrReplaceTempView('employee')
queries_df.createOrReplaceTempView('queries')

spark.sql(
'''
with cte as
(select employee_id , count(query_id) as unique_querie
from employee left join queries 
using(employee_id)
group by employee_id)

select unique_querie, count(employee_id) as employee_count
from cte
group by 1
order by 1


'''
).show()

+-------------+--------------+
|unique_querie|employee_count|
+-------------+--------------+
|            0|             1|
|            2|             1|
|            3|             1|
+-------------+--------------+

