# HackerRank Lead Data Engineer Interview

## ETL Pipeline Development

During this interview, we aim to evaluate several skills:
1. Transform data using SQL
2. Manage dataframes with PySpark or similar technologies
3. Identify and troubleshoot inconsistencies in data

## Problem Statement

When you are solving a test on the HackerRank platform, our platform collects click stream data on certain user actions. For example, we receive ping data when you run code, submit code, or view different questions. In this notebook, you will be manipulating a similar set of the click stream data to extract certain features.

## Import

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

## Create Spark Session

In [5]:
spark = SparkSession.builder \
                    .master("local") \
                    .appName("etl_pipeline_development") \
                    .enableHiveSupport() \
                    .getOrCreate()

## Questions

### Relevant Tables

\* **Perform any cleaning, exploratory analysis, and/or visualizations for the provided data as needed.**

There are several `csv` files we will be using for this exercise:
1. `ping_events.csv`
2. `company_candidates.csv`

### Question 1:

`ping_events` schema:

|  column name  |  data type  |
|---------------|-------------|
|  attempt_id   |  int        |
|  event_id     |  int        |
|  inserttime   |  datetime   |
|  metadata     |  string     |

In the `metadata` column, `qno` represents the question number within the test and `question_id` represents the question id stored in our database.

Sample rows from `ping_events`:

|  attempt_id  |  event_id  |  inserttime  |  metadata                      |
|:------------:|:----------:|:------------:|:-------------------------------|
|      1       |  2         |  05:40       | {"question_id": 101, "qno": 1} |
|      1       |  2         |  05:45       | {"question_id": 103, "qno": 3} |
|      1       |  3         |  05:46       | {"question_id": 204, "qno": 2} |
|      1       |  3         |  05:50       | {"question_id": 101, "qno": 1} |
|      1       |  1         |  05:55       | {"qno": 0}                     |
|      2       |  3         |  06:20       | {"question_id": 103, "qno": 2} |
|      2       |  3         |  06:50       | {"question_id": 101, "qno": 1} |
|      2       |  1         |  07:10       | {"qno": 0}                     |

Write executable PySpark SQL to create a table with `event_id`, `inserttime`, and `metadata` as an array of tuples grouped by `attempt_id` in a column named `data`:

\* **Note that the data is in array form under one `attempt_id`**

|   attempt_id   |   data                                       |
|:--------------:|:---------------------------------------------|
|       15       | [                                            |
|                |  (2, 05:40, {"question_id": 101, "qno": 1}), |
|                |  (2, 05:45, {"question_id": 103, "qno": 3}), |
|                |  (3, 05:46, {"question_id": 204, "qno": 2}), |
|                |  (3, 05:50, {"question_id": 101, "qno": 1}), |
|                |  (1, 05:55, {"qno": 0})                      |
|                | ]                                            |
|       16       | [                                            |
|                |  (3, 06:20, {"question_id": 103, "qno": 2}), |
|                |  (3, 06:50, {"question_id": 101, "qno": 1}), |
|                |  (1, 07:10, {"qno": 0})                      |
|                | ]                                            |

### Solution:

### Question 2:

Using PySpark, process the resulting table above to determine the time spent on each question. The dictionaries should follow this format: `{question_id: time_in_seconds}`. Do not use Pandas to perform this transformation.

Each `inserttime` event denotes a user-action. The method we use to calculate the `time_spent` transformation is by looking at the difference between consecutive ping `inserttime`s. For attempt 15, the candidate first looked at `qno 1` at 05:40 then `qno 3` at 05:45. So we know the user spent 5 minutes so far on `qno 1`.

Using this method, the above table should be transformed into:

|   attempt_id   |   time_spent              |
|:--------------:|:--------------------------|
|     15         | {"1": 10, "2": 4, "3": 1} |
|     16         | {"1": 20, "2": 30}        |

### Solution:

### Question 3:

We want to create a table with `company_id` as well:

|   company_id   |   attempt_id   |   time_spent              |
|:--------------:|:--------------:|:--------------------------|
|   1            |     15         | {"1": 10, "2": 4, "3": 1} |
|   2            |     16         | {"1": 20, "2": 30}        |

To get the `company_id` associated with each `attempt_id`, you can use `company_candidates.csv`. After you generate the above table, please store it as an interim table called `attempt_times`. 

### Solution: