# Introduction
The Didomi challenge is the recruiting challenge for all the data engineer who wants to work at Didomi. You can check
the origin [git repo](https://github.com/pengfei99/challenges/blob/master/data/README.md)

## The challenge

In some specific cases, companies need to collect consent from consumers before using their data. For instance,
app users might need to explicitly consent to share their geolocation before a company can use it for advertising.

As users interact with the Didomi platform, we collect different types of events like:

- "Page view" when a user visits a webpage
- "Consent asked" when a user is asked for consent (ie a consent notice is displayed)
- "Consent given" when a user gives consent (ie has clicked on a Agree or Disagree in the notice)

The goal of this challenge is to build a very simple Spark app that processes events and summarizes various metrics as time-series data.


## Input

### Format

Events are stored as JSON Lines files with the following format:

```js
{
    "timestamp": "2020-01-21T15:19:34Z",
    "id": "94cabac0-088c-43d3-976a-88756d21132a",
    "type": "pageview",
    "domain": "www.website.com",
    "user": {
        "id": "09fcb803-2779-4096-bcfd-0fb1afef684a",
        "country": "US",
        "token": "{\"vendors\":{\"enabled\":[\"vendor\"],\"disabled\":[]},\"purposes\":{\"enabled\":[\"analytics\"],\"disabled\":[]}}",
    }
}
```

| Property       | Values                                       | Description                          |
| -------------- | -------------------------------------------- | ------------------------------------ |
| 'timestamp'    | ISO 8601 date                                | Date of the event                    |
| 'id'           | UUID                                         | Unique event ID                      |
| 'type'         | 'pageview', 'consent.given', 'consent.asked' | Event type                           |
| 'domain'       | Domain name                                  | Domain where the event was collected |
| 'user.id'      | UUID                                         | Unique user ID                       |
| 'user.token'   | JSON-String                                  | Contains status of purposes/vendors  |
| 'user.country' | ISO 3166-1 alpha-2 country code              | Country of the user                  |

### Consent status

We consider an event as positive consent when at least one purpose is enabled.

### Partitioning

The data is partitioned by date/hour with Hive partition structure.

## Output

The Spark job is expected to compute the following metrics:

| Metric                        | Description                                                                      |
| ----------------------------- | -------------------------------------------------------------------------------- |
| 'pageviews'                   | Number of events of type 'pageview'                                              |
| 'pageviews_with_consent'      | Number of events of type 'pageview' with consent (at least one enabled purpose)  |
| 'consents_asked'              | Number of events of type 'consent.asked'                                         |
| 'consents_given'              | Number of events of type 'consent.given'                                         |
| 'consents_given_with_consent' | Number of events of type 'consent.given' with consent                            |
| 'avg_pageviews_per_user'      | Average number of events of type 'pageviews' per user                            |

The metrics should be grouped by the following dimensions:

- Date and hour (YYYY-MM-DD-HH)
- Domain
- User country

## Processing

On top of computing the metrics listed above, the following operations must be run:

- Deduplication of events based on event ID

## Step 0 
Load the data

In [32]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import to_timestamp, col, from_json, date_format,size, count

In [4]:
spark = SparkSession.builder \
        .master("local[4]") \
        .appName("didomi_challenge") \
        .config("spark.driver.memory","4g") \
        .getOrCreate()
    
root_path = "/home/pliu/data_set/didomi_challenge_input"
data_path1 = "{}/datehour=2021-01-23-10".format(root_path)
data_path2 = "{}/datehour=2021-01-23-11".format(root_path)

df = spark.read.json(['{}/*.json'.format(data_path1), '{}/*.json'.format(data_path2)])
df.printSchema()
df.show(5, truncate=False)

21/09/19 12:23:13 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.184.142 instead (on interface ens33)
21/09/19 12:23:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/09/19 12:23:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/19 12:23:14 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
                                                                                

root
 |-- datetime: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- token: string (nullable = true)

+-------------------+---------------+------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|datetime           |domain         |id                                  |type         |user                                                                                                                                            |
+-------------------+---------------+------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------------------------

## Step1 Data preprocessing 

After step0, we could already notice some problems:
- datetime is in string type and format does not meets the requirements,(need to be convert to timestamp with format YYYY-MM-DD-HH , for arthitemetic operation) 
- user is struct type (need to be flattened to get better performence)
- we could have duplicatedd event id(tips from the intro), (need to be deduplicated)

So the data preprocessing will focus on these three point.


In [5]:
# 1.1 Remove duplicate event based on event id 
# row number before dedup
print("Row number before dedup: {}".format(df.count()))

# remove duplicate based on event id
df_dedup = df.dropDuplicates(["id"])
print("Row number after dedup: {}".format(df_dedup.count()))

Row number before dedup: 62




Row number after dedup: 57


                                                                                

In [6]:
# 1.2 Convert date to required format
# convert string date to timestamp then to right date format
df_cleaned = df_dedup.withColumn("datetime", to_timestamp(col("datetime"), "yyyy-MM-DD HH:mm:ss")) \
        .withColumn("datetime", date_format("datetime", "yyyy-MM-DD-HH"))
df_cleaned.printSchema()
df_cleaned.show(5, truncate=False)

root
 |-- datetime: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user: struct (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- token: string (nullable = true)

+-------------+--------------------+------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|datetime     |domain              |id                                  |type         |user                                                                                                                                            |
+-------------+--------------------+------------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------

In [8]:
# 1.3 flat the struct type of user
df_flat = df_cleaned.withColumn("user_country", df_cleaned.user.country) \
        .withColumn("user_id", df_cleaned.user.id).withColumn("user_token", df_cleaned.user.token).drop('user')

df_flat.printSchema()
df_flat.show(5, truncate=False)

root
 |-- datetime: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user_country: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- user_token: string (nullable = true)

+-------------+--------------------+------------------------------------+-------------+------------+------------------------------------+----------------------------------------------------------------------------------------------------+
|datetime     |domain              |id                                  |type         |user_country|user_id                             |user_token                                                                                          |
+-------------+--------------------+------------------------------------+-------------+------------+------------------------------------+----------------------------------------------------------------------------------------------------+
|202

In [21]:
# We can notice that the user_token is a json string, so we need to flat the token again
# We need to get the json string column schema first. Because the from_json function needs it
json_df = spark.read.json(df_flat.select("user_token").rdd.map(lambda row: row.user_token))
json_df.show()
json_df.printSchema()
json_schema=json_df.schema




+-----------------+---------------+
|         purposes|        vendors|
+-----------------+---------------+
|{[], [analytics]}|{[], [Vendor1]}|
|{[analytics], []}|{[], [Vendor1]}|
|         {[], []}|       {[], []}|
|{[], [analytics]}|{[], [Vendor1]}|
|{[analytics], []}|{[], [Vendor1]}|
|         {[], []}|       {[], []}|
|         {[], []}|       {[], []}|
|{[], [analytics]}|{[], [Vendor1]}|
|{[], [analytics]}|{[], [Vendor1]}|
|         {[], []}|       {[], []}|
|{[], [analytics]}|{[], [Vendor1]}|
|{[], [analytics]}|{[], [Vendor1]}|
|{[], [analytics]}|{[], [Vendor1]}|
|{[analytics], []}|{[], [Vendor1]}|
|         {[], []}|       {[], []}|
|         {[], []}|       {[], []}|
|{[], [analytics]}|{[], [Vendor1]}|
|{[], [analytics]}|{[], [Vendor1]}|
|         {[], []}|       {[], []}|
|{[], [analytics]}|{[], [Vendor1]}|
+-----------------+---------------+
only showing top 20 rows

root
 |-- purposes: struct (nullable = true)
 |    |-- disabled: array (nullable = true)
 |    |    |-- elemen



In [11]:
# we convert the token json string to a struct column.
df_token_flat = df_flat.withColumn("token", from_json(col("user_token"), schema=json_schema)).drop("user_token")
df_token_flat.printSchema()
df_token_flat.show(2)


root
 |-- datetime: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user_country: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- token: struct (nullable = true)
 |    |-- purposes: struct (nullable = true)
 |    |    |-- disabled: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- enabled: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |-- vendors: struct (nullable = true)
 |    |    |-- disabled: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- enabled: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

+-------------+--------------------+--------------------+-------------+------------+--------------------+--------------------+
|     datetime|              domain|                  id|         type|user_country|

In [15]:
# We consider an event as positive consent when at least one purpose is enabled.
# here we get the size of the purpose enabled list. If the size is greater or equal to 1, we have the consent.
df_ready = df_token_flat.withColumn("consent", size(df_token_flat.token.purposes.enabled)).drop("token")
# finish data preparation, ready for stats analyze
df_ready.printSchema()
df_ready.show(5, truncate=False)
df_ready.cache()

root
 |-- datetime: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user_country: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- consent: integer (nullable = false)





+-------------+--------------------+------------------------------------+-------------+------------+------------------------------------+-------+
|datetime     |domain              |id                                  |type         |user_country|user_id                             |consent|
+-------------+--------------------+------------------------------------+-------------+------------+------------------------------------+-------+
|2021-01-23-11|www.domain-A.eu     |515d7d1a-2465-4ad1-91ba-2081867521c6|pageview     |ES          |8bfa9ffb-97d5-46f1-b712-215374a30e96|1      |
|2021-01-23-10|my-other-website.com|b6f9743e-a43e-44a8-b1e8-4cd4ed2ff02e|consent.given|FR          |cf73342a-9edb-4312-ba75-123a0bb5d701|0      |
|2021-01-23-10|www.mywebsite.com   |639b3171-a205-4be9-b744-adf10e7b2c61|consent.asked|FR          |d902eb4a-bac4-4bd5-acf4-9004ed770bf6|0      |
|2021-01-23-10|www.mywebsite.com   |85f451dc-368a-4d8e-abf2-c93efb9a37fc|pageview     |FR          |8a082392-1ce7-4957-ab7f-

DataFrame[datetime: string, domain: string, id: string, type: string, user_country: string, user_id: string, consent: int]

## Step 2 Get the metric

In [17]:
# Get the page view groupby date, by country, by domain

df_page_view = df_ready.filter(col("type") == "pageview")
df_page_view_groupby_date = df_page_view.groupBy("datetime").count() \
                    .select("datetime",col("count").alias("pageviews"))

def group_by_count(df: DataFrame, group_col_name: str, count_alias: str) -> DataFrame:
    return df.groupBy(group_col_name).count().withColumnRenamed("count", count_alias)

df_page_view_groupby_country = group_by_count(df_page_view, "user_country", "pageviews")
df_page_view_groupby_domain = group_by_count(df_page_view, "domain", "pageviews")

df_page_view_groupby_domain.show()
df_page_view_groupby_country.show()
df_page_view_groupby_date.show()

                                                                                

+--------------------+---------+
|              domain|pageviews|
+--------------------+---------+
|   www.mywebsite.com|       13|
|     www.domain-A.eu|       17|
|my-other-website.com|        1|
+--------------------+---------+

+------------+---------+
|user_country|pageviews|
+------------+---------+
|          DE|        4|
|          ES|       13|
|          FR|       14|
+------------+---------+



                                                                                

+-------------+---------+
|     datetime|pageviews|
+-------------+---------+
|2021-01-23-11|       15|
|2021-01-23-10|       16|
+-------------+---------+



In [20]:
# page view with consent groupby date, country domain
df_page_view_with_consent = df_ready.filter((col("type") == "pageview") & (col("consent")>0))
df_page_view_consent_date = group_by_count(df_page_view_with_consent, "datetime", "pageviews_with_consent")
df_page_view_consent_country = group_by_count(df_page_view_with_consent, "user_country", "pageviews_with_consent")
df_page_view_consent_domain = group_by_count(df_page_view_with_consent, "domain", "pageviews_with_consent")

df_page_view_consent_date.show()
df_page_view_consent_domain.show()
df_page_view_consent_country.show()

                                                                                

+-------------+----------------------+
|     datetime|pageviews_with_consent|
+-------------+----------------------+
|2021-01-23-11|                     8|
|2021-01-23-10|                     8|
+-------------+----------------------+

+-----------------+----------------------+
|           domain|pageviews_with_consent|
+-----------------+----------------------+
|www.mywebsite.com|                     8|
|  www.domain-A.eu|                     8|
+-----------------+----------------------+



                                                                                

+------------+----------------------+
|user_country|pageviews_with_consent|
+------------+----------------------+
|          DE|                     3|
|          ES|                     5|
|          FR|                     8|
+------------+----------------------+



In [22]:
# total consents_asked groupBy
df_consents_asked = df_ready.filter(col("type") == "consent.asked")
df_consents_asked_date = group_by_count(df_consents_asked, "datetime", "consents_asked")
df_consents_asked_country = group_by_count(df_consents_asked, "user_country", "consents_asked")
df_consents_asked_domain = group_by_count(df_consents_asked, "domain", "consents_asked")

In [23]:
# total consents_given
df_consents_given = df_ready.filter(col("type") == "consent.given")
df_consents_given_date = group_by_count(df_consents_given, "datetime", "consents_given")
df_consents_given_country = group_by_count(df_consents_given, "user_country", "consents_given")
df_consents_given_domain = group_by_count(df_consents_given, "domain", "consents_given")
# df_consents_given.show()

In [27]:
# total consents_given with consent
df_consents_given_with_consent = df_ready.filter((col("type") == "consent.given")&(col("consent")>0))
df_consents_given_with_consent_date = group_by_count(df_consents_given_with_consent, "datetime", "consents_given_with_consent")
df_consents_given_with_consent_country = group_by_count(df_consents_given_with_consent, "user_country", "consents_given_with_consent")
df_consents_given_with_consent_domain = group_by_count(df_consents_given_with_consent, "domain","consents_given_with_consent")

df_consents_given_with_consent_domain.show()

+--------------------+---------------------------+
|              domain|consents_given_with_consent|
+--------------------+---------------------------+
|   www.mywebsite.com|                          6|
|     www.domain-A.eu|                          5|
|my-other-website.com|                          1|
+--------------------+---------------------------+



In [29]:
def get_group_distinct_user_number(df: DataFrame, group_col_name, spark: SparkSession) -> DataFrame:
    result = []
    group_distinct_val_list = list(df.select(group_col_name).distinct().toPandas()[group_col_name])
    for group_distinct_val in group_distinct_val_list:
        # get distinct user number of the specific col value
        user_number = df.filter(col(group_col_name) == group_distinct_val).select("user_id").distinct().count()
        # print("{}:{}".format(group_distinct_val, user_number))
        result.append((group_distinct_val, user_number))
    return spark.createDataFrame(result, schema=[group_col_name, 'distinct_user_number'])


# note the two df must are the result of the same groupBy condition (e.g. country, domain, date)
def get_avg_view_per_user(df_distinct_user_number: DataFrame, df_view_number: DataFrame,
                          join_cond_col_name) -> DataFrame:
    df_tmp = df_distinct_user_number.join(df_view_number, join_cond_col_name, "inner")
    return df_tmp.withColumn("avg_pageviews_per_user", col("pageviews") / col("distinct_user_number")) \
        .drop("distinct_user_number").drop("pageviews")


# Average number of events of type pageviews per user
df_distinct_user_by_country = get_group_distinct_user_number(df_page_view, "user_country", spark)
df_avg_by_country = get_avg_view_per_user(df_distinct_user_by_country, df_page_view_groupby_country, "user_country")
df_distinct_user_by_country.show()
df_avg_by_country.show()

df_distinct_user_by_date = get_group_distinct_user_number(df_page_view, "datetime", spark)
df_avg_by_date = get_avg_view_per_user(df_distinct_user_by_date, df_page_view_groupby_date, "datetime")
# df_avg_by_date.show()

df_distinct_user_by_domain = get_group_distinct_user_number(df_page_view, "domain", spark)
df_avg_by_domain = get_avg_view_per_user(df_distinct_user_by_domain, df_page_view_groupby_domain, "domain")
# df_avg_by_domain.show()

+------------+--------------------+
|user_country|distinct_user_number|
+------------+--------------------+
|          DE|                   1|
|          ES|                   3|
|          FR|                   4|
+------------+--------------------+

+------------+----------------------+
|user_country|avg_pageviews_per_user|
+------------+----------------------+
|          DE|                   4.0|
|          ES|     4.333333333333333|
|          FR|                   3.5|
+------------+----------------------+



In [30]:
# join page view with consent
df_metric_by_date = df_page_view_groupby_date.join(df_page_view_consent_date, "datetime", "left")
df_metric_by_country = df_page_view_groupby_country.join(df_page_view_consent_country, "user_country", "left")
df_metric_by_domain = df_page_view_groupby_domain.join(df_page_view_consent_domain, "domain", "left")

# join consents_asked
df_metric_by_date = df_metric_by_date.join(df_consents_asked_date, "datetime", "left")
df_metric_by_country = df_metric_by_country.join(df_consents_asked_country, "user_country", "left")
df_metric_by_domain = df_metric_by_domain.join(df_consents_asked_domain, "domain", "left")

# join consents_given
df_metric_by_date = df_metric_by_date.join(df_consents_given_date, "datetime", "left")
df_metric_by_country = df_metric_by_country.join(df_consents_given_country, "user_country", "left")
df_metric_by_domain = df_metric_by_domain.join(df_consents_given_domain, "domain", "left")

# join consents_given_with_consents
df_metric_by_date = df_metric_by_date.join(df_consents_given_with_consent_date, "datetime", "left")
df_metric_by_country = df_metric_by_country.join(df_consents_given_with_consent_country, "user_country", "left")
df_metric_by_domain = df_metric_by_domain.join(df_consents_given_with_consent_domain, "domain", "left")

# join avg page views per user
df_metric_by_date = df_metric_by_date.join(df_avg_by_date, "datetime", "left")
df_metric_by_country = df_metric_by_country.join(df_avg_by_country, "user_country", "left")
df_metric_by_domain = df_metric_by_domain.join(df_avg_by_domain, "domain", "left")

############################ Step4: show final metric ############################################
df_metric_by_date.show()
df_metric_by_country.show()
df_metric_by_domain.show()

                                                                                

+-------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+
|     datetime|pageviews|pageviews_with_consent|consents_asked|consents_given|consents_given_with_consent|avg_pageviews_per_user|
+-------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+
|2021-01-23-11|       15|                     8|             3|             3|                          2|                  3.75|
|2021-01-23-10|       16|                     8|             8|            12|                         10|                   3.2|
+-------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+

+------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+
|user_country|pageviews|pageviews_with_consent|consents_asked|consents_given|consents_give

                                                                                

+--------------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+
|              domain|pageviews|pageviews_with_consent|consents_asked|consents_given|consents_given_with_consent|avg_pageviews_per_user|
+--------------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+
|   www.mywebsite.com|       13|                     8|             5|             7|                          6|                  3.25|
|     www.domain-A.eu|       17|                     8|             4|             6|                          5|                  4.25|
|my-other-website.com|        1|                  null|             2|             2|                          1|                   1.0|
+--------------------+---------+----------------------+--------------+--------------+---------------------------+----------------------+



## Use window function



In [35]:
from pyspark.sql.window import Window
win_country=Window.partitionBy("user_country")

df_win_metric_by_country=df_ready.filter(col("type")=="pageview") \
                       .withColumn("page_views_sum",count("*").over(win_country)) \
                       .select("user_country","page_views_sum")\
                       .distinct()

df_win_metric_
df_win_metric_by_country.show()


+------------+--------------+
|user_country|page_views_sum|
+------------+--------------+
|          DE|             4|
|          ES|            13|
|          FR|            14|
+------------+--------------+

