# *Etapa 1: Engenharia de Features (offers.json)*

Ao término do processamento, devemos ter as seguintes variáveis:

* ``offer_id``: código de identificação da oferta    
* ``offer_type``: o tipo da oferta (BOGO, discount, informational)
* ``offer_min_value``: valor mínimo para ativação da oferta
* ``offer_discount_value``: valor do desconto da oferta
* ``offer_duration``: duração da oferta
* ``channel_mobile``: variável dummy para indicar a veiculação da oferta em 'mobile'
* ``channel_email``: variável dummy para indicar a veiculação da oferta em 'email'
* ``channel_social``: variável dummy para indicar a veiculação da oferta em 'social'
* ``channel_web``: variável dummy para indicar a veiculação da oferta em 'web'
* ``offer_type_index``: um código para o tipo de oferta

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PySparkTest').getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/11 00:21:52 WARN Utils: Your hostname, N0L144853, resolves to a loopback address: 127.0.1.1; using 192.168.68.107 instead (on interface wlp0s20f3)
25/08/11 00:21:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/08/11 00:21:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from pyspark.sql import functions as F
from pyspark.sql import types as T

# Read Files

In [3]:
path_json = "../data/raw/offers.json"
output_table = "../data/trusted/offers"

schema = T.StructType([
    T.StructField("id", T.StringType(), True),
    T.StructField("offer_type", T.StringType(), True),
    T.StructField("min_value", T.IntegerType(), True),
    T.StructField("discount_value", T.IntegerType(), True),
    T.StructField("duration", T.DoubleType(), True),
    T.StructField("channels", T.ArrayType(T.StringType()), True),
])

df_offers = spark.read.schema(schema).json(path_json)
df_offers.show(5, truncate=False)

+--------------------------------+-------------+---------+--------------+--------+----------------------------+
|id                              |offer_type   |min_value|discount_value|duration|channels                    |
+--------------------------------+-------------+---------+--------------+--------+----------------------------+
|ae264e3637204a6fb9bb56bc8210ddfd|bogo         |10       |10            |7.0     |[email, mobile, social]     |
|4d5c57ea9a6940dd891ad53e9dbe8da0|bogo         |10       |10            |5.0     |[web, email, mobile, social]|
|3f207df678b143eea3cee63160fa8bed|informational|0        |0             |4.0     |[web, email, mobile]        |
|9b98b8c7a33c4b65b9aebfe6a799e6d9|bogo         |5        |5             |7.0     |[web, email, mobile]        |
|0b1e1539f2cc45b7b9fa7c272da2e1d7|discount     |20       |5             |10.0    |[web, email]                |
+--------------------------------+-------------+---------+--------------+--------+----------------------

In [4]:
df_offers.describe(['min_value', 'discount_value', 'duration']).show()

+-------+-----------------+-----------------+------------------+
|summary|        min_value|   discount_value|          duration|
+-------+-----------------+-----------------+------------------+
|  count|               10|               10|                10|
|   mean|              7.7|              4.2|               6.5|
| stddev|5.831904586934796|3.583914681524163|2.3213980461973533|
|    min|                0|                0|               3.0|
|    max|               20|               10|              10.0|
+-------+-----------------+-----------------+------------------+



In [5]:
df_offers.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_offers.columns]).show()

+---+----------+---------+--------------+--------+--------+
| id|offer_type|min_value|discount_value|duration|channels|
+---+----------+---------+--------------+--------+--------+
|  0|         0|        0|             0|       0|       0|
+---+----------+---------+--------------+--------+--------+



# Data Understanding

In [6]:
df_offers.groupBy("offer_type").count().show()

+-------------+-----+
|   offer_type|count|
+-------------+-----+
|     discount|    4|
|informational|    2|
|         bogo|    4|
+-------------+-----+



In [7]:
df_offers.filter("offer_type='bogo'").show(truncate=False)

+--------------------------------+----------+---------+--------------+--------+----------------------------+
|id                              |offer_type|min_value|discount_value|duration|channels                    |
+--------------------------------+----------+---------+--------------+--------+----------------------------+
|ae264e3637204a6fb9bb56bc8210ddfd|bogo      |10       |10            |7.0     |[email, mobile, social]     |
|4d5c57ea9a6940dd891ad53e9dbe8da0|bogo      |10       |10            |5.0     |[web, email, mobile, social]|
|9b98b8c7a33c4b65b9aebfe6a799e6d9|bogo      |5        |5             |7.0     |[web, email, mobile]        |
|f19421c1d4aa40978ebb69ca19b0e20d|bogo      |5        |5             |5.0     |[web, email, mobile, social]|
+--------------------------------+----------+---------+--------------+--------+----------------------------+



In [8]:
df_offers.filter("offer_type='informational'").show(truncate=False)

+--------------------------------+-------------+---------+--------------+--------+-----------------------+
|id                              |offer_type   |min_value|discount_value|duration|channels               |
+--------------------------------+-------------+---------+--------------+--------+-----------------------+
|3f207df678b143eea3cee63160fa8bed|informational|0        |0             |4.0     |[web, email, mobile]   |
|5a8bc65990b245e5a138643cd4eb9837|informational|0        |0             |3.0     |[email, mobile, social]|
+--------------------------------+-------------+---------+--------------+--------+-----------------------+



In [9]:
df_offers.filter("offer_type='discount'").show(truncate=False)

+--------------------------------+----------+---------+--------------+--------+----------------------------+
|id                              |offer_type|min_value|discount_value|duration|channels                    |
+--------------------------------+----------+---------+--------------+--------+----------------------------+
|0b1e1539f2cc45b7b9fa7c272da2e1d7|discount  |20       |5             |10.0    |[web, email]                |
|2298d6c36e964ae4a3e7e9706d1fb8c2|discount  |7        |3             |7.0     |[web, email, mobile, social]|
|fafdcd668e3743c1bb461111dcafc2a4|discount  |10       |2             |10.0    |[web, email, mobile, social]|
|2906b810c7d4411798c6938adc9daaa5|discount  |10       |2             |7.0     |[web, email, mobile]        |
+--------------------------------+----------+---------+--------------+--------+----------------------------+



In [10]:
df_offers.select("channels").distinct().show(truncate=False)

+----------------------------+
|channels                    |
+----------------------------+
|[web, email]                |
|[email, mobile, social]     |
|[web, email, mobile]        |
|[web, email, mobile, social]|
+----------------------------+



# Data Clean

In [11]:
df_offers.show(5, truncate=False)

+--------------------------------+-------------+---------+--------------+--------+----------------------------+
|id                              |offer_type   |min_value|discount_value|duration|channels                    |
+--------------------------------+-------------+---------+--------------+--------+----------------------------+
|ae264e3637204a6fb9bb56bc8210ddfd|bogo         |10       |10            |7.0     |[email, mobile, social]     |
|4d5c57ea9a6940dd891ad53e9dbe8da0|bogo         |10       |10            |5.0     |[web, email, mobile, social]|
|3f207df678b143eea3cee63160fa8bed|informational|0        |0             |4.0     |[web, email, mobile]        |
|9b98b8c7a33c4b65b9aebfe6a799e6d9|bogo         |5        |5             |7.0     |[web, email, mobile]        |
|0b1e1539f2cc45b7b9fa7c272da2e1d7|discount     |20       |5             |10.0    |[web, email]                |
+--------------------------------+-------------+---------+--------------+--------+----------------------

##### Transformando coluna channels 

In [12]:
channels_df = df_offers.select(F.explode(F.col("channels")).alias("channels")).distinct()
distinct_channels = [ch.channels for ch in channels_df.collect()]
distinct_channels

['mobile', 'email', 'social', 'web']

In [13]:
for channel in distinct_channels:
    df_offers = df_offers.withColumn(channel, F.when(F.array_contains(F.col("channels"), channel), 1).otherwise(0))
df_offers = df_offers.drop("channels")
df_offers.show(5, truncate=False)

+--------------------------------+-------------+---------+--------------+--------+------+-----+------+---+
|id                              |offer_type   |min_value|discount_value|duration|mobile|email|social|web|
+--------------------------------+-------------+---------+--------------+--------+------+-----+------+---+
|ae264e3637204a6fb9bb56bc8210ddfd|bogo         |10       |10            |7.0     |1     |1    |1     |0  |
|4d5c57ea9a6940dd891ad53e9dbe8da0|bogo         |10       |10            |5.0     |1     |1    |1     |1  |
|3f207df678b143eea3cee63160fa8bed|informational|0        |0             |4.0     |1     |1    |0     |1  |
|9b98b8c7a33c4b65b9aebfe6a799e6d9|bogo         |5        |5             |7.0     |1     |1    |0     |1  |
|0b1e1539f2cc45b7b9fa7c272da2e1d7|discount     |20       |5             |10.0    |0     |1    |0     |1  |
+--------------------------------+-------------+---------+--------------+--------+------+-----+------+---+
only showing top 5 rows


##### Padronizando colunas referente às ofertas

In [14]:
df_offers = df_offers.withColumnRenamed("id", "offer_id")\
                     .withColumnRenamed("min_value", "offer_min_value")\
                     .withColumnRenamed("discount_value", "offer_discount_value")\
                     .withColumnRenamed("duration", "offer_duration")\
                     .withColumnRenamed("min_value", "offer_min_value")\
                     .withColumnRenamed("mobile", "channel_mobile")\
                     .withColumnRenamed("email", "channel_email")\
                     .withColumnRenamed("social", "channel_social")\
                     .withColumnRenamed("web", "channel_web")
df_offers.show(5, truncate=False)

+--------------------------------+-------------+---------------+--------------------+--------------+--------------+-------------+--------------+-----------+
|offer_id                        |offer_type   |offer_min_value|offer_discount_value|offer_duration|channel_mobile|channel_email|channel_social|channel_web|
+--------------------------------+-------------+---------------+--------------------+--------------+--------------+-------------+--------------+-----------+
|ae264e3637204a6fb9bb56bc8210ddfd|bogo         |10             |10                  |7.0           |1             |1            |1             |0          |
|4d5c57ea9a6940dd891ad53e9dbe8da0|bogo         |10             |10                  |5.0           |1             |1            |1             |1          |
|3f207df678b143eea3cee63160fa8bed|informational|0              |0                   |4.0           |1             |1            |0             |1          |
|9b98b8c7a33c4b65b9aebfe6a799e6d9|bogo         |5         

##### Encodando o tipo de oferta

In [15]:
df_offer_type = df_offers.select(F.col("offer_type")).distinct()\
                         .withColumn("offer_type_index", F.monotonically_increasing_id())\
                         .select('offer_type', 'offer_type_index')
df_offer_type.show(5, truncate=False)

+-------------+----------------+
|offer_type   |offer_type_index|
+-------------+----------------+
|discount     |0               |
|informational|1               |
|bogo         |2               |
+-------------+----------------+



In [16]:
df_offers = df_offers.join(df_offer_type, on='offer_type', how='inner')
df_offers.show(5, truncate=False)

+-------------+--------------------------------+---------------+--------------------+--------------+--------------+-------------+--------------+-----------+----------------+
|offer_type   |offer_id                        |offer_min_value|offer_discount_value|offer_duration|channel_mobile|channel_email|channel_social|channel_web|offer_type_index|
+-------------+--------------------------------+---------------+--------------------+--------------+--------------+-------------+--------------+-----------+----------------+
|bogo         |ae264e3637204a6fb9bb56bc8210ddfd|10             |10                  |7.0           |1             |1            |1             |0          |2               |
|bogo         |4d5c57ea9a6940dd891ad53e9dbe8da0|10             |10                  |5.0           |1             |1            |1             |1          |2               |
|informational|3f207df678b143eea3cee63160fa8bed|0              |0                   |4.0           |1             |1            |0

##### Salvando tabela de ofertas 

In [17]:
df_offers = df_offers.select('offer_id', 'offer_type', 'offer_min_value', 'offer_discount_value', 'offer_duration', 'channel_mobile', 
                             'channel_email', 'channel_social', 'channel_web', 'offer_type_index')
df_offers.show(5, truncate=False)

+--------------------------------+-------------+---------------+--------------------+--------------+--------------+-------------+--------------+-----------+----------------+
|offer_id                        |offer_type   |offer_min_value|offer_discount_value|offer_duration|channel_mobile|channel_email|channel_social|channel_web|offer_type_index|
+--------------------------------+-------------+---------------+--------------------+--------------+--------------+-------------+--------------+-----------+----------------+
|ae264e3637204a6fb9bb56bc8210ddfd|bogo         |10             |10                  |7.0           |1             |1            |1             |0          |2               |
|4d5c57ea9a6940dd891ad53e9dbe8da0|bogo         |10             |10                  |5.0           |1             |1            |1             |1          |2               |
|3f207df678b143eea3cee63160fa8bed|informational|0              |0                   |4.0           |1             |1            |0

In [18]:
df_offers.write.format("parquet").mode("overwrite").save(output_table)