# EDA for events
Useful for understanding before feature engineering.
Some EDA was already done in `src/eda/eda_main`

In [None]:
from config import proj
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [2]:
holidays = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("holidays_events.parquet")))
holidays.show(5)

                                                                                

+----------+-------+--------+-----------+--------------------+-----------+
|      date|   type|  locale|locale_name|         description|transferred|
+----------+-------+--------+-----------+--------------------+-----------+
|2012-03-02|Holiday|   Local|      Manta|  Fundacion de Manta|      false|
|2012-04-01|Holiday|Regional|   Cotopaxi|Provincializacion...|      false|
|2012-04-12|Holiday|   Local|     Cuenca| Fundacion de Cuenca|      false|
|2012-04-14|Holiday|   Local|   Libertad|Cantonizacion de ...|      false|
|2012-04-21|Holiday|   Local|   Riobamba|Cantonizacion de ...|      false|
+----------+-------+--------+-----------+--------------------+-----------+
only showing top 5 rows



In [3]:
holidays.select(["locale_name", "locale"]).distinct().show(30)

+--------------------+--------+
|         locale_name|  locale|
+--------------------+--------+
|             Ecuador|National|
|             Cayambe|   Local|
|           Guayaquil|   Local|
|              Cuenca|   Local|
|           Latacunga|   Local|
|                Loja|   Local|
|            Riobamba|   Local|
|                Puyo|   Local|
|              Ibarra|   Local|
|            Cotopaxi|Regional|
|             Machala|   Local|
|             Quevedo|   Local|
|              Ambato|   Local|
|Santo Domingo de ...|Regional|
|       Santo Domingo|   Local|
|           El Carmen|   Local|
|               Manta|   Local|
|             Salinas|   Local|
|            Guaranda|   Local|
|          Esmeraldas|   Local|
|         Santa Elena|Regional|
|            Libertad|   Local|
|            Imbabura|Regional|
|               Quito|   Local|
+--------------------+--------+



Locale_name is dependent on locale.

Will need to split these up into each locale

In [4]:
holidays_nat = holidays.filter("locale == 'National'")
holidays_reg = holidays.filter("locale == 'Regional'")
holidays_loc = holidays.filter("locale == 'Local'")

### Understanding transferred

In [5]:
holidays.filter("transferred == TRUE").collect()[0]

Row(date=datetime.date(2012, 10, 9), type='Holiday', locale='National', locale_name='Ecuador', description='Independencia de Guayaquil', transferred=True)

In [6]:
holidays.filter("description == 'Independencia de Guayaquil'").show()

+----------+-------+--------+-----------+--------------------+-----------+
|      date|   type|  locale|locale_name|         description|transferred|
+----------+-------+--------+-----------+--------------------+-----------+
|2012-10-09|Holiday|National|    Ecuador|Independencia de ...|       true|
|2013-10-09|Holiday|National|    Ecuador|Independencia de ...|       true|
|2014-10-09|Holiday|National|    Ecuador|Independencia de ...|       true|
|2015-10-09|Holiday|National|    Ecuador|Independencia de ...|      false|
|2016-10-09|Holiday|National|    Ecuador|Independencia de ...|      false|
|2017-10-09|Holiday|National|    Ecuador|Independencia de ...|      false|
+----------+-------+--------+-----------+--------------------+-----------+



In [7]:
print("Transferred events: " + str(holidays.filter("type == 'Transfer'").count()))
holidays.filter("type == 'Transfer'").collect()[0]

Transferred events: 12


Row(date=datetime.date(2012, 10, 12), type='Transfer', locale='National', locale_name='Ecuador', description='Traslado Independencia de Guayaquil', transferred=False)

In [8]:
holidays.groupby("type").count().show()

+----------+-----+
|      type|count|
+----------+-----+
|     Event|   56|
|   Holiday|  221|
|  Transfer|   12|
|    Bridge|    5|
|Additional|   51|
|  Work Day|    5|
+----------+-----+



In [9]:
holidays.filter("type == 'Transfer'").show()

+----------+--------+--------+-----------+--------------------+-----------+
|      date|    type|  locale|locale_name|         description|transferred|
+----------+--------+--------+-----------+--------------------+-----------+
|2012-10-12|Transfer|National|    Ecuador|Traslado Independ...|      false|
|2013-10-11|Transfer|National|    Ecuador|Traslado Independ...|      false|
|2014-10-10|Transfer|National|    Ecuador|Traslado Independ...|      false|
|2016-05-27|Transfer|National|    Ecuador|Traslado Batalla ...|      false|
|2016-07-24|Transfer|   Local|  Guayaquil|Traslado Fundacio...|      false|
|2016-08-12|Transfer|National|    Ecuador|Traslado Primer G...|      false|
|2017-01-02|Transfer|National|    Ecuador|Traslado Primer d...|      false|
|2017-04-13|Transfer|   Local|     Cuenca| Fundacion de Cuenca|      false|
|2017-05-26|Transfer|National|    Ecuador|Traslado Batalla ...|      false|
|2017-08-11|Transfer|National|    Ecuador|Traslado Primer G...|      false|
|2017-09-29|

Only 12 transferred events, some of which are before the start date of the training data.

These are usually national holidays, and therefore probably will have impact.

The only thing different about a transfer, is the type. Rather than saying holiday, event or bridge, instead is says transfer.

Starting off, we'll assume holidays and events are similar, therefore don't require much manipulation or matching of the transfer data. This can happen if we want our model to better capture event types, but the assumption is that locale makes more of a difference than event type.

If manipulations is required, matching can be done regex since the descriptions look to be very similar but start with "Traslado"

For the moment, a holiday that has been transferred according to the flag will be considered normal day, and the day that is it transferred to be be considered the same as a holiday/event.

### How to join to the correct `locale_name`, city or state?

In [10]:
holidays_nat = holidays.filter("locale == 'National'") # obviously this will impact all stores
holidays_reg = holidays.filter("locale == 'Regional'")
holidays_loc = holidays.filter("locale == 'Local'")

In [11]:
stores = spark.read.parquet(str(proj.Config.paths.get("data_proc").joinpath("stores.parquet")))
stores.show()

+---------+-------------+--------------------+----+-------+
|store_nbr|         city|               state|type|cluster|
+---------+-------------+--------------------+----+-------+
|        1|        Quito|           Pichincha|   D|     13|
|        2|        Quito|           Pichincha|   D|     13|
|        3|        Quito|           Pichincha|   D|      8|
|        4|        Quito|           Pichincha|   D|      9|
|        5|Santo Domingo|Santo Domingo de ...|   D|      4|
|        6|        Quito|           Pichincha|   D|     13|
|        7|        Quito|           Pichincha|   D|      8|
|        8|        Quito|           Pichincha|   D|      8|
|        9|        Quito|           Pichincha|   B|      6|
|       10|        Quito|           Pichincha|   C|     15|
|       11|      Cayambe|           Pichincha|   B|      6|
|       12|    Latacunga|            Cotopaxi|   C|     15|
|       13|    Latacunga|            Cotopaxi|   C|     15|
|       14|     Riobamba|          Chimb

In [12]:
states = stores.select("state").distinct()
holidays_reg.select("locale_name").distinct()\
    .join(states, holidays.locale_name == states.state, "left")\
    .show()

+--------------------+--------------------+
|         locale_name|               state|
+--------------------+--------------------+
|            Cotopaxi|            Cotopaxi|
|         Santa Elena|         Santa Elena|
|            Imbabura|            Imbabura|
|Santo Domingo de ...|Santo Domingo de ...|
+--------------------+--------------------+



In [13]:
cities = stores.select("city").distinct()
holidays_city = holidays_loc.select("locale_name").distinct()\
    .join(cities, holidays.locale_name == cities.city, "left")\
    .show()

+-------------+-------------+
|  locale_name|         city|
+-------------+-------------+
|      Quevedo|      Quevedo|
|       Cuenca|       Cuenca|
|     Guaranda|     Guaranda|
|Santo Domingo|Santo Domingo|
|         Puyo|         Puyo|
|        Quito|        Quito|
|        Manta|        Manta|
|    Latacunga|    Latacunga|
|    Guayaquil|    Guayaquil|
|         Loja|         Loja|
|       Ibarra|       Ibarra|
|    El Carmen|    El Carmen|
|       Ambato|       Ambato|
|      Machala|      Machala|
|      Cayambe|      Cayambe|
|      Salinas|      Salinas|
|     Libertad|     Libertad|
|     Riobamba|     Riobamba|
|   Esmeraldas|   Esmeraldas|
+-------------+-------------+



### Join by
- Local = city
- Regional = state