# EDA & Feature Engineering about holidays_event

## Summary
- 문제 : 같은 날짜에 중복되는 휴일들이 있어서 train dataset 과의 merge 시에 dataset의 길이가 늘어나는 현상 발생
- 해결 : 'locale'을 기준으로 National / Regional / Local 의 3가지로 구분하여 중복요소들을 제거하고, 유의미한 정보만 남기는 작업을 진행함.
- 그 결과는 아래와 같음.

In [None]:
import pandas as pd
import csv

df_holidays = pd.read_csv('./data/holidays_events.csv')

# convert holidays_event
df_holidays_sort=df_holidays.sort_values(by='locale')

df_holidays_L = df_holidays_sort.loc[:66]
df_holidays_N = df_holidays_sort.loc[227:154]
df_holidays_R = df_holidays_sort.loc[334:278]
# dropping datasets which was duplicated on 'date'
df_holidays_N = df_holidays_N.drop([35, 40, 156, 235, 242, 245])

# description 의 뒷부분 제거
def desc_transfer(data_set):
    desc_use = data_set['description'].str.split(' ').str[0]
    df_base = data_set.drop('description', axis=1)
    df_desc_transfered = pd.concat([df_base, desc_use], axis=1)
    return df_desc_transfered

df_holidays_L_use = desc_transfer(df_holidays_L)
df_holidays_N_use = desc_transfer(df_holidays_N)
df_holidays_R_use = desc_transfer(df_holidays_R)

# holidays 에서 사용할 데이터만 정리
df_h_n = df_holidays_N_use[['date', 'type', 'description']]
df_h_r = df_holidays_R_use[['date', 'locale_name', 'description']]
df_h_l = df_holidays_L_use[['date', 'locale_name', 'description']]

# national의 description에서 장기연휴의 경우 뒤에 붙어 있는 +/- 를 제거하기
drop_add_plus = df_h_n['description'].str.split('+').str[0]
drop_add_base_p = df_h_n.drop('description', axis=1)
df_h_n_drop_plus = pd.concat([drop_add_base_p, drop_add_plus], axis=1)

drop_add_minus = df_h_n_drop_plus['description'].str.split('-').str[0]
drop_add_base_m = df_h_n_drop_plus.drop('description', axis=1)
df_h_n_drop = pd.concat([drop_add_base_m, drop_add_minus], axis=1)

# locale_name을 regional 과 locale 에 따라 구분하기 편하게 변경함.
df_h_r_rename = df_h_r.rename(index=str, columns={"locale_name": "state"})
df_h_l_rename = df_h_l.rename(index=str, columns={"locale_name": "city"})

# merge 과정에서 df_h_l_rename의 2016-07-24 이 같은 city에서 date가 중복됨을 확인함. 둘중 하나를 지움.
df_h_l_rename_drop = df_h_l_rename.drop(['265'])



## 진행과정

In [2]:
import pandas as pd
import csv

df_holidays = pd.read_csv('./data/holidays_events.csv')

# reference : https://pandas.pydata.org/pandas-docs/stable/options.html
pd.options.display.max_rows = 350

In [3]:
# 날짜와 휴일이 일대일로 대응되는지 확인하기
df_holidays['date'].value_counts()

2014-06-25    4
2013-06-25    3
2017-06-25    3
2016-06-25    3
2012-06-25    3
2015-06-25    3
2016-11-12    2
2016-07-24    2
2016-05-07    2
2016-05-12    2
2013-12-22    2
2013-05-12    2
2017-12-08    2
2014-12-26    2
2015-07-03    2
2017-07-03    2
2016-07-03    2
2014-12-22    2
2012-12-22    2
2016-05-08    2
2015-12-22    2
2013-07-03    2
2017-04-14    2
2017-12-22    2
2012-07-03    2
2016-05-01    2
2012-12-24    2
2012-12-31    2
2016-12-22    2
2016-04-21    2
2014-07-03    2
2017-02-27    1
2013-12-24    1
2014-05-12    1
2016-05-06    1
2015-07-23    1
2017-11-10    1
2014-04-14    1
2012-11-12    1
2014-06-12    1
2013-12-31    1
2012-04-21    1
2015-11-10    1
2016-05-04    1
2014-08-10    1
2017-04-12    1
2013-02-12    1
2014-06-20    1
2012-11-07    1
2012-08-10    1
2015-12-21    1
2017-08-05    1
2016-08-10    1
2017-11-12    1
2017-11-03    1
2016-04-14    1
2014-07-09    1
2016-12-24    1
2015-06-23    1
2015-02-17    1
2013-11-03    1
2015-12-23    1
2016-11-

대체로 날짜와 휴일이 일대일로 대응되었지만, 31개의 날짜에 대해서는 복수의 휴일이 대응되어 있다.
이 상태에서 merge를 진행할 경우, dataset가 늘어나서 기본index와 merge된 dataset의 index가 일대일로 대응되지 않는 현상이 발생하게 된다.
따라서 날짜와 휴일이 일대일로 대응될 수 있도록 feature engineering을 진행한다.

In [4]:
df_holidays['description'].value_counts()

Carnaval                                           10
Fundacion de Cuenca                                 7
Fundacion de Ibarra                                 7
Navidad                                             6
Cantonizacion de Guaranda                           6
Cantonizacion del Puyo                              6
Navidad-2                                           6
Navidad+1                                           6
Cantonizacion de Cayambe                            6
Cantonizacion de Latacunga                          6
Cantonizacion de Quevedo                            6
Cantonizacion de Riobamba                           6
Fundacion de Ambato                                 6
Fundacion de Esmeraldas                             6
Navidad-3                                           6
Fundacion de Manta                                  6
Cantonizacion de Libertad                           6
Dia de Difuntos                                     6
Fundacion de Quito          

In [5]:
df_holidays

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
6,2012-06-23,Holiday,Local,Guaranda,Cantonizacion de Guaranda,False
7,2012-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
8,2012-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
9,2012-06-25,Holiday,Local,Machala,Fundacion de Machala,False


In [6]:
df_holidays = pd.read_csv('./data/holidays_events.csv')
df_holidays['locale'].value_counts()

National    174
Local       152
Regional     24
Name: locale, dtype: int64

In [7]:
df_holidays_sort=df_holidays.sort_values(by='locale')
df_holidays_sort

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
126,2014-07-23,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False
127,2014-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
128,2014-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
129,2014-08-05,Holiday,Local,Esmeraldas,Fundacion de Esmeraldas,False
273,2016-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False
131,2014-08-15,Holiday,Local,Riobamba,Fundacion de Riobamba,False
132,2014-08-24,Holiday,Local,Ambato,Fundacion de Ambato,False
133,2014-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
134,2014-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False


description 의 내용이 locale 에 따라 구성되어 있음을 확인할 수 있다. 예를 들어 Fundacion은 local에만 있고, Navidad 는 National에만 있다.
따라서 일단 locale 별로 구분하여 정리해 보겠다.

In [6]:
df_holidays_sort.head(153)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
126,2014-07-23,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False
127,2014-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
128,2014-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
129,2014-08-05,Holiday,Local,Esmeraldas,Fundacion de Esmeraldas,False
273,2016-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False
131,2014-08-15,Holiday,Local,Riobamba,Fundacion de Riobamba,False
132,2014-08-24,Holiday,Local,Ambato,Fundacion de Ambato,False
133,2014-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
134,2014-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False


In [7]:
df_holidays_sort.head(328)

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
126,2014-07-23,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False
127,2014-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
128,2014-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
129,2014-08-05,Holiday,Local,Esmeraldas,Fundacion de Esmeraldas,False
273,2016-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False
131,2014-08-15,Holiday,Local,Riobamba,Fundacion de Riobamba,False
132,2014-08-24,Holiday,Local,Ambato,Fundacion de Ambato,False
133,2014-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
134,2014-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False


In [8]:
df_holidays_sort.tail()

Unnamed: 0,date,type,locale,locale_name,description,transferred
302,2017-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
76,2013-11-06,Holiday,Regional,Santo Domingo de los Tsachilas,Provincializacion de Santo Domingo,False
77,2013-11-07,Holiday,Regional,Santa Elena,Provincializacion Santa Elena,False
112,2014-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
278,2016-11-06,Holiday,Regional,Santo Domingo de los Tsachilas,Provincializacion de Santo Domingo,False


In [8]:
# locale에 따라 정렬한 후, 각 locale 단위로 나누어 dataset 만들기

df_holidays_L = df_holidays_sort.loc[:66]
df_holidays_N = df_holidays_sort.loc[227:154]
df_holidays_R = df_holidays_sort.loc[334:278]

In [9]:
# 잘 나누어 졌는지 확인하기
df_holidays_N

Unnamed: 0,date,type,locale,locale_name,description,transferred
227,2016-04-23,Event,National,Ecuador,Terremoto Manabi+7,False
343,2017-12-21,Additional,National,Ecuador,Navidad-4,False
223,2016-04-20,Event,National,Ecuador,Terremoto Manabi+4,False
243,2016-05-07,Event,National,Ecuador,Terremoto Manabi+21,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
225,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False
226,2016-04-22,Event,National,Ecuador,Terremoto Manabi+6,False
239,2016-05-04,Event,National,Ecuador,Terremoto Manabi+18,False
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
229,2016-04-25,Event,National,Ecuador,Terremoto Manabi+9,False


In [10]:
# 각 locale 단위에서 날짜와 휴일의 중복여부 확인하기
df_holidays_L['date'].value_counts()

2012-07-03    2
2016-07-03    2
2013-07-03    2
2017-06-25    2
2015-06-25    2
2015-07-03    2
2016-06-25    2
2013-06-25    2
2014-06-25    2
2017-07-03    2
2012-06-25    2
2014-07-03    2
2017-12-08    2
2016-07-24    2
2013-10-07    1
2013-07-23    1
2016-12-08    1
2015-04-12    1
2012-09-28    1
2016-10-07    1
2015-11-12    1
2013-05-12    1
2014-12-08    1
2013-08-15    1
2016-04-12    1
2014-03-02    1
2013-12-06    1
2015-03-02    1
2015-12-06    1
2015-06-23    1
2016-03-02    1
2017-12-22    1
2016-07-25    1
2013-03-02    1
2012-12-08    1
2013-08-05    1
2014-09-28    1
2016-12-22    1
2016-12-05    1
2013-11-10    1
2017-12-06    1
2016-06-23    1
2014-04-21    1
2017-12-05    1
2017-08-24    1
2012-11-10    1
2012-08-15    1
2017-06-23    1
2012-08-24    1
2014-12-06    1
2015-07-25    1
2015-09-28    1
2015-07-23    1
2015-10-07    1
2017-07-23    1
2016-04-21    1
2014-10-07    1
2014-07-23    1
2015-04-14    1
2015-12-22    1
2012-08-05    1
2017-03-02    1
2013-04-

In [9]:
df_holidays_N['date'].value_counts()

2012-12-31    2
2016-05-07    2
2012-12-24    2
2014-12-26    2
2016-05-08    2
2016-05-01    2
2016-01-01    1
2016-05-03    1
2013-05-24    1
2014-12-20    1
2015-01-02    1
2013-04-29    1
2017-12-23    1
2012-10-12    1
2017-04-14    1
2017-10-09    1
2014-12-21    1
2015-05-01    1
2017-12-24    1
2013-12-23    1
2016-03-25    1
2016-04-28    1
2014-05-11    1
2013-10-11    1
2015-12-26    1
2013-10-09    1
2013-01-05    1
2016-05-05    1
2014-07-12    1
2013-08-10    1
2016-04-23    1
2012-11-02    1
2015-02-17    1
2014-07-09    1
2015-12-23    1
2013-01-01    1
2016-10-09    1
2014-12-01    1
2012-12-22    1
2014-06-15    1
2016-04-22    1
2014-10-09    1
2015-08-10    1
2015-12-31    1
2015-12-21    1
2014-11-03    1
2014-06-12    1
2016-11-25    1
2014-04-18    1
2015-05-10    1
2016-11-02    1
2012-12-25    1
2014-11-28    1
2015-12-24    1
2014-05-01    1
2017-05-26    1
2016-05-13    1
2016-12-31    1
2014-06-28    1
2014-06-29    1
2014-12-22    1
2014-06-30    1
2013-12-

In [12]:
# 중복된 휴일의 유형을 파악하기 위해서 날짜순으로 정렬함.
df_holidays_N_sort = df_holidays_N.sort_values(by='date')
df_holidays_N_sort

Unnamed: 0,date,type,locale,locale_name,description,transferred
14,2012-08-10,Holiday,National,Ecuador,Primer Grito de Independencia,False
19,2012-10-09,Holiday,National,Ecuador,Independencia de Guayaquil,True
20,2012-10-12,Transfer,National,Ecuador,Traslado Independencia de Guayaquil,False
21,2012-11-02,Holiday,National,Ecuador,Dia de Difuntos,False
22,2012-11-03,Holiday,National,Ecuador,Independencia de Cuenca,False
31,2012-12-21,Additional,National,Ecuador,Navidad-4,False
33,2012-12-22,Additional,National,Ecuador,Navidad-3,False
34,2012-12-23,Additional,National,Ecuador,Navidad-2,False
35,2012-12-24,Bridge,National,Ecuador,Puente Navidad,False
36,2012-12-24,Additional,National,Ecuador,Navidad-1,False


In [10]:
df_holidays_R['date'].value_counts()

2012-04-01    1
2012-11-06    1
2014-04-01    1
2013-04-01    1
2012-06-25    1
2015-06-25    1
2017-06-25    1
2017-11-06    1
2014-11-07    1
2012-11-07    1
2015-11-06    1
2015-04-01    1
2014-06-25    1
2016-11-06    1
2013-11-07    1
2017-11-07    1
2014-11-06    1
2013-11-06    1
2016-11-07    1
2016-06-25    1
2015-11-07    1
2013-06-25    1
2016-04-01    1
2017-04-01    1
Name: date, dtype: int64

In [14]:
# National 에서 중복된 휴일 중 하나 지우기.
# df_holidays_N_use_1 = df_holidays_N.drop([8, 12, 71, 116, 123, 125])
# 여기서 loc 와 drop에 쓰이는 인덱스는 데이터에서 몇번째 줄인지가 아니라, 데이터프레임에서 매겨진 값이 얼마인지임을 알수 있다. 

df_holidays_N_use_2 = df_holidays_N.drop([35, 40, 156, 235, 242, 245])

In [15]:
# 잘 지워졌는지 확인하기.
df_holidays_N_use_2['date'].value_counts()

2015-12-26    1
2017-10-09    1
2013-10-09    1
2017-12-24    1
2013-10-11    1
2013-05-24    1
2014-12-20    1
2015-01-02    1
2013-04-29    1
2017-12-23    1
2014-12-21    1
2016-04-17    1
2013-12-23    1
2016-03-25    1
2016-04-28    1
2014-05-11    1
2016-01-01    1
2016-05-03    1
2015-05-01    1
2017-04-14    1
2015-12-21    1
2015-12-31    1
2015-08-10    1
2014-10-09    1
2013-11-03    1
2014-06-12    1
2013-02-12    1
2016-05-05    1
2014-07-12    1
2013-08-10    1
2016-04-23    1
2012-11-02    1
2015-02-17    1
2014-07-09    1
2015-12-23    1
2013-01-01    1
2016-10-09    1
2014-12-01    1
2012-12-22    1
2014-06-15    1
2016-04-22    1
2012-10-12    1
2013-01-05    1
2014-11-03    1
2015-02-16    1
2015-05-10    1
2016-11-02    1
2012-12-25    1
2014-11-28    1
2015-12-24    1
2014-05-01    1
2017-05-26    1
2016-05-13    1
2016-12-31    1
2014-06-28    1
2014-06-29    1
2016-05-07    1
2014-12-22    1
2014-06-30    1
2013-12-22    1
2017-02-28    1
2014-11-02    1
2014-04-

In [16]:
# 효율적인 작업을 위해서, 이상의 과정을 National 과 Regional 이 함께 들어있는 dataset에서 진행하는 것으로 만들었다. 
df_holidays_NR = df_holidays_sort.loc[227:278]
df_holidays_NR = df_holidays_NR.drop([35, 40, 156, 235, 242, 245])
df_holidays_NR

Unnamed: 0,date,type,locale,locale_name,description,transferred
227,2016-04-23,Event,National,Ecuador,Terremoto Manabi+7,False
343,2017-12-21,Additional,National,Ecuador,Navidad-4,False
223,2016-04-20,Event,National,Ecuador,Terremoto Manabi+4,False
243,2016-05-07,Event,National,Ecuador,Terremoto Manabi+21,False
347,2017-12-24,Additional,National,Ecuador,Navidad-1,False
225,2016-04-21,Event,National,Ecuador,Terremoto Manabi+5,False
226,2016-04-22,Event,National,Ecuador,Terremoto Manabi+6,False
239,2016-05-04,Event,National,Ecuador,Terremoto Manabi+18,False
345,2017-12-22,Additional,National,Ecuador,Navidad-3,False
229,2016-04-25,Event,National,Ecuador,Terremoto Manabi+9,False


In [17]:
df_holidays_L

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
126,2014-07-23,Holiday,Local,Cayambe,Cantonizacion de Cayambe,False
127,2014-07-24,Additional,Local,Guayaquil,Fundacion de Guayaquil-1,False
128,2014-07-25,Holiday,Local,Guayaquil,Fundacion de Guayaquil,False
129,2014-08-05,Holiday,Local,Esmeraldas,Fundacion de Esmeraldas,False
273,2016-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False
131,2014-08-15,Holiday,Local,Riobamba,Fundacion de Riobamba,False
132,2014-08-24,Holiday,Local,Ambato,Fundacion de Ambato,False
133,2014-09-28,Holiday,Local,Ibarra,Fundacion de Ibarra,False
134,2014-10-07,Holiday,Local,Quevedo,Cantonizacion de Quevedo,False


In [18]:
# 위에서 확인해보면 description의 내용에서 앞부분이 중요하지, 뒷부분은 그냥 지역을 언급한 것일 뿐임을 알 수 있다. 
# 지역은 어차피 다른 column에서 설명이 되어 있으니 desc column 에서는 뒷부분을 제거하고 앞부분의 설명만 사용하기로 하였다. 
# reference : https://pandas.pydata.org/pandas-docs/stable/text.html

df_holidays_L_desc = df_holidays_L['description'].str.split(' ').str[0]

df_holidays_L_desc

In [19]:
df_holidays_L_desc

0          Fundacion
126    Cantonizacion
127        Fundacion
128        Fundacion
129        Fundacion
273    Cantonizacion
131        Fundacion
132        Fundacion
133        Fundacion
134    Cantonizacion
272        Fundacion
271        Fundacion
270        Fundacion
267        Fundacion
141    Independencia
142    Independencia
143    Independencia
266        Fundacion
280    Independencia
265         Traslado
281    Independencia
118    Cantonizacion
86     Cantonizacion
305    Cantonizacion
304        Fundacion
303        Fundacion
301        Fundacion
93         Fundacion
97         Fundacion
98     Cantonizacion
100    Cantonizacion
104    Cantonizacion
291    Cantonizacion
288        Fundacion
109    Cantonizacion
111        Fundacion
287        Fundacion
286        Fundacion
282    Independencia
119        Fundacion
307    Cantonizacion
146        Fundacion
148        Fundacion
183        Fundacion
184        Fundacion
186        Fundacion
187        Fundacion
188        Fu

In [23]:
df_h_L_drop = df_holidays_L.drop('description', axis=1)

In [24]:
# 앞부분만 사용하는 desc column을 merge 한 locale holidays 에 관한 dataset 만들기.
# reference : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html

df_holidays_L_use = pd.concat([df_h_L_drop, df_holidays_L_desc], axis=1)
df_holidays_L_use

Unnamed: 0,date,type,locale,locale_name,transferred,description
0,2012-03-02,Holiday,Local,Manta,False,Fundacion
126,2014-07-23,Holiday,Local,Cayambe,False,Cantonizacion
127,2014-07-24,Additional,Local,Guayaquil,False,Fundacion
128,2014-07-25,Holiday,Local,Guayaquil,False,Fundacion
129,2014-08-05,Holiday,Local,Esmeraldas,False,Fundacion
273,2016-10-07,Holiday,Local,Quevedo,False,Cantonizacion
131,2014-08-15,Holiday,Local,Riobamba,False,Fundacion
132,2014-08-24,Holiday,Local,Ambato,False,Fundacion
133,2014-09-28,Holiday,Local,Ibarra,False,Fundacion
134,2014-10-07,Holiday,Local,Quevedo,False,Cantonizacion


비단 locale 에서 뿐 아니라, national 과 regional 에서도 desc의 앞부분만 사용하는 것이 좋을 것으로 판단된다. 왜냐하면 앞부분만 사용해도 내용에 대한 정보는 충분히 제공되기 때문이다. 

In [25]:
# 위의 과정들을 함수로 만들어 보았다.

def desc_transfer(data_set):
    desc_use = data_set['description'].str.split(' ').str[0]
    df_base = data_set.drop('description', axis=1)
    df_desc_transfered = pd.concat([df_base, desc_use], axis=1)
    return df_desc_transfered

In [27]:
df_holidays_L_use = desc_transfer(df_holidays_L)
df_holidays_L_use

Unnamed: 0,date,type,locale,locale_name,transferred,description
0,2012-03-02,Holiday,Local,Manta,False,Fundacion
126,2014-07-23,Holiday,Local,Cayambe,False,Cantonizacion
127,2014-07-24,Additional,Local,Guayaquil,False,Fundacion
128,2014-07-25,Holiday,Local,Guayaquil,False,Fundacion
129,2014-08-05,Holiday,Local,Esmeraldas,False,Fundacion
273,2016-10-07,Holiday,Local,Quevedo,False,Cantonizacion
131,2014-08-15,Holiday,Local,Riobamba,False,Fundacion
132,2014-08-24,Holiday,Local,Ambato,False,Fundacion
133,2014-09-28,Holiday,Local,Ibarra,False,Fundacion
134,2014-10-07,Holiday,Local,Quevedo,False,Cantonizacion


In [28]:
df_holidays_NR_use = desc_transfer(df_holidays_NR)
df_holidays_NR_use

Unnamed: 0,date,type,locale,locale_name,transferred,description
227,2016-04-23,Event,National,Ecuador,False,Terremoto
343,2017-12-21,Additional,National,Ecuador,False,Navidad-4
223,2016-04-20,Event,National,Ecuador,False,Terremoto
243,2016-05-07,Event,National,Ecuador,False,Terremoto
347,2017-12-24,Additional,National,Ecuador,False,Navidad-1
225,2016-04-21,Event,National,Ecuador,False,Terremoto
226,2016-04-22,Event,National,Ecuador,False,Terremoto
239,2016-05-04,Event,National,Ecuador,False,Terremoto
345,2017-12-22,Additional,National,Ecuador,False,Navidad-3
229,2016-04-25,Event,National,Ecuador,False,Terremoto


In [1]:
# 이상의 과정을 정리하면 아래와 같다

# holidays data engineering

import pandas as pd
import csv

df_holidays = pd.read_csv('./data/holidays_events.csv')
df_holidays_sort=df_holidays.sort_values(by='locale')

# Locale / National / Regional 별로 dataset 구분하기.
df_holidays_L = df_holidays_sort.loc[:66]
df_holidays_N = df_holidays_sort.loc[227:154]
df_holidays_R = df_holidays_sort.loc[334:278]
# dropping datasets which was duplicated on 'date'
df_holidays_N = df_holidays_N.drop([35, 40, 156, 235, 242, 245])

# desc column의 내용 앞부분만 사용하도록 바꾸어 주기
def desc_transfer(data_set):
    desc_use = data_set['description'].str.split(' ').str[0]
    df_base = data_set.drop('description', axis=1)
    df_desc_transfered = pd.concat([df_base, desc_use], axis=1)
    return df_desc_transfered

df_holidays_L_use = desc_transfer(df_holidays_L)
df_holidays_N_use = desc_transfer(df_holidays_N)
df_holidays_R_use = desc_transfer(df_holidays_R)

# df_holidays_L_use 의 'locale_name'의 칼럼이름을 'city'로 바꿔줘야 함. -> 필요 없음. merge에서 left_on, right_on을 활용하면 됨.
# 그렇지만, city로 하는 것이 나중에 한꺼번에 볼때에 구분하여 인식하기 쉬울 것으로 판단되어 바꿔주기로 함.
# reference : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html

df_holidays_L_use.rename(index=str, columns={"locale_name": "city"})

Unnamed: 0,date,type,locale,city,transferred,description
0,2012-03-02,Holiday,Local,Manta,False,Fundacion
126,2014-07-23,Holiday,Local,Cayambe,False,Cantonizacion
127,2014-07-24,Additional,Local,Guayaquil,False,Fundacion
128,2014-07-25,Holiday,Local,Guayaquil,False,Fundacion
129,2014-08-05,Holiday,Local,Esmeraldas,False,Fundacion
273,2016-10-07,Holiday,Local,Quevedo,False,Cantonizacion
131,2014-08-15,Holiday,Local,Riobamba,False,Fundacion
132,2014-08-24,Holiday,Local,Ambato,False,Fundacion
133,2014-09-28,Holiday,Local,Ibarra,False,Fundacion
134,2014-10-07,Holiday,Local,Quevedo,False,Cantonizacion


In [None]:
df_1 = pd.merge(df, df_holidays_NR_use, on='date', how='left')

In [None]:
df_2 = pd.merge(df_1, df_holidays_L_use, on='date', 'city')

In [2]:
df_holidays_N_use

Unnamed: 0,date,type,locale,locale_name,transferred,description
227,2016-04-23,Event,National,Ecuador,False,Terremoto
343,2017-12-21,Additional,National,Ecuador,False,Navidad-4
223,2016-04-20,Event,National,Ecuador,False,Terremoto
243,2016-05-07,Event,National,Ecuador,False,Terremoto
347,2017-12-24,Additional,National,Ecuador,False,Navidad-1
225,2016-04-21,Event,National,Ecuador,False,Terremoto
226,2016-04-22,Event,National,Ecuador,False,Terremoto
239,2016-05-04,Event,National,Ecuador,False,Terremoto
345,2017-12-22,Additional,National,Ecuador,False,Navidad-3
229,2016-04-25,Event,National,Ecuador,False,Terremoto


In [3]:
df_holidays_R_use

Unnamed: 0,date,type,locale,locale_name,transferred,description
334,2017-11-06,Holiday,Regional,Santo Domingo de los Tsachilas,False,Provincializacion
24,2012-11-07,Holiday,Regional,Santa Elena,False,Provincializacion
279,2016-11-07,Holiday,Regional,Santa Elena,False,Provincializacion
216,2016-04-01,Holiday,Regional,Cotopaxi,False,Provincializacion
7,2012-06-25,Holiday,Regional,Imbabura,False,Provincializacion
1,2012-04-01,Holiday,Regional,Cotopaxi,False,Provincializacion
335,2017-11-07,Holiday,Regional,Santa Elena,False,Provincializacion
23,2012-11-06,Holiday,Regional,Santo Domingo de los Tsachilas,False,Provincializacion
194,2015-11-07,Holiday,Regional,Santa Elena,False,Provincializacion
139,2014-11-06,Holiday,Regional,Santo Domingo de los Tsachilas,False,Provincializacion


In [4]:
df_holidays_L_use

Unnamed: 0,date,type,locale,locale_name,transferred,description
0,2012-03-02,Holiday,Local,Manta,False,Fundacion
126,2014-07-23,Holiday,Local,Cayambe,False,Cantonizacion
127,2014-07-24,Additional,Local,Guayaquil,False,Fundacion
128,2014-07-25,Holiday,Local,Guayaquil,False,Fundacion
129,2014-08-05,Holiday,Local,Esmeraldas,False,Fundacion
273,2016-10-07,Holiday,Local,Quevedo,False,Cantonizacion
131,2014-08-15,Holiday,Local,Riobamba,False,Fundacion
132,2014-08-24,Holiday,Local,Ambato,False,Fundacion
133,2014-09-28,Holiday,Local,Ibarra,False,Fundacion
134,2014-10-07,Holiday,Local,Quevedo,False,Cantonizacion


위에서 확인해보면 각 데이터셋에서 유의미한 정보를 가지고 있는 column들은 한정되어 있다.
아래에서는 그 column 들만 사용하기 위한 작업을 진행한다.

In [8]:
df_h_n = df_holidays_N_use[['date', 'type', 'description']]

In [9]:
df_h_n

Unnamed: 0,date,type,description
227,2016-04-23,Event,Terremoto
343,2017-12-21,Additional,Navidad-4
223,2016-04-20,Event,Terremoto
243,2016-05-07,Event,Terremoto
347,2017-12-24,Additional,Navidad-1
225,2016-04-21,Event,Terremoto
226,2016-04-22,Event,Terremoto
239,2016-05-04,Event,Terremoto
345,2017-12-22,Additional,Navidad-3
229,2016-04-25,Event,Terremoto


In [10]:
df_h_r = df_holidays_R_use[['date', 'locale_name', 'description']]
df_h_l = df_holidays_L_use[['date', 'locale_name', 'description']]

In [12]:
df_h_l

Unnamed: 0,date,locale_name,description
0,2012-03-02,Manta,Fundacion
126,2014-07-23,Cayambe,Cantonizacion
127,2014-07-24,Guayaquil,Fundacion
128,2014-07-25,Guayaquil,Fundacion
129,2014-08-05,Esmeraldas,Fundacion
273,2016-10-07,Quevedo,Cantonizacion
131,2014-08-15,Riobamba,Fundacion
132,2014-08-24,Ambato,Fundacion
133,2014-09-28,Ibarra,Fundacion
134,2014-10-07,Quevedo,Cantonizacion


In [13]:

drop_add_plus = df_h_n['description'].str.split('+').str[0]
drop_add_base_p = df_h_n.drop('description', axis=1)
df_h_n_1 = pd.concat([drop_add_base_p, drop_add_plus], axis=1)

drop_add_minus = df_h_n_1['description'].str.split('-').str[0]
drop_add_base_m = df_h_n.drop('description', axis=1)
df_h_n_2 = pd.concat([drop_add_base_m, drop_add_minus], axis=1)

df_h_n_2

Unnamed: 0,date,type,description
227,2016-04-23,Event,Terremoto
343,2017-12-21,Additional,Navidad
223,2016-04-20,Event,Terremoto
243,2016-05-07,Event,Terremoto
347,2017-12-24,Additional,Navidad
225,2016-04-21,Event,Terremoto
226,2016-04-22,Event,Terremoto
239,2016-05-04,Event,Terremoto
345,2017-12-22,Additional,Navidad
229,2016-04-25,Event,Terremoto


In [3]:
df_holidays_sort=df_holidays.sort_values(by='locale')

df_holidays_L = df_holidays_sort.loc[:66]
df_holidays_N = df_holidays_sort.loc[227:154]
df_holidays_R = df_holidays_sort.loc[334:278]
# dropping datasets which was duplicated on 'date'
df_holidays_N = df_holidays_N.drop([35, 40, 156, 235, 242, 245])

def desc_transfer(data_set):
    desc_use = data_set['description'].str.split(' ').str[0]
    df_base = data_set.drop('description', axis=1)
    df_desc_transfered = pd.concat([df_base, desc_use], axis=1)
    return df_desc_transfered

df_holidays_L_use = desc_transfer(df_holidays_L)
df_holidays_N_use = desc_transfer(df_holidays_N)
df_holidays_R_use = desc_transfer(df_holidays_R)

# holidays 에서 사용할 데이터만 정리
df_h_n = df_holidays_N_use[['date', 'type', 'description']]
df_h_r = df_holidays_R_use[['date', 'locale_name', 'description']]
df_h_l = df_holidays_L_use[['date', 'locale_name', 'description']]

# description의 뒷자리 제거
drop_add_plus = df_h_n['description'].str.split('+').str[0]
drop_add_base_p = df_h_n.drop('description', axis=1)
df_h_n_drop_plus = pd.concat([drop_add_base_p, drop_add_plus], axis=1)

drop_add_minus = df_h_n_drop_plus['description'].str.split('-').str[0]
drop_add_base_m = df_h_n_drop_plus.drop('description', axis=1)
df_h_n_drop = pd.concat([drop_add_base_m, drop_add_minus], axis=1)

df_h_r_rename = df_h_r.rename(index=str, columns={"locale_name": "state"})
df_h_l_rename = df_h_l.rename(index=str, columns={"locale_name": "city"})

# final use
# df_h_n_drop
# df_h_r_rename
# df_h_l_rename

In [4]:
df_h_n_drop

Unnamed: 0,date,type,description
227,2016-04-23,Event,Terremoto
343,2017-12-21,Additional,Navidad
223,2016-04-20,Event,Terremoto
243,2016-05-07,Event,Terremoto
347,2017-12-24,Additional,Navidad
225,2016-04-21,Event,Terremoto
226,2016-04-22,Event,Terremoto
239,2016-05-04,Event,Terremoto
345,2017-12-22,Additional,Navidad
229,2016-04-25,Event,Terremoto


In [5]:
df_h_r_rename

Unnamed: 0,date,state,description
334,2017-11-06,Santo Domingo de los Tsachilas,Provincializacion
24,2012-11-07,Santa Elena,Provincializacion
279,2016-11-07,Santa Elena,Provincializacion
216,2016-04-01,Cotopaxi,Provincializacion
7,2012-06-25,Imbabura,Provincializacion
1,2012-04-01,Cotopaxi,Provincializacion
335,2017-11-07,Santa Elena,Provincializacion
23,2012-11-06,Santo Domingo de los Tsachilas,Provincializacion
194,2015-11-07,Santa Elena,Provincializacion
139,2014-11-06,Santo Domingo de los Tsachilas,Provincializacion


In [6]:
df_h_l_rename

Unnamed: 0,date,city,description
0,2012-03-02,Manta,Fundacion
126,2014-07-23,Cayambe,Cantonizacion
127,2014-07-24,Guayaquil,Fundacion
128,2014-07-25,Guayaquil,Fundacion
129,2014-08-05,Esmeraldas,Fundacion
273,2016-10-07,Quevedo,Cantonizacion
131,2014-08-15,Riobamba,Fundacion
132,2014-08-24,Ambato,Fundacion
133,2014-09-28,Ibarra,Fundacion
134,2014-10-07,Quevedo,Cantonizacion


train dataset과 merge 하는 과정에서 locale 안에서도 한 도시에서 같은 날에 중복된 휴일이 존재함을 확인하여 이를 제거해 줌.

In [20]:
df_h_l_rename_drop = df_h_l_rename.drop(['265'])

In [21]:
df_h_l_rename_drop

Unnamed: 0,date,city,description
0,2012-03-02,Manta,Fundacion
126,2014-07-23,Cayambe,Cantonizacion
127,2014-07-24,Guayaquil,Fundacion
128,2014-07-25,Guayaquil,Fundacion
129,2014-08-05,Esmeraldas,Fundacion
273,2016-10-07,Quevedo,Cantonizacion
131,2014-08-15,Riobamba,Fundacion
132,2014-08-24,Ambato,Fundacion
133,2014-09-28,Ibarra,Fundacion
134,2014-10-07,Quevedo,Cantonizacion
