# Data preprocessing: OHE - One Hot Enconding

**NOTEBOOK GOAL**: Encode categorical variables

**DATASET TRANSFORMATION**: `preprocessed1_imputation_(train|test).csv` >> `preprocessed2_OHE_(train|test).csv`

One Hot Encoding is needed on the categorical variables to use them for regression or any other kind of computation.
Special attention must be paid to Events, since it has many possible values separated by a dash

#### Attirbutes to be encoded

- StoreType
- AssortmentType
- Events

**INDEX**
- [Unique values of attributes to encode](#Unique-values-of-attributes-to-encode)
- [OHE for StoreType](#OHE-for-StoreType)
- [OHE for AssortmentType](#OHE-for-AssortmentType)
- [OHE for Events](#OHE-for-Events)
- [Final check](#Final-check)

In [14]:
# reoder variables to work on train or test dataset
work_on = 'test'
#work_on = 'train'

In [15]:
from import_man import *

df = pd.read_csv('./dataset/preprocessed1_imputation_' + work_on + '.csv')

### Unique values of attributes to encode

In [16]:
print("Unique values of StoreType")
pprint(df.StoreType.unique())

print("\nUnique values of AssortmentType")
pprint(df.AssortmentType.unique())

print("\nUnique values of Events")
pprint(df.Events.unique())

Unique values of StoreType
array(['Hyper Market', 'Super Market', 'Standard Market',
       'Shopping Center'], dtype=object)

Unique values of AssortmentType
array(['General', 'With Non-Food Department', 'With Fish Department'],
      dtype=object)

Unique values of Events
array(['Rain', 'Fog-Rain', 'Rain-Snow', 'Rain-Thunderstorm', 'None',
       'Fog', 'Rain-Snow-Hail', 'Fog-Rain-Thunderstorm',
       'Rain-Snow-Hail-Thunderstorm', 'Rain-Hail-Thunderstorm',
       'Rain-Hail', 'Fog-Rain-Snow', 'Snow', 'Rain-Snow-Thunderstorm',
       'Fog-Rain-Hail-Thunderstorm'], dtype=object)


In [17]:
df.head()

Unnamed: 0,StoreID,Date,IsHoliday,IsOpen,HasPromotions,StoreType,AssortmentType,NearestCompetitor,Region,Region_AreaKM2,...,Min_Sea_Level_PressurehPa,Min_TemperatureC,Min_VisibilitykM,Precipitationmm,WindDirDegrees,D_Day,D_DayOfYear,D_Month,D_Year,D_DayOfweek
0,1000,01/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,1011,2,0.0,0.0,180,1,60,3,2018,3
1,1000,02/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,1009,3,1.0,5.08,315,2,61,3,2018,4
2,1000,03/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,1013,-2,2.0,0.0,210,3,62,3,2018,5
3,1000,04/03/2018,0,0,0,Hyper Market,General,326,7,9643,...,1002,1,3.0,3.05,193,4,63,3,2018,6
4,1000,05/03/2018,0,1,1,Hyper Market,General,326,7,9643,...,1000,2,4.0,0.25,247,5,64,3,2018,0


### OHE for StoreType

In [18]:
def OHE_StoreType_SuperMarket(value):
    return 1 if value=='Super Market' else 0

def OHE_StoreType_HyperMarket(value):
    return 1 if value=='Hyper Market' else 0

def OHE_StoreType_StandardMarket(value):
    return 1 if value=='Standard Market' else 0

def OHE_StoreType_ShoppingCenter(value):
    return 1 if value=='Shopping Center' else 0

In [19]:
df['StoreType_SuperMarket'] = df.StoreType.apply(OHE_StoreType_SuperMarket)
df['StoreType_HyperMarket'] = df.StoreType.apply(OHE_StoreType_HyperMarket)
df['StoreType_StandardMarket'] = df.StoreType.apply(OHE_StoreType_StandardMarket)
df['StoreType_ShoppingCenter'] = df.StoreType.apply(OHE_StoreType_ShoppingCenter)

### OHE for AssortmentType

In [20]:
def OHE_AssortmentType_General(value):
    return 1 if value=='General' else 0

def OHE_AssortmentType_WithNFDepartment(value):
    return 1 if value=='With Non-Food Department' else 0

def OHE_AssortmentType_WithFishDepartment(value):
    return 1 if value=='With Fish Department' else 0


In [21]:
df['AssortmentType_General'] = df.AssortmentType.apply(OHE_AssortmentType_General)
df['AssortmentType_WithNFDept'] = df.AssortmentType.apply(OHE_AssortmentType_WithNFDepartment)
df['AssortmentType_WithFishDept'] = df.AssortmentType.apply(OHE_AssortmentType_WithFishDepartment)

In [22]:
df.head()

Unnamed: 0,StoreID,Date,IsHoliday,IsOpen,HasPromotions,StoreType,AssortmentType,NearestCompetitor,Region,Region_AreaKM2,...,D_Month,D_Year,D_DayOfweek,StoreType_SuperMarket,StoreType_HyperMarket,StoreType_StandardMarket,StoreType_ShoppingCenter,AssortmentType_General,AssortmentType_WithNFDept,AssortmentType_WithFishDept
0,1000,01/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,3,2018,3,0,1,0,0,1,0,0
1,1000,02/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,3,2018,4,0,1,0,0,1,0,0
2,1000,03/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,3,2018,5,0,1,0,0,1,0,0
3,1000,04/03/2018,0,0,0,Hyper Market,General,326,7,9643,...,3,2018,6,0,1,0,0,1,0,0
4,1000,05/03/2018,0,1,1,Hyper Market,General,326,7,9643,...,3,2018,0,0,1,0,0,1,0,0


Notice that for instance AssortmentType_WithFishDept has only a few cases (since the mean is 0.004) and it refers to all examples, since every store has more than 700 tuples (one per each observation day), it means that only a few stores (e.g. 2 or 3 stores) have a AssortmentType_WithFishDept, we could discard cases like that or we could treat those stores separately

### OHE for Events

In [23]:
# Possible single values
# 'Fog', 'Hail', 'None', 'Rain', 'Snow', 'Thunderstorm'

events = [
    'Fog',
    'Hail',
    'Rain',
    'Snow',
    'Thunderstorm'
]

df.columns

Index(['StoreID', 'Date', 'IsHoliday', 'IsOpen', 'HasPromotions', 'StoreType',
       'AssortmentType', 'NearestCompetitor', 'Region', 'Region_AreaKM2',
       'Region_GDP', 'Region_PopulationK', 'CloudCover', 'Events',
       'Max_Dew_PointC', 'Max_Humidity', 'Max_Sea_Level_PressurehPa',
       'Max_TemperatureC', 'Max_VisibilityKm', 'Max_Wind_SpeedKm_h',
       'Mean_Dew_PointC', 'Mean_Humidity', 'Mean_Sea_Level_PressurehPa',
       'Mean_TemperatureC', 'Mean_VisibilityKm', 'Mean_Wind_SpeedKm_h',
       'Min_Dew_PointC', 'Min_Humidity', 'Min_Sea_Level_PressurehPa',
       'Min_TemperatureC', 'Min_VisibilitykM', 'Precipitationmm',
       'WindDirDegrees', 'D_Day', 'D_DayOfYear', 'D_Month', 'D_Year',
       'D_DayOfweek', 'StoreType_SuperMarket', 'StoreType_HyperMarket',
       'StoreType_StandardMarket', 'StoreType_ShoppingCenter',
       'AssortmentType_General', 'AssortmentType_WithNFDept',
       'AssortmentType_WithFishDept'],
      dtype='object')

In [24]:
for event in events:
    df["Events_" + event] = df.Events.apply(lambda x: 1 if event in x else 0)

## Final check

Let's check which are the attributes contained in the dataset

In [25]:
print("\n\nSORTED ATTRIBUTES LIST:")
pprint(sorted(list(df.columns)))

df.head()



SORTED ATTRIBUTES LIST:
['AssortmentType',
 'AssortmentType_General',
 'AssortmentType_WithFishDept',
 'AssortmentType_WithNFDept',
 'CloudCover',
 'D_Day',
 'D_DayOfYear',
 'D_DayOfweek',
 'D_Month',
 'D_Year',
 'Date',
 'Events',
 'Events_Fog',
 'Events_Hail',
 'Events_Rain',
 'Events_Snow',
 'Events_Thunderstorm',
 'HasPromotions',
 'IsHoliday',
 'IsOpen',
 'Max_Dew_PointC',
 'Max_Humidity',
 'Max_Sea_Level_PressurehPa',
 'Max_TemperatureC',
 'Max_VisibilityKm',
 'Max_Wind_SpeedKm_h',
 'Mean_Dew_PointC',
 'Mean_Humidity',
 'Mean_Sea_Level_PressurehPa',
 'Mean_TemperatureC',
 'Mean_VisibilityKm',
 'Mean_Wind_SpeedKm_h',
 'Min_Dew_PointC',
 'Min_Humidity',
 'Min_Sea_Level_PressurehPa',
 'Min_TemperatureC',
 'Min_VisibilitykM',
 'NearestCompetitor',
 'Precipitationmm',
 'Region',
 'Region_AreaKM2',
 'Region_GDP',
 'Region_PopulationK',
 'StoreID',
 'StoreType',
 'StoreType_HyperMarket',
 'StoreType_ShoppingCenter',
 'StoreType_StandardMarket',
 'StoreType_SuperMarket',
 'WindDirDegre

Unnamed: 0,StoreID,Date,IsHoliday,IsOpen,HasPromotions,StoreType,AssortmentType,NearestCompetitor,Region,Region_AreaKM2,...,StoreType_StandardMarket,StoreType_ShoppingCenter,AssortmentType_General,AssortmentType_WithNFDept,AssortmentType_WithFishDept,Events_Fog,Events_Hail,Events_Rain,Events_Snow,Events_Thunderstorm
0,1000,01/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,0,0,1,0,0,0,0,1,0,0
1,1000,02/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,0,0,1,0,0,0,0,1,0,0
2,1000,03/03/2018,0,1,0,Hyper Market,General,326,7,9643,...,0,0,1,0,0,1,0,1,0,0
3,1000,04/03/2018,0,0,0,Hyper Market,General,326,7,9643,...,0,0,1,0,0,0,0,1,0,0
4,1000,05/03/2018,0,1,1,Hyper Market,General,326,7,9643,...,0,0,1,0,0,0,0,1,1,0


### Write to file

In [26]:
# save the encoded dataset
df.to_csv('./dataset/preprocessed2_OHE_' + work_on + '.csv', index=False)