Skip to content
nave91 edited this page Sep 2, 2014 · 2 revisions

Preprocessing

Data can be preprocessed according to Menzies guidelines

       mark | type  | nump | wordp |
------------|-------|------|-------|-------------------
       ?    |       |      |       | a column to ignore
------------|-------|------|-------|-------------------
dep    =    | klass |      |   X   | a label to predict
       +    | more  |   X  |       | a goal to maximize
       -    | less  |   X  |       | a goal to minimize
------------|-------|------|-------|-------------------
indep  $    | num   |   X  |       | non-goal number 
       else | term  |      |   X   | non-goal non-number

Example if we consider weather dataset:

#data/weather1.csv
outlook,        # forecast
?+$temperature, # degrees Farenheit,
-$humidity,     # % of dewpoint
windy,          # boolean
=play           # goal
#################################################
sunny       ,85     ,90     ,FALSE  ,no
sunny       ,80     ,90     ,TRUE   ,no
overcast    ,83     ,86     ,FALSE  ,yes
rainy       ,70     ,96     ,FALSE  ,yes
rainy       ,68     ,80     ,FALSE  ,yes
rainy       ,65     ,?      ,TRUE   ,no
overcast    ,64     ,65     ,TRUE   ,yes
sunny       ,72     ,?      ,FALSE  ,no
sunny       ,69     ,70     ,FALSE  ,yes
rainy       ,75     ,80     ,FALSE  ,yes
sunny       ,75     ,70     ,TRUE   ,yes
overcast    ,72     ,90     ,TRUE   ,yes
overcast    ,81     ,75     ,FALSE  ,yes
rainy       ,71     ,90     ,TRUE   ,no

The header can be formatted in form of:

outlook,        # forecast
?+$temperature, # degrees Farenheit,
-$humidity,     # % of dewpoint
windy,          # boolean
=play           # goal

Share and Enjoy!

Clone this wiki locally