# Yearly cadence of an employee

The majority of the work force, specially office workers, follow the same rhythm of life. They have a weekend of 1 or 2 days, they take holidays off, and other than that they follow the schedule of their job. I probably didn't do a good job explaining this, but you get it. The machines don't get it though! In this notebook, I show an example of how to explain this to a machine by extracting features that will help it find such patterns in the data.

## Explainning the day of year to a machine

In statistics litrature, features are called explanatory variables. I like this name because it can be understood in two ways (and I like puns):

- The dry understand is that those variables explain the target or response variable; the LHS of the equation that explain the outcome on the RHS.
- The pun is that those variables explain some concepts to the machine, and by deciding to include them during feature engineering we are telling it to explore the importance of this concept.

Let's adopt the second understanding of explanatory variables, and let's see how to explain to the machine the concept of the yearly cadence of an employee. One might think of features such as `IsWeekend` which will be true if the day of week is Saturday or Sunday, and `IsThanksGiving` which will be true for American Thanks Giving day. This is OK if the user base is strictly in the USA. However, I think that there is too much of the programmer's biases encoded into these features. I prefer to engineer features that do not hard code my biases, but allow the machine to find the biases in the training data. This is another play on words, and not an exactly accurate use of the word bias as in Bias vs Variance -- however, I think it is no blatently wrong either, even though a more accurate word to use in this case is priors. Anyway, back to our example. The code snippet below defines the explanatory variables I will use:

In [14]:
// First variable: a on-hot-encoded explanatory variable for the day of week.
// This should help the machine discover the fact that there is a weekly 
// cycle to the lives of many workers, if the data supports this.
final val dayOfWeekFeautreNames = Array("IsMonday",
                                        "IsTueday",
                                        "IsWednesday",
                                        "IsThursday",
                                        "IsFriday",
                                        "IsSaturday",
                                        "IsSunday")


// Second variable: a one-hot-encoded explanatory variable for the month of year.
// This should help the machine discover the yearly cycle in the lives
// of many people, if the subjects creating the dataset have one.
final val monthOfYearFeatureNames = Array("IsJan",
                                          "IsFeb",
                                          "IsMar",
                                          "IsApr",
                                          "IsMay",
                                          "IsJun",
                                          "IsJul",
                                          "IsAug",
                                          "IsSep",
                                          "IsOct",
                                          "IsNov",
                                          "IsDec")


// Third variable: a one-hot-encoded explanatory variable for the week of the month.
// This should help the machine explain any patters that follow the cycle
// of paychecks, or more generally any monthly cycle. Notice that I could have
// used 31 features, one for each day. Do this if there will be enough data to find
// patterns on such fine grained features. You can use a Kalman filter to test.
final val weekOfMonthFeatureNames = Array("IsWeek1",
                                          "IsWeek2",
                                          "IsWeek3",
                                          "IsWeek4",
                                          "IsWeek5")


// Fourth, we need some explanatory variables to explain trends.
// We will use only one, for the year over year trend.
final val trendFeatureNames = Array("Year")


// Finally, we need some explanatory variables to capture the effect of
// domain specific factors which we believe should affect the target variable.
final val domainSpecificFeatureNames = Array("IsHoliday",
                                             "IsHolidayEve",
                                             "IsWorkDay")

// Now let's put everything together, and to help us create feature vectors
// we will create a Map from a feature to a column index.
final val dayOfYearExplanatoryVariables = dayOfWeekFeautreNames.++(
                                          monthOfYearFeatureNames).++(
                                          weekOfMonthFeatureNames).++(
                                          trendFeatureNames).++(
                                          domainSpecificFeatureNames)
                                          
final val featureToCol = dayOfYearExplanatoryVariables.zipWithIndex.toMap
                                          
println("The timestamp of each datapoint will be converted to the following 28 features: \n\t" +
         dayOfYearExplanatoryVariables.zipWithIndex.map(kv => kv._1 + " -> col" + kv._2).mkString("\n\t"))

The timestamp of each datapoint will be converted to the following 28 features: 
	IsMonday -> col0
	IsTueday -> col1
	IsWednesday -> col2
	IsThursday -> col3
	IsFriday -> col4
	IsSaturday -> col5
	IsSunday -> col6
	IsJan -> col7
	IsFeb -> col8
	IsMar -> col9
	IsApr -> col10
	IsMay -> col11
	IsJun -> col12
	IsJul -> col13
	IsAug -> col14
	IsSep -> col15
	IsOct -> col16
	IsNov -> col17
	IsDec -> col18
	IsWeek1 -> col19
	IsWeek2 -> col20
	IsWeek3 -> col21
	IsWeek4 -> col22
	IsWeek5 -> col23
	Year -> col24
	IsHoliday -> col25
	IsHolidayEve -> col26
	IsWorkDay -> col27


## Dependencies

The code depends on Joda Time and Jolly Day:

- Joda Time is the de facto time library before Java 8. 
- Jolly Day is a library to detect holidays. It worked well for North American holidays, and it should work for other countries as well! Version 0.4.9 is the latest version that does not use Java 8 features, so I am using it to avoid dealing with issues of running Scala on Java 8. 

PS: Jolly Day pulls in a version of Joda Time.

In [5]:
%AddDeps de.jollyday jollyday 0.4.9 --transitive --verbose
import de.jollyday.{HolidayCalendar, HolidayManager}
import org.joda.time.format.ISODateTimeFormat
import org.joda.time.{DateTime, DateTimeConstants, DateTimeZone, Days}

Marking de.jollyday:jollyday:0.4.9 for download
Preparing to fetch from:
-> file:/var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/
-> https://repo1.maven.org/maven2
=> https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom: Found at /var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/https/repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom
=> https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1: Found at /var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/https/repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1
=> https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1: Found at /var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/https/repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1
=> https://repo1.maven.org/maven2/de/

Marking de.jollyday:jollyday:0.4.9 for download
Preparing to fetch from:
-> file:/var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/
-> https://repo1.maven.org/maven2
=> 1 (): Downloading https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1
=> 2 (): Downloading https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom
=> 1 (jollyday-0.4.9.pom.sha1): Finished downloading
=> 2 (jollyday-0.4.9.pom): Finished downloading
=> 3 (): Downloading https://repo1.maven.org/maven2/joda-time/joda-time/2.4/joda-time-2.4.pom.sha1
=> 4 (): Downloading https://repo1.maven.org/maven2/joda-time/joda-time/2.4/joda-time-2.4.pom
=> 5 (): Downloading https://repo1.maven.org/maven2/javax/xml/bind/jaxb-api/2.2.7/jaxb-api-2.2.7.pom.sha1
=> 6 (): Downloading https://repo1.maven.org/maven2/javax/xml/bind/jaxb-api/2.2.7/jaxb-api-2.2.7.pom
=> 3 (joda-time-2.4.pom.sha1): Finished downloading
=> 6 (jaxb-api-2.2.7.pom): Finished downloading