## Yearly cadence of an employee

The majority of the work force, specially office workers, follow the same rhythm of life. They have a weekend of 1 or 2 days, they take holidays off, and other than that they follow the schedule of their job. I probably didn't do a good job explaining this, but you get it. The machines don't get it though! In this notebook, I show an example of how to explain this to a machine by extracting features that will help it find such patterns in the data.

## Explainning the day of year to a machine

In statistics litrature, features are called explanatory variables. I like this name because it can be understood in two ways (and I like puns):

- The dry understand is that those variables explain the target or response variable; the LHS of the equation that explain the outcome on the RHS.
- The pun is that those variables explain some concepts to the machine, and by deciding to include them during feature engineering we are telling it to explore the importance of this concept.

Let's adopt the second understanding of explanatory variables, and let's see how to explain to the machine the concept of the yearly cadence of an employee. One might think of features such as `IsWeekend` which will be true if the day of week is Saturday or Sunday, and `IsThanksGiving` which will be true for American Thanks Giving day. This is OK if the user base is strictly in the USA. However, I think that there is too much of the programmer's biases encoded into these features. I prefer to engineer features that do not hard code my biases, but allow the machine to find the biases in the training data. This is another play on words, and not an exactly accurate use of the word bias as in Bias vs Variance -- however, I think it is no blatently wrong either, even though a more accurate word to use in this case is priors. Anyway, back to our example. The code snippet below defines the explanatory variables I will use:

In [15]:
// First variable: a on-hot-encoded explanatory variable for the day of week.
// This should help the machine discover the fact that there is a weekly 
// cycle to the lives of many workers, if the data supports this.
final val dayOfWeekFeautreNames = Array("IsMonday",
                                        "IsTueday",
                                        "IsWednesday",
                                        "IsThursday",
                                        "IsFriday",
                                        "IsSaturday",
                                        "IsSunday")


// Second variable: a one-hot-encoded explanatory variable for the month of year.
// This should help the machine discover the yearly cycle in the lives
// of many people, if the subjects creating the dataset have one.
final val monthOfYearFeatureNames = Array("IsJan",
                                          "IsFeb",
                                          "IsMar",
                                          "IsApr",
                                          "IsMay",
                                          "IsJun",
                                          "IsJul",
                                          "IsAug",
                                          "IsSep",
                                          "IsOct",
                                          "IsNov",
                                          "IsDec")


// Third variable: a one-hot-encoded explanatory variable for the week of the month.
// This should help the machine explain any patters that follow the cycle
// of paychecks, or more generally any monthly cycle. Notice that I could have
// used 31 features, one for each day. Do this if there will be enough data to find
// patterns on such fine grained features. You can use a Kalman filter to test.
final val weekOfMonthFeatureNames = Array("IsWeek1",
                                          "IsWeek2",
                                          "IsWeek3",
                                          "IsWeek4",
                                          "IsWeek5")


// Fourth, we need some explanatory variables to explain trends.
// We will use only one, for the year over year trend.
final val yearFeatureName = "Year"
final val trendFeatureNames = Array(yearFeatureName)


// Finally, we need some explanatory variables to capture the effect of
// domain specific factors which we believe should affect the target variable.
final val isHolidayFeatureName = "IsHoliday"
final val isHolidayEveFeatureName = "IsHolidayEve"
final val isWorkDayFeatureName = "IsWorkDay"
final val domainSpecificFeatureNames = Array(isHolidayFeatureName,
                                             isHolidayEveFeatureName,
                                             isWorkDayFeatureName)

// Now let's put everything together, and to help us create feature vectors
// we will create a Map from a feature to a column index.
final val dayOfYearExplanatoryVariables = dayOfWeekFeautreNames.++(
                                          monthOfYearFeatureNames).++(
                                          weekOfMonthFeatureNames).++(
                                          trendFeatureNames).++(
                                          domainSpecificFeatureNames)
                                          
final val featureToCol = dayOfYearExplanatoryVariables.zipWithIndex.toMap
                                          
println("The timestamp of each datapoint will be converted to the following 28 features: \n\t" +
         dayOfYearExplanatoryVariables.zipWithIndex.map(kv => kv._1 + " -> col" + kv._2).mkString("\n\t"))

The timestamp of each datapoint will be converted to the following 28 features: 
	IsMonday -> col0
	IsTueday -> col1
	IsWednesday -> col2
	IsThursday -> col3
	IsFriday -> col4
	IsSaturday -> col5
	IsSunday -> col6
	IsJan -> col7
	IsFeb -> col8
	IsMar -> col9
	IsApr -> col10
	IsMay -> col11
	IsJun -> col12
	IsJul -> col13
	IsAug -> col14
	IsSep -> col15
	IsOct -> col16
	IsNov -> col17
	IsDec -> col18
	IsWeek1 -> col19
	IsWeek2 -> col20
	IsWeek3 -> col21
	IsWeek4 -> col22
	IsWeek5 -> col23
	Year -> col24
	IsHoliday -> col25
	IsHolidayEve -> col26
	IsWorkDay -> col27


## Feature extraction

Now that we know what features we want to use, we need to write code to convert a timestamp to all those features. The output is going to be an array of (feautre column index, value) pairs. This can then be easily converted to a Vector representation suitable for the machine learning library of your choice.

### Dependencies

The code depends on Joda Time and Jolly Day:

- Joda Time is the de facto time library before Java 8. 
- Jolly Day is a library to detect holidays. It worked well for North American holidays, and it should work for other countries as well! Version 0.4.9 is the latest version that does not use Java 8 features, so I am using it to avoid dealing with issues of running Scala on Java 8. 

PS: Jolly Day pulls in a version of Joda Time.

In [16]:
%AddDeps de.jollyday jollyday 0.4.9 --transitive --verbose
import de.jollyday.{HolidayCalendar, HolidayManager}
import org.joda.time.format.ISODateTimeFormat
import org.joda.time.{DateTime, DateTimeConstants, DateTimeZone, Days, LocalDate}

Marking de.jollyday:jollyday:0.4.9 for download
Preparing to fetch from:
-> file:/var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/
-> https://repo1.maven.org/maven2
=> https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom: Found at /var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/https/repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom
=> https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom: Found at /var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/https/repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom
=> https://repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1: Found at /var/folders/pd/3vnhbdwj3wx1nm58z5lnyyqm0000gn/T/toree_add_deps8827194317915505009/https/repo1.maven.org/maven2/de/jollyday/jollyday/0.4.9/jollyday-0.4.9.pom.sha1
=> https://repo1.maven.org/maven2/de/jollyday/j

### A note on code organization

The code in this notebook should be organized by putting it in a feature extractor class and its companion object. I am writing this because I am still put off by the notebook style of programming, where everything is global. Well, I know that I can define a class in a code cell, but then how do I insert markdown cells in the middle. Anyway, back to the example, and at least Scala makes it very easy to code in several styles. Let's code!

In [25]:
import scala.collection.mutable.ListBuffer

def extractDayOfYearFeatures(timeInZone: DateTime, countryCode: String, stateOpt: Option[String]): Seq[(Int, Double)] = {
    
    val dayOfWeekFeatureName = dayOfWeekFeautreNames(timeInZone.getDayOfWeek - 1)
    val dayOfWeekFeatureIx = featureToCol(dayOfWeekFeatureName)

    val monthOfYearFeatureName = monthOfYearFeatureNames(timeInZone.getMonthOfYear - 1)
    val monthOfYearFeatureIx = featureToCol(monthOfYearFeatureName)

    // The value of the year cannot be left as large as 2015, but must be normalized to 0, 1 range
    // See http://www.machine-wisdom.com/blog/post/002_linear-regression-poor-perf/ for reasons why
    val year: Double = timeInZone.getYear
    
    val weekOfMonthNumber: Int = timeInZone.getDayOfMonth / 7
    val weekOfMonthFeatureName = weekOfMonthFeatureNames(weekOfMonthNumber)
    val weekOfMonthFeatureIx = featureToCol(weekOfMonthFeatureName)
    
    // The holiday and workday features are a little bit more interesting
    val holidayManager = countryCode match {
      case "CA" => HolidayManager.getInstance(HolidayCalendar.CANADA)
      // TODO: Add all the countries in which you operate
      case _ => HolidayManager.getInstance(HolidayCalendar.UNITED_STATES)
    }

    def isHoliday(localDate: LocalDate): Boolean = stateOpt match {
      case Some(state) => holidayManager.isHoliday(localDate, state)
      case None =>        holidayManager.isHoliday(localDate)
    }

    // TODO: Make sure this is correct in all countries in which you operate
    val weekendDays = Set(DateTimeConstants.SATURDAY, DateTimeConstants.SUNDAY)

    val features: ListBuffer[(Int, Double)] = ListBuffer((dayOfWeekFeatureIx, 1.0),
                                                          (monthOfYearFeatureIx, 1.0),
                                                          (featureToCol(yearFeatureName), year),
                                                          (weekOfMonthFeatureIx, 1.0))

    if (isHoliday(timeInZone.toLocalDate)) {
      features.+=((featureToCol(isHolidayFeatureName), 1.0))
    } else if (!weekendDays.contains(timeInZone.dayOfWeek().get())) {
      features.+=((featureToCol(isWorkDayFeatureName), 1.0))
    }

    if (isHoliday(timeInZone.plusDays(1).toLocalDate)) {
      features.+=((featureToCol(isHolidayEveFeatureName), 1.0))
    }

    features.toSeq.sorted
}

### Unit tests
Let's test if the code above works. Again, this code should go into a unit test. But when in Rome do like Romans do, so when in a notebook do like people who code in a browser do.

In [30]:
val canadianThanksGiving = ISODateTimeFormat.date().parseLocalDateTime("2015-10-12").toDateTime(DateTimeZone.forID("CST6CDT"))
val canadianThanksGivingActual = extractDayOfYearFeatures(canadianThanksGiving, "CA", Some("ON"))
val canadianThanksGivingExpected = Seq((featureToCol("IsMonday"), 1.0),
                                       (featureToCol("IsHoliday"), 1.0),
                                       (featureToCol("IsOct"), 1.0),
                                       (featureToCol("IsWeek2"), 1.0),
                                       (featureToCol("Year"), 2015.0)
                                       ).sorted
println("Expected: " + canadianThanksGivingExpected)
println("Actual: " + canadianThanksGivingActual)
assert(canadianThanksGivingActual == canadianThanksGivingExpected)


Expected: List((0,1.0), (16,1.0), (20,1.0), (24,2015.0), (25,1.0))
Actual: List((0,1.0), (16,1.0), (20,1.0), (24,2015.0), (25,1.0))


In [32]:
val usChristmasEve = ISODateTimeFormat.date().parseLocalDateTime("2014-12-24").toDateTime(DateTimeZone.forID("CST6CDT"))
val usChristmasEveActual = extractDayOfYearFeatures(usChristmasEve, "US", None)
val usChristmasEveExpected = Seq((featureToCol("IsWednesday"), 1.0),
                                 (featureToCol("IsHolidayEve"), 1.0),
                                 (featureToCol("IsWorkDay"), 1.0),
                                 (featureToCol("IsDec"), 1.0),
                                 (featureToCol("IsWeek4"), 1.0),
                                 (featureToCol("Year"), 2014.0)
                                 ).sorted
println("Expected: " + usChristmasEveExpected)
println("Actual: " + usChristmasEveActual)
assert(usChristmasEveActual == usChristmasEveExpected)


Expected: List((2,1.0), (18,1.0), (22,1.0), (24,2014.0), (26,1.0), (27,1.0))
Actual: List((2,1.0), (18,1.0), (22,1.0), (24,2014.0), (26,1.0), (27,1.0))


## Conclusion

We have seen one way of extracting useful features out of a timestamp. Now append this sequence to other features that you use, add a response variable column, maybe also add the value of response in some lag days, and let the learning begin.... wait NO! You must first see if, based on your data, there are any correlated features (specially the holiday and workay features). Some algorithms will produce poor results if they are given correlated features. Also, you may (need to normalize the Year column)[http://www.machine-wisdom.com/blog/post/002_linear-regression-poor-perf/]. 

If you would like to discuss this notebook, please (join the conversation on Gitter)[https://gitter.im/machine-wisdom/Lobby].