Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize xts' index so users can provide their own index class #190

Open
joshuaulrich opened this issue Jun 2, 2017 · 7 comments
Open
Labels
enhancement Enhancement to existing feature

Comments

@joshuaulrich
Copy link
Owner

joshuaulrich commented Jun 2, 2017

The xts class was originally implemented with the index class and index timezone attributes attached to the xts object itself. That means the xts class needs to know about all possible index classes, which makes it difficult to extend xts to use a new index class.

Jeff started to move the index-specific attributes from the xts object to the index object. See the commit message for e4edc53:

o  added tzone and tclass functions as aliases to
   indexTZ and indexClass, respectively. Eventually
   will Deprecate/Defunct the former.

Most places currently set attributes on both objects, because it's not clear which components of xts depend on the index-specific attributes being attached to either object (xts or index).

We would first need to ensure that all attempts to access the index class or timezone are getting and returning values from the index object (note that this applies to R and C code). Then we would need to remove all attempts to set attributes on the xts object... and see what breaks.

Next we need to define an API for any potential index to implement. If I recall correctly, this was part of the idea behind xtime. The thing I remember most was that Jeff said, "xtime is for date-time classes what xts is for time-series classes". With an "xtime API", xts could program to it and you could use any class as an xts index as long as it implemented the xtime API. (Note that "xtime" may not be good name, since it is a Cox Automotive brand).

This API may need more than "index class" and "index timezone". The current xts API will need to be examined to see which functions rely on specific attributes and/or functionalities of the index. At the C level, the xts index can be either INTSXP, REALSXP, or an S4 object (thanks to timeDate).

@joshuaulrich joshuaulrich added the enhancement Enhancement to existing feature label Jun 2, 2017
@ghost
Copy link

ghost commented Jun 2, 2017

The idea is a right direction, but realizing it will be painfully difficult. I quickly looked at C code and functions like do_merge_xts() or do_rbind_xts() use coredata and index. xts C types expects INTSXP or REALSXP, which most often correspond to Date or POSIXct in R, and using index and coredata in C gives xts its speed (rbind, merge).

xtime is based on POSIXct (REALSXP), so it is correctly supported by xts (C). I guess these changes are supposed to allow to use nanotime in the index. I am afraid, that there will be no simple way to operate on an nanotime based index from C (xts), eg. how to distinguish nanotime from POSIXct? - both will be seen as double/REALSXP.

@joshuaulrich
Copy link
Owner Author

eg. how to distinguish nanotime from POSIXct? - both will be seen as double/REALSXP.

The point of an API is that there should be no need to distinguish between any potential index class. xts shouldn't know or care the class of the index, as long as the class supplies either an INTSXP or REALSXP for the index, and whatever other components are determined to be necessary (e.g. a timezone attribute, a print method, etc).

@lsilvest
Copy link

lsilvest commented Jun 5, 2017

@silentbits has a point though, at C level nanotime will look like a REALSXP. But in fact, behind the scenes, it's the 64-bit value of this floating point that will be interpreted as a 64-bit integer. The issue in doing this is that without that cast the comparison operators used in functions like do_merge_xts() or do_rbind_xts() will be incorrect. So nanotime will represent a special case. Ugly to say the least. We can once more lament the fact that R doesn't provide a 64-bit integer and all the hacks that derive from this lack.

Here is an example of what I'm saying about ordering:

// -*- compile-command: "gcc -std=c99 -g prec.c -o prec; ./prec" -*- 

#include <stdio.h>
#include <stdint.h>


int main() {
  int64_t a = -1;
  int64_t b = 1;

  double* ad = (double*)(&a);
  double* bd = (double*)(&b);

  printf("ad: %lf, bd: %lf\n", *ad, *bd);
  printf("ad < bd: %d\n", *ad < *bd);
  
  return 0;
}

Which outputs:

ad: -nan, bd: 0.000000
ad < bd: 0

@ghost
Copy link

ghost commented Jun 5, 2017

There is a simple solution that would solve all these problems: we should drop POSIXct as the xts index. It sounds like madness, but (after understanding) my solution is very logical and reasonable. Currently, xts internally for the index always expects INTSXP (Date) and REALSXP (POSIXct). POSIXct can be replaced by nanotime and then before every .Call() (if necessary) xts will convert POSIXct to nanotime (as.nanotime()). In C code instead of using double pointers we can use union:

typedef union {
   double *r_storage;
   int64_t *xindex;
} i64_index_t;

and then map REALSXP pointer to our xts union, eg (UNTESTED):

https://github.com/joshuaulrich/xts/blob/master/src/any.c

SEXP any_negative (SEXP x)
{
	int i;
	int len = length(x);
	int *int_x = NULL;
	i64_index_t idx; /* nanotime index union */

	switch (TYPEOF(x)) {
	case INTSXP:
		int_x = INTEGER(x);
		for (i = 0; i < len; i++) {
			if (int_x[i] >= 0)
				continue;
			return ScalarLogical(1);
		}
		break;
	case REALSXP:
		/* get the REALSXP data pointer */
		idx.r_storage = REAL(x);
		for (i = 0; i < len; i++) {
			/* use index data as int64_t (nanotime) */
			if (idx.xindex[i] >= 0)
				continue;
			return ScalarLogical(1);
		}
		break;
	default:
		error("unsupported type");
	}
	return ScalarLogical(0);
}

It's a small change, but the most important is that it does not break the internal logic of C code (do_merge_xts(), do_rbind_xts(). An additional benefit is that we can hide this change, that it will not be visible outside xts (eg. reclass()).

If you are interested, I would be very happy to help with such an implementation.

Best regards,
Daniel

@ghost
Copy link

ghost commented Jun 6, 2017

btw. R's TypeTable for TYPEOF() or struct sxpinfo->type:

https://github.com/wch/r-source/blob/trunk/src/main/util.c#L196

it means that the mark for a special case for integer64 type is necessary in R's core sources (hard cake). And here's an example that bit64 (and nanotime) uses double (REALSXP) as internal storage.

> x <- as.integer64(123)
> x
integer64
[1] 123
> storage.mode(x)
[1] "double"

@joshuaulrich
Copy link
Owner Author

@silentbits I'm replying to your comment in issue 241 here, to keep the conversation in one place.

I showed a possible solution to this problem [above] and yes, it is a lot of work, but in my opinion it is worth taking up this effort. The key is to not break xts R API for external packages.

If you decide on the proposed solution, then of course I will be very committed to help as much as possible (especially in the case of C code).

I do like your idea, and I've been using a similar approach to clean up the merge C code. The idea is that using your approach to clean up the existing code should make it easier to add another index type. My changes are in the generic_index branch. You may also be interested in some of the really crazy stuff in the tight-loop branch (which has no hope of being merged without a solution that uses dynamic arrays to store the nodes).

I'm sorry, I didn't realize you were willing to help. I'm glad to have whatever help you can give! I will work on writing a roadmap and some coding standards. I've also started a project for this feature, since it's bigger than just one issue, but I'm not sure if it's publicly visible.

I'm going to be busy with R/Finance this week, so I won't be able to make much progress. But we will need plenty of unit tests and performance benchmark tests before making any changes, so you could work on those. It would help if you could look into setting up Rperform for benchmark regression testing.

I just created an issue (245) for one thing that needs to be done before we can make significant progress on adding nanosecond resolution. I have been working on that in the insane_refactor branch.

@ghost
Copy link

ghost commented Jun 16, 2018

My opinion is a bit controversial: we should drop POSIXct as the xts index. And the main motivation is not the ability to use nanoseconds. It's simply: POSIXct as a xts index is broken by design. A simple demo where is the problem:

> getSymbols('AAPL')
[1] "AAPL"
> # extract last close price
> v1 <- AAPL[nrow(AAPL),4]
> v1
           AAPL.Close
2018-06-15     188.84
> # get it as numeric
> v1 <- as.numeric(v1)
> v1
[1] 188.84
> # now let's assign '188.84' to v2
> v2 <- 188.84
> cat(v1, v2)
188.84 188.84
> # are they really equal?
> v1 == v2
[1] FALSE
> # so where is the problem?
> as.character(v1)
[1] "188.839996"
> as.character(v2)
[1] "188.84"

We had the conversion "188.84" character to double, during which we lost a precision. As a result, our comparison returned FALSE. So we have unexpected behavior bug (our code is correct). Why it's important for xts index? Each time we want to use the POSIXct index, we start to play this unexpected behavior bugs game (eg. xts_obj["datetime_string"]). Another example:

> Sys.setenv(TZ='UTC')
> require(nanotime)
> # date + 100 ms
> str(nanotime("2018-06-16T00:00:00.100000000+00:00"))
integer64 1529107200100000000
> print(as.numeric(as.POSIXct("2018-06-16 00:00:00.100")), digits = 20)
[1] 1529107200.0999999046

So integer and integer64 is the only right solution for the xts index.

This is a very big change. That's why I think we should ask others for an opinion on this issue (Brian, Dirk, Jeff etc. etc.). If they also support this idea, then we can get to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement to existing feature
Projects
Development

No branches or pull requests

2 participants