-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gegen total vs. egen total #72
Comments
I do not think it's specific to I cannot change the default behavior of FYI this works:
PS: Sorts are not stable by default and the sort order of your data will affect it. You ought to do something like
|
I see, makes sense that this problem is not specific to gegen total. If I've understood gtools correctly, it's improvement over egen disappears when gegen is combined with the by-prefix (and not the by()-option), right? I.e.
Isn't necessarily any faster than using egen?
I agree: If it's not easily fixed, then a warning might help other users to be aware of this potential problem. Never mind the mistake in the toy example. I guess the problem here is what happens when gtools tries to call the _n==0 observation when using by(). (Btw: Thanx for a very good package. gtools gives me several hours of extra coding time each week). |
Right; the The issue is that gtools is computing the expression for the whole data, whereas |
Hm, haven't read enough about gtools to understand exactly what's going on here. But yeah, if I replace the toy example above with
I get a data set where cat is always equal to "one". Then gtools produces
Which is another surprising result. So, what's the moral here? Subscripting with gtools should be used with caution (or not at all)? |
@adamreir The lesson is this:
|
Aha, now I understand what's going on! Will keep this in mind. Thanx a lot! |
Features - Added `gglm` to estimate GLM models, including `logit`. Bug Fixes - Closes #78: if now passed raw/in double-quotes throughout the pipeline - Closes #75: gunique returns 0s in r() when there are no obs - Closes #74: gstats transform parses abbreviated targets - Closes #72: Warning for gegen expressions without by group - Fixed GLM issues generating de-meaned variables - Fixed gegen nunique with multiple inputs - Fixed bug in `gpoisson` where internal weights not copied correctly in loop. - Various fixes to the docs.
I came across a discrepancy between egen total and gegen total when comparing the value of a variable with an observation that is out of bounds (i.e. _n==0).
It seems that when subscripting a variable in Stata with [_n-1], egen assumes the 0th observation is missing (var[0]=""). However, when gegen total is combined with by(), gegen seems to look looks at the last value (__n==N) in the last group (defined by by()).
Here is a simple toy example:
Description: gegen/egen total is used to calculate the total number of distinct values in cat by id (Edit: I know that gegen unique() executes this task, but believe this example might highlight a more fundamental problem with gegen).
Output egen_vs_gegen.log:
. list
Gegen returns 0 for the third group (id==3), probably because the last value of cat in the second group (id==2) is identical to the value of cat in that third group (id==3).
gegen, for comparison, returns 1 for all observations, which is as expected.
Version info
The text was updated successfully, but these errors were encountered: