Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gegen total vs. egen total #72

Closed
adamreir opened this issue Feb 22, 2021 · 6 comments
Closed

gegen total vs. egen total #72

adamreir opened this issue Feb 22, 2021 · 6 comments

Comments

@adamreir
Copy link

adamreir commented Feb 22, 2021

I came across a discrepancy between egen total and gegen total when comparing the value of a variable with an observation that is out of bounds (i.e. _n==0).

It seems that when subscripting a variable in Stata with [_n-1], egen assumes the 0th observation is missing (var[0]=""). However, when gegen total is combined with by(), gegen seems to look looks at the last value (__n==N) in the last group (defined by by()).

Here is a simple toy example:

clear
set obs 9
gen id=strofreal(floor((_n+2)/3)) 

g cat="none" if id=="1"
replace cat="one" if id!="1"

gegen gtot=total(cat!=cat[_n-1]), by(id)
egen tot=total(cat!=cat[_n-1]), by(id)

Description: gegen/egen total is used to calculate the total number of distinct values in cat by id (Edit: I know that gegen unique() executes this task, but believe this example might highlight a more fundamental problem with gegen).

Output egen_vs_gegen.log:

. list

id cat gtot tot
1 none 1 1
1 none 1 1
1 none 1 1
2 one 1 1
2 one 1 1
2 one 1 1
3 one 0 1
3 one 0 1
3 one 0 1

Gegen returns 0 for the third group (id==3), probably because the last value of cat in the second group (id==2) is identical to the value of cat in that third group (id==3).

gegen, for comparison, returns 1 for all observations, which is as expected.

Version info

  • OS: Windows 10
  • Version (gtools): version 1.7.5 18Apr2020
@mcaceresb
Copy link
Owner

mcaceresb commented Feb 22, 2021

I do not think it's specific to total, and the issue is not quite what you describe so it affects every command. If gegen is called without a by prefix, then the expression is computed without by. If you want to compute the expression by a set of variables, you need to use the by prefix.

I cannot change the default behavior of gegen. Computations are done in C and I cannot parse Stata's syntax there. However, I can try to print a warning here and in the documentation? The whole point of gtools is that the data needn't be sorted; I didn't realize that egen computed stuff after the sort even when by is not a prefix.

FYI this works:

bys id: gegen gtot = total(cat!=cat[_n-1])

PS: Sorts are not stable by default and the sort order of your data will affect it. You ought to do something like

bys id (subid): gegen gtot = total(cat!=cat[_n-1])

@adamreir
Copy link
Author

I see, makes sense that this problem is not specific to gegen total.

If I've understood gtools correctly, it's improvement over egen disappears when gegen is combined with the by-prefix (and not the by()-option), right?

I.e.

bys id: gegen gtot = total(cat!=cat[_n-1])

Isn't necessarily any faster than using egen?

gsort id
by id: egen gtot = total(cat!=cat[_n-1])

I agree: If it's not easily fixed, then a warning might help other users to be aware of this potential problem.

Never mind the mistake in the toy example. I guess the problem here is what happens when gtools tries to call the _n==0 observation when using by().

(Btw: Thanx for a very good package. gtools gives me several hours of extra coding time each week).

@mcaceresb
Copy link
Owner

Right; the by prefix eliminates much of the speed gains. It may still be faster in some cases, but it could also be slower in others.

The issue is that gtools is computing the expression for the whole data, whereas egen is doing it by group. Not really the 0th observation. If cat was all the same, then gtools would give yet a different answer, though egen would not change.

@adamreir
Copy link
Author

adamreir commented Feb 22, 2021

Hm, haven't read enough about gtools to understand exactly what's going on here. But yeah, if I replace the toy example above with

(...)
replace cat="one" //if id!="1"
(...)

I get a data set where cat is always equal to "one". Then gtools produces

id cat gtot tot
1 one 1 1
1 one 1 1
1 one 1 1
2 one 0 1
2 one 0 1
2 one 0 1
3 one 0 1
3 one 0 1
3 one 0 1

Which is another surprising result.

So, what's the moral here? Subscripting with gtools should be used with caution (or not at all)?

@mcaceresb
Copy link
Owner

mcaceresb commented Feb 23, 2021

@adamreir The lesson is this:

  1. gtools internals are in C and cannot parse Stata syntax. If you are creating variables inside of gegen, they are being created before gtools internals are called.
  2. Therefore you should think of variables created inside gegen functions as equivalent to using gen, because that is what it is doing.
    • If you call by ...: gegen then the variable creation will be equivalent to by ...: gen
    • If you call gegen then the variable creation will be equivalent to simply calling gen.
  3. After variable creation, gtools is called and the function is invoked correctly by group.

@adamreir
Copy link
Author

Aha, now I understand what's going on! Will keep this in mind.

Thanx a lot!

mcaceresb added a commit that referenced this issue Aug 28, 2021
Features

- Added `gglm` to estimate GLM models, including `logit`.

Bug Fixes

- Closes #78: if now passed raw/in double-quotes throughout the pipeline
- Closes #75: gunique returns 0s in r() when there are no obs
- Closes #74: gstats transform parses abbreviated targets
- Closes #72: Warning for gegen expressions without by group
- Fixed GLM issues generating de-meaned variables
- Fixed gegen nunique with multiple inputs
- Fixed bug in `gpoisson` where internal weights not copied correctly in loop.
- Various fixes to the docs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants