Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort group variables internally in C #6

Closed
mcaceresb opened this issue Jul 27, 2017 · 0 comments
Closed

Sort group variables internally in C #6

mcaceresb opened this issue Jul 27, 2017 · 0 comments
Assignees

Comments

@mcaceresb
Copy link
Owner

A major enhancement to the plugin would be to sort the groups internally in C. This would afford a major speed-up when processing a large number of groups (vs sorting in Stata) and allow gegen to be used as an adequate replacement for egen.

In particular, one issue raied in #4 is that gegen does not produce IDs in the order that the groups would be sorted. For instance,

sysuse auto, clear
egen id1 = group(turn trunk)
fegen id2 = group(turn trunk)
gegen id3 = group(turn trunk)

assert id1==id2
assert id1==id3

Instead, gegen produces IDs in the order the groups appear. While this is by design, it is not the behavior of egen. Sorting groups internally would allow solving this issue as well.

@mcaceresb mcaceresb self-assigned this Jul 27, 2017
@mcaceresb mcaceresb changed the title Sort group variables in ternally in C Sort group variables internally in C Jul 27, 2017
mcaceresb added a commit that referenced this issue Sep 29, 2017
Features

* `gisid` is added as a working replacement for `isid` and `isid, missok`.
  `gisid` taks `if` and `in` statements; however, it does not implement
  `isid, sort` or `isid using`.
* `glevelsof` is added as a working replacement for `levelsof`.
  All `levelsof` features are available.
* Temporary variable no longer created for `egen, tag` or `egen, group`
* Fixes #6
    * Variables are sorted internally for `egen, group`, which matches `egen`.
    * Variables are sorted internally for `gcollapse`, which is faster.
* Various internal enhancements:
    * The hash is validated faster
    * Hash validation is also used to read in group variables
    * Integer bijection now sorts by the integers correctly,
      obviating the need for a second sort.
    * No need to validate the hash with integer bijection.
    * The memory usage is marginally leaner.
    * Reorganized all the files, making the code-base easier to maintain.
* Various commented internal code deleted.

Enhancements

* Fixes #13 so
  `gcollapse` maintains source formats on targets.
* Improved internal handling of if conditions for `egen`.
* `egen` now only processes observations in range for `id, group`
* `egen, group` now marginally faster when all vars are integers

Bug fixes

* Prior versions de-facto used a 64-bit hash instead of a 128-bit hash.
  The new version should use the 128-bit hash correctly.
* Prior versions would fail if there was only 1 observation.
* Fixes #15
  which was introduced trying to fix
  #15

Backwards-incompatible

* `gcollapse, unsorted` no longer supported (due to internal sorting)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant