Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtools does not group strL variables correctly #39

Closed
ractingmatrix opened this issue May 2, 2018 · 6 comments
Closed

gtools does not group strL variables correctly #39

ractingmatrix opened this issue May 2, 2018 · 6 comments
Assignees

Comments

@ractingmatrix
Copy link

Hi, when doing gcollapse by groups with some group being data type of "strL", the results seem to be problematic. Reproduced as follows:


clear *
set obs 4
g strL name = strofreal(round(_n,2)) 
g id = round(_n,2)
g value = _n
gcollapse (max) value = value , by(name id)
@mcaceresb
Copy link
Owner

Actually, this is a limitation of the stata plugin interface version I am using. I wrote the plugin using version 2, which makes no mention of strL variables being an issue (here). However, version 3 notes that the API method to access string variables only works for non-strL variables, and they added special functions to deal with strL variables (here).

First of all, I have to throw an error if the variable type is strL. I don't think version 2.0 of the plugin can deal with it correctly. The error will note:

  • strL not supported by the Stata C API 2.0
  • Try compress varname

That's done (will merge to master soon). I now need to compile a separate version of the plugin for Stata 14 and above, but I don't own Stata 14.I might be able to do it through servers or my library's computers, so It'll take me a bit longer to fix. For now the function just throws an error for strL variables.

mcaceresb added a commit that referenced this issue May 2, 2018
Pending

- For now, throws error for strL variable: #39

Plugin version 2.0 makes no mention of strL variables; however, plugin
version 3.0 notes that they are actually not supported by the macros
that I use to access strings. I think I'll have to compile a sepparate
version of the plugin for Stata 13 and Stata 14 and above. The Stata
13 version will throw an error for strL variables. This is currently
the only version until I figure out where to test Stata 14 and how to
implement this switch.

Bug fixes

- Fixes #38

When the quantile requested was N - 1 out of an array of length N, I
makde an exception so the function picked up the largest value, instead
of passing N - 1 to the selection algorithm. However, I made a mistake
and quantiles between 100 - 150 / N and 100 - 100 / N, left-inclusive
right-exclusive, would give the wrong quantile.
@mcaceresb mcaceresb self-assigned this May 2, 2018
@mcaceresb mcaceresb changed the title issue in gcollapse gtools does not group strL variables correctly May 2, 2018
mcaceresb added a commit that referenced this issue May 2, 2018
Pending

- For now, throws error for strL variable; see
  #39

Plugin version 2.0 makes no mention of strL variables; however, plugin
version 3.0 notes that they are actually not supported by the macros
that I use to access strings. I think I'll have to compile a sepparate
version of the plugin for Stata 13 and Stata 14 and above. The Stata
13 version will throw an error for strL variables. This is currently
the only version until I figure out where to test Stata 14 and how to
implement this switch.

Bug fixes

- Fixes #38

When the quantile requested was N - 1 out of an array of length N, I
makde an exception so the function picked up the largest value, instead
of passing N - 1 to the selection algorithm. However, I made a mistake
and quantiles between 100 - 150 / N and 100 - 100 / N, left-inclusive
right-exclusive, would give the wrong quantile.
@mcaceresb
Copy link
Owner

Actually, as I'm getting around to this I've noticed that I can't write to a strL variable, even with the 3.0 interface. I think I can implement this for most functions anyway, but, crucially, not for gcollapse (or gcontract, for that matter).

@sergiocorreia
Copy link

Would it even make sense to collapse a strL? One alternative is to just recast it as a str#, and if not possible (max length is too long, to just raise an error)

@mcaceresb
Copy link
Owner

mcaceresb commented Jul 18, 2018

This is basically what happens at the moment. Though I don't recast as str# (I'm not entirely sure I should; though I suppose I could...), I just suggest the user try that.

It's just annoying that I can't entirely support strL variables because of a plugin limitation.

@mcaceresb
Copy link
Owner

I guess I could add an option, force or something, to treat strL as str# wherever possible...seems like the better compromise.

@mcaceresb
Copy link
Owner

This turned out to be way more complicated than I thought because I have to make a special case for when strL contains binary data; in that case all the string functions I use will give the wrong answer. I'd have to create a special array that contains the number of bytes in each entry, and change strcmp to memcmp and sprintf to snprintf for binary data.

At any rate, I will implement support for very long strings in version 0.14 but not cases when strL contains binary data (at least not yet). This will throw an error.

mcaceresb added a commit that referenced this issue Jul 21, 2018
Features

* Adds `gduplicates` as a replacement of `duplicates`. This is
  basically a wrapper for `gegen tag` and `gegen count`.
* `fasterxtile` and `gquantiles` now accept weights (including `by()`)
* Note that stata might not handle weights accurately in pctile
  and xtile (or, at the very least, it does not seem to follow its
  documented formulas in all cases).
* `strL` partial support (long strings only).
* Option `compress` tries to recast strL variables as str#
* Option `forcestrl` ignores binary values check (use with caution).
* Fixes #31
  (added option `mlast` to `hashsort` to imitate `gsort` when sorting
  numbers in inverse order and they contain missing values).

Bug Fixes

* `semean` returns missing with fewer than 2 observations.
* If `strL` contains binary data `gtools` functions now throw an error.
* `strL` missing values now read correctly.
* Added `strL`-specific tests
* Closes #39 `strL` partial
  support.
* Closes #41 `wild` with
  existing variables gets a warning.
* Fixes #42 `gunique` typo.

Enhancements

* `_gtools_internals.ado` exits when `GTOOLS_CALLER` is empty.

Tests

* All passing in Stata 13 in Linux, Windows
* All passing in Stata 15 in Linux, Windows
* All passing in Stata 14 in OSX
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants