Calculation of frequencies during ingest #5236

lubitchv · 2018-10-25T18:41:09Z

Frequencies are not calculated during upload and ingest of a file into dataverse. As a result export DDI metadata does not have frequency field in catStat in exported xml DDI file. This information is needed for new Data Curation tool. Data Curation Tool is discussed in #4174, #4448, #3604

djbrooke · 2018-10-26T13:37:38Z

Hi @lubitchv, we should catch up about Scholar's Portal's schedule for the Data Curation Tool development and what the dependencies are. I'll get in touch with @amberleahey to schedule.

djbrooke · 2018-11-05T20:28:55Z

ping @landreev - we should talk about this one. Are we OK with ingest adding frequencies to the DB?

@scolapasta @jggautier @kcondon

djbrooke · 2018-11-05T22:31:53Z

Thanks @lubitchv for the PR with the proposed changes. Moving to code review.

landreev · 2018-11-09T17:34:17Z

@lubitchv
Hi,
Thanks for the PR! It was an omission on our part - not calculating the frequencies, for the categorical values.
(it's something we used to do in much older versions; but we must have stopped in Dataverse 4.0; and it looks like nobody has had a practical use case of needing that field in the DDI as of late; so nobody else has complained).

Looking at the PR, couple of things:

The method calculateFrequencies() in the DDIExport bean seems to mostly duplicate the code in IngestServiceBean.produceFrequencyStatistics()... The former should probably only download the file, and then just call the latter on it, instead? - just a straightforward refactoring, that'll save us keeping these few lines of the same code in 2 places.
This one may be a bit less straightforward; and we can discuss if we want to address it right away. As implemented now, when you count the distinct values, you rely on a Hashtable<String, Double> and go through the tabular column as a vector of strings... This is definitely the right thing to do for strings (dataVariable.isTypeCharacter()). But we should probably be more careful for numeric variables. Meaning, if, for example, the variable is a float (dataVariable.isTypeNumeric() && "float".equals(dataVariable.getFormat()), we probably don't want to count "0" and "0.0" as 2 different values - ?
In practical terms, all the currently implemented ingest plugins guarantee that the numbers in the tabular files that we produce are formatted uniformly. So this should not be a an immediate problem in real life for your Dataverse... But I don't think we should assume that this will always be the case. We may soon have a use case of working with tabular files produced outside Dataverse, by some other tool where this cannot be guaranteed. And we may have some legacy tabular files here in our Dataverse at IQSS where this could be a problem too.

So a really clean implementation should probably include separate cases for strings, floats, doubles and integers - just like elsewhere in IngestService, where we calculate summary stats - and maybe use Hashtable<Object, Double> to count.

landreev · 2018-11-09T17:40:16Z

oh, and there may be
3) - we may need a couple of lines of extra logic for counting missing values too (as in, if (variableVector[j] == null)?). But I'll double-check on that and confirm.

landreev · 2018-11-09T17:47:25Z

BTW, one thing I'm a little confused about: why is VariableCategory.frequency a float, and not an integer?? That class file is very old, legacy code... I'm guessing maybe the original intent back in the day was to actually store frequencies, and not the numbers of occurrences? (as in, number of occurrences/total number of cases)
What you are doing is definitely correct; we want to store the number of occurrences in that field. But in case you were wondering about it being a float - I'm just as puzzled about it.

lubitchv · 2018-11-09T20:35:43Z

Hi,
@landreev
Thank you for the notes.

You are right, there is a duplication of code in ingest bean and DDI Export, so I refactored that part of code as you advised. Now in Export DDI the function produceFrequencies is called from the ingest.
It is quite valid concern. I need to think how to implement it, I think you are right and it should be implemented with different types of objects (int, float, char)

Regarding frequencies as Double, it is indeed confusing. My guess, frequency is Double in dataverse because of weighted frequencies. There is a flag that says if variable is weighted. Weighted frequencies are real numbers. There is a problem here, since we will need both frequencies and weighted frequencies. The weighted frequency field does not exist at the moment in dataverse. Data Curation Tool will allow users to weight variables and calculate weighted frequencies. Then an api should write it back to dataverse database. But anyway, weighted frequencies are for another issue, latter.

landreev · 2018-11-09T21:41:32Z

@lubitchv
Thanks for the quick reply;
Yes, let's revisit the issue 2) next week. I realize that the example I gave, of a potential issue was probably not the best one. At the same time, the more I think about it, I'm not really positive if this is something we even need to address at this point.

lubitchv · 2018-11-12T16:21:59Z

@landreev
I talked to @amberleahey about catValu and it seems that catValu behaves more like a label (string). In this case "0" a "0.0" may mean two different things.

landreev · 2018-11-13T04:22:37Z

@lubitchv
Hi Victoria,
I'm not sure I got that... I'll write more on this tomorrow.
(today was a holiday in this country).

landreev · 2018-11-15T02:35:37Z

@lubitchv
Hi,
We store catValues in the database as strings; but for numeric vectors, these strings are representations of the specific numeric values. Neither SPSS nor Stata allows to create more than one category label for the same numeric value. So it should never be the case in Dataverse either, to have a numeric floating point variable, and have categories for both "1" and "1.0" - as these strings represent the same floating point value.

The way I first described this problem was: what if we have a tab file where the same value is represented differently on different lines... like "0" and "0.0". On a second thought, let's not even worry about this use case. The tab files that our ingest produces should have all the numeric values formatted the same way. If we discover a tab file where this is not the case - we'll fix it.

What we should worry about is the case where the formatting of the numeric value, as it appears in the tab file, is different from the way it's represented in the category value, as stored in the database (this is VariableCategory.getValue()). Because our ingest does not actually guarantee that. Does that make sense?

A normal example of a variable category: A researcher has a numeric, integer column in their data file. She assigns categorical labels: "Urban" to 0 and "Rural" to 1. In this, trivial example, when this data file is ingested into Dataverse, "0" and "1" are saved as cat values, and "Urban" and "Rural" - as category labels. And the tabular file has a column of "0"s and "1"s. The current code in the PR works just fine for this example. When you open the tab file and subset this variable column, as a vector of Strings, you get back an array of "0"s and "1"s, you compare them to the .getValue() strings and you get the correct counts of the values...

But here's another example: In Stata you can assign value labels to integers, and also to floats (but only to values with zero fractional part). So I created this Stata file (attached; I had to change the extension .dta to .txt, in order for GitHub to accept it), with the single variable column of type float, made of values 0.0 and 1.0. I assigned the labels "Urban" as "Rural", as above. When ingested into Dataverse, the entries in the tab file are formatted as floats:
0.0
1.0
0.0
...
But the categorical values are still saved in the database as "0" and "1". (exported as ddi:

<catgry><catValu>0</catValu><labl level="category">Urban</labl>...</catgry> ...

) This is because Stata treats all these categorical values as integers internally; and because we never thought it was important to format the values exactly the same way as the entries in the tab file... So the code currently in the PR will produce fequency = 0 for each of the 2 categories.
We can of course rewrite the ingest plugins to format the cat values according to the variable type (and then go through the existing databases and re-format all the ingested cat values...) But it should be way easier to just assume that the formatting may be mismatched. And then, when counting the frequency for this variable, convert the values to floats (new Float(variableCategory.getValue()?); then use the TabularSubsetGenerator.subsetFloatVector() method to extract the variable values from the tab file - and compare the values as floats, not as strings.
(I'm pretty sure there are similar use cases with SPSS ingest too...)

test_cat_values.txt

lubitchv · 2018-11-15T19:17:07Z

@landreev
Thank you, Leonid. I see the issue now. I will try to fix it in the way you have described.

lubitchv · 2018-11-16T17:26:07Z

@landreev
Leonid, in case of dataVariable.isTypeCharacter() should we trim spaces in a vector that one gets from a Tabular file? Potentially, since the number of characters in a variable is fixed, spss, for example, will add blank spaces to character data, then catValues may mismatch the data if user is not careful enough creating catvalues and catlabels. On the other hand by trimming the data we are not comparing original data.

lubitchv · 2018-11-20T17:37:22Z

@landreev
Leonid, I introduced the changes that you have asked for (separate treatment for string and numerical category values). I also added in unit testing additional testing for stata (numeric, since as I understand stata does not allow character category values) and spss (numeric and character category values).

landreev · 2018-11-21T01:01:39Z

@lubitchv
Great, thank you.
I'm looking at it now, will move it along, but QA and merging will have to wait till next week; since most of the team are already out for (the U.S.) Thanksgiving.

lubitchv mentioned this issue Nov 5, 2018

5236 frequency ingest #5278

Merged

5 tasks

djbrooke added Status: Community Dev and removed Status: Community Dev labels Nov 5, 2018

djbrooke assigned landreev Nov 6, 2018

landreev added Status: Community Dev and removed Status: Code Review labels Nov 9, 2018

pdurbin added Status: Code Review and removed Status: Community Dev labels Nov 20, 2018

pdurbin unassigned landreev Nov 20, 2018

landreev self-assigned this Nov 21, 2018

landreev added Status: QA and removed Status: Code Review labels Nov 24, 2018

landreev removed their assignment Nov 24, 2018

landreev assigned kcondon Nov 24, 2018

kcondon closed this as completed in d221385 Nov 29, 2018

kcondon removed the Status: QA label Nov 29, 2018

djbrooke added this to the 4.10 - Additional Data Transfer Options milestone Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculation of frequencies during ingest #5236

Calculation of frequencies during ingest #5236

lubitchv commented Oct 25, 2018

djbrooke commented Oct 26, 2018

djbrooke commented Nov 5, 2018

djbrooke commented Nov 5, 2018

landreev commented Nov 9, 2018

landreev commented Nov 9, 2018

landreev commented Nov 9, 2018

lubitchv commented Nov 9, 2018

landreev commented Nov 9, 2018

lubitchv commented Nov 12, 2018

landreev commented Nov 13, 2018

landreev commented Nov 15, 2018 •

edited

lubitchv commented Nov 15, 2018

lubitchv commented Nov 16, 2018

lubitchv commented Nov 20, 2018

landreev commented Nov 21, 2018

Calculation of frequencies during ingest #5236

Calculation of frequencies during ingest #5236

Comments

lubitchv commented Oct 25, 2018

djbrooke commented Oct 26, 2018

djbrooke commented Nov 5, 2018

djbrooke commented Nov 5, 2018

landreev commented Nov 9, 2018

landreev commented Nov 9, 2018

landreev commented Nov 9, 2018

lubitchv commented Nov 9, 2018

landreev commented Nov 9, 2018

lubitchv commented Nov 12, 2018

landreev commented Nov 13, 2018

landreev commented Nov 15, 2018 • edited

lubitchv commented Nov 15, 2018

lubitchv commented Nov 16, 2018

lubitchv commented Nov 20, 2018

landreev commented Nov 21, 2018

landreev commented Nov 15, 2018 •

edited