Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of frequencies during ingest #5236

Closed
lubitchv opened this issue Oct 25, 2018 · 15 comments
Closed

Calculation of frequencies during ingest #5236

lubitchv opened this issue Oct 25, 2018 · 15 comments
Assignees

Comments

@lubitchv
Copy link
Contributor

Frequencies are not calculated during upload and ingest of a file into dataverse. As a result export DDI metadata does not have frequency field in catStat in exported xml DDI file. This information is needed for new Data Curation tool. Data Curation Tool is discussed in #4174, #4448, #3604

@djbrooke
Copy link
Contributor

Hi @lubitchv, we should catch up about Scholar's Portal's schedule for the Data Curation Tool development and what the dependencies are. I'll get in touch with @amberleahey to schedule.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 5, 2018

ping @landreev - we should talk about this one. Are we OK with ingest adding frequencies to the DB?

@scolapasta @jggautier @kcondon

@djbrooke
Copy link
Contributor

djbrooke commented Nov 5, 2018

Thanks @lubitchv for the PR with the proposed changes. Moving to code review.

@landreev
Copy link
Contributor

landreev commented Nov 9, 2018

@lubitchv
Hi,
Thanks for the PR! It was an omission on our part - not calculating the frequencies, for the categorical values.
(it's something we used to do in much older versions; but we must have stopped in Dataverse 4.0; and it looks like nobody has had a practical use case of needing that field in the DDI as of late; so nobody else has complained).

Looking at the PR, couple of things:

  1. The method calculateFrequencies() in the DDIExport bean seems to mostly duplicate the code in IngestServiceBean.produceFrequencyStatistics()... The former should probably only download the file, and then just call the latter on it, instead? - just a straightforward refactoring, that'll save us keeping these few lines of the same code in 2 places.

  2. This one may be a bit less straightforward; and we can discuss if we want to address it right away. As implemented now, when you count the distinct values, you rely on a Hashtable<String, Double> and go through the tabular column as a vector of strings... This is definitely the right thing to do for strings (dataVariable.isTypeCharacter()). But we should probably be more careful for numeric variables. Meaning, if, for example, the variable is a float (dataVariable.isTypeNumeric() && "float".equals(dataVariable.getFormat()), we probably don't want to count "0" and "0.0" as 2 different values - ?
    In practical terms, all the currently implemented ingest plugins guarantee that the numbers in the tabular files that we produce are formatted uniformly. So this should not be a an immediate problem in real life for your Dataverse... But I don't think we should assume that this will always be the case. We may soon have a use case of working with tabular files produced outside Dataverse, by some other tool where this cannot be guaranteed. And we may have some legacy tabular files here in our Dataverse at IQSS where this could be a problem too.

So a really clean implementation should probably include separate cases for strings, floats, doubles and integers - just like elsewhere in IngestService, where we calculate summary stats - and maybe use Hashtable<Object, Double> to count.

@landreev
Copy link
Contributor

landreev commented Nov 9, 2018

oh, and there may be
3) - we may need a couple of lines of extra logic for counting missing values too (as in, if (variableVector[j] == null)?). But I'll double-check on that and confirm.

@landreev
Copy link
Contributor

landreev commented Nov 9, 2018

BTW, one thing I'm a little confused about: why is VariableCategory.frequency a float, and not an integer?? That class file is very old, legacy code... I'm guessing maybe the original intent back in the day was to actually store frequencies, and not the numbers of occurrences? (as in, number of occurrences/total number of cases)
What you are doing is definitely correct; we want to store the number of occurrences in that field. But in case you were wondering about it being a float - I'm just as puzzled about it.

@lubitchv
Copy link
Contributor Author

lubitchv commented Nov 9, 2018

Hi,
@landreev
Thank you for the notes.

  1. You are right, there is a duplication of code in ingest bean and DDI Export, so I refactored that part of code as you advised. Now in Export DDI the function produceFrequencies is called from the ingest.

  2. It is quite valid concern. I need to think how to implement it, I think you are right and it should be implemented with different types of objects (int, float, char)

Regarding frequencies as Double, it is indeed confusing. My guess, frequency is Double in dataverse because of weighted frequencies. There is a flag that says if variable is weighted. Weighted frequencies are real numbers. There is a problem here, since we will need both frequencies and weighted frequencies. The weighted frequency field does not exist at the moment in dataverse. Data Curation Tool will allow users to weight variables and calculate weighted frequencies. Then an api should write it back to dataverse database. But anyway, weighted frequencies are for another issue, latter.

@landreev
Copy link
Contributor

landreev commented Nov 9, 2018

@lubitchv
Thanks for the quick reply;
Yes, let's revisit the issue 2) next week. I realize that the example I gave, of a potential issue was probably not the best one. At the same time, the more I think about it, I'm not really positive if this is something we even need to address at this point.

@lubitchv
Copy link
Contributor Author

@landreev
I talked to @amberleahey about catValu and it seems that catValu behaves more like a label (string). In this case "0" a "0.0" may mean two different things.

@landreev
Copy link
Contributor

@lubitchv
Hi Victoria,
I'm not sure I got that... I'll write more on this tomorrow.
(today was a holiday in this country).

@landreev
Copy link
Contributor

landreev commented Nov 15, 2018

@lubitchv
Hi,
We store catValues in the database as strings; but for numeric vectors, these strings are representations of the specific numeric values. Neither SPSS nor Stata allows to create more than one category label for the same numeric value. So it should never be the case in Dataverse either, to have a numeric floating point variable, and have categories for both "1" and "1.0" - as these strings represent the same floating point value.

The way I first described this problem was: what if we have a tab file where the same value is represented differently on different lines... like "0" and "0.0". On a second thought, let's not even worry about this use case. The tab files that our ingest produces should have all the numeric values formatted the same way. If we discover a tab file where this is not the case - we'll fix it.

What we should worry about is the case where the formatting of the numeric value, as it appears in the tab file, is different from the way it's represented in the category value, as stored in the database (this is VariableCategory.getValue()). Because our ingest does not actually guarantee that. Does that make sense?

A normal example of a variable category: A researcher has a numeric, integer column in their data file. She assigns categorical labels: "Urban" to 0 and "Rural" to 1. In this, trivial example, when this data file is ingested into Dataverse, "0" and "1" are saved as cat values, and "Urban" and "Rural" - as category labels. And the tabular file has a column of "0"s and "1"s. The current code in the PR works just fine for this example. When you open the tab file and subset this variable column, as a vector of Strings, you get back an array of "0"s and "1"s, you compare them to the .getValue() strings and you get the correct counts of the values...

But here's another example: In Stata you can assign value labels to integers, and also to floats (but only to values with zero fractional part). So I created this Stata file (attached; I had to change the extension .dta to .txt, in order for GitHub to accept it), with the single variable column of type float, made of values 0.0 and 1.0. I assigned the labels "Urban" as "Rural", as above. When ingested into Dataverse, the entries in the tab file are formatted as floats:
0.0
1.0
0.0
...
But the categorical values are still saved in the database as "0" and "1". (exported as ddi:

<catgry><catValu>0</catValu><labl level="category">Urban</labl>...</catgry> ...

) This is because Stata treats all these categorical values as integers internally; and because we never thought it was important to format the values exactly the same way as the entries in the tab file... So the code currently in the PR will produce fequency = 0 for each of the 2 categories.
We can of course rewrite the ingest plugins to format the cat values according to the variable type (and then go through the existing databases and re-format all the ingested cat values...) But it should be way easier to just assume that the formatting may be mismatched. And then, when counting the frequency for this variable, convert the values to floats (new Float(variableCategory.getValue()?); then use the TabularSubsetGenerator.subsetFloatVector() method to extract the variable values from the tab file - and compare the values as floats, not as strings.
(I'm pretty sure there are similar use cases with SPSS ingest too...)

test_cat_values.txt

@lubitchv
Copy link
Contributor Author

@landreev
Thank you, Leonid. I see the issue now. I will try to fix it in the way you have described.

@lubitchv
Copy link
Contributor Author

@landreev
Leonid, in case of dataVariable.isTypeCharacter() should we trim spaces in a vector that one gets from a Tabular file? Potentially, since the number of characters in a variable is fixed, spss, for example, will add blank spaces to character data, then catValues may mismatch the data if user is not careful enough creating catvalues and catlabels. On the other hand by trimming the data we are not comparing original data.

@lubitchv
Copy link
Contributor Author

@landreev
Leonid, I introduced the changes that you have asked for (separate treatment for string and numerical category values). I also added in unit testing additional testing for stata (numeric, since as I understand stata does not allow character category values) and spss (numeric and character category values).

@landreev
Copy link
Contributor

@lubitchv
Great, thank you.
I'm looking at it now, will move it along, but QA and merging will have to wait till next week; since most of the team are already out for (the U.S.) Thanksgiving.

@landreev landreev removed their assignment Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants