Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4676 mixed labels in r #4814

Merged
merged 28 commits into from
Aug 1, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
a7cbdef
cleaned up RemoteDataFrameService.java
oscardssmith Jun 25, 2018
bdf593d
some more RemoteDataFrameService cleanups
oscardssmith Jun 26, 2018
51d8967
more minor cleanups
oscardssmith Jun 27, 2018
3b20175
gave DataVariable lables and a constructor that sets up stuff that ev…
oscardssmith Jun 28, 2018
0f87e24
add lables to ejb
oscardssmith Jun 29, 2018
29fdd23
Merge branch '2301-stata' into 4676-mixed-labels-in-R
oscardssmith Jun 29, 2018
e53e775
updated spss, stata readers to set labled data as labled instead of c…
oscardssmith Jun 29, 2018
066333a
set variable as labled once instead of once per variableCategory
oscardssmith Jun 29, 2018
2cb8e1e
added empty constructor to placate glassfish
oscardssmith Jun 29, 2018
3249bb9
Merge branch 'develop' into 4676-DataConverter-cleenup
oscardssmith Jun 29, 2018
403a184
merged my branch of niceness improvements
oscardssmith Jun 29, 2018
d6ff4fc
halfway done with changes
oscardssmith Jul 2, 2018
d6588f3
use Haven to convert from Stata/SPSS to R
oscardssmith Jul 2, 2018
3d47ad0
more stuff
oscardssmith Jul 3, 2018
f5d2b63
Uses Haven to export stata, spss to R
oscardssmith Jul 5, 2018
eec68d7
undo accidental change to dataverse_r_functions.R
oscardssmith Jul 5, 2018
ff2e041
removed useless line
oscardssmith Jul 5, 2018
0f63eb6
Merge branch 'develop' into 4676-mixed-labels-in-R
oscardssmith Jul 10, 2018
fa1f34c
Merge branch 'develop' into 4676-mixed-labels-in-R
oscardssmith Jul 10, 2018
a65427f
Added the db migration script for the "islabled" field in the datavar…
landreev Jul 13, 2018
835ad49
first update pass on guides
oscardssmith Jul 16, 2018
5771979
more docs changes
oscardssmith Jul 16, 2018
823ebd2
Merge branch 'develop' into 4676-mixed-labels-in-R
oscardssmith Jul 17, 2018
c3246a7
updated docs with more info on tabular data download options
oscardssmith Jul 18, 2018
03c8051
clean up sql update (now a boolean called "factor") #4676
pdurbin Jul 23, 2018
39a5a63
Merge branch 'develop' into 4676-mixed-labels-in-R #4676
pdurbin Jul 23, 2018
a34034f
Fixed the newlines (removed the DOS-style "\r"s) in the dataverse_r_f…
landreev Jul 31, 2018
42965b8
Fixed DOS newlines in the preprocess.R file too. (#4676)
landreev Aug 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
19 changes: 17 additions & 2 deletions doc/sphinx-guides/source/user/find-use-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Finding and Using Data
+++++++++++++++++++++++

.. contents:: |toctitle|
:local:
:local:

Finding Data
============
Expand Down Expand Up @@ -31,7 +31,6 @@ Other basic search features:
- Sorting results: search results can be sorted by name (A-Z or Z-A), by date (newest or oldest), or by relevancy of results. The sort button can be found above the search results, in the top right.
- Bookmarkable URLs: search URLs can be copied and sent to a fellow researcher, or can be bookmarked for future sessions.


Advanced Search
---------------

Expand Down Expand Up @@ -81,6 +80,22 @@ You may also download a file from its file page by clicking the Download button

Tabular data files offer additional options: You can explore using the TwoRavens data visualization tool (or other :doc:`/installation/external-tools` if they have been enabled) by clicking the Explore button, or choose from a number of tabular-data-specific download options available as a dropdown under the Download button.


Tabular Data
------------

Ingested files can be downloaded in several different ways.

- The default option is to download a tab-separated-value file which is an easy and free standard to use.

- The original file, which may be in a proprietary format which requires special software

- Rdata format if the instalation has configured this

- The variable metadata for the file in DDI format

- A subset of the columns of the data

.. _rsync_download:

Downloading a Dataverse Package via rsync
Expand Down
16 changes: 8 additions & 8 deletions doc/sphinx-guides/source/user/tabulardataingest/rdata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ R Data Format
Support for R (.RData) files has been introduced in DVN 3.5.

.. contents:: |toctitle|
:local:
:local:

Overview.
===========
Expand Down Expand Up @@ -43,7 +43,7 @@ a missing value, as stored in a TAB file, is an empty string, an not "NA" as in

In addition to Missing Values, R recognizes "Not a Value" (NaN) and
positive and negative infinity for floating point variables. These
are now properly supported by the DVN.
are now properly supported by the Dataverse.

Also note, that unlike Stata, that does recognize "float" and "double"
as distinct data types, all floating point values in R are in fact
Expand All @@ -52,12 +52,12 @@ doubles.
R Factors
---------

These are ingested as "Categorical Values" in the DVN.
These are ingested as "Categorical Values" in the Dataverse.

One thing to keep in mind: in both Stata and SPSS, the actual value of
a categorical variable can be both character and numeric. In R, all
factor values are strings, even if they are string representations of
numbers. So the values of the resulting categoricals in the DVN will
numbers. So the values of the resulting categoricals in the Dataverse will
always be of string type too.

Another thing to note is that R factors have no builtin support for
Expand Down Expand Up @@ -88,7 +88,7 @@ Limitations of R, as compared to SPSS and STATA.
------------------------------------------------

Most noticeably, R lacks a standard mechanism for defining descriptive
labels for the data frame variables. In the DVN, similarly to
labels for the data frame variables. In the Dataverse, similarly to
both Stata and SPSS, variables have distinct names and labels; with
the latter reserved for longer, descriptive text.
With variables ingested from R data frames the variable name will be
Expand All @@ -103,7 +103,7 @@ Similarly, R categorical values (factors) lack descriptive labels too.
**Note:** This is potentially confusing, since R factors do
actually have "labels". This is a matter of terminology - an R
factor's label is in fact the same thing as the "value" of a
categorical variable in SPSS or Stata and DVN; it contains the actual
categorical variable in SPSS or Stata and Dataverse; it contains the actual
meaningful data for the given observation. It is NOT a field reserved
for explanatory, human-readable text, such as the case with the
SPSS/Stata "label".
Expand Down Expand Up @@ -174,13 +174,13 @@ discussed in depth on R-related forums, and no attempt is made to
summarize it all in any depth here; this is just to made you aware of
this being a potentially complex issue!)

An important thing to keep in mind, in connection with the DVN ingest
An important thing to keep in mind, in connection with the Dataverse ingest
of R files, is that it will **reject** an R data file with any time
values that have time zones that we can't recognize. This is done in
order to avoid (some) of the potential issues outlined above.

It is also recommended that any vectors containing time values
ingested into the DVN are reviewed, and the resulting entries in the
ingested into the Dataverse are reviewed, and the resulting entries in the
TAB files are compared against the original values in the R data
frame, to make sure they have been ingested as expected.

Expand Down
1 change: 1 addition & 0 deletions scripts/database/upgrades/upgrade_v4.9.1_to_v4.9.2.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ALTER TABLE datavariable ADD COLUMN factor BOOLEAN;
176 changes: 94 additions & 82 deletions src/main/java/edu/harvard/iq/dataverse/dataaccess/DataConverter.java
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,9 @@
import java.nio.channels.ReadableByteChannel;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.HashSet;
import java.util.Set;
import java.util.logging.Level;



Expand Down Expand Up @@ -99,39 +102,7 @@ public static StorageIO<DataFile> performFormatConversion(DataFile file, Storage
// If not cached, run the conversion:
if (convertedFileStream == null) {

File tabFile = null;

boolean tempFilesRequired = false;

try {
Path tabFilePath = storageIO.getFileSystemPath();
tabFile = tabFilePath.toFile();
} catch (UnsupportedDataAccessOperationException uoex) {
// this means there is no direct filesystem path for this object; it's ok!
logger.fine("Could not open source file as a local Path - will go the temp file route.");
tempFilesRequired = true;
} catch (IOException ioex) {
// this is likely a fatal condition, as in, the file is unaccessible:
return null;
}

if (tempFilesRequired) {
ReadableByteChannel tabFileChannel = null;
try {
logger.fine("opening datafFileIO for the source tabular file...");
storageIO.open();
tabFileChannel = storageIO.getReadChannel();

FileChannel tempFileChannel;
tabFile = File.createTempFile("tempTabFile", ".tmp");
tempFileChannel = new FileOutputStream(tabFile).getChannel();
tempFileChannel.transferFrom(tabFileChannel, 0, storageIO.getSize());
} catch (IOException ioex) {
logger.warning("caught IOException trying to store tabular file " + storageIO.getDataFile().getStorageIdentifier() + " as a temp file.");

return null;
}
}
File tabFile = downloadFromStorageIO(storageIO);

if (tabFile == null) {
return null;
Expand Down Expand Up @@ -187,23 +158,50 @@ public static StorageIO<DataFile> performFormatConversion(DataFile file, Storage
}

return null;
} // end of performformatconversion();
}

private static File downloadFromStorageIO(StorageIO<DataFile> storageIO) {
if (storageIO.isLocalFile()){
try {
Path tabFilePath = storageIO.getFileSystemPath();
return tabFilePath.toFile();
} catch (IOException ioex) {
// this is likely a fatal condition, as in, the file is unaccessible:
}
} else {
try {
storageIO.open();
return downloadFromByteChannel(storageIO.getReadChannel(), storageIO.getSize());
} catch (IOException ex) {
logger.warning("caught IOException trying to store tabular file " + storageIO.getDataFile().getStorageIdentifier() + " as a temp file.");
}
}
return null;
}

private static File downloadFromByteChannel(ReadableByteChannel tabFileChannel, long size) {
try {
logger.fine("opening datafFileIO for the source tabular file...");

File tabFile = File.createTempFile("tempTabFile", ".tmp");
FileChannel tempFileChannel = new FileOutputStream(tabFile).getChannel();
tempFileChannel.transferFrom(tabFileChannel, 0, size);
return tabFile;
} catch (IOException ioex) {
logger.warning("caught IOException trying to store tabular file as a temp file.");
}
return null;
}

// Method for (subsettable) file format conversion.
// The method needs the subsettable file saved on disk as in the
// TAB-delimited format.
// Meaning, if this is a remote subsettable file, it needs to be downloaded
// and stored locally as a temporary file; and if it's a fixed-field file, it
// needs to be converted to TAB-delimited, before you can feed the file
// to this method. (See performFormatConversion() method)
// and stored locally as a temporary file (See performFormatConversion() method)
// The method below takes the tab file and sends it to the R server
// (possibly running on a remote host) and gets back the transformed copy,
// providing error-checking and diagnostics in the process.
// This is mostly Akio Sone's code from DVN3.
// (hence some obsolete elements in the comment above: ALL of the tabular
// data files in Dataverse are saved in tab-delimited format - we no longer
// support fixed-field files!

// This is mostly Akio Sone's code from DVN3.
private static File runFormatConversion (DataFile file, File tabFile, String formatRequested) {

if ( formatRequested.equals (FILE_TYPE_TAB) ) {
Expand All @@ -220,28 +218,47 @@ private static File runFormatConversion (DataFile file, File tabFile, String for
return tabFile;
}

File formatConvertedFile = null;
File formatConvertedFile;
// create the service instance
RemoteDataFrameService dfs = new RemoteDataFrameService();

if ("RData".equals(formatRequested)) {
List<DataVariable> dataVariables = file.getDataTable().getDataVariables();
Map<String, Map<String, String>> vls = null;

vls = getValueTableForRequestedVariables(dataVariables);
logger.fine("format conversion: variables(getDataVariableForRequest())=" + dataVariables + "\n");
logger.fine("format conversion: variables(dataVariables)=" + dataVariables + "\n");
logger.fine("format conversion: value table(vls)=" + vls + "\n");
RJobRequest sro = new RJobRequest(dataVariables, vls);

sro.setTabularDataFileName(tabFile.getAbsolutePath());
sro.setRequestType(SERVICE_REQUEST_CONVERT);
sro.setFormatRequested(FILE_TYPE_RDATA);



// execute the service
Map<String, String> resultInfo = dfs.execute(sro);
String origFormat = file.getOriginalFileFormat();
Map<String, String> resultInfo;
if (origFormat.contains("stata") || origFormat.contains("spss")){
if (origFormat.contains("stata")){
origFormat = "dta";
} else if (origFormat.contains("sav")){
origFormat = "sav";
} else if (origFormat.contains("por")){
origFormat = "por";
}

try {
StorageIO<DataFile> storageIO = file.getStorageIO();
long size = storageIO.getAuxObjectSize("orig");
File origFile = downloadFromByteChannel((ReadableByteChannel) storageIO.openAuxChannel("orig"), size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether this cast to ReadableByteChannel should instead be a change to StorageIO, as all current ones return ReadableByteChannels. I figured that that is a bigger change, and probably needs an OK from @landreev and/or @scolapasta first.

resultInfo = dfs.directConvert(origFile, origFormat);
} catch (IOException ex) {
ex.printStackTrace();
return null;
}

} else{
List<DataVariable> dataVariables = file.getDataTable().getDataVariables();
Map<String, Map<String, String>> vls = getValueTableForRequestedVariables(dataVariables);
logger.fine("format conversion: variables(getDataVariableForRequest())=" + dataVariables + "\n");
logger.fine("format conversion: variables(dataVariables)=" + dataVariables + "\n");
logger.fine("format conversion: value table(vls)=" + vls + "\n");
RJobRequest sro = new RJobRequest(dataVariables, vls);

sro.setTabularDataFileName(tabFile.getAbsolutePath());
sro.setRequestType(SERVICE_REQUEST_CONVERT);
sro.setFormatRequested(FILE_TYPE_RDATA);

// execute the service
resultInfo = dfs.execute(sro);
}

//resultInfo.put("offlineCitation", citation);
logger.fine("resultInfo="+resultInfo+"\n");
Expand All @@ -251,12 +268,11 @@ private static File runFormatConversion (DataFile file, File tabFile, String for
if ("true".equals(resultInfo.get("RexecError"))){
logger.fine("R-runtime error trying to convert a file.");
return null;
} else {
String dataFrameFileName = resultInfo.get("dataFrameFileName");
logger.fine("data frame file name: "+dataFrameFileName);

formatConvertedFile = new File(dataFrameFileName);
}
String dataFrameFileName = resultInfo.get("dataFrameFileName");
logger.fine("data frame file name: "+dataFrameFileName);

formatConvertedFile = new File(dataFrameFileName);
} else if ("prep".equals(formatRequested)) {
formatConvertedFile = dfs.runDataPreprocessing(file);
} else {
Expand All @@ -265,32 +281,28 @@ private static File runFormatConversion (DataFile file, File tabFile, String for
}


if (formatConvertedFile != null && formatConvertedFile.exists()) {
logger.fine("frmtCnvrtdFile:length=" + formatConvertedFile.length());
} else {
if (formatConvertedFile == null || !formatConvertedFile.exists()) {
logger.warning("Format-converted file was not properly created.");
return null;
}

logger.fine("frmtCnvrtdFile:length=" + formatConvertedFile.length());
return formatConvertedFile;
}

private static Map<String, Map<String, String>> getValueTableForRequestedVariables(List<DataVariable> dvs){
Map<String, Map<String, String>> vls = new LinkedHashMap<>();
for (DataVariable dv : dvs){
List<VariableCategory> varCat = new ArrayList<>();
varCat.addAll(dv.getCategories());
Map<String, String> vl = new HashMap<>();
for (VariableCategory vc : varCat){
if (vc.getLabel() != null){
vl.put(vc.getValue(), vc.getLabel());
private static Map<String, Map<String, String>> getValueTableForRequestedVariables(List<DataVariable> dataVariables){
Map<String, Map<String, String>> allVarLabels = new LinkedHashMap<>();
for (DataVariable dataVar : dataVariables){
Map<String, String> varLabels = new HashMap<>();
for (VariableCategory varCatagory : dataVar.getCategories()){
if (varCatagory.getLabel() != null){
varLabels.put(varCatagory.getValue(), varCatagory.getLabel());
}
}
if (vl.size() > 0){
vls.put("v"+dv.getId(), vl);
if (!varLabels.isEmpty()){
allVarLabels.put("v"+dataVar.getId(), varLabels);
}
}
return vls;
return allVarLabels;
}

private static String generateAltFileName(String formatRequested, String xfileId) {
Expand All @@ -299,7 +311,7 @@ private static String generateAltFileName(String formatRequested, String xfileId
if (altFileName == null || altFileName.isEmpty()) {
altFileName = "Converted";
}

// Fixme:" should this be else if?
if ( formatRequested != null ) {
altFileName = FileUtil.replaceExtension(altFileName, formatRequested);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -91,10 +91,7 @@ private Map<String, DataTable> processDataDscr(XMLStreamReader xmlr) throws XMLS
}

private void processVar(XMLStreamReader xmlr, Map<String, DataTable> dataTablesMap, Map<String, Integer> varsPerFileMap) throws XMLStreamException {
DataVariable dv = new DataVariable();
dv.setInvalidRanges(new ArrayList<>());
dv.setSummaryStatistics(new ArrayList<>());
dv.setCategories(new ArrayList<>());
DataVariable dv = new DataVariable(0,null);
dv.setName( xmlr.getAttributeValue(null, "name") );

try {
Expand Down