Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5f1cab3
Tests for old csvparser
oscardssmith Jun 26, 2017
60caade
This has tests asserting the old behavior.
oscardssmith Jun 26, 2017
b435ada
Updates to CSVFileReader
oscardssmith Jun 26, 2017
c137ba7
Updates to CSVFileReaderTest
oscardssmith Jun 26, 2017
ee34c02
fixed new-line parsing
oscardssmith Jun 26, 2017
932754c
harder test file
oscardssmith Jun 26, 2017
487b84b
Fixed tab files
oscardssmith Jun 26, 2017
f6defb5
Fixes to allow weird unicode white space characters
oscardssmith Jun 27, 2017
3e68a9c
Made the tests just a bit more evil. Now null strings and strings tha…
oscardssmith Jun 27, 2017
cbb7afa
Added the first Leonid test, and removed the now redundant HardRead test
oscardssmith Jun 27, 2017
606458f
Actually doing what I said I was doing earlier
oscardssmith Jun 27, 2017
158091d
Added messages to the asserts to remove the need for most separate l…
oscardssmith Jun 28, 2017
e96ba16
Fixed testSubset to reflect the new behavior of counting large integr…
oscardssmith Jun 28, 2017
406906d
Added testVariableUNFs
oscardssmith Jun 28, 2017
8651826
Revived my test API for sending files through the ingest process with…
landreev Jun 29, 2017
ae00815
CSV Ingest Doc Update [#3767]
dlmurphy Jun 30, 2017
fd5bbf5
Small url fix
dlmurphy Jun 30, 2017
830fca6
Allows empty strings to be parsed as integer values to preserve compa…
oscardssmith Jul 5, 2017
bbfa4b9
Merge branch '3767-CSV-injest-code' of https://github.com/IQSS/datave…
oscardssmith Jul 5, 2017
784edb7
Removed tons of logging that didn't really help much.
oscardssmith Jul 6, 2017
40ef40a
lets retry this
oscardssmith Jul 6, 2017
61c74c7
Merge branch 'develop' into 3767-CSV-injest-code
oscardssmith Jul 7, 2017
144468c
Updated IngestReport to use an unlimited length for errors to make er…
oscardssmith Jul 11, 2017
deb85ae
updated comments to better reflect status
oscardssmith Jul 13, 2017
9fbff77
Merge branch 'develop' into 3767-CSV-injest-code
oscardssmith Jul 17, 2017
68f3826
fix merge conflict
oscardssmith Jul 17, 2017
e9f4da4
Fixed really stupid csv bug causing csv's with duplicate headers to b…
oscardssmith Jul 18, 2017
6601ea2
Slightly nicer fix
oscardssmith Jul 18, 2017
5a3c86f
refix
oscardssmith Jul 18, 2017
0cdc2c0
Merge branch 'develop' into 3767-CSV-injest-code
oscardssmith Jul 20, 2017
867485f
Merge branch 'develop' into 3767-CSV-injest-code
oscardssmith Jul 24, 2017
fbd5789
Typo fix [ref: #3767]
dlmurphy Jul 31, 2017
310b86b
some additions to the documentation page for the CSV ingest
landreev Aug 2, 2017
145608c
shortened one sentence in the csv doc (#3767)
landreev Aug 2, 2017
4e45054
Docs reviewed [ref: #3767]
dlmurphy Aug 2, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/user/appendix.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,6 @@ Detailed below are what metadata schemas we support for Citation and Domain Spec
: These metadata elements can be mapped/exported to the International Virtual Observatory Alliance’s (IVOA)
`VOResource Schema format <http://www.ivoa.net/documents/latest/RM.html>`__ and is based on
`Virtual Observatory (VO) Discovery and Provenance Metadata <http://perma.cc/H5ZJ-4KKY>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/astrophysics.tsv>`__).
- `Life Sciences Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=2>`__: based on `ISA-Tab Specification <http://isatab.sourceforge.net/format.html>`__, along with controlled vocabulary from subsets of the `OBI Ontology <http://bioportal.bioontology.org/ontologies/OBI>`__ and the `NCBI Taxonomy for Organisms <http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/biomedical.tsv>`__).
- `Life Sciences Metadata <https://docs.google.com/spreadsheet/ccc?key=0AjeLxEN77UZodHFEWGpoa19ia3pldEFyVFR0aFVGa0E#gid=2>`__: based on `ISA-Tab Specification <http://isa-tools.org/format/specification/>`__, along with controlled vocabulary from subsets of the `OBI Ontology <http://bioportal.bioontology.org/ontologies/OBI>`__ and the `NCBI Taxonomy for Organisms <http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/>`__ (`see .tsv version <https://github.com/IQSS/dataverse/blob/master/scripts/api/data/metadatablocks/biomedical.tsv>`__).

See also the `Dataverse 4.0 Metadata Crosswalk: DDI, DataCite, DC, DCTerms, VO, ISA-Tab <https://docs.google.com/spreadsheets/d/10Luzti7svVTVKTA-px27oq3RxCUM-QbiTkm8iMd5C54/edit?usp=sharing>`__ document.
75 changes: 67 additions & 8 deletions doc/sphinx-guides/source/user/tabulardataingest/csv.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,24 +7,83 @@ CSV
Ingest of Comma-Separated Values files as tabular data.
-------------------------------------------------------

Dataverse will make an attempt to turn CSV files uploaded by the user into tabular data.
Dataverse will make an attempt to turn CSV files uploaded by the user into tabular data, using the `Apache CSV parser <https://commons.apache.org/proper/commons-csv/>`_.

Main formatting requirements:
Main formatting requirements:
-----------------------------

The first line must contain a comma-separated list of the variable names;
The first row in the document will be treated as the CSV's header, containing variable names for each column.

All the lines that follow must contain the same number of comma-separated values as the first, variable name line.
Each following row must contain the same number of comma-separated values ("cells") as that header.

Limitations:
As of the Dataverse 4.8 release, we allow ingest of CSV files with commas and line breaks within cells. A string with any number of commas and line breaks enclosed within double quotes is recognized as a single cell. Double quotes can be encoded as two double quotes in a row (``""``).

For example, the following lines:

.. code-block:: none

a,b,"c,d
efgh""ijk""l",m,n

are recognized as a **single** row with **5** comma-separated values (cells):

.. code-block:: none

a
b
c,d\nefgh"ijk"l
m
n

(where ``\n`` is a new line character)


Limitations:
------------

Except for the variable names supplied in the top line, very little information describing the data can be obtained from a CSV file. We strongly recommend using one of the supported rich files formats (Stata, SPSS and R) to provide more descriptive metadata (informatinve lables, categorical values and labels, and more) that cannot be encoded in a CSV file.
Compared to other formats, relatively little information about the data ("variable-level metadata") can be extracted from a CSV file. Aside from the variable names supplied in the top line, the ingest will make an educated guess about the data type of each comma-separated column. One of the supported rich file formats (Stata, SPSS and R) should be used if you need to provide more descriptive variable-level metadata (variable labels, categorical values and labels, explicitly defined data types, etc.).

Recognized data types and formatting:
-------------------------------------

The application will attempt to recognize numeric, string, and date/time values in the individual comma-separated columns.


For dates, the ``yyyy-MM-dd`` format is recognized.

For date-time values, the following 2 formats are recognized:

``yyyy-MM-dd HH:mm:ss``

``yyyy-MM-dd HH:mm:ss z`` (same format as the above, with the time zone specified)

For numeric variables, the following special values are recognized:

``inf``, ``+inf`` - as a special IEEE 754 "positive infinity" value;

``NaN`` - as a special IEEE 754 "not a number" value;

An empty value (i.e., a comma followed immediately by another comma, or the line end), or ``NA`` - as a *missing value*.

``null`` - as a numeric *zero*.

(any combinations of lower and upper cases are allowed in the notations above).

In character strings, an empty value (a comma followed by another comma, or the line end) is treated as an empty string (NOT as a *missing value*).

Any non-Latin characters are allowed in character string values, **as long as the encoding is UTF8**.


**Note:** When the ingest recognizes a CSV column as a numeric vector, or as a date/time value, this information is reflected and saved in the database as the *data variable metadata*. To inspect that metadata, click on the *Download* button next to a tabular data file, and select *Variable Metadata*. This will export the variable records in the DDI XML format. (Alternatively, this metadata fragment can be downloaded via the Data Access API; for example: ``http://localhost:8080/api/access/datafile/<FILEID>/metadata/ddi``).

The most immediate implication is in the calculation of the UNF signatures for the data vectors, as different normalization rules are applied to numeric, character, and date/time values. (see the :doc:`/developers/unf/index` section for more information). If it is important to you that the UNF checksums of your data are accurately calculated, check that the numeric and date/time columns in your file were recognized as such (as ``type=numeric`` and ``type=character, category=date(time)``, respectively). If, for example, a column that was supposed to be numeric is recognized as a vector of character values (strings), double-check that the formatting of the values is consistent. Remember, a single improperly-formatted value in the column will turn it into a vector of character strings, and result in a different UNF. Fix any formatting errors you find, delete the file from the dataset, and try to ingest it again.

The application will however make an attempt to recognize numeric, string and date/time values in CSV files.

Tab-delimited Data Files:
-------------------------

Tab-delimited files could be ingested by replacing the TABs with commas.
Presently, tab-delimited files can be ingested by replacing the TABs with commas.
(We are planning to add direct support for tab-delimited files in an upcoming release).



6 changes: 5 additions & 1 deletion src/main/java/Bundle.properties
Original file line number Diff line number Diff line change
Expand Up @@ -1689,4 +1689,8 @@ authenticationProvider.name.github=GitHub
authenticationProvider.name.google=Google
authenticationProvider.name.orcid=ORCiD
authenticationProvider.name.orcid-sandbox=ORCiD Sandbox
authenticationProvider.name.shib=Shibboleth
authenticationProvider.name.shib=Shibboleth
ingest.csv.invalidHeader=Invalid header row. One of the cells is empty.
ingest.csv.lineMismatch=Mismatch between line counts in first and final passes!, {0} found on first pass, but {1} found on second.
ingest.csv.recordMismatch=Reading mismatch, line {0} of the Data file: {1} delimited values expected, {2} found.
ingest.csv.nullStream=Stream can't be null.
28 changes: 9 additions & 19 deletions src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,14 @@
package edu.harvard.iq.dataverse.api;

import edu.harvard.iq.dataverse.DataFile;
import edu.harvard.iq.dataverse.DataFileServiceBean;
import edu.harvard.iq.dataverse.DataTable;
import edu.harvard.iq.dataverse.DatasetServiceBean;
import edu.harvard.iq.dataverse.FileMetadata;
import edu.harvard.iq.dataverse.ingest.IngestServiceBean;
import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataFileReader;
import edu.harvard.iq.dataverse.ingest.tabulardata.TabularDataIngest;
import edu.harvard.iq.dataverse.util.FileUtil;
import edu.harvard.iq.dataverse.util.StringUtil;
import java.io.BufferedInputStream;
import java.util.logging.Logger;
import javax.ejb.EJB;
Expand All @@ -32,6 +32,7 @@
import javax.ws.rs.core.HttpHeaders;
import javax.ws.rs.core.UriInfo;
import javax.servlet.http.HttpServletResponse;
import javax.ws.rs.QueryParam;



Expand All @@ -56,49 +57,38 @@
public class TestIngest {
private static final Logger logger = Logger.getLogger(TestIngest.class.getCanonicalName());

@EJB
DataFileServiceBean dataFileService;
@EJB
DatasetServiceBean datasetService;
@EJB
IngestServiceBean ingestService;

//@EJB

@Path("test/{fileName}/{fileType}")
@Path("test/file")
@GET
@Produces({ "text/plain" })
public String datafile(@PathParam("fileName") String fileName, @PathParam("fileType") String fileType, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) /*throws NotFoundException, ServiceUnavailableException, PermissionDeniedException, AuthorizationRequiredException*/ {
public String datafile(@QueryParam("fileName") String fileName, @QueryParam("fileType") String fileType, @Context UriInfo uriInfo, @Context HttpHeaders headers, @Context HttpServletResponse response) /*throws NotFoundException, ServiceUnavailableException, PermissionDeniedException, AuthorizationRequiredException*/ {
String output = "";

if (fileName == null || fileType == null || "".equals(fileName) || "".equals(fileType)) {
output = output.concat("Usage: java edu.harvard.iq.dataverse.ingest.IngestServiceBean <file> <type>.");
if (StringUtil.isEmpty(fileName) || StringUtil.isEmpty(fileType)) {
output = output.concat("Usage: /api/ingest/test/file?fileName=PATH&fileType=TYPE");
return output;
}

BufferedInputStream fileInputStream = null;

String absoluteFilePath = null;
if (fileType.equals("x-stata")) {
absoluteFilePath = "/usr/share/data/retest_stata/reingest/" + fileName;
} else if (fileType.equals("x-spss-sav")) {
absoluteFilePath = "/usr/share/data/retest_sav/reingest/" + fileName;
} else if (fileType.equals("x-spss-por")) {
absoluteFilePath = "/usr/share/data/retest_por/reingest/" + fileName;
}

try {
fileInputStream = new BufferedInputStream(new FileInputStream(new File(absoluteFilePath)));
fileInputStream = new BufferedInputStream(new FileInputStream(new File(fileName)));
} catch (FileNotFoundException notfoundEx) {
fileInputStream = null;
}

if (fileInputStream == null) {
output = output.concat("Could not open file "+absoluteFilePath+".");
output = output.concat("Could not open file "+fileName+".");
return output;
}

fileType = "application/"+fileType;
TabularDataFileReader ingestPlugin = ingestService.getTabDataReaderByMimeType(fileType);

if (ingestPlugin == null) {
Expand All @@ -123,7 +113,7 @@ public String datafile(@PathParam("fileName") String fileName, @PathParam("fileT
&& tabFile != null
&& tabFile.exists()) {

String tabFilename = FileUtil.replaceExtension(absoluteFilePath, "tab");
String tabFilename = FileUtil.replaceExtension(fileName, "tab");

java.nio.file.Files.copy(Paths.get(tabFile.getAbsolutePath()), Paths.get(tabFilename), StandardCopyOption.REPLACE_EXISTING);

Expand Down
78 changes: 40 additions & 38 deletions src/main/java/edu/harvard/iq/dataverse/ingest/IngestReport.java
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import javax.persistence.Id;
import javax.persistence.Index;
import javax.persistence.JoinColumn;
import javax.persistence.Lob;
import javax.persistence.ManyToOne;
import javax.persistence.Table;
import javax.persistence.Temporal;
Expand All @@ -39,86 +40,87 @@ public Long getId() {
public void setId(Long id) {
this.id = id;
}
public static int INGEST_TYPE_TABULAR = 1;
public static int INGEST_TYPE_METADATA = 2;
public static int INGEST_STATUS_INPROGRESS = 1;
public static int INGEST_STATUS_SUCCESS = 2;
public static int INGEST_STATUS_FAILURE = 3;

public static int INGEST_TYPE_TABULAR = 1;
public static int INGEST_TYPE_METADATA = 2;

public static int INGEST_STATUS_INPROGRESS = 1;
public static int INGEST_STATUS_SUCCESS = 2;
public static int INGEST_STATUS_FAILURE = 3;

@ManyToOne
@JoinColumn(nullable=false)
private DataFile dataFile;

private String report;

private int type;


@Lob
private String report;

private int type;

private int status;

@Temporal(value = TemporalType.TIMESTAMP)
private Date startTime;
private Date startTime;

@Temporal(value = TemporalType.TIMESTAMP)
private Date endTime;
private Date endTime;

public int getType() {
return type;
return type;
}

public void setType(int type) {
this.type = type;
}

public int getStatus() {
return status;
return status;
}

public void setStatus(int status) {
this.status = status;
}

public boolean isFailure() {
return status == INGEST_STATUS_FAILURE;
}

public void setFailure() {
this.status = INGEST_STATUS_FAILURE;
}

public String getReport() {
return report;
}

public void setReport(String report) {
this.report = report;
this.report = report;
}

public DataFile getDataFile() {
return dataFile;
}

public void setDataFile(DataFile dataFile) {
this.dataFile = dataFile;
this.dataFile = dataFile;
}

public Date getStartTime() {
return startTime;
return startTime;
}

public void setStartTime(Date startTime) {
this.startTime = startTime;
}

public Date getEndTime() {
return endTime;
return endTime;
}

public void setEndTime(Date endTime) {
this.endTime = endTime;
}

@Override
public int hashCode() {
int hash = 0;
Expand All @@ -143,5 +145,5 @@ public boolean equals(Object object) {
public String toString() {
return "edu.harvard.iq.dataverse.ingest.IngestReport[ id=" + id + " ]";
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -544,8 +544,8 @@ public void produceContinuousSummaryStatistics(DataFile dataFile, File generated
for (int i = 0; i < dataFile.getDataTable().getVarQuantity(); i++) {
if (dataFile.getDataTable().getDataVariables().get(i).isIntervalContinuous()) {
logger.fine("subsetting continuous vector");
DataFileIO dataFileIO = dataFile.getDataFileIO();
dataFileIO.open();
//DataFileIO dataFileIO = dataFile.getDataFileIO();
//dataFileIO.open();
if ("float".equals(dataFile.getDataTable().getDataVariables().get(i).getFormat())) {
Float[] variableVector = TabularSubsetGenerator.subsetFloatVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue());
logger.fine("Calculating summary statistics on a Float vector;");
Expand Down Expand Up @@ -576,8 +576,8 @@ public void produceDiscreteNumericSummaryStatistics(DataFile dataFile, File gene
if (dataFile.getDataTable().getDataVariables().get(i).isIntervalDiscrete()
&& dataFile.getDataTable().getDataVariables().get(i).isTypeNumeric()) {
logger.fine("subsetting discrete-numeric vector");
DataFileIO dataFileIO = dataFile.getDataFileIO();
dataFileIO.open();
//DataFileIO dataFileIO = dataFile.getDataFileIO();
//dataFileIO.open();
Long[] variableVector = TabularSubsetGenerator.subsetLongVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue());
// We are discussing calculating the same summary stats for
// all numerics (the same kind of sumstats that we've been calculating
Expand Down Expand Up @@ -610,8 +610,8 @@ public void produceCharacterSummaryStatistics(DataFile dataFile, File generatedT

for (int i = 0; i < dataFile.getDataTable().getVarQuantity(); i++) {
if (dataFile.getDataTable().getDataVariables().get(i).isTypeCharacter()) {
DataFileIO dataFileIO = dataFile.getDataFileIO();
dataFileIO.open();
//DataFileIO dataFileIO = dataFile.getDataFileIO();
//dataFileIO.open();
logger.fine("subsetting character vector");
String[] variableVector = TabularSubsetGenerator.subsetStringVector(new FileInputStream(generatedTabularFile), i, dataFile.getDataTable().getCaseQuantity().intValue());
//calculateCharacterSummaryStatistics(dataFile, i, variableVector);
Expand Down
Loading