Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8524 adding mechanism for storing tab. files with variable headers #10282

Merged
merged 17 commits into from
Feb 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Tabular Data Ingest can now save the generated archival files with the list of variable names added as the first tab-delimited line. As the most significant effect of this feature,
Access API will be able to take advantage of Direct Download for tab. files saved with these headers on S3 - since they no longer have to be generated and added to the streamed content on the fly.

This behavior is controlled by the new setting `:StoreIngestedTabularFilesWithVarHeaders`. It is false by default, preserving the legacy behavior. When enabled, Dataverse will be able to handle both the newly ingested files, and any already-existing legacy files stored without these headers transparently to the user. E.g. the access API will continue delivering tab-delimited files **with** this header line, whether it needs to add it dynamically for the legacy files, or reading complete files directly from storage for the ones stored with it.

An API for converting existing legacy tabular files will be added separately. [this line will need to be changed if we have time to add said API before 6.2 is released].
22 changes: 22 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4151,3 +4151,25 @@ A true/false (default) option determining whether the dataset datafile table dis

.. _supported MicroProfile Config API source: https://docs.payara.fish/community/docs/Technical%20Documentation/MicroProfile/Config/Overview.html


.. _:UseStorageQuotas:

:UseStorageQuotas
+++++++++++++++++

Enables storage use quotas in collections. See the :doc:`/api/native-api` for details.


.. _:StoreIngestedTabularFilesWithVarHeaders:

:StoreIngestedTabularFilesWithVarHeaders
++++++++++++++++++++++++++++++++++++++++

With this setting enabled, tabular files produced during Ingest will
be stored with the list of variable names added as the first
tab-delimited line. As the most significant effect of this feature,
Access API will be able to take advantage of Direct Download for
tab. files saved with these headers on S3 - since they no longer have
to be generated and added to the streamed file on the fly.

The setting is ``false`` by default, preserving the legacy behavior.
18 changes: 18 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/DataTable.java
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,16 @@ public DataTable() {
@Column( nullable = true )
private String originalFileName;


/**
* The physical tab-delimited file is in storage with the list of variable
* names saved as the 1st line. This means that we do not need to generate
* this line on the fly. (Also means that direct download mechanism can be
* used for this file!)
*/
@Column(nullable = false)
private boolean storedWithVariableHeader = false;

/*
* Getter and Setter methods:
*/
Expand Down Expand Up @@ -206,6 +216,14 @@ public void setOriginalFileName(String originalFileName) {
this.originalFileName = originalFileName;
}

public boolean isStoredWithVariableHeader() {
return storedWithVariableHeader;
}

public void setStoredWithVariableHeader(boolean storedWithVariableHeader) {
this.storedWithVariableHeader = storedWithVariableHeader;
}

/*
* Custom overrides for hashCode(), equals() and toString() methods:
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
import jakarta.ws.rs.ext.Provider;

import edu.harvard.iq.dataverse.DataFile;
import edu.harvard.iq.dataverse.FileMetadata;
import edu.harvard.iq.dataverse.dataaccess.*;
import edu.harvard.iq.dataverse.datavariable.DataVariable;
import edu.harvard.iq.dataverse.engine.command.Command;
Expand Down Expand Up @@ -104,8 +103,10 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
String auxiliaryTag = null;
String auxiliaryType = null;
String auxiliaryFileName = null;

// Before we do anything else, check if this download can be handled
// by a redirect to remote storage (only supported on S3, as of 5.4):

if (storageIO.downloadRedirectEnabled()) {

// Even if the above is true, there are a few cases where a
Expand Down Expand Up @@ -159,7 +160,7 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
}

} else if (dataFile.isTabularData()) {
// Many separate special cases here.
// Many separate special cases here.

if (di.getConversionParam() != null) {
if (di.getConversionParam().equals("format")) {
Expand All @@ -180,12 +181,26 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
redirectSupported = false;
}
}
} else if (!di.getConversionParam().equals("noVarHeader")) {
// This is a subset request - can't do.
} else if (di.getConversionParam().equals("noVarHeader")) {
// This will work just fine, if the tab. file is
// stored without the var. header. Throw "unavailable"
// exception otherwise.
// @todo: should we actually drop support for this "noVarHeader" flag?
if (dataFile.getDataTable().isStoredWithVariableHeader()) {
throw new ServiceUnavailableException();
}
// ... defaults to redirectSupported = true
} else {
// This must be a subset request then - can't do.
redirectSupported = false;
}
} else {
// "straight" download of the full tab-delimited file.
// can redirect, but only if stored with the variable
// header already added:
if (!dataFile.getDataTable().isStoredWithVariableHeader()) {
redirectSupported = false;
}
} else {
redirectSupported = false;
}
}
}
Expand Down Expand Up @@ -247,11 +262,16 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
// finally, issue the redirect:
Response response = Response.seeOther(redirect_uri).build();
logger.fine("Issuing redirect to the file location.");
// Yes, this throws an exception. It's not an exception
// as in, "bummer, something went wrong". This is how a
// redirect is produced here!
throw new RedirectionException(response);
}
throw new ServiceUnavailableException();
}

// Past this point, this is a locally served/streamed download

if (di.getConversionParam() != null) {
// Image Thumbnail and Tabular data conversion:
// NOTE: only supported on local files, as of 4.0.2!
Expand Down Expand Up @@ -285,9 +305,14 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
// request any tabular-specific services.

if (di.getConversionParam().equals("noVarHeader")) {
logger.fine("tabular data with no var header requested");
storageIO.setNoVarHeader(Boolean.TRUE);
storageIO.setVarHeader(null);
if (!dataFile.getDataTable().isStoredWithVariableHeader()) {
logger.fine("tabular data with no var header requested");
storageIO.setNoVarHeader(Boolean.TRUE);
storageIO.setVarHeader(null);
} else {
logger.fine("can't serve request for tabular data without varheader, since stored with it");
throw new ServiceUnavailableException();
}
} else if (di.getConversionParam().equals("format")) {
// Conversions, and downloads of "stored originals" are
// now supported on all DataFiles for which StorageIO
Expand Down Expand Up @@ -329,11 +354,10 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
if (variable.getDataTable().getDataFile().getId().equals(dataFile.getId())) {
logger.fine("adding variable id " + variable.getId() + " to the list.");
variablePositionIndex.add(variable.getFileOrder());
if (subsetVariableHeader == null) {
subsetVariableHeader = variable.getName();
} else {
subsetVariableHeader = subsetVariableHeader.concat("\t");
subsetVariableHeader = subsetVariableHeader.concat(variable.getName());
if (!dataFile.getDataTable().isStoredWithVariableHeader()) {
subsetVariableHeader = subsetVariableHeader == null
? variable.getName()
: subsetVariableHeader.concat("\t" + variable.getName());
}
} else {
logger.warning("variable does not belong to this data file.");
Expand All @@ -346,16 +370,29 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
try {
File tempSubsetFile = File.createTempFile("tempSubsetFile", ".tmp");
TabularSubsetGenerator tabularSubsetGenerator = new TabularSubsetGenerator();
tabularSubsetGenerator.subsetFile(storageIO.getInputStream(), tempSubsetFile.getAbsolutePath(), variablePositionIndex, dataFile.getDataTable().getCaseQuantity(), "\t");

long numberOfLines = dataFile.getDataTable().getCaseQuantity();
if (dataFile.getDataTable().isStoredWithVariableHeader()) {
numberOfLines++;
}

tabularSubsetGenerator.subsetFile(storageIO.getInputStream(),
tempSubsetFile.getAbsolutePath(),
variablePositionIndex,
numberOfLines,
"\t");

if (tempSubsetFile.exists()) {
FileInputStream subsetStream = new FileInputStream(tempSubsetFile);
long subsetSize = tempSubsetFile.length();

InputStreamIO subsetStreamIO = new InputStreamIO(subsetStream, subsetSize);
logger.fine("successfully created subset output stream.");
subsetVariableHeader = subsetVariableHeader.concat("\n");
subsetStreamIO.setVarHeader(subsetVariableHeader);

if (subsetVariableHeader != null) {
subsetVariableHeader = subsetVariableHeader.concat("\n");
subsetStreamIO.setVarHeader(subsetVariableHeader);
}

String tabularFileName = storageIO.getFileName();

Expand All @@ -380,8 +417,13 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
} else {
logger.fine("empty list of extra arguments.");
}
// end of tab. data subset case
} else if (dataFile.getDataTable().isStoredWithVariableHeader()) {
logger.fine("tabular file stored with the var header included, no need to generate it on the fly");
storageIO.setNoVarHeader(Boolean.TRUE);
storageIO.setVarHeader(null);
}
}
} // end of tab. data file case

if (storageIO == null) {
//throw new WebApplicationException(Response.Status.SERVICE_UNAVAILABLE);
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ public String datafile(@QueryParam("fileName") String fileName, @QueryParam("fil
TabularDataIngest tabDataIngest = null;

try {
tabDataIngest = ingestPlugin.read(fileInputStream, null);
tabDataIngest = ingestPlugin.read(fileInputStream, false, null);
} catch (IOException ingestEx) {
output = output.concat("Caught an exception trying to ingest file " + fileName + ": " + ingestEx.getLocalizedMessage());
return output;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,8 @@ public void open (DataAccessOption... options) throws IOException {
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())) {
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -450,8 +450,12 @@ public void open(DataAccessOption... options) throws IOException {
this.setSize(retrieveSizeFromMedia());
}
// Only applies for the S3 Connector case (where we could have run an ingest)
if (dataFile.getContentType() != null && dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData() && dataFile.getDataTable() != null && (!this.noVarHeader())) {
if (dataFile.getContentType() != null
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,12 @@ public void open(DataAccessOption... options) throws IOException {
logger.fine("Setting size");
this.setSize(retrieveSizeFromMedia());
}
if (dataFile.getContentType() != null && dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData() && dataFile.getDataTable() != null && (!this.noVarHeader())) {
if (dataFile.getContentType() != null
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,8 @@ public void open(DataAccessOption... options) throws IOException {
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())) {
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,8 @@ public void open(DataAccessOption... options) throws IOException {
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())) {
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down