Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-19204 Spark-HPCC class(es) to store data on an RDD and a Dataframe to THOR #28

Merged
merged 1 commit into from
Oct 5, 2018

Conversation

jpmcmu
Copy link
Contributor

@jpmcmu jpmcmu commented Sep 21, 2018

@rpastrana please review

@jpmcmu jpmcmu changed the title HPCC-19204 WIP HPCC-19204 Spark-HPCC class(es) to store data on an RDD and a Dataframe to THOR Oct 4, 2018
Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu looks good. a few comments.


public class HpccFileWriter implements Serializable {
static private final long serialVersionUID = 1L;
static private final int DefaultExpiryTimeSecs = 300;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the same as the fileaccess expiry, we should consider sharing the default.
HPCCFile will have static private final int DEFAULT_ACCESS_EXPIRY_SECONDS = 120;
Perhaps it belongs in wsclient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we may want to have different expiry times for write vs read. I would imagine read might benefit from longer expiration times. Maybe this is something we can review when we look into access token renewal?

}

public enum HpccCluster {
THOR("thor");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPCCWsclient getAvailableClusterNames might provide this set of clusters


/**
* HpccFileWriter Constructor
* Attempts to open a connection to the specified HPCC cluster and validate the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets mention that the target host and port are eclwatch host and port

* HpccFileWriter Constructor
* Attempts to open a connection to the specified HPCC cluster and validate the user.
* @param connectionString of format {http|https}://{HOST}:{PORT}
* @param user a valid account
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ecwatch account and pass

*/
public HpccFileWriter(String connectionString, String user, String pass) throws Exception {
// Verify connection & password
Pattern connectionRegex = Pattern.compile("(http|https)://([^:]+):([0-9]+)",Pattern.CASE_INSENSITIVE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to compile the regex once for the class? final static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to make it final but left it in the constructor because it is not used elsewhere in the class at the moment

DFUFileCopyWrapper[] filePartCopies = fileParts[i].getCopies();
if (filePartCopies.length == 0) {
abortFileCreation();
throw new Exception("File creation error: File part: " + i + " does not have an file copies associated with it. Aborting write.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i + 1?

Function2<Integer, Iterator<Row>, Iterator<FilePartWriteResults>> writeFunc =
(Integer partitionIndex, Iterator<Row> it) -> {

InetAddress localAddress = getLocalAddress();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible localAddress.getHostAddress() could be == to the 2nd NIC's address?
if so, perhaps we need a isLocalAddress(String ip) , which iterates over all available NICs?

Copy link
Contributor Author

@jpmcmu jpmcmu Oct 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that the system NIC order changed in between getting the partition map and starting the write. If this were to happen I think it would be better to bail

FilePartWriteResults result = new FilePartWriteResults();

// Make the parent dir if it does not exist
String fileWriteDir = filePartPath.substring(0,filePartPath.lastIndexOf('/'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using String separator =System.getProperty("path.separator");


// calculated size includes the data size field & and padding byte if present
// We want only the data size of the child array
dataSetSize -= 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's create a final static int paddingsize = 4


this.buffer.clear();
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, this might need new line at the end of the file

Copy link
Member

@rpastrana rpastrana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmcmu looks good.

@richardkchapman
Copy link
Member

@jpmcmu Please squash

…me to THOR

- Created HpccFileWriter class that allows RDDs to be written to HPCC

Signed-off-by: James McMullan <James.McMullan@lexisnexis.com>
@jpmcmu
Copy link
Contributor Author

jpmcmu commented Oct 5, 2018

@richardkchapman squashed & ready for merge

@richardkchapman richardkchapman merged commit e194441 into hpcc-systems:master Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants