Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Data Lake integration #134

Open
isaacabraham opened this issue Jan 15, 2016 · 7 comments
Open

Investigate Data Lake integration #134

isaacabraham opened this issue Jan 15, 2016 · 7 comments

Comments

@isaacabraham
Copy link
Contributor

https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-overview/

Data Lake is Microsoft's new "big data" store - auto-scaling, support for HDFS etc. etc.. There are several components to it, such as Data Lake Analytics (i.e. U-SQL) but we should look into the possibility of hooking an MBrace cluster up to Data Lake in addition to simply blob storage.

@isaacabraham isaacabraham changed the title Data Lake compatibility Investigate Data Lake integration Jan 15, 2016
@mathias-brandewinder
Copy link

Isaac: agreed. To frame it differently, having a 'real big data' example would be great. What do you think would be the best way to approach that? Perhaps work through an example?

@isaacabraham
Copy link
Contributor Author

That's a good question :-) I'm speaking to the Data Lake guys to see what the story is for plugging third party components onto the DL store (MBrace has CloudFlow so no need for the USQL side of things). I do think investigating the HDFS side of things is worth spending some time on though. cc: @palladin @dsyme @eiriktsarpalis

@mathias-brandewinder
Copy link

Totally agree on the HDFS side. Having guidance / a story on how to work against 'stuff in HDFS' would be awesome.

@palladin
Copy link
Member

What kind of HDFS support do you have in mind? Because AFAIK HDinsight HDFS acts as an access interface for blob storage.

@isaacabraham
Copy link
Contributor Author

@palladin there's definitely a side of WASB / HDFS interop that allows HDFS to talk to blob storage without realising it. I'm talking about things from the other side of the fence i.e. allowing people to access resources in MBrace via HDFS or WASB. Currently the mechanism for accessing blobs in MBrace is: -

  1. One storage account only.
  2. Limited / somewhat inconsistent way to navigate to blobs.

Support for WASB addressing would allow us to consistently address blobs, particularly when connecting multiple storage accounts (this is IMHO a really important feature because it allows us to create clusters and perform data analysis on other storage accounts e.g. customer data without writing MBrace-specific data to their storage account).

For me this issue is about looking at Data Lake integration - one way is via WASB, another is HDFS, and another is Data Lake's own ADL naming format.

The HDFS part would be interesting if you have e.g. an HDFS cluster running somewhere with data on it - can we access it? Again, maybe that's another issue (there is definitely one either here or on MBrace.Core about this) - but the ability to e.g. index files based on wildcard paths e.g. "data/customers/january/*" using HDFS noation rather than explicitly providing a list of files to operate over would be great.

@palladin
Copy link
Member

@isaacabraham Support for wasb urls is certainly useful and actually Eirik has some ideas for multiple storage-account management that will enable the account resolution part of wasb urls.

@isaacabraham
Copy link
Contributor Author

That's great. @eiriktsarpalis and I have had some chats about this. Whilst we're on the subject - something that is valuable is the ability to clearly segregate internal MBrace data in terms of Store from data access. I'm thinking here in terms of running MBrace on Service Fabric - which has in-built support for local, replicated state across a cluster. This could be a perfect fit for MBrace's internal state and would have potentially large performance benefits for things like persisted cloud flows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants