Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating RDDs incrementally #79

Open
ashrowty opened this issue Oct 25, 2014 · 15 comments
Open

Updating RDDs incrementally #79

ashrowty opened this issue Oct 25, 2014 · 15 comments

Comments

@ashrowty
Copy link

Hi all,

Trying to explore use of Spark as a query engine for an OLAP style dashboard. Planning on the persistent store being Cassandra. How do I update the RDDs as the data is constantly refreshing on disk? The UI dashboard needs to show data upto the minute. Is this an appropriate use case for Spark if the underlying data is changing?

Thanks,
Ashish

@velvia
Copy link
Contributor

velvia commented Oct 25, 2014

I would say Spark tends to not be as good for constantly changing data. You can use Spark to repeatedly query Cassandra but the latencies might be pretty high.

-Evan
"Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Oct 25, 2014, at 3:53 PM, Ashish notifications@github.com wrote:

Hi all,

Trying to explore use of Spark as a query engine for an OLAP style dashboard. Planning on the persistent store being Cassandra. How do I update the RDDs as the data is constantly refreshing on disk? The UI dashboard needs to show data upto the minute. Is this an appropriate use case for Spark if the underlying data is changing?

Thanks,
Ashish


Reply to this email directly or view it on GitHub.

@ashrowty
Copy link
Author

Thanks for your response Evan. Any suggestions of an alternate architecture? One option would be to design an OLAP style schema in Cassandra itself and query directly. Parquet also looks quite interesting. How do you guys do real-time dashboards at Ooyala?

-Ashish

@epugh
Copy link

epugh commented Oct 27, 2014

Out of curiosity, isn’t that what Spark Streaming would be about? I haven’t done much yet there, but would look at it as an option?

When we’ve had to do more real time analytics, we’ve used Spark to do the heavy lifting work, and then either CQL or Solr for more “realtime” querying. Yes, you still have a bit of latency between when he data comes in and when the spark job finished, but then the rest of the interface, based on CQL or Solr is blazingly fast.

When folks say “real time” sometimes what they mean is they want to explore the data without running a batch job just to refresh the screen!

On Oct 25, 2014, at 11:46 PM, Ashish notifications@github.com wrote:

Thanks for your response Evan. Any suggestions of an alternate architecture? One option would be to design an OLAP style schema in Cassandra itself and query directly. Parquet also looks quite interesting. How do you guys do real-time dashboards at Ooyala?

-Ashish


Reply to this email directly or view it on GitHub.


Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com | My Free/Busy
Co-Author: Apache Solr 3 Enterprise Search Server
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

@velvia
Copy link
Contributor

velvia commented Oct 28, 2014

I think @epugh is right. One approach is to use streaming processor to compute predefined answers, then pull them out directly using K-V lookups in Cassandra.

An intermediate is to use ElasticSearch / Solr which can update search indexes and allow you to do richer queries than pure K-V.

I'm guessing @ashrowty you want Spark/OLAP to let the user do rich queries in "real time" on changing data right? This was never the case at Ooyala, which was append-only aggregates. Changing data is much harder.

Another idea: use Spark / Streaming to process incoming data, reducing it to as small of an aggregate table as you can; the delta aggregates updates a single-node, in-memory database (H2, HSql) or something like Vertica, if you want OLAP.

@ashrowty
Copy link
Author

@velvia - I should correct my earlier statement .. I am not truly looking for 'changing' data or updates. These are mostly append only events. Essentially I am looking to display a dashboard that shows counts and % of impressions by various dimensions and hierarchies.
My thought was to use Spark Streaming to denormalize data on its way in, store it in Cassandra and then use Spark to provide fast query capabilities on this data, using JobServer to service incoming dashboard requests. This I think could work, however there is data coming in constantly and the RDDs need to be updated to show the latest counts. One thought is to periodically load new RDDs every hour or so and release the older RDDs. My data size is not very big (at this point) .. looking at approx. 1B rows a year.
I am trying to not do any pre-computes as much as possible, but realize that I may need to in the longer term.
@epugh - CQL does not support groupby and aggregates .. so that kinda rules it out, am looking at Solr also.

@epugh
Copy link

epugh commented Oct 28, 2014

I’m doing a project with DSE, and using Spark. I went looking for the REST service for submitting jobs with DSE, and didn’t find anything. And it kind of makes sense, right? Spark is built around workers, and they are the limited resource. You don’t have guaranteed resources, because you submit the job, and then it’s picked up and run by Spark master.

So if you are looking at the real time response of a RESTful service, then you really need to look at how do you partition out your workers, and make sure you have enough workers to meet the number of jobs submitted.

@ashrowty https://github.com/ashrowty, so the inverted index data format the Lucene (underpinning Solr) makes grouping and aggregations very fast. Look at some of the grouping, facet, pivot facet, functions that Solr provides.

Eric

On Oct 28, 2014, at 10:42 AM, Ashish notifications@github.com wrote:

@velvia https://github.com/velvia - I should correct my earlier statement .. I am not truly looking for 'changing' data or updates. These are mostly append only events. Essentially I am looking to display a dashboard that shows counts and % of impressions by various dimensions and hierarchies.
My thought was to use Spark Streaming to denormalize data on its way in, store it in Cassandra and then use Spark to provide fast query capabilities on this data, using JobServer to service incoming dashboard requests. This I think could work, however there is data coming in constantly and the RDDs need to be updated to show the latest counts. One thought is to periodically load new RDDs every hour or so and release the older RDDs. My data size is not very big (at this point) .. looking at approx. 1B rows a year.
I am trying to not do any pre-computes as much as possible, but realize that I may need to in the longer term.
@epugh https://github.com/epugh - CQL does not support groupby and aggregates .. so that kinda rules it out, am looking at Solr also.


Reply to this email directly or view it on GitHub #79 (comment).


Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com http://www.opensourceconnections.com/ | My Free/Busy http://tinyurl.com/eric-cal
Co-Author: Apache Solr 3 Enterprise Search Server http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.

@smagadi
Copy link

smagadi commented Oct 1, 2015

try using Elastic search

@epugh
Copy link

epugh commented Oct 1, 2015

@smagadi I agree that Elasticsearch is really good for streaming aggregations, that has been very much their focus with all the work on Kibana and Logstash.

Having said that, I find that at times you need a true computational engine, like Spark, to roll up your data. For example, let's say you are doing an internet of things type application, and you are picking up millions of events, each of them very small increments of time. Yes, you could put all that into ES, but you are really going to tax the inverted index. Instead, dump the events into something that is really mean for lots of writes [Cassandra, or other write oriented system], and then run your streaming Spark over them to aggregate to a point where a search based interface easily parses over them. Plus, if you want to do more sophisticated analytics, like Complex Event Processing, then you need Spark.

If you didn't want something like Cassandra at all, you could use Spark to recalculate data (http://opensourceconnections.com/blog/2015/09/24/using-dse-to-run-solr-rdd/ or http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html) over a search index as well, but it stills means you are introducing Spark and need a way to run it.

@ashrowty
Copy link
Author

ashrowty commented Oct 1, 2015

I ended up implementing a Spark batch job for now that does the heavy
lifting of rollups and segment calculations and then using Solr to do the
basic slicing and dicing on the summarized data. Going into production in
the next couple of weeks.

The current job takes a long time to complete since I am pre-generating the
output space for all possible permutations and combinations. When I have
some time I want to try the JobServer to do a hybrid approach ... where
some of the calcs are predone but others are real time. That way the system
only needs to compute results for specific user requests.

Spark streaming is interesting but probably won't work since the
calculations require a full 1 year worth of data to compute segments and
then aggregates ... unless I can find a way to do incremental updates.

On Thu, Oct 1, 2015 at 7:49 AM Eric Pugh notifications@github.com wrote:

@smagadi https://github.com/smagadi I agree that Elasticsearch is
really good for streaming aggregations, that has been very much their
focus.

Having said that, I find that at times you need a true computational
engine, like Spark, to roll up your data. For example, let's say you are
doing an internet of things type application, and you are picking up
millions of events, each of them very small increments of time. Yes, you
could put all that into ES, but you are really going to tax the inverted
index. Instead, dump the events into something that is really mean for lots
of writes [Cassandra, or other write oriented system], and then run your
streaming Spark over them to aggregate to a point where a search based
interface easily parses over them. Plus, if you want to do more
sophisticated analytics, like Complex Event Processing, then you need
Spark.

If you didn't want something like Cassandra at all, you could use Spark to
recalculate data (
http://opensourceconnections.com/blog/2015/09/24/using-dse-to-run-solr-rdd/
or
http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html)
over a search index as well, but it stills means you are introducing Spark
and need a way to run it.


Reply to this email directly or view it on GitHub
#79 (comment)
.

@velvia
Copy link
Contributor

velvia commented Oct 1, 2015

Since we are discussing streaming aggregations, I'd like to take the opportunity to point out my other OSS

-Evan
"Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Oct 1, 2015, at 6:11 AM, Ashish notifications@github.com wrote:

I ended up implementing a Spark batch job for now that does the heavy
lifting of rollups and segment calculations and then using Solr to do the
basic slicing and dicing on the summarized data. Going into production in
the next couple of weeks.

The current job takes a long time to complete since I am pre-generating the
output space for all possible permutations and combinations. When I have
some time I want to try the JobServer to do a hybrid approach ... where
some of the calcs are predone but others are real time. That way the system
only needs to compute results for specific user requests.

Spark streaming is interesting but probably won't work since the
calculations require a full 1 year worth of data to compute segments and
then aggregates ... unless I can find a way to do incremental updates.

On Thu, Oct 1, 2015 at 7:49 AM Eric Pugh notifications@github.com wrote:

@smagadi https://github.com/smagadi I agree that Elasticsearch is
really good for streaming aggregations, that has been very much their
focus.

Having said that, I find that at times you need a true computational
engine, like Spark, to roll up your data. For example, let's say you are
doing an internet of things type application, and you are picking up
millions of events, each of them very small increments of time. Yes, you
could put all that into ES, but you are really going to tax the inverted
index. Instead, dump the events into something that is really mean for lots
of writes [Cassandra, or other write oriented system], and then run your
streaming Spark over them to aggregate to a point where a search based
interface easily parses over them. Plus, if you want to do more
sophisticated analytics, like Complex Event Processing, then you need
Spark.

If you didn't want something like Cassandra at all, you could use Spark to
recalculate data (
http://opensourceconnections.com/blog/2015/09/24/using-dse-to-run-solr-rdd/
or
http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html)
over a search index as well, but it stills means you are introducing Spark
and need a way to run it.


Reply to this email directly or view it on GitHub
#79 (comment)
.


Reply to this email directly or view it on GitHub.

@velvia
Copy link
Contributor

velvia commented Oct 1, 2015

Oops, didn't finish my thought. I'd like to point out my other OSS

-Evan
"Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Oct 1, 2015, at 6:11 AM, Ashish notifications@github.com wrote:

I ended up implementing a Spark batch job for now that does the heavy
lifting of rollups and segment calculations and then using Solr to do the
basic slicing and dicing on the summarized data. Going into production in
the next couple of weeks.

The current job takes a long time to complete since I am pre-generating the
output space for all possible permutations and combinations. When I have
some time I want to try the JobServer to do a hybrid approach ... where
some of the calcs are predone but others are real time. That way the system
only needs to compute results for specific user requests.

Spark streaming is interesting but probably won't work since the
calculations require a full 1 year worth of data to compute segments and
then aggregates ... unless I can find a way to do incremental updates.

On Thu, Oct 1, 2015 at 7:49 AM Eric Pugh notifications@github.com wrote:

@smagadi https://github.com/smagadi I agree that Elasticsearch is
really good for streaming aggregations, that has been very much their
focus.

Having said that, I find that at times you need a true computational
engine, like Spark, to roll up your data. For example, let's say you are
doing an internet of things type application, and you are picking up
millions of events, each of them very small increments of time. Yes, you
could put all that into ES, but you are really going to tax the inverted
index. Instead, dump the events into something that is really mean for lots
of writes [Cassandra, or other write oriented system], and then run your
streaming Spark over them to aggregate to a point where a search based
interface easily parses over them. Plus, if you want to do more
sophisticated analytics, like Complex Event Processing, then you need
Spark.

If you didn't want something like Cassandra at all, you could use Spark to
recalculate data (
http://opensourceconnections.com/blog/2015/09/24/using-dse-to-run-solr-rdd/
or
http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html)
over a search index as well, but it stills means you are introducing Spark
and need a way to run it.


Reply to this email directly or view it on GitHub
#79 (comment)
.


Reply to this email directly or view it on GitHub.

@velvia
Copy link
Contributor

velvia commented Oct 1, 2015

Check out my other OSS project, FiloDB (GitHub.com/TupleJump/FiloDB). It allows you to dump event data into C and query them in an ad hoc way using Spark very efficiently, at Parquet performance levels. Sorry for the other emails and thanks for the plug :)

-Evan
"Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Oct 1, 2015, at 6:11 AM, Ashish notifications@github.com wrote:

I ended up implementing a Spark batch job for now that does the heavy
lifting of rollups and segment calculations and then using Solr to do the
basic slicing and dicing on the summarized data. Going into production in
the next couple of weeks.

The current job takes a long time to complete since I am pre-generating the
output space for all possible permutations and combinations. When I have
some time I want to try the JobServer to do a hybrid approach ... where
some of the calcs are predone but others are real time. That way the system
only needs to compute results for specific user requests.

Spark streaming is interesting but probably won't work since the
calculations require a full 1 year worth of data to compute segments and
then aggregates ... unless I can find a way to do incremental updates.

On Thu, Oct 1, 2015 at 7:49 AM Eric Pugh notifications@github.com wrote:

@smagadi https://github.com/smagadi I agree that Elasticsearch is
really good for streaming aggregations, that has been very much their
focus.

Having said that, I find that at times you need a true computational
engine, like Spark, to roll up your data. For example, let's say you are
doing an internet of things type application, and you are picking up
millions of events, each of them very small increments of time. Yes, you
could put all that into ES, but you are really going to tax the inverted
index. Instead, dump the events into something that is really mean for lots
of writes [Cassandra, or other write oriented system], and then run your
streaming Spark over them to aggregate to a point where a search based
interface easily parses over them. Plus, if you want to do more
sophisticated analytics, like Complex Event Processing, then you need
Spark.

If you didn't want something like Cassandra at all, you could use Spark to
recalculate data (
http://opensourceconnections.com/blog/2015/09/24/using-dse-to-run-solr-rdd/
or
http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html)
over a search index as well, but it stills means you are introducing Spark
and need a way to run it.


Reply to this email directly or view it on GitHub
#79 (comment)
.


Reply to this email directly or view it on GitHub.

@Prasidhdh
Copy link

I am working on spark to get refreshed data. I have source data in form of Mysql tables. I am using jdbc connection to get those data in spark. Then, i am doing some filtering, mapping and finally join between two Rdds. And then apply action on that joined rdd. Now, my Program is sleeping (thread sleep) for some time period and during that time, i am deleting data from original source tables and inserting new data. So, after thread sleep again i am calling action on same joined rdd. so it is my second job on same rdds.

In a result i need merged data (before thread sleep joined result + after thread sleep joined result) so i can get deleted data and new inserted data.

In this case, spark uses sort shuffle mechanism and internally it creates datafiles and indexfiles for each shuffle map task. So, during second job when i am getting new data, i want to merge or append this new data with previously stored datafile. For that, i tried to append new data to datafile but,I do not have read or write access to previously created datafile and indexfile. so, how can i merge results of both of the joins?

Please help me to solve my problem.

Thanks in advance!

@velvia
Copy link
Contributor

velvia commented May 24, 2016

In general you don’t have access to SPark’s internal shuffle files. You can however persist the result of the first join (in HDFS, in memory, etc.) and append the results of the second join, or however you wish to combine them. Does this not work?

Also beware you are on Ooyala job server mailing list, but the project is at spark-jobserver/spark-jobserver.

On May 24, 2016, at 8:25 AM, Prasidhdh notifications@github.com wrote:

I am working on spark to get refreshed data. I have source data in form of Mysql tables. I am using jdbc connection to get those data in spark. Then, i am doing some filtering, mapping and finally join between two Rdds. And then apply action on that joined rdd. Now, my Program is sleeping (thread sleep) for some time period and during that time, i am deleting data from original source tables and inserting new data. So, after thread sleep again i am calling action on same joined rdd. so it is my second job on same rdds.

In a result i need merged data (before thread sleep joined result + after thread sleep joined result) so i can get deleted data and new inserted data.

In this case, spark uses sort shuffle mechanism and internally it creates datafiles and indexfiles for each shuffle map task. So, during second job when i am getting new data, i want to merge or append this new data with previously stored datafile. For that, i tried to append new data to datafile but,I do not have read or write access to previously created datafile and indexfile. so, how can i merge results of both of the joins?

Please help me to solve my problem.

Thanks in advance!


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #79 (comment)

@Prasidhdh
Copy link

@velvia Thank you for your response. But, i want to do it internally like merging datafiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants