-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating RDDs incrementally #79
Comments
I would say Spark tends to not be as good for constantly changing data. You can use Spark to repeatedly query Cassandra but the latencies might be pretty high. -Evan
|
Thanks for your response Evan. Any suggestions of an alternate architecture? One option would be to design an OLAP style schema in Cassandra itself and query directly. Parquet also looks quite interesting. How do you guys do real-time dashboards at Ooyala? -Ashish |
Out of curiosity, isn’t that what Spark Streaming would be about? I haven’t done much yet there, but would look at it as an option? When we’ve had to do more real time analytics, we’ve used Spark to do the heavy lifting work, and then either CQL or Solr for more “realtime” querying. Yes, you still have a bit of latency between when he data comes in and when the spark job finished, but then the rest of the interface, based on CQL or Solr is blazingly fast. When folks say “real time” sometimes what they mean is they want to explore the data without running a batch job just to refresh the screen! On Oct 25, 2014, at 11:46 PM, Ashish notifications@github.com wrote:
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com | My Free/Busy |
I think @epugh is right. One approach is to use streaming processor to compute predefined answers, then pull them out directly using K-V lookups in Cassandra. An intermediate is to use ElasticSearch / Solr which can update search indexes and allow you to do richer queries than pure K-V. I'm guessing @ashrowty you want Spark/OLAP to let the user do rich queries in "real time" on changing data right? This was never the case at Ooyala, which was append-only aggregates. Changing data is much harder. Another idea: use Spark / Streaming to process incoming data, reducing it to as small of an aggregate table as you can; the delta aggregates updates a single-node, in-memory database (H2, HSql) or something like Vertica, if you want OLAP. |
@velvia - I should correct my earlier statement .. I am not truly looking for 'changing' data or updates. These are mostly append only events. Essentially I am looking to display a dashboard that shows counts and % of impressions by various dimensions and hierarchies. |
I’m doing a project with DSE, and using Spark. I went looking for the REST service for submitting jobs with DSE, and didn’t find anything. And it kind of makes sense, right? Spark is built around workers, and they are the limited resource. You don’t have guaranteed resources, because you submit the job, and then it’s picked up and run by Spark master. So if you are looking at the real time response of a RESTful service, then you really need to look at how do you partition out your workers, and make sure you have enough workers to meet the number of jobs submitted. @ashrowty https://github.com/ashrowty, so the inverted index data format the Lucene (underpinning Solr) makes grouping and aggregations very fast. Look at some of the grouping, facet, pivot facet, functions that Solr provides. Eric
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com http://www.opensourceconnections.com/ | My Free/Busy http://tinyurl.com/eric-cal |
try using Elastic search |
@smagadi I agree that Elasticsearch is really good for streaming aggregations, that has been very much their focus with all the work on Kibana and Logstash. Having said that, I find that at times you need a true computational engine, like Spark, to roll up your data. For example, let's say you are doing an internet of things type application, and you are picking up millions of events, each of them very small increments of time. Yes, you could put all that into ES, but you are really going to tax the inverted index. Instead, dump the events into something that is really mean for lots of writes [Cassandra, or other write oriented system], and then run your streaming Spark over them to aggregate to a point where a search based interface easily parses over them. Plus, if you want to do more sophisticated analytics, like Complex Event Processing, then you need Spark. If you didn't want something like Cassandra at all, you could use Spark to recalculate data (http://opensourceconnections.com/blog/2015/09/24/using-dse-to-run-solr-rdd/ or http://chapeau.freevariable.com/2015/04/elasticsearch-and-spark-1-dot-3.html) over a search index as well, but it stills means you are introducing Spark and need a way to run it. |
I ended up implementing a Spark batch job for now that does the heavy The current job takes a long time to complete since I am pre-generating the Spark streaming is interesting but probably won't work since the On Thu, Oct 1, 2015 at 7:49 AM Eric Pugh notifications@github.com wrote:
|
Since we are discussing streaming aggregations, I'd like to take the opportunity to point out my other OSS -Evan
|
Oops, didn't finish my thought. I'd like to point out my other OSS -Evan
|
Check out my other OSS project, FiloDB (GitHub.com/TupleJump/FiloDB). It allows you to dump event data into C and query them in an ad hoc way using Spark very efficiently, at Parquet performance levels. Sorry for the other emails and thanks for the plug :) -Evan
|
I am working on spark to get refreshed data. I have source data in form of Mysql tables. I am using jdbc connection to get those data in spark. Then, i am doing some filtering, mapping and finally join between two Rdds. And then apply action on that joined rdd. Now, my Program is sleeping (thread sleep) for some time period and during that time, i am deleting data from original source tables and inserting new data. So, after thread sleep again i am calling action on same joined rdd. so it is my second job on same rdds. In a result i need merged data (before thread sleep joined result + after thread sleep joined result) so i can get deleted data and new inserted data. In this case, spark uses sort shuffle mechanism and internally it creates datafiles and indexfiles for each shuffle map task. So, during second job when i am getting new data, i want to merge or append this new data with previously stored datafile. For that, i tried to append new data to datafile but,I do not have read or write access to previously created datafile and indexfile. so, how can i merge results of both of the joins? Please help me to solve my problem. Thanks in advance! |
In general you don’t have access to SPark’s internal shuffle files. You can however persist the result of the first join (in HDFS, in memory, etc.) and append the results of the second join, or however you wish to combine them. Does this not work? Also beware you are on Ooyala job server mailing list, but the project is at spark-jobserver/spark-jobserver.
|
@velvia Thank you for your response. But, i want to do it internally like merging datafiles. |
Hi all,
Trying to explore use of Spark as a query engine for an OLAP style dashboard. Planning on the persistent store being Cassandra. How do I update the RDDs as the data is constantly refreshing on disk? The UI dashboard needs to show data upto the minute. Is this an appropriate use case for Spark if the underlying data is changing?
Thanks,
Ashish
The text was updated successfully, but these errors were encountered: