Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL performance in feed_model #93

Closed
henols opened this issue Oct 10, 2012 · 4 comments
Closed

SQL performance in feed_model #93

henols opened this issue Oct 10, 2012 · 4 comments

Comments

@henols
Copy link

henols commented Oct 10, 2012

Background:
Creating a db query are more costly then executing a complex query.
The strategy in emoncms seams to be creating db queries instead of doing complex queries and that kills performance, on a fast system it may not be noticeable but on slow boxes and i guess on an Raspberry PI its very noticeable.
Performance is a must and to keep emoncms in the pole position this has to be done right...

function get_user_feeds creates 1 + nr_of_feeds * 2 queries and can be done in 1, this function seams to be hitted every 10th second per opened page

function get_user_feeds($userid,$status)
{
    $result = db_query("SELECT id, name, tag, time, value, DATA_LENGTH + INDEX_LENGTH size, datatype, status ".
    "FROM feeds f, feed_relation r, information_schema.tables i ".
    "WHERE userid = '$userid' and r.feedid = f.id and status = $status ".
    "and table_name = concat('feed_' , f.id) and table_schema = DATABASE()");

    $feeds = array();

    if ($result) {
        $qfeeds = "";
        $feedids = array();
        while ($row = db_fetch_array($result)) {
            $feeds[] = array($row['id'],$row['name'],$row['tag'],strtotime($row['time'])*1000,$row['value'],$row['size'], $row['datatype'], $row['status']);
            $feedids[] = $row['id'];

        }
    }

    usort($feeds, 'compare');       // Sort feeds by tag's
    return $feeds;
}

Its the same with function get_user_feeds_size and it can be taken down to 2 queries.

function get_user_feeds_size($userid)
{
    $result = db_query("SELECT * FROM feed_relation WHERE userid = '$userid'");
    $total = 0;
    if ($result) {
        $feedids = array();
        while ($row = db_fetch_array($result)) {
            $feedids[] = $row['feedid'];
        }
        $feeds = "";
        foreach ($feedids as $feedid) {
            //          $total += get_feedtable_size($row['feedid']);
            if(!empty($feeds)) {
                $feeds .= " or";
            }
            $feeds .= " table_name = 'feed_".$feedid."'";
        }
        $result = db_query("SELECT SUM(DATA_LENGTH + INDEX_LENGTH) TOTAL_LENGTH FROM information_schema.tables WHERE (".$feeds.") and table_schema = DATABASE()");
        $row = db_fetch_array($result);
        $total = $row['TOTAL_LENGTH'];
    }

    return $total;
}

Function get_feed_data are creating 1000 queries per feed if you display a multi-graph chart!
And it can be reduced to 1 query, its not doing 100% the same but in my opinion its more correct to get an average value between the time slots.

function get_feed_data($feedid,$start,$end,$dp)
{
    if ($end == 0) $end = time()*1000;

    $feedname = "feed_".trim($feedid)."";
    $start = $start/1000; $end = $end/1000;

    $data = array();
    if ($dp <= 0)   {
        dp = 1;
    }

    $range = $end - $start;
    $td = $range / $dp;

    $result = db_query("SELECT convert(avg(time), UNSIGNED) time, avg(data) data FROM $feedname WHERE `time` > $start AND `time` < $end group by convert(time / $td, UNSIGNED) order by time Asc");
    while($row = db_fetch_array($result)) {
        $dataValue = $row['data'];
        $time = $row['time'] * 1000;
        $data[] = array($time , $dataValue);
    }
    return $data;
}

There are much more to be done and I will create issues when I find some show stoppers.

@TrystanLea
Copy link
Member

Hello Henols

Thanks a lot for your insight, I can see the benefit of the first two, it would be interesting to see the timing difference.

I did quite a bit of work on establishing the best method for get_feed_data of which creating 1000's of queries actually came out best.

Here's the blogpost I did detailing the investigation: http://openenergymonitor.blogspot.co.uk/2012/04/speeding-up-emoncms-feed-data-requests.html

Do you think your query will be faster at selecting 400 datapoints over months of 5s time series data?

I was wondering if creating many queries was an issue hence my initial use of a mysql only resolution query for get_feed_data but after the test it seemed that as long as the data was indexed it was actually faster at certain queries than the alternative.

Lets get some timing data on this

Thanks a lot

Trystan

@henols
Copy link
Author

henols commented Oct 12, 2012

Hi Tristan
I did some timing on the tow different approaches, but we have to consider that they aren't doing the exact same thing.
The single query are doing average on the point and 1000 queries are picking 'random' points and the my be used for different purposes. Deviations might be displayed with the 1000 queries method but they are 'assimilated' with the single query.

The test data are over 5 month and 10 seconds time series.

1000 queries

millis  days    rows    dp
  638.6     1    400     400
 1549.4     1   1000    1000
  642.9     2    400     400
 1575.5     2   1000    1000
  732.6     3    400     400
 1537.2     3   1000    1000
  652.3     4    400     400
 1585.9     4   1000    1000
  678.5     5    400     400
 1567.7     5   1000    1000

1 query

millis  days    rows    dp
  156.3     1    401     400
  235       1   1001    1000
  250.6     2    401     400
  279.4     2   1001    1000
  360.5     3    401     400
  404.5     3   1001    1000
  637.4     4    401     400
  725.4     4   1001    1000
  945.3     5    401     400
 1052       5   1001    1000

1000 queries

 millis month   rows    dp
  783.9     1    395     400
 1573.5     1    988    1000
  671.7     2    398     400
 1554.4     2    995    1000
  652.9     3    399     400
 1560.2     3    996    1000
  721.6     4    387     400
 1541       4    966    1000
  662.7     5    390     400
 1539.2     5    972    1000

1 query

millis  month   rows    dp
 4546.2     1    396     400
 4380.3     1    989    1000
 5371.1     2    399     400
 5319.3     2    995    1000
 6289.4     3    400     400
 6693.8     3    997    1000
 8479.5     4    389     400
 8859.7     4    966    1000
 8099.7     5    390     400
 8173.7     5    974    1000

The single query preforms faster if series of data are less then 45000 records.

Maybe there shall be an option to choose how the data shall be collected or a query can be made check the number of rows and then decide what the approach shall be.

Henrik

@henols
Copy link
Author

henols commented Oct 14, 2012

Forgot to add the java script that I did the timings with.
Just change the apiKey and url and hopefully it will work out of the box.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <meta http-equiv="content-script-type" content="text/javascript">
    <script language="javascript" type="text/javascript" src="/emon/Includes/Vis/flot/jquery.js"></script>
  </head>
  <body>
    <script type="text/javascript">
var url = "/emon/feed/data.json";
var apiKey ="33c23d935b49746f8b5efe3e2a9d3763";
var id = 3;

$(function() {
    var end = ((new Date()).getTime());
    document.write("<table><tr><th>millis</th><th>month</th><th>rows</th><th>dp</th></tr>");

    callApi(1,end, 400);
    callApi(1,end, 1000);
    callApi(2,end, 400);
    callApi(2,end, 1000);
    callApi(3,end, 400);
    callApi(3,end, 1000);
    callApi(4,end, 400);
    callApi(4,end, 1000);
    callApi(5,end, 400);
    callApi(5,end, 1000);
});

function callApi(month, end, dp){
    var start = end - 3600000 * 24 * 30 * month; 
    var payLoad = buildPayLoad(apiKey, id, start, end, dp);
    var totalTime = 0; 
    for (i=1; i<=10; i++){
        var startCall = ((new Date()).getTime());
        $.ajax({
            url: url,
            data: payLoad,
            async: false,
            method: 'GET',
            dataType: 'json', 
            success: function(dataVal){
                totalTime += (new Date()).getTime() - startCall;
                dataLength = dataVal.length;
            }
        });
    }
    document.write("<tr><td align='right'>"+ (totalTime/10)+ "</td><td align='right'>"+month+"</td><td  align='right'>"+dataLength+"</td><td align='right'>"+dp+"</td></tr>");
}

function buildPayLoad(apiKey, id, start, end, dp){
    var apiKeyData ="&apikey="+apiKey;
    var idData = "&id=" + id;
    var endData = "&end=" + end;
    var startData = "&start=" + start;
    var dpData = "&dp=" + dp;
    return apiKeyData+idData+startData+endData+dpData;
}
    </script>
  </body>
</html>

Henrik

@TrystanLea
Copy link
Member

Hello Henols, that's fantastic!! a definite improvement at the below 45000 records range as you say. Coupled with averaging being a technically more correct implementation, great! lets implement it this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants