Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Feature: Add filter to /messages. Add 'contains_url' to filter. #922

Merged
merged 5 commits into from Jul 20, 2016

Conversation

Projects
None yet
2 participants
Owner

erikjohnston commented Jul 14, 2016

No description provided.

@NegativeMjark NegativeMjark commented on the diff Jul 14, 2016

synapse/storage/stream.py
+ )
+ args.extend(event_filter.senders)
+
+ for sender in event_filter.not_senders:
+ clauses.append("sender != ?")
+ args.append(sender)
+
+ if event_filter.rooms:
+ clauses.append(
+ "(%s)" % " OR ".join("room_id = ?" for _ in event_filter.rooms)
+ )
+ args.extend(event_filter.rooms)
+
+ for room_id in event_filter.not_rooms:
+ clauses.append("room_id != ?")
+ args.append(room_id)
@NegativeMjark

NegativeMjark Jul 14, 2016

Contributor

Is it worth adding the rooms filter for /messages? Either the room is filtered out by the filter or it isn't.

@erikjohnston

erikjohnston Jul 14, 2016

Owner

I'd rather have this function do the expected thing and filter on room_ids too, so its usable elsewhere. Separately I guess we could check that the room_id matches the filter, but even if it doesn't I think the query would be relatively quick

@NegativeMjark

NegativeMjark Jul 14, 2016

Contributor

I guess I'd be impressed if postgres had a hard time optimising room_id = '1' and room_id != '1'. So this seems fine

@NegativeMjark NegativeMjark commented on the diff Jul 14, 2016

synapse/storage/schema/delta/33/event_fields.py
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from synapse.storage.prepare_database import get_statements
+
+import logging
+import ujson
+
+logger = logging.getLogger(__name__)
+
+
+ALTER_TABLE = """
+ALTER TABLE events ADD COLUMN sender TEXT;
+ALTER TABLE events ADD COLUMN contains_url BOOLEAN;
@NegativeMjark

NegativeMjark Jul 14, 2016 edited

Contributor

Maybe add an index on (room_id, contains_url, ...whatever columns the pagination queries can by indexed against)?
It's probably not necessary if if people are posting things with URLs frequently enough.

@erikjohnston

erikjohnston Jul 14, 2016

Owner

Maybe, though I'm slightly loathed to put tooooo many indices on the events table. I'd also want to check what actual queries are common, as e.g. file pagination would require a (room_id, topological, stream_ordering, contains_url) index

@NegativeMjark

NegativeMjark Jul 14, 2016

Contributor

Don't you mean a (room_id, contains_url, topological, stream_ordering) index?

@erikjohnston

erikjohnston Jul 14, 2016

Owner

Err, yes, quite.

Contributor

NegativeMjark commented Jul 14, 2016

Amusingly postgres doesn't optimise out queries with a = 0 AND a != 0.

DROP TABLE IF EXISTS t;
CREATE TABLE t (c INT, a INT);

INSERT INTO t (c, a) SELECT 0, data.* FROM
    (SELECT * FROM generate_series(1,1000000)) AS data;

CREATE INDEX t_c_a ON t USING btree (c, a);

ANALYZE t;

EXPLAIN ANALYSE
    SELECT c, a FROM t WHERE c = 0 AND c != 0;


EXPLAIN ANALYSE
    SELECT c, a FROM t WHERE c = 0 AND false;

SELECT VERSION();

Gives:


DROP TABLE
CREATE TABLE
INSERT 0 1000000
CREATE INDEX
ANALYZE
                                           QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..19425.00 rows=1 width=8) (actual time=59.416..59.416 rows=0 loops=1)
   Filter: ((c <> 0) AND (c = 0))
   Rows Removed by Filter: 1000000
 Planning time: 0.139 ms
 Execution time: 59.434 ms
(5 rows)

                                       QUERY PLAN                                       
----------------------------------------------------------------------------------------
 Result  (cost=0.00..14425.00 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=1)
   One-Time Filter: false
   ->  Seq Scan on t  (cost=0.00..14425.00 rows=1 width=8) (never executed)
 Planning time: 0.032 ms
 Execution time: 0.009 ms
(5 rows)

                                             version                                             
-------------------------------------------------------------------------------------------------
 PostgreSQL 9.5.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
(1 row)

Contributor

NegativeMjark commented Jul 14, 2016

Hmm, but it does optimise "a = 0 AND a = 1". How peculiar.

@NegativeMjark NegativeMjark commented on the diff Jul 15, 2016

synapse/storage/stream.py
@@ -95,6 +95,50 @@ def upper_bound(token, engine, inclusive=True):
)
+def filter_to_clause(event_filter):
+ if not event_filter:
+ return "", []
+
+ clauses = []
+ args = []
+
+ if event_filter.types:
+ clauses.append(
+ "(%s)" % " OR ".join("type = ?" for _ in event_filter.types)
+ )
+ args.extend(event_filter.types)
@NegativeMjark

NegativeMjark Jul 15, 2016

Contributor

We currently allow wildcard matches for event types.

"types": lambda v: _matches_wildcard(event_type, v)

Contributor

NegativeMjark commented Jul 15, 2016 edited

Also postgresql seems to optimise a < 0 OR (0 < a AND a < 2) OR 2 < a better than a != 0 AND a != 2 FWIW.
(Not sure that's actually useful for us, but it seemed interesting)

Running:

DROP TABLE IF EXISTS t;
CREATE TABLE t (a INT, b INT);
INSERT INTO t (a, b) SELECT 0, data.* FROM
    (SELECT * FROM generate_series(1,500000)) AS data;
INSERT INTO t (a, b) SELECT 2, data.* FROM
    (SELECT * FROM generate_series(1,500000)) AS data;
CREATE INDEX t_a_b ON t USING btree (a, b);
ANALYZE t;

EXPLAIN ANALYSE
    SELECT a,b FROM t WHERE a = 1;

EXPLAIN ANALYSE
    SELECT a,b FROM t WHERE a != 0 AND a != 2;

EXPLAIN ANALYSE
    SELECT a,b FROM t WHERE a < 0 OR (0 < a AND a < 2) OR 2 < a;

SELECT VERSION();

Gives:

DROP TABLE
CREATE TABLE
INSERT 0 500000
INSERT 0 500000
CREATE INDEX
ANALYZE
                                                  QUERY PLAN                                                  
--------------------------------------------------------------------------------------------------------------
 Index Only Scan using t_a_b on t  (cost=0.42..6.19 rows=1 width=8) (actual time=0.018..0.018 rows=0 loops=1)
   Index Cond: (a = 1)
   Heap Fetches: 0
 Planning time: 0.151 ms
 Execution time: 0.038 ms
(5 rows)

                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..19425.00 rows=249998 width=8) (actual time=74.798..74.798 rows=0 loops=1)
   Filter: ((a <> 0) AND (a <> 2))
   Rows Removed by Filter: 1000000
 Planning time: 0.035 ms
 Execution time: 74.810 ms
(5 rows)

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=13.30..17.32 rows=1 width=8) (actual time=0.021..0.021 rows=0 loops=1)
   Recheck Cond: ((a < 0) OR ((0 < a) AND (a < 2)) OR (2 < a))
   ->  BitmapOr  (cost=13.30..13.30 rows=1 width=0) (actual time=0.021..0.021 rows=0 loops=1)
         ->  Bitmap Index Scan on t_a_b  (cost=0.00..4.43 rows=1 width=0) (actual time=0.010..0.010 rows=0 loops=1)
               Index Cond: (a < 0)
         ->  Bitmap Index Scan on t_a_b  (cost=0.00..4.43 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1)
               Index Cond: ((0 < a) AND (a < 2))
         ->  Bitmap Index Scan on t_a_b  (cost=0.00..4.43 rows=1 width=0) (actual time=0.007..0.007 rows=0 loops=1)
               Index Cond: (2 < a)
 Planning time: 0.071 ms
 Execution time: 0.035 ms
(11 rows)

                                             version                                             
-------------------------------------------------------------------------------------------------
 PostgreSQL 9.5.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
(1 row)
Owner

erikjohnston commented Jul 20, 2016

(The tests are jenkins being unhappy)

Contributor

NegativeMjark commented Jul 20, 2016

LGTM

@erikjohnston erikjohnston merged commit 1e2a7f1 into develop Jul 20, 2016

8 of 10 checks passed

Sytest Dendron (Commit) Build #289 origin/erikj/file_api2 failed in 1 min 15 sec
Details
Sytest Dendron (Merged PR) Build finished.
Details
Flake8 + Packaging (Commit) Build #1175 origin/erikj/file_api2 succeeded in 34 sec
Details
Flake8 + Packaging (Merged PR) Build finished.
Details
Sytest Postgres (Commit) Build #1126 origin/erikj/file_api2 succeeded in 6 min 31 sec
Details
Sytest Postgres (Merged PR) Build finished.
Details
Sytest SQLite (Commit) Build #1148 origin/erikj/file_api2 succeeded in 5 min 59 sec
Details
Sytest SQLite (Merged PR) Build finished.
Details
Unit Tests (Commit) Build #1214 origin/erikj/file_api2 succeeded in 1 min 34 sec
Details
Unit Tests (Merged PR) Build finished.
Details

@richvdh richvdh deleted the erikj/file_api2 branch Dec 1, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment