-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync test lists #50
Sync test lists #50
Conversation
Write the last synched commit to the sync_test_lists table
); | ||
""" | ||
|
||
# XXX can a url belong to more than one category? (probably not) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it can (unless we've cleaned up that sort of weeds from test-lists).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not possible according to the citizenlab/test-lists format. This comment was mostly there if we maybe wanted to have some extra set of categories (multiple) for our own use, but I think that for the time being this is not that important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to some extent possible with the citizenlab/test-lists format. One just adds same URL twice with different category codes:
ae.csv:http://www.haya.org,HUMR,Human Rights Issues,2014-04-15,citizenlab,
bh.csv:http://www.haya.org,REL,Religion,2014-04-15,citizenlab,"Religious conversion, commentary and criticism"
global.csv:https://www.facebook.com,GRP,Social Networking,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
pk.csv:https://www.facebook.com,NEWS,News Media,2014-04-15,citizenlab,
id.csv:http://denypagetests.netsweeper.com,NEWS,News Media,2014-04-15,citizenlab,
kw.csv:http://denypagetests.netsweeper.com,CTRL,Control content,2014-04-15,citizenlab,
global.csv:http://www.crazyshit.com,PORN,Pornography,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
sg.csv:http://www.crazyshit.com,NEWS,News Media,2014-04-15,citizenlab,
I'm unsure if test-lists should be updated or data model should be extended...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is, however, according to the data model of the DB, two distinct entries. The same URL present in two country specific lists is a different URL.
In general I think we should be doing better consistency checks in the test-lists URL to avoid dupe categories, but I don't see a problem in the DB datamodel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It leads to several interesting questions:
- when the orchestra gives URLs out and same URLs have different categories, should it give same URL twice? Probably, not.
- if the user in
SG
does not want to testPORN
but wants to testNEWS
, should the user testcrazyshit.com
or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed: citizenlab/test-lists#400
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is third possible inconsistency: same url with trailing /
for bare domain and without it.
bh.csv:http://arabi21.com/,NEWS,News Media,2017-11-13,citizenlab,
eg.csv:http://arabi21.com,NEWS,News Media,2017-05-24,ADEF,blocked as of date added
It also makes processing a bit more complex than it should be. Canonical form for URL with bare domain...
- has NO trailing slash when DISPLAYED in Firefox and Chrome UI
- HAS trailing slash when COPIED from Firefox and Chrome UI
IIRC, we already had this discussion on canonical form of bare-domain URL, so let's take that decision and suggest it as canonical form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list-lists.py
does have checks to avoid duplicates inside of the same list based on the canonical representation of the URL. We consider the canonical representation to be that without the trailing slash: https://github.com/citizenlab/test-lists/blob/master/scripts/lint-lists.py#L138
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue with trailing /
is that http://example.com/
and http://example.com
are same URLs, but http://example.com/foo/
and http://example.com/foo
are different ones, so that specific heuristics may be wrong.
Example: https://vigruzki.rkn.gov.ru/auto != https://vigruzki.rkn.gov.ru/auto/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, yet from the perspective of duplicate detection it's pretty likely that the user wanted to test only one of those two representations and I have yet to come across a case in which both representations should be tested. When that will happen I guess we can improve the heuristic.
scripts/sync-test-lists.py
Outdated
' VALUES (%s, %s, %s, %s, %s, %s, %s)' | ||
' ON CONFLICT DO NOTHING RETURNING url_no', | ||
(url, cat_no, country_no, date_added, source, notes, True)) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One URL ~ one SQL request. If the code is run in HKG and db is in AMS, then it's 250ms per URL, that's ~2h8m of runtime to import all lists (30780 lines in lists/*). It sounds suboptimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you suggest doing it instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COPY to temporary table, do manipulations, INSERT into main table. Tell me if an example is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes an example for this would be useful.
scripts/sync-test-lists.py
Outdated
with pgconn, pgconn.cursor() as c: | ||
c.execute('INSERT INTO sync_test_lists (executed_at, commit_hash)' | ||
' VALUES (NOW(), %s)', | ||
(last_commit_hash, )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_sync_table()
writes meaningful data in separate transaction. It looks like a bug to me and other functions also follow same pattern of strange usage of transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good point. I will see if I can do some refactoring to wrap things in transactions in a better way.
import psycopg2 | ||
import git | ||
|
||
TEST_LISTS_GIT_URL = 'https://github.com/citizenlab/test-lists/' | ||
COUNTRY_UTIL_URL = 'https://github.com/hellais/country-util' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we run code instead of pulling from personal repos?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this question. That repo contains JSON data for country codes and country names.
executed_at TIMESTAMP WITH TIME ZONE, | ||
commit_hash VARCHAR PRIMARY KEY | ||
); | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, either all DDLs should be here or none of them (IMHO, second is easier if we have more than one app speaking to the DB). Having some subset of tables being DDLed here and other subset somewhere else looks strange to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to have this in here was that I didn't want to have to write a new migration script for orchestra and also that this table is never used by the orchestra app, but is only used by the sync_test_lists script.
scripts/sync-test-lists.py
Outdated
c.execute('INSERT INTO urls (url, cat_no, country_no, date_added, source, notes, active)' | ||
' VALUES (%s, %s, %s, %s, %s, %s, %s)' | ||
' ON CONFLICT DO NOTHING RETURNING url_no', | ||
(url, cat_no, country_no, date_added, source, notes, True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, country_iso2
is natural PK for the data, isn't it? TBH, I don't quite understand why country_no
is used instead of country_iso2
. Also, I'm unsure if name
and full_name
should be stored in database or not. Are these strings ever going to be translated? If they are, they probably should not be fetched from DB.
Same logic applies to category codes as well.
On the other hand, these questions are not that important, that's just a suggestion for reflection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point for country_iso2. I will see how tricky it is to do a schema migration on that table and if so do that.
WRT name
and full_name
, these are the canonical english names (full_name is the name according to iso, while name is the name according to cldr) for those countries and I think putting them in the DB is a reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure if country_iso2
is a good point or not. We have cis.csv
, you also add EU
preudo-country somewhere...
I agree that putting names in the DB is reasonable for some use-cases. But we will eventually want to localize those names in UI, that's why I say that fetching those names from DB may be unreasonable in some cases. That's why I ask question if we have a use-case when we actually want to put a name into DB :)
@@ -309,18 +304,15 @@ def dirname(s): | |||
|
|||
def parse_args(): | |||
p = argparse.ArgumentParser(description='test-lists: perform operations related to test-list synchronization') | |||
p.add_argument('--working-dir', metavar='DIR', type=dirname, help='where we should be cloning the git repository to', required=True) | |||
p.add_argument('--init', action='store_true', help='if we should be running the initialisation routine') | |||
p.add_argument('--working-dir', metavar='DIR', type=dirname, help='The working directory for this script. It should be preserved across runs.', required=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easier to avoid working-dir being cached and rely on git ls-remote https://github.com/citizenlab/test-lists/ HEAD
to understand if we have to re-fetch or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if gitpython supports this. Did you see it in their docs? Help on doing this would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems, it does not. subprocess.check_output
with direct call to git cli or git.cmd
is good enough.
Another random idea: we don't import URLs on commit-by-commit basis (do we want to?), so it may be easier to import all lists/* to temporary tables and act on "latest snapshot" and have a shallow clone instead of full clone...
Okay, I'll add ls-remote and then we'll decide about packaging this stuff to run on top of airflow. |
Yeah. It should be fixed in the latest commit. Seems related to this: golang/lint#397. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable to me
That's python2.7 as cStringIO is gone in 3.x :-)
This makes some changes to the sync-test-lists script which populates the relevant tables in the orchestra database for enabling the /api/v1/urls endpoint to work.
The workflow of this script is the following:
sync_test_lists
table, if not it is considered to be a first run in which case it will populate the tables for the category codes, countries and writes all the URLs in the country lists.workingdir/test-lists
sync_test_lists
contains a history, it will take the last commit hash and do a diff of thatactive = false
)