Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync test lists #50

Merged
merged 13 commits into from
Oct 22, 2018
Merged

Sync test lists #50

merged 13 commits into from
Oct 22, 2018

Conversation

hellais
Copy link
Member

@hellais hellais commented Oct 9, 2018

This makes some changes to the sync-test-lists script which populates the relevant tables in the orchestra database for enabling the /api/v1/urls endpoint to work.

The workflow of this script is the following:

  1. It checks if there is a history inside of the sync_test_lists table, if not it is considered to be a first run in which case it will populate the tables for the category codes, countries and writes all the URLs in the country lists.
  2. It will pull (or clone if it's the first run) the citizenlab/test-list repository into the workingdir/test-lists
  3. If the sync_test_lists contains a history, it will take the last commit hash and do a diff of that
  4. Based on the diff it will look at which paths have changed and figure out if there are new URLs to add or if some were deleted and mark them as inactive (active = false)

@hellais hellais changed the base branch from master to feature/signing October 9, 2018 15:48
@hellais hellais changed the base branch from feature/signing to master October 9, 2018 15:48
Write the last synched commit to the sync_test_lists table
);
"""

# XXX can a url belong to more than one category? (probably not)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, it can (unless we've cleaned up that sort of weeds from test-lists).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible according to the citizenlab/test-lists format. This comment was mostly there if we maybe wanted to have some extra set of categories (multiple) for our own use, but I think that for the time being this is not that important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to some extent possible with the citizenlab/test-lists format. One just adds same URL twice with different category codes:

ae.csv:http://www.haya.org,HUMR,Human Rights Issues,2014-04-15,citizenlab,
bh.csv:http://www.haya.org,REL,Religion,2014-04-15,citizenlab,"Religious conversion, commentary and criticism"

global.csv:https://www.facebook.com,GRP,Social Networking,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
pk.csv:https://www.facebook.com,NEWS,News Media,2014-04-15,citizenlab,

id.csv:http://denypagetests.netsweeper.com,NEWS,News Media,2014-04-15,citizenlab,
kw.csv:http://denypagetests.netsweeper.com,CTRL,Control content,2014-04-15,citizenlab,

global.csv:http://www.crazyshit.com,PORN,Pornography,2014-04-15,citizenlab,Updated by OONI on 2017-02-14
sg.csv:http://www.crazyshit.com,NEWS,News Media,2014-04-15,citizenlab,

I'm unsure if test-lists should be updated or data model should be extended...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, however, according to the data model of the DB, two distinct entries. The same URL present in two country specific lists is a different URL.

In general I think we should be doing better consistency checks in the test-lists URL to avoid dupe categories, but I don't see a problem in the DB datamodel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It leads to several interesting questions:

  • when the orchestra gives URLs out and same URLs have different categories, should it give same URL twice? Probably, not.
  • if the user in SG does not want to test PORN but wants to test NEWS, should the user test crazyshit.com or not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@darkk darkk Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is third possible inconsistency: same url with trailing / for bare domain and without it.

bh.csv:http://arabi21.com/,NEWS,News Media,2017-11-13,citizenlab,
eg.csv:http://arabi21.com,NEWS,News Media,2017-05-24,ADEF,blocked as of date added

It also makes processing a bit more complex than it should be. Canonical form for URL with bare domain...

  • has NO trailing slash when DISPLAYED in Firefox and Chrome UI
  • HAS trailing slash when COPIED from Firefox and Chrome UI

IIRC, we already had this discussion on canonical form of bare-domain URL, so let's take that decision and suggest it as canonical form.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list-lists.py does have checks to avoid duplicates inside of the same list based on the canonical representation of the URL. We consider the canonical representation to be that without the trailing slash: https://github.com/citizenlab/test-lists/blob/master/scripts/lint-lists.py#L138

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue with trailing / is that http://example.com/ and http://example.com are same URLs, but http://example.com/foo/ and http://example.com/foo are different ones, so that specific heuristics may be wrong.
Example: https://vigruzki.rkn.gov.ru/auto != https://vigruzki.rkn.gov.ru/auto/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, yet from the perspective of duplicate detection it's pretty likely that the user wanted to test only one of those two representations and I have yet to come across a case in which both representations should be tested. When that will happen I guess we can improve the heuristic.

' VALUES (%s, %s, %s, %s, %s, %s, %s)'
' ON CONFLICT DO NOTHING RETURNING url_no',
(url, cat_no, country_no, date_added, source, notes, True))
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One URL ~ one SQL request. If the code is run in HKG and db is in AMS, then it's 250ms per URL, that's ~2h8m of runtime to import all lists (30780 lines in lists/*). It sounds suboptimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you suggest doing it instead?

Copy link
Contributor

@darkk darkk Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COPY to temporary table, do manipulations, INSERT into main table. Tell me if an example is useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes an example for this would be useful.

with pgconn, pgconn.cursor() as c:
c.execute('INSERT INTO sync_test_lists (executed_at, commit_hash)'
' VALUES (NOW(), %s)',
(last_commit_hash, ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_sync_table() writes meaningful data in separate transaction. It looks like a bug to me and other functions also follow same pattern of strange usage of transactions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good point. I will see if I can do some refactoring to wrap things in transactions in a better way.

import psycopg2
import git

TEST_LISTS_GIT_URL = 'https://github.com/citizenlab/test-lists/'
COUNTRY_UTIL_URL = 'https://github.com/hellais/country-util'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run code instead of pulling from personal repos?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this question. That repo contains JSON data for country codes and country names.

executed_at TIMESTAMP WITH TIME ZONE,
commit_hash VARCHAR PRIMARY KEY
);
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, either all DDLs should be here or none of them (IMHO, second is easier if we have more than one app speaking to the DB). Having some subset of tables being DDLed here and other subset somewhere else looks strange to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to have this in here was that I didn't want to have to write a new migration script for orchestra and also that this table is never used by the orchestra app, but is only used by the sync_test_lists script.

c.execute('INSERT INTO urls (url, cat_no, country_no, date_added, source, notes, active)'
' VALUES (%s, %s, %s, %s, %s, %s, %s)'
' ON CONFLICT DO NOTHING RETURNING url_no',
(url, cat_no, country_no, date_added, source, notes, True))
Copy link
Contributor

@darkk darkk Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, country_iso2 is natural PK for the data, isn't it? TBH, I don't quite understand why country_no is used instead of country_iso2. Also, I'm unsure if name and full_name should be stored in database or not. Are these strings ever going to be translated? If they are, they probably should not be fetched from DB.
Same logic applies to category codes as well.
On the other hand, these questions are not that important, that's just a suggestion for reflection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point for country_iso2. I will see how tricky it is to do a schema migration on that table and if so do that.

WRT name and full_name, these are the canonical english names (full_name is the name according to iso, while name is the name according to cldr) for those countries and I think putting them in the DB is a reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if country_iso2 is a good point or not. We have cis.csv, you also add EU preudo-country somewhere...
I agree that putting names in the DB is reasonable for some use-cases. But we will eventually want to localize those names in UI, that's why I say that fetching those names from DB may be unreasonable in some cases. That's why I ask question if we have a use-case when we actually want to put a name into DB :)

@@ -309,18 +304,15 @@ def dirname(s):

def parse_args():
p = argparse.ArgumentParser(description='test-lists: perform operations related to test-list synchronization')
p.add_argument('--working-dir', metavar='DIR', type=dirname, help='where we should be cloning the git repository to', required=True)
p.add_argument('--init', action='store_true', help='if we should be running the initialisation routine')
p.add_argument('--working-dir', metavar='DIR', type=dirname, help='The working directory for this script. It should be preserved across runs.', required=True)
Copy link
Contributor

@darkk darkk Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to avoid working-dir being cached and rely on git ls-remote https://github.com/citizenlab/test-lists/ HEAD to understand if we have to re-fetch or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if gitpython supports this. Did you see it in their docs? Help on doing this would be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems, it does not. subprocess.check_output with direct call to git cli or git.cmd is good enough.

Another random idea: we don't import URLs on commit-by-commit basis (do we want to?), so it may be easier to import all lists/* to temporary tables and act on "latest snapshot" and have a shallow clone instead of full clone...

@darkk
Copy link
Contributor

darkk commented Oct 12, 2018

Okay, I'll add ls-remote and then we'll decide about packaging this stuff to run on top of airflow.
BTW, something is wrong with travis... I bet, it has nothing to do with this branch, but maybe it's worth looking.

@hellais
Copy link
Member Author

hellais commented Oct 12, 2018

BTW, something is wrong with travis... I bet, it has nothing to do with this branch, but maybe it's worth looking.

Yeah. It should be fixed in the latest commit. Seems related to this: golang/lint#397.

Copy link
Member Author

@hellais hellais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me

That's python2.7 as cStringIO is gone in 3.x :-)
@hellais hellais merged commit 798f07f into master Oct 22, 2018
@bassosimone bassosimone deleted the sync-test-lists branch December 16, 2019 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants