Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create MVP of URL prioritisation backend #361

Closed
hellais opened this issue Mar 16, 2020 · 7 comments
Closed

Create MVP of URL prioritisation backend #361

hellais opened this issue Mar 16, 2020 · 7 comments
Assignees
Labels
ooni/backend Issues related to https://github.com/ooni/backend ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/high

Comments

@hellais
Copy link
Member

hellais commented Mar 16, 2020

During the team meeting we discussed an MVP for how a URL prioritisation system could work.

Meeting notes here: https://docs.google.com/document/d/1fCTU_P6nQVGS1TQ60EnMxWR6Ex5nw2Ug94PfP78uiMc/edit#.

For starters, we can experiment with sending updated lists only to our own personal probes (rather than testing with all probes out there).

Priorities to start with:
Test social media sites around the world => Get them from the URLs included in the “Social Media” button here: https://ooni.org/get-involved/run/
Test “News Media” and “VPNs” and “Human Rights” from here: https://ooni.org/get-involved/run/
Prioritize URLs from the Citizen Lab test lists that fall under “NEWS”, “HUMR”, etc and prioritize those over other categories (such as “PORN”) => since this is more in-line with our mission.

Criteria for prioritization (globally):
Type of content that is frequently blocked around the world (e.g. social media).
If this content is blocked, it has a bigger impact on the public and on human rights (e.g. human rights websites, news media, etc.).

Criteria for country-specific prioritization:
If the type of content is illegal/banned, more likely to be blocked.
If a specific website or type of content is known to be blocked in a country or reportedly blocked in the past.
If certain types of websites are likely to get blocked in correlation to specific events (for example, testing political party websites in country X leading up to and during elections). => these types of sites are often categorized as “POLR”. We may want to flag election-related sites in an extra column in the Citizen Lab test lists, where we can add the “election” tag.

We should consider the locality of priorities, and that there are different clusters and groups that are globally sorted (e.g. social media > porn), but perhaps this changes locally.

Essentially we are talking about multipliers: We can have the default weight for a target, and then we can have a multiplier => for example:
A target has a value (e.g. default = 100)
A category has a +10% multiplier
A category+country pair has +20% multiplier

Perhaps a lot of this should be probabilistic. There should be some noise added to ensure the distribution is spread across the probes. If a user can test 100 URLs, we probably want those to include a distribution of different types of websites. This is an effect of the weights.If you just apply the weights uniformly but you don’t take into account the local clustering, perhaps it won’t be distributed uniformly.

We would change the testing of the probes depending on what other probes have tested, to ensure that the same URLs get tested across networks and that we have good coverage in general.

We can start a spreadsheet to categorize the Citizen Lab category codes: https://docs.google.com/spreadsheets/d/17nwpu_lrNMyB0nGjU79Z5K42O_ymkoFcS1_-6i0-WNQ/edit#gid=0

We can add a tag which flags which categories should have a higher weight in specific countries. Some categories can also not have any weight at all.

Examples:
Foo.com weight 50 lower than the default)
Twitter.com weight 200 higher that the default)
Category SOCIAL +50% Increase weight for the whole category
Category SOCIAL + Country:IT +20% Increase weight for the whole category in one country
Foo.com + Country:UK -30% Lower weight for a specific target in one country

→ twitter in Italy get a weight 200 + 50% + 20% → leads to 900 which means the site is tested nine times more that the average website

Example of proactive prioritization: before an election we create a bunch of rules:

Election website 1 + Country of election → +100%
Election website 2 + Country of election → +100%
Election website 3 + Country of election → +100%
Election website 4 + Country of election → +100%

...in the repository/database

Support for tags in CitizenLab test-lists: citizenlab/test-lists#264

Full notes available here: https://docs.google.com/document/d/1fCTU_P6nQVGS1TQ60EnMxWR6Ex5nw2Ug94PfP78uiMc/edit#

@FedericoCeratto
Copy link
Contributor

TODO: add category priorities to citizenlab list and then update fastpath/domain_input.py to handle them

FedericoCeratto pushed a commit to ooni/pipeline that referenced this issue Apr 16, 2020
FedericoCeratto pushed a commit to ooni/pipeline that referenced this issue Apr 16, 2020
FedericoCeratto pushed a commit to ooni/pipeline that referenced this issue Apr 22, 2020
FedericoCeratto pushed a commit to ooni/pipeline that referenced this issue Apr 22, 2020
FedericoCeratto pushed a commit to ooni/pipeline that referenced this issue Apr 22, 2020
* fastpath 0.25 URL prioritization citizenlab table ooni/backend#361
* analysis 0.14: URL prioritization MVP
@FedericoCeratto
Copy link
Contributor

PR merged

@FedericoCeratto
Copy link
Contributor

TODO:

  • configure local Nginx to serve files based on query
  • implement fallback to static files if the backend stops updating the dynamic files
  • switch traffic from orchestration nginx

@FedericoCeratto
Copy link
Contributor

@hellais can you please provide detailed specs on which calls in https://github.com/ooni/orchestra/blob/master/docs/orchestrate-swagger.yml need to be implemented and the semantics? The previous discussion around the MVP did non include enough details.

@FedericoCeratto
Copy link
Contributor

FedericoCeratto commented Apr 23, 2020

Update after discussion on Slack:

  • honor country_code and category_codes and limit https://github.com/ooni/orchestra/blob/master/docs/orchestrate-swagger.yml#L95
  • implement a dedicated component that serves that entry point
  • cache database contents in the component
  • refresh data from the database at least every 5m
  • the database should not be a hard dependency: if unresponsive, use cached data
  • support active/standby model: if the main instance is not responsive, an external LB can fall back to a second instance
  • some counters are in-memory only (preventing active/active)

@FedericoCeratto
Copy link
Contributor

Second iteration PR: ooni/pipeline#312

@hellais
Copy link
Member Author

hellais commented Apr 27, 2020

This is great. I approved the relevant PR and I would say we can consider this issue closed. Let's open some follow up issues for the next sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ooni/backend Issues related to https://github.com/ooni/backend ooni/pipeline Issues related to https://github.com/ooni/pipeline priority/high
Projects
None yet
Development

No branches or pull requests

2 participants