Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Description of what different services do. #734

Closed
esirK opened this issue Oct 16, 2020 · 3 comments
Closed

Description of what different services do. #734

esirK opened this issue Oct 16, 2020 · 3 comments
Assignees
Labels

Comments

@esirK
Copy link

esirK commented Oct 16, 2020

Hello team.
My team and I would like to requests you to provide us with some description of what different services do. This way, we will be able to make educated modifications while performing our own deployment. Thank you for your time.

@pypt pypt self-assigned this Oct 16, 2020
@pypt pypt added the question label Oct 16, 2020
@pypt
Copy link
Contributor

pypt commented Oct 16, 2020

Hi there!

  • cliff-annotator - HTTP service with machine learning model that does named entity recognition (names, places, brandnames, etc. - see test for a list); if you're dealing with English-language content, you probably want this one; requires 8 GB of RAM (IIRC) for a single containter to run;
  • cliff-fetch-annotation - fetches named entities from cliff-annotator, stores them in a PostgreSQL table as a compressed blob; you want this one if you run CLIFF;
  • cliff-update-story-tags - fetches compressed blob from PostgreSQL table, parses JSON, tags stories with entities;
  • crawler-ap - worker that (used to) fetch content via Associated Press API; long broken, but could serve as a reference for how to ingest content from various "custom" APIs; you want this one if you run CLIFF;
  • crawler-fetcher - despite the name, doesn't actually crawl anything; polls the PostgreSQL table managed by crawler-provider and fetches downloads enqueued on that table, stores their content, adds extract-and-vector jobs to get the content extracted; you want this one;
  • crawler-provider - using the feeds table, manages PostgreSQL table with a queue of (RSS) feeds / news articles (stories) to be fetched; you want this one;
  • create-missing-partitions - lingers around and tries to create missing PostgreSQL table partitions for the upcoming rows; you want this one;
  • cron-generate-daily-rss-dumps - dumps all stories collected every day into an RSS feed for external users to download; you probably don't need this one as you don't have external users who would like to ingest your data;
  • cron-generate-media-health - generates daily (?) media health reports, i.e. tries to find out which media sources are dead; it's optional, but you might find it useful;
  • cron-generate-user-summary - generates daily report of new users who have signed up; it's supposed to email this daily report but I think it just logs this report into Docker container logs; totally optional to you;
  • cron-print-long-running-job-states - tries to periodically identify which Celery jobs are running for a long time, and if they are, which "state" have they reported last; I think it's broken but I'm not sure; you probably don't want this one;
  • cron-refresh-stats - updates some daily stats, not sure if broken; it shouldn't hurt if you decide to run it;
  • cron-rescrape-due-media - periodically adds new rescrape-media Celery jobs for every media source so that we would become aware of new / updated / deleted RSS feeds in each of these media sources; you want this one;
  • cron-rescraping-changes - prints reports on how well cron-rescrape-due-media did its job; it might be broken, but I'm not sure as I don't think we read these reports anymore; you can skip this, but it's not a particularly intensive process anyway;
  • cron-set-media-primary-language - identifies media sources for which we haven't determined their "primary language" (e.g. English for BBC UK or French for Le Monde) and tries to do that; not sure if we use this, but maybe we do? you might keep it just in case;
  • cron-set-media-subject-country - same as -primary-language, just with countries;
  • dump-table - utility script that dumps a huge table in parallel; not a service, just a container image that you can use if you wanna dump some tables;
  • export-tables-to-backup-crawler - utility script that's long broken;
  • extract-and-vector - tries to extract plain text from HTML pages of news articles (stories) fetched by crawler-fetcher, determines each story's language, tokenizes it into sentences to do deduplication; you want this one;
  • extract-article-from-page - HTTP service that does the actual HTML -> plain text extracting; you want this one;
  • facebook-fetch-story-stats - fetches Facebook statistics (IIRC share and comment counts) for newly added stories;
  • import-solr-data - daemon process which periodically imports new stories to Solr; you probably want this one;
  • import-solr-data-for-testing - service that we use for running tests against import-solr-data;
  • import-stories-feedly - one-off script (not a service) that imports stories from Feedly;
  • import-stories-scrapehtml - don't even know what this one does, I think it imports stories by scraping a HTML page with a bunch of links? also a one-off script;
  • mail-opendkim-server - signs emails sent out with mail-postfix-server with DKIM; as you'll be using a third party service for sending out email, you'll probably won't need this one;
  • mail-postfix-server - SMTP server which listens on port 25 and sends emails from the test of the system (registration form, various periodic reports, monitoring alerts, etc.); on second thought, this should be called just mail-server; if you'd like to send email from AWS, you might need to rewrite this one for it to relay emails submitted to it via port 25 to some sort of a third party email service as AWS blocks SMTP by default (alternatively, there's a method described in the linked page on how to convince AWS to let you send email, so you can try that too);
  • mirror-cpan-on-s3 - a tool (not a service) to fetch Perl dependency modules from CPAN (it's like PyPI for Perl) and upload them to Amazon S3, thus making it into a poor man's CPAN mirror;
  • munin-cron - Munin's (our monitoring system's) Cron script which fetches the monitored stats every 5 minutes; if you'd like to monitor how well your system is doing, you might want to run these, it's just that the limits that various Munin plugins expect (e.g. new stories per day) are hardcoded, so you'll need to change those;
  • munin-fastcgi-graph - FastCGI workers for Munin's HTTP webapp;
  • munin-httpd - Munin's webapp;
  • munin-node - Munin's stat collector (I think it gets called from munin-cron, or maybe it's the other way around);
  • nytlabels-annotator - somewhat similar to CLIFF, this service tries to guess what the story is about, e.g. US elections, gardening, Nairobi, the Moon, etc.; works with English language content only; requires 8 GB of RAM for a single instance to run;
  • nytlabels-fetch-annotation - same like with cliff-fetch-annotation, just with NYTLabels;
  • nytlabels-update-story-tags - same like with cliff-update-story-tags, just with NYTLabels;
  • podcast-fetch-episode - fetches audio files of podcasts (from podcast feeds), transcribes them to something that Google's speech detection can understand (e.g. WAV), tries to guess the language from the podcast episode's description; uploads it to Google Cloud's object store, adds podcast-submit-operation job; if you'll be doing podcast transcripts, you might want this one, however, if speakers on your podcasts speak with English language accents other than American English or British English, you might want to make the language detection code a bit smarter;
  • podcast-fetch-transcript - fetches transcripts from Google Cloud's speech API, stores them locally; if the transcript is not done yet, makes sure (I forgot how) that podcast-poll-due-operations will add a new podcast-fetch-transcript job after a few minutes;
  • podcast-poll-due-operations - polls a PostgreSQL table for speech transcription operations (added by podcast-submit-operation) which should be done by now (assumes that it will take Google 1 min to transcribe 1 min of speech), adds jobs to podcast-fetch-transcript;
  • podcast-submit-operation - adds a new speech transcription job for each podcast story;
  • postgresql-pgadmin - pgAdmin4 instance to manage PostgreSQL; if your SQL is rusty or you don't like dealing with psql, you can choose to run this one; a bit buggy but more or less works;
  • postgresql-pgbouncer - pgBouncer instance that does connection pooling in front of postgresql-server; you don't need this one, but it's a very light process, and otherwise you'd have to update a bunch of code for it to point to postgresql-server directly, so I'd say you leave it;
  • postgresql-server - PostgreSQL server; you need this one as we store pretty much everything here;
  • purge-object-caches - we use (or at least used to use) a few PostgreSQL tables to do some object caching (think Redis), so this script periodically cleans up said cache; you might want this one;
  • rabbitmq-server - RabbitMQ server, the backbone of the Celery-based jobs system; you want this one;
  • rabbitmq-server-webapp-proxy - Nginx proxy to RabbitMQ's web interface; you might want this one, unless you're really good at writing commands to the terminal;
  • rescrape-media - for each new / updated (by cron-rescrape-due-media service) media source, crawls their webpage a bit in an attempt to find RSS / Atom feeds that we could then set up for periodic fetching; you want this one;
  • sitemap-fetch-media-pages - part of our sitemap XML ingestion experiment; you can skip this one;
  • solr-shard - Solr shard instance; you want this one, although with your dataset I'd reduce the number of shards from 24 to - dunno - two? maybe four? you can keep it simple and just run a single Solr shard (although you might want to get rid of solr-zookeeper then as it won't have anything to, eh, "zookeep" in that case);
  • solr-shard-webapp-proxy - webproxy to Solr's web interface; keep this one;
  • solr-zookeeper - ZooKeeper which manages Solr's configuration, keeps track of live shards, etc.;
  • tools - various one-off tools;
  • topics-extract-story-links - this and other topics-* services make up a topics system; my knowledge on how it all works is somewhat limited, and it changes often, so @hroberts would be better off telling you all about it; we might have a doc on it (or two) somewhere too!;
  • topics-fetch-link -
  • topics-fetch-twitter-urls -
  • topics-map -
  • topics-mine -
  • topics-mine-public -
  • topics-snapshot -
  • webapp-api - API's FastCGI worker; keep this one;
  • webapp-httpd - Nginx for API; keep this one;
  • word2vec-generate-snapshot-model - generates word2vec models for topic snapshots; admittedly, I knew what exactly does it mean (as I was the one who wrote this service) but now I forgot; sorry!

Please note that some services add jobs for other services to do, for example:

  1. crawler-fetcher adds extract-and-vector jobs for newly fetched news articles (stories);
  2. extract-and-vector adds cliff-fetch-annotation jobs;
  3. cliff-fetch-annotation adds cliff-update-story-tags jobs;
  4. cliff-update-story-tags adds nytlabels-fetch-annotation jobs;
  5. nytlabels-fetch-annotation adds nytlabels-update-story-tags jobs;
  6. finally, nytlabels-update-story-tags calls mark_as_processed() to trigger Solr import of the story.

This chain is hardcoded, meaning that if you decide to skip parts of the chain in your processing pipeline, you'll need to make sure that the chain doesn't break at some point, e.g. if you'd like to skip CLIFF and NYTLabels processing, you'll probably want to update extract-and-vector's code for it to call mark_as_processed() instead of nytlabels-update-story-tags.

Feel free to reopen this issue if something is unclear or if you have other questions.

@esirK
Copy link
Author

esirK commented Nov 9, 2020

Hi there!

  • cliff-annotator - HTTP service with machine learning model that does named entity recognition (names, places, brandnames, etc. - see test for a list); if you're dealing with English-language content, you probably want this one; requires 8 GB of RAM (IIRC) for a single containter to run;
  • cliff-fetch-annotation - fetches named entities from cliff-annotator, stores them in a PostgreSQL table as a compressed blob; you want this one if you run CLIFF;
  • cliff-update-story-tags - fetches compressed blob from PostgreSQL table, parses JSON, tags stories with entities;
  • crawler-ap - worker that (used to) fetch content via Associated Press API; long broken, but could serve as a reference for how to ingest content from various "custom" APIs; you want this one if you run CLIFF;
  • crawler-fetcher - despite the name, doesn't actually crawl anything; polls the PostgreSQL table managed by crawler-provider and fetches downloads enqueued on that table, stores their content, adds extract-and-vector jobs to get the content extracted; you want this one;
  • crawler-provider - using the feeds table, manages PostgreSQL table with a queue of (RSS) feeds / news articles (stories) to be fetched; you want this one;
  • create-missing-partitions - lingers around and tries to create missing PostgreSQL table partitions for the upcoming rows; you want this one;
  • cron-generate-daily-rss-dumps - dumps all stories collected every day into an RSS feed for external users to download; you probably don't need this one as you don't have external users who would like to ingest your data;
  • cron-generate-media-health - generates daily (?) media health reports, i.e. tries to find out which media sources are dead; it's optional, but you might find it useful;
  • cron-generate-user-summary - generates daily report of new users who have signed up; it's supposed to email this daily report but I think it just logs this report into Docker container logs; totally optional to you;
  • cron-print-long-running-job-states - tries to periodically identify which Celery jobs are running for a long time, and if they are, which "state" have they reported last; I think it's broken but I'm not sure; you probably don't want this one;
  • cron-refresh-stats - updates some daily stats, not sure if broken; it shouldn't hurt if you decide to run it;
  • cron-rescrape-due-media - periodically adds new rescrape-media Celery jobs for every media source so that we would become aware of new / updated / deleted RSS feeds in each of these media sources; you want this one;
  • cron-rescraping-changes - prints reports on how well cron-rescrape-due-media did its job; it might be broken, but I'm not sure as I don't think we read these reports anymore; you can skip this, but it's not a particularly intensive process anyway;
  • cron-set-media-primary-language - identifies media sources for which we haven't determined their "primary language" (e.g. English for BBC UK or French for Le Monde) and tries to do that; not sure if we use this, but maybe we do? you might keep it just in case;
  • cron-set-media-subject-country - same as -primary-language, just with countries;
  • dump-table - utility script that dumps a huge table in parallel; not a service, just a container image that you can use if you wanna dump some tables;
  • export-tables-to-backup-crawler - utility script that's long broken;
  • extract-and-vector - tries to extract plain text from HTML pages of news articles (stories) fetched by crawler-fetcher, determines each story's language, tokenizes it into sentences to do deduplication; you want this one;
  • extract-article-from-page - HTTP service that does the actual HTML -> plain text extracting; you want this one;
  • facebook-fetch-story-stats - fetches Facebook statistics (IIRC share and comment counts) for newly added stories;
  • import-solr-data - daemon process which periodically imports new stories to Solr; you probably want this one;
  • import-solr-data-for-testing - service that we use for running tests against import-solr-data;
  • import-stories-feedly - one-off script (not a service) that imports stories from Feedly;
  • import-stories-scrapehtml - don't even know what this one does, I think it imports stories by scraping a HTML page with a bunch of links? also a one-off script;
  • mail-opendkim-server - signs emails sent out with mail-postfix-server with DKIM; as you'll be using a third party service for sending out email, you'll probably won't need this one;
  • mail-postfix-server - SMTP server which listens on port 25 and sends emails from the test of the system (registration form, various periodic reports, monitoring alerts, etc.); on second thought, this should be called just mail-server; if you'd like to send email from AWS, you might need to rewrite this one for it to relay emails submitted to it via port 25 to some sort of a third party email service as AWS blocks SMTP by default (alternatively, there's a method described in the linked page on how to convince AWS to let you send email, so you can try that too);
  • mirror-cpan-on-s3 - a tool (not a service) to fetch Perl dependency modules from CPAN (it's like PyPI for Perl) and upload them to Amazon S3, thus making it into a poor man's CPAN mirror;
  • munin-cron - Munin's (our monitoring system's) Cron script which fetches the monitored stats every 5 minutes; if you'd like to monitor how well your system is doing, you might want to run these, it's just that the limits that various Munin plugins expect (e.g. new stories per day) are hardcoded, so you'll need to change those;
  • munin-fastcgi-graph - FastCGI workers for Munin's HTTP webapp;
  • munin-httpd - Munin's webapp;
  • munin-node - Munin's stat collector (I think it gets called from munin-cron, or maybe it's the other way around);
  • nytlabels-annotator - somewhat similar to CLIFF, this service tries to guess what the story is about, e.g. US elections, gardening, Nairobi, the Moon, etc.; works with English language content only; requires 8 GB of RAM for a single instance to run;
  • nytlabels-fetch-annotation - same like with cliff-fetch-annotation, just with NYTLabels;
  • nytlabels-update-story-tags - same like with cliff-update-story-tags, just with NYTLabels;
  • podcast-fetch-episode - fetches audio files of podcasts (from podcast feeds), transcribes them to something that Google's speech detection can understand (e.g. WAV), tries to guess the language from the podcast episode's description; uploads it to Google Cloud's object store, adds podcast-submit-operation job; if you'll be doing podcast transcripts, you might want this one, however, if speakers on your podcasts speak with English language accents other than American English or British English, you might want to make the language detection code a bit smarter;
  • podcast-fetch-transcript - fetches transcripts from Google Cloud's speech API, stores them locally; if the transcript is not done yet, makes sure (I forgot how) that podcast-poll-due-operations will add a new podcast-fetch-transcript job after a few minutes;
  • podcast-poll-due-operations - polls a PostgreSQL table for speech transcription operations (added by podcast-submit-operation) which should be done by now (assumes that it will take Google 1 min to transcribe 1 min of speech), adds jobs to podcast-fetch-transcript;
  • podcast-submit-operation - adds a new speech transcription job for each podcast story;
  • postgresql-pgadmin - pgAdmin4 instance to manage PostgreSQL; if your SQL is rusty or you don't like dealing with psql, you can choose to run this one; a bit buggy but more or less works;
  • postgresql-pgbouncer - pgBouncer instance that does connection pooling in front of postgresql-server; you don't need this one, but it's a very light process, and otherwise you'd have to update a bunch of code for it to point to postgresql-server directly, so I'd say you leave it;
  • postgresql-server - PostgreSQL server; you need this one as we store pretty much everything here;
  • purge-object-caches - we use (or at least used to use) a few PostgreSQL tables to do some object caching (think Redis), so this script periodically cleans up said cache; you might want this one;
  • rabbitmq-server - RabbitMQ server, the backbone of the Celery-based jobs system; you want this one;
  • rabbitmq-server-webapp-proxy - Nginx proxy to RabbitMQ's web interface; you might want this one, unless you're really good at writing commands to the terminal;
  • rescrape-media - for each new / updated (by cron-rescrape-due-media service) media source, crawls their webpage a bit in an attempt to find RSS / Atom feeds that we could then set up for periodic fetching; you want this one;
  • sitemap-fetch-media-pages - part of our sitemap XML ingestion experiment; you can skip this one;
  • solr-shard - Solr shard instance; you want this one, although with your dataset I'd reduce the number of shards from 24 to - dunno - two? maybe four? you can keep it simple and just run a single Solr shard (although you might want to get rid of solr-zookeeper then as it won't have anything to, eh, "zookeep" in that case);
  • solr-shard-webapp-proxy - webproxy to Solr's web interface; keep this one;
  • solr-zookeeper - ZooKeeper which manages Solr's configuration, keeps track of live shards, etc.;
  • tools - various one-off tools;
  • topics-extract-story-links - this and other topics-* services make up a topics system; my knowledge on how it all works is somewhat limited, and it changes often, so @hroberts would be better off telling you all about it; we might have a doc on it (or two) somewhere too!;
  • topics-fetch-link -
  • topics-fetch-twitter-urls -
  • topics-map -
  • topics-mine -
  • topics-mine-public -
  • topics-snapshot -
  • webapp-api - API's FastCGI worker; keep this one;
  • webapp-httpd - Nginx for API; keep this one;
  • word2vec-generate-snapshot-model - generates word2vec models for topic snapshots; admittedly, I knew what exactly does it mean (as I was the one who wrote this service) but now I forgot; sorry!

Please note that some services add jobs for other services to do, for example:

  1. crawler-fetcher adds extract-and-vector jobs for newly fetched news articles (stories);
  2. extract-and-vector adds cliff-fetch-annotation jobs;
  3. cliff-fetch-annotation adds cliff-update-story-tags jobs;
  4. cliff-update-story-tags adds nytlabels-fetch-annotation jobs;
  5. nytlabels-fetch-annotation adds nytlabels-update-story-tags jobs;
  6. finally, nytlabels-update-story-tags calls mark_as_processed() to trigger Solr import of the story.

This chain is hardcoded, meaning that if you decide to skip parts of the chain in your processing pipeline, you'll need to make sure that the chain doesn't break at some point, e.g. if you'd like to skip CLIFF and NYTLabels processing, you'll probably want to update extract-and-vector's code for it to call mark_as_processed() instead of nytlabels-update-story-tags.

Feel free to reopen this issue if something is unclear or if you have other questions.

Hi @pypt
is the WORD_EMBEDDINGS_SERVER_URL https://github.com/mediacloud/web-tools/blob/main/config/app.config.template#L35
part of the backend or how has it been configured?

@pypt
Copy link
Contributor

pypt commented Nov 9, 2020

I think it's this one:

https://github.com/mediacloud/word-embeddings-server

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants