Skip to content
This repository has been archived by the owner on May 25, 2020. It is now read-only.

How can we share Sequelize models between podverse-web and podverse-feedparser as separate apps? #1

Closed
mitchdowney opened this issue Jan 4, 2017 · 9 comments

Comments

@mitchdowney
Copy link
Member

mitchdowney commented Jan 4, 2017

First, a rough idea of how I imagine podverse-feedparser working:

  1. podverse-feedparser, podverse-web, and the podverse PostgreSQL database all listen on their separate ports, deployed on their separate servers.

  2. Every few hours or so, a cron job triggers podverse-feedparser to query for all podcast RSS feed URLs in the database.

  3. The parseFeeds method is called with the array of all the feed URLs currently in the db. parseFeeds adds each of these feeds on a queue to be parsed.

  4. The parseFeeds queue runs sequentially, calling the parseFeed method with each URL until finished. As it goes parseFeed writes updated podcast and episode feeds to the PostgreSQL db. (This parseFeed function already exists in podverse-web here.)

I feel confident I can write code to make each of these things happen, but I am not sure how to elegantly reuse the podverse-web repositories/sequelize/engineFactory.js and models in the separate podverse-feedparser app.

I considered using npm install git://podverse-web as a dependency in podverse-feedparser, then somehow loading the models within podverse-feedparser by loading podverse-web files available in node_modules...but I'm not quite sure how I'd do that yet, and I wonder if I'm heading down the wrong path.

Having two separate apps that share a PostgreSQL db is new territory for me. Any tips on how to architect this stuff?

@mitchdowney
Copy link
Member Author

I'm also considering whether there is a 4th app that needs to be here, podverse-api?

In that case I imagine I would remove all references to models and database stuff in podverse-web and podverse-feedparser, then whenever a database interaction needs to happen, I would make a request to podverse-api.

Decoupling all the db stuff from -web and -feedparser sounds like more work than I prefer right now, buuut if that is a good pattern then I am up for doing it.

@scvnc
Copy link
Contributor

scvnc commented Jan 4, 2017

In the spirit of open source and reusable tools this is how I'd stab at architecting it

High Level

There is a podcast-db application. It is the authority of having Podcast and Episode models and associated rss links and all that. It knows nothing about MediaRefs / clips / playlists etc. It exposes two APIs, one is a RESTful API and the other is the seuqelize models that it owns. Focus on the latter because it is more relevant for podcast-web.

Podverse-web consumes the podcast-db application. This means exposing an npm package... could use a git:// url for now. Essentially the podcast-db exposes sequelize models that know how to interact with a postgres instance containing all the podcast information. Podverse-web also contains information about MediaRefs and Clips, Playlists, etc. Podverse-api could happen but possibly later. Because of this architecture.. a mediaref wouldn't necessarilly be directly linked to an Episode in the database.. but that's fine for the sake of being more decentralized and decoupled

podcast-db (lower level)

It doesn't need to run on a port or a server like with express/feathers. Really it is a set of routines and some sort of job queue mechanism. These routines would probably be executed as a command line application or something similar. No need to get HTTP involved to invoke these routines (beyond fetching rss feeds and

Routine: update rss feed

Given an RSS feed, update the podcast-db.
Tonnes of logic and conditions would be required here to make it robust. What if the RSS feed is garbage or too big? What if it's a dead link? How should this be reported? What should happen if the episode name is changed? What should happen if the media URL is changed? + many other cases we haven't thought of yet?

Routine: add rss feed

Given an RSS feed, add a podcast/episode to the podcast-db.

Routine: determine which rss feeds need to be updated

When executed (perhaps hourly) it should result in adding the set of routines that need to be executed. It could be a query like "give me all podcast rss urls which have a last updated date older than 48 hrs"

job queue mechanism

There are a lot of these. Some you can run on yourself and some you can leverage a cloud service. https://aws.amazon.com/sqs/ There's amazon-- azure has one. Fundamentally it is the orchestrator of a task and ensuring that it is queued up and that it was completed in a robust way. It would take orders to execute routines and make sure that they get executed.

But maybe we don't need to build podcast-db so quickly

Consider using the audiosear.ch api... You can send it a term such as the podcast "Invisibilia" to it's api and it will return a json payload that will have everything we need to get podcast/episode information based on searching for it. It is essentially the podcast-db piece but via a RESTful api

@mitchdowney
Copy link
Member Author

mitchdowney commented Jan 5, 2017

I just played around with audiosear.ch api, and one issue I have is a few podcasts I listen to a lot (Rubin Report, #WeThePeople Live, Peace Propaganda) are not in the system. It looks like this doesn't have to be a show stopper though, as the website takes suggestions:

image

When I query for a podcast by ID (887 for "Waking Up with Sam Harris"), I see 39 episode IDs, but there are 61 episodes in Sam Harris's RSS feed, so Audiosear.ch apparently can return inaccurate data.

Furthermore, I do not see a way to request both a podcast AND have its episode information included. It appears we need to individually query for each episode by its ID if we want to know anything about the episode, such as its title or lastPubDate. This seems to me like it would be unusable for the Podcast page in Podverse.

Unless there is a workaround for handling displaying all episodes that I am not thinking of...

If there are not workarounds for these issues, then I am leaning towards building podcast-db...

@mitchdowney
Copy link
Member Author

mitchdowney commented Jan 5, 2017

Ooo Iiiii seee now. After rereading your proposal, audiosear.ch sounds very viable to me. Why run all these RSS parsing jobs ourselves, when someone else is already doing it? Audiosear.ch can take a huge load off our backs for the podcasts they support.

I'd still like to have our own RSS feed parser though, that we use just to fill in the gaps for feeds that audiosear.ch doesn't provide yet (like Waking Up with Sam Harris, and Peace Propaganda). By drastically limiting the number of feeds we parse ourselves, our feed parser robustness should be more manageable...

Whooops didn't mean to press close

@scvnc
Copy link
Contributor

scvnc commented Jan 5, 2017

There may be some dealbreakers for sure. Having a podcast/episode in one request is not one of them-- it's not a big deal in the grand scheme of things to have that as two requests at the moment. Not having control of which feeds show up is another.

It could be that audiosear.ch has strict standards to ensure their clean database. And so they gatekeeper rss feeds.

It is an illustration that this piece really is a whole other app sort of unrelated to the core podverse mvp. I'd be happier with a micro-service running on lambda or something that took an RSS feed, converted it into json, and shoved it back to the client to store in local storage. No need to maintain a database of podcast/episodes.

@mitchdowney
Copy link
Member Author

I hear ya on how not storing podcasts / episodes to db would simplify things greatly.

The only thing is I cannot imagine a UX where I would want to track down an RSS feed link every time I want to use the web clipper. Moreso I can't imagine most of my friends / family ever doing that. I can however imagine myself and others clicking a Search icon, typing in the name of the podcast we are looking for, and then listening to an episode and making clips that way.

Also I want a web clipper with a good UX because I don't want users to be totally limited to iOS (if we were to go solely the mobile app route).

Sooo the reason I'm not going the localStorage route is 100% UX related. If I am misunderstanding and there is a way to accomplish this UX without a podcast-db then I am interested in simplifying things.

In the mean time I'll be working getting podcast-db to work with podverse-web locally today.

@mitchdowney
Copy link
Member Author

Basically what I am planning on doing today is:

  1. Move podcast and episode models, services, and tests into podcast-db.
  2. Make podcast-db a node_module dependency of podverse-web.
  3. Write a podcast-db script that extracts podcast data from audiosear.ch and stores it in db.
  4. Write a podcast-db script that extracts episode data from audiosear.ch and stores it in db.
  5. Make sure the previous feedparser.js still works for outlier podcasts that audiosear.ch does not support.

@mitchdowney
Copy link
Member Author

@scvnc podverse-web and podcast-db have been decoupled. podverse-web fires up for me when I run npm start, and all the features seem to be working.

I can populate the db with feeds / podcasts / episodes with this cli command:

node -e 'require("./src/tasks/feedParser.js").parseFullFeedIfFeedHasBeenUpdated("http://joeroganexp.joerogan.libsynpro.com/rss")'

There's some hardcoded db stuff going on in podverse-web and podcast-db that I assume will have to get cleaned up for deployment.

I'm thinking I'll work on the shell scripts and queue stuff next. That stuff is more new territory for me. I'll look up tutorials on SQS and see what I can do...

@scvnc
Copy link
Contributor

scvnc commented Jan 10, 2017

With SQS and webfaction we would need to have a cron job on webfaction invoke every 5 minutes (or another considered inverval). It would connect to SQS and retrieve one message (which is a task for feedparsing) and then it would do the feedparsing job (add or update feed).

After it's done (or errored!!) it would have to interact with SQS again. If it's done, then it has to tell SQS to delete the message from the queue because it was successfully parsed. If it has errored, then we need to log that somewhere nice and then delete the message from SQS.

The other task is "determine which rss feeds need to be updated"... which should probably be a daily script which combs through the lastUpdated in the podcast's and adds appropriate update jobs to the SQS queue. They will later get picked up by the previously illustrated cronjob.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants