Database schema #1

Closed
mplorentz opened this Issue Apr 9, 2013 · 13 comments

Comments

Projects
None yet
4 participants
Owner

mplorentz commented Apr 9, 2013

I think it would be really cool if we could have a table of users and auth tokens (or however that works) and store everything else in a use'rs Tent server. Is that possible?

We could create tagged posts in the users Tent servers for their preferences, I can't think of anything that couldn't be stored that way.

The upshot is that we would have a lightweight database (SQLite) which makes things easier for us and anyone who wants to self-host the app.

The downside is that we'd need more code to prevent people who tinker with their Tent databases from getting smoke-signals to behave in undesirably ways (i.e. creating an rss feed out of their tent posts and then making tent posts out of the rss feed).

Collaborator

seanmonstar commented Apr 9, 2013

We certainly could. What sort of data do you imagine needing to be stored?

Keeping a local list of [RSS feeds]:[accounts publishing them] will let us show people before they create yet another account publishing the same feed.

Owner

mplorentz commented Apr 10, 2013

Hmm ya, that would be a great feature. Maybe we could just tag the preferences onto the users table. Just have columns like entity, auth token, type (rss2tent or tent2rss), and feed URL, etc.

Collaborator

bnjbvr commented Apr 10, 2013

I thought about this for the rss2tent part:

User:

  • app_id
  • app_mac_key
  • app_mac_key_id
  • user_mac_key
  • user_mac_key_id
  • entity

Feed:

  • name char256
  • URL char256
  • lastFetchDate date

FeedItem:

  • guid text
  • feed_id: id of belonging feed

UserFeed : many to many table

Collaborator

seanmonstar commented Apr 10, 2013

@benjbouv and how do you determine a new item? Feed items that have a published date of later than lastFetchDate?

For the tent2rss part:

Most of the above, but add to User last_post_id.

Collaborator

bnjbvr commented Apr 10, 2013

@seanmonstar New items are determined the following way:

  • either the item contains a pubDate and we compare pubDate and lastFetchDate. FYI, RSS2 feed items are described over there: http://cyber.law.harvard.edu/rss/rss.html
  • or it doesn't, and in this case we just compare the guid with the guid of feed items belonging to the current RSS feed. A guid which is not present in the set of guids of retrieved items should be considered to be a new item ( http://www.詹姆斯.com/blog/2006/08/rss-dup-detection ). Of course, a dictionnary should help finding guid for a given feed, so as to avoid fetching the database everytime.
Collaborator

seanmonstar commented Apr 10, 2013

Since we know some people are horrid at making RSS feeds, what if pubDates change to be later than lastFetchDate? And for guid resolution: some feeds use the same guid for every item :(. And some use the same URL for every item too.

Also, would there ever be concern of storage space if we hold on to every feed item? Maybe we could clean up feed items every month, removing all that are older than a month (or pick a better time span)?

Collaborator

bnjbvr commented Apr 10, 2013

The uniqueness problem seems particularly complicated to deal with. Having an algorithm that decides which field to use in the feed seems to be overkill. What about just using the link as a unique identifier? Are there some stats about how bad people make RSS feeds? :)

Cleaning up feed items regularly seems indeed a good idea. A beta test with real world data should show us how fast the database gets filled and give us an estimate of a right time span.

Collaborator

seanmonstar commented Apr 10, 2013

An addition to the Feed table: add an etag field

Collaborator

florianjacob commented Apr 11, 2013

Hm, how do other feed readers solve the uniqueness problem? What about throwing all fields togehter and generate a single hash out of it? That's hopefully unique in every case. ;) And it would give us a post update detection mechanism e.g. when typos got corrected.

Owner

mplorentz commented Apr 18, 2013

I'm going to write up an sqlite3 schema based on our discussion. I'll ask for feedback when it's up.

Owner

mplorentz commented Apr 22, 2013

Should the mac_key_ids be char256 columns in the database? I've never worked with auth tokens before.

While I'm at it, what is the mac_key and mac_key_id. Related to the MAC address?

Collaborator

seanmonstar commented Apr 22, 2013

Think of them like public/private key-pairs. The mac_key is the private key
that is used to sign the request, and the id (mac_key_id) is passed so
people can verify the signature.

Owner

mplorentz commented Apr 22, 2013

Ok guys the initial database schema is up. The feed, feeditems, and the user's preferences will all be stored in the user's Tent server. We can go ahead and create tables for these as we need them since we don't have 0.3 servers to test against yet.

As a followup on determining new RSS posts here's my proposal:
We make a dict of the last 200 or so FeedItems we've seen from a particular feed, by hashing together a bunch of the RSS item's fields. We store this (pickled) dict in the user's Tent server and when we fetch the RSS feed we search for each of the items in the dict. Does anyone see major problems with that? It solves the uniqueness problem without requiring us to sift through a large amount of posts.

mplorentz closed this Jun 6, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment