-
Notifications
You must be signed in to change notification settings - Fork 0
Technical Design and Challenges
Most of my big projects start on a whiteboard, and this is no different :) I find that once I plan out the tricky stuff, everything else follows fairly quickly. It's best to get this step out of the way early!
When I was given this problem to work on, it initially perplexed me, but I quickly realized I was making something similar to email, so a lot of the architecture stemmed from that. As soon as I noticed this, the problem became translating email (which is normally contained into a file system) into a REST API.
The solution to this problem was inspired by the email-style architecture. I quickly understood that the hardest hit parts of the application are going to be the send and receive message calls. Making these work quickly is important. I started from the bottom up designing the database.
The first thing I noticed is even though a sender can send a message to multiple recipients, those recipients all have the exact same message. This means the message content doesn't have to be duplicated for each recipient, it can just sit in a messageBody table. Separating this into a table makes sharding easier and drastically reduces storage requirements from O(users * messageSize) to O(messageSize).
After breaking this off into a seperate table, I noticed that recipients all have the same sender information as well (sender username, sender id, subject, etc.). I chose to break sender information into a separate table from messageBody for lookup efficiency. Searches in the messageSent table (which in retrospect should really be named "outbox") are sped up and more entries can fit in the messageSent table without needing to shard the database.
Next I needed a place to put the recipient information. A recipient needs to:
- Know if a message is in his or her inbox.
- Delete a message from his or her inbox.
- Read a message
So I created a messageRecieve to hold recipient-specific data about messages.
A message can be sent to all users (broadcast), meaning if there are n users, then there need to be n entries made in the messageRecieve table. My first instinct was to grab a list of all the users in the application code and then insert entries for each of those users. This is not going to work, because storing so many users' information on the server would cause it to run out of memory very quickly.
To make things more efficient, I moved the work to the database:
INSERT INTO messageReceive (messageId, toUser, `read`)
SELECT DISTINCT ?, userId, 0
FROM `user`
WHERE userId != ?
This query is still taxing, but it's still more efficient than maintaining a list of all users on the application server. Since userId is a key, it is indexed, making retrieval in the subselect quick.
Two things this API sorely needs:
- User authentication -- Right now anybody can send a message as any user. This is obviously a problem, but I decided to exclude it for this version so my friends at Udacity could play around with the API without having to worry about a complex authentication scheme.
- Caching -- Speedy in-memory access to frequently used users and groups would speed things up a bit.
- Storing the Message Content in a Distributed System -- Right now, message content is stored in the relational DB. This is bad, and relational DBs aren't meant for this. The data should be moved to some distributed key-value storage service, like S3 or Riak. Content body storage is (I suspect) one of the major slow points in this application. Allowing to scale storage and lookup to near linear time is very important.
