System Design Interview: Notification/Alerting Service
Hello! I'm Emily, a senior systems engineer conducting your system design interview today. I'll be guiding you through designing a notification/alerting service based on the case study you've shared.
Problem Statement
We need to design a notification/alerting service that can send notifications through various channels (browser, email, SMS, phone calls, push notifications, in-app messages) to billions of recipients daily. This service should be highly available, scalable, and support various features like templates, scheduled notifications, and recipient group management.
Let's start clarifying the requirements and constraints before diving into the architecture.
Requirements Clarification
Could you help me understand some of the key priorities for this notification service?
For example:

Which notification channels are most critical to support in the initial implementation?
Are there any specific performance requirements beyond the "delivered within seconds" mentioned in the case?
Any specific security or compliance requirements we should prioritize?

thank you for the questions, I also have questions around the requirements. the prioritization of the notficiation channel to me depends on how the tool will be used by the organization the most, what type of users we have and whether those users are actual users or systems that want to push the notifications, whether the notifications are being triggered automatically by events or if they are being manually sent by users. it depends whether the nature of the notifications are urgent or not as urgent. 

I'm going to make the assumption that the notifications are going to be urgent, as generally speaking we don't raise notifications that get pushed to others attention and potentially disrupt them unless they require shorter term action. by that logic, I would prioritize the browser, SMS, push notifications and in-app messages. these are the types of notifications that will be noticed faster and acted on faster. 

what other performance requirements should we focus on? should we be planning for handling a particular volume of messages in a particular time window? doing some napkin math, if the max number of messages sent in a day could be in the billions, how many billions? let's assume 2 billion if we don't have a specific target, that is 2,000,000,000 in a day as possibly the average load, but would that be evenly distributed throughout the day or could we expect stress on the system where you have 2,000,000,000 messages in just one 10 second interval and then nothing the rest of the day, for example? if we assume average then we can get to roughly 20,000 per second. but it really matters in this case I'd imagine if that number goes up or down by orders of magnitude in the course of a day. 

for security, I would imagine we would want authentication and authorization on this application. we can set up an identity provider and use oauth for permissions at a user level and openid for authentication. we should discuss what roles users should have in the system. should all users be allowed to configure the notifications? additionally, for compliance, are there any standards bodies governing the types or size of notification messages and what types of data should be allowed to be displayed in a notification? PII and SPII for example should likely be excluded. 

I'm also wondering whether the notification service can have several concurrent service clients, what kind of conditions trigger a notification, are the notifications triggered through a synchronous request? can one system send multiple notification reqeusts at once? what should the format of the request be? are there ways to cancel a request once it has been sent? can there be requests that are sent at intervals which can be disabled or ignored? is this for services that are doing system monitoring or is this more for communications?



Thank you for those thoughtful questions and assumptions. You've raised excellent points that will help us define the scope and requirements more clearly. Let's build on your insights to establish a solid foundation for our design.
Requirements Clarification (continued)
Based on your assumptions about urgency, I agree with prioritizing browser notifications, SMS, push notifications, and in-app messages for the initial implementation. These channels provide immediate attention-grabbing capabilities that are essential for time-sensitive notifications.
For volume handling, your napkin math of around 20,000 notifications per second (for 2 billion daily) as an average is reasonable. However, we need to design for peak loads rather than averages. Let's assume:

Normal operation: 20,000 notifications/second
Peak loads: 200,000 notifications/second (10x normal)
Burst capacity: Up to 500,000 notifications/second for short periods

For security and compliance:

We should implement OAuth 2.0 for authorization and OpenID Connect for authentication
Role-based access control (RBAC) to define who can configure, view, and send notifications
PII/SPII policies to limit sensitive data in notifications
Opt-in/opt-out mechanisms to comply with anti-spam regulations

System Constraints
Let's define some key constraints:

Notification size limit: 1MB (as mentioned in the case)
Delivery timeline: Within seconds for high-priority notifications
High availability requirement: Five 9s (99.999% uptime)
Fault tolerance: Guaranteed delivery with retry mechanisms
Data retention: Let's assume notifications are stored for 30 days



## High Level Architecture, my Answer

the notification service will ened a way to handle events coming in from other sources, this coudl be: 

a message queue
kafka
some bespoke solution where we essentially recreate one of teh two above in a pub sub fashion. 

I know I need pub sub. kafka offers durability so I'm thinking we should take that because the system needs to be highly available. 

the other non functional requirements to consider are scalability, and kafka will achieve that the best I believe with the ability to partition topics and have consumer groups that scale up with the volume of messages. it ensures that if there is ever a failure, kafka will start from the last unconsumed message offset and there willb e no data loss

for feature configuraiton we will likely need a UI which I think could be a static web app with administrator dashboard. 

so client systems will publish events to teh kafka queue whenever there is somethign amiss and then teh notification service, which could potentially be serverless to save on cost, or maybe just be a container running in a virtual machine somewhere, will ingest these messages and then publish them out to the user through the various channels. so there is a backend and front end to that as well. 

it will likely be necessary to scale the notification servcie backend according to the number of messages we have coming in, and it will be a good idea to load balance these in some fashion, maybe have only some topics be consumed by some instances of the service, or maybe distribute them across all services. 



## High-Level Architecture, Claude's answer
Based on our discussion, let's sketch out a high-level architecture:

Client-Facing Layer:

API Gateway for external requests
Web UI for manual notification configuration and monitoring
Authentication and authorization


Core Services:

Notification Management Service
Template Service
Address Group Service
Scheduler Service


Distribution Layer:

Kafka messaging infrastructure
Consumer clusters for each notification channel


Channel-Specific Services:

Browser Notification Service
SMS Service
Push Notification Service
In-App Notification Service
Email Service
Phone Call Service


Supporting Infrastructure:

Object Store for large media files
Metadata Service
Monitoring and Alerting Service
Dead Letter Queues for retry logic



Let me first ask: Does this high-level architecture align with your thinking? Would you like to discuss any specific component in more detail first, or should we start diving into the overall system flow?