* pastebin.com, pasted.co, chopapp.com
* users can store plain text. Users of the service will enter a piece of text or images and get a randomly generated URL to access it.

# Requirement
* Users should be able to upload or “paste” their data and get a unique URL to access it.
* Users will only be able to upload text.
* Data and links will expire after a specific timespan automatically; users should also be able to specify expiration time.
* Users should optionally be able to pick a custom alias for their paste.
* System does not suuport user account or editing documents
* Old document gets deleted after not being accessed for long time
* User enters a paste's url and views the contents
* User is anonymous
* Service tracks analytics of pages
    * Monthly visit stats

* Out of scope
    - User register for an account
        - Verify an email
    - User logs in registered account
        - User edit document
    - User can set visibility

* The system should be highly reliable, any data uploaded should not be lost.
* Uers should be able to access their Pastes in real-time with minimum latency.
* Paste links should not be guessable
* Analytics, e.g., how many times a paste was accessed?
* Our service should also be accessible through REST APIs by other services.
* Analytics can be accessed by stats button on each page

* We can limit users not to have Pastes bigger than 10MB to stop the abuse of the service.
* Since our service supports custom URLs, users can pick any URL that they like, but providing a custom URL is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on custom URLs, so that we have a consistent URL database.
* System gets heavy traffic and contains millions of docs
* Traffic is NOT equally distributed on all systems. Some hot documents are there.

# Capacity Estimation and Constraints

* Our services will be read-heavy; there will be more read requests compared to new Pastes creation. We can assume a 5:1 ratio between read and write.
* Traffic
    - 1 M new paste per day
    - Means 1 * 5 = 5 M reads per day
    - 1M / (3600 * 24) = 12 paste/sec
    - 5M / (3600 * 24) = 58 reads/sec
* Storage
    - user can upload 10 MB of data. User share source code, configs or logs such text are not huge. Avg size = 10 KB
    - 1M * 10KB = 10 GB/day
    - To store it for 10 years 10 * 365 * 10 = 36TB
* With 1M pastes every day we will have 3.6 billion Pastes in 10 years. We need to generate and store keys to uniquely identify these pastes. If we use base64 encoding ([A-Z, a-z, 0-9, ., -]) we would need six letters strings:
    - 64^6 ~= 68.7 billion unique strings
* If it takes one byte to store one character, total size required to store 3.6B keys would be:
    - 3.6B * 6 => 22 GB

* We can generate random GUID of 128 bit, but not guarantee to be unique. Low odds of collision, that can treat it as unique. But not pretty to user. To make it pretty cut to smaller values but increase chance of collison.

* Size per paste
    - 1 KB content per paste
    - shortlink - 7 bytes
    - expiration_length_in_minutes - 4 bytes
    - created_at - 5 bytes
    - paste_path - 255 bytes
    - total = ~1.27 KB

* 22GB is negligible compared to 36TB. To keep some margin, we will assume a 70% capacity model (meaning we don’t want to use more than 70% of our total storage capacity at any point), which raises our storage needs to 51.4TB.

* Bandwidth
    - For write requests, we expect 12 new pastes per second, resulting in 120KB of ingress per second.
        - 12 * 10KB => 120 KB/s
    -  for the read request, we expect 58 requests per second. Therefore, total data egress (sent to users) will be 0.6 MB/s.
        - 58 * 10KB => 0.6 MB/s
* Memory
    - 20% of hot pastes generate 80% of traffic, we would like to cache these 20% pastes
    - Since we have 5M read requests per day, to cache 20% of these requests, we would need:
        - 0.2 * 5M * 10KB ~= 10 GB
    - As document can not be edited, we do not have to worry about invalidating.

# System API

```
addPaste(api_dev_key, paste_data, custom_url=None user_name=None, paste_name=None, expire_date=None)
```
* The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota.
* A successful insertion returns the URL through which the paste can be accessed, otherwise, it will return an error code.

```
getPaste(api_dev_key, api_paste_key)
deletePaste(api_dev_key, api_paste_key)
```

# Database Design
* We need to store billions of records.
* Each metadata object we are storing would be small (less than 1KB).
* Each paste object we are storing can be of medium size (it can be a few MB).
* There are no relationships between records, except if we want to store which user created what Paste.
* Our service is read-heavy.
![](images/pastebin1.PNG)

```
shortlink char(7) NOT NULL
expiration_length_in_minutes int NOT NULL
created_at datetime NOT NULL
paste_path varchar(255) NOT NULL
PRIMARY KEY(shortlink)
```

*  ‘ContentKey’ is a reference to an external object storing the contents of the paste

* We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer

# High level Design

* we need an application layer that will serve all the read and write requests. Application layer will talk to a storage layer to store and retrieve data. We can segregate our storage layer with one database storing metadata related to each paste, users, etc., while the other storing the paste contents in some object storage (like Amazon S3) or NoSQL document store. This division of data will also allow us to scale them individually.

![](images/pastebin2.SVG)

![](images/pastebin4.PNG)

* The Client sends a create paste request to the Web Server, running as a reverse proxy
* The Web Server forwards the request to the Write API server
* The Write API server does the following:
    * Generates a unique url
        * Checks if the url is unique by looking at the SQL Database for a duplicate
        * If the url is not unique, it generates another url
        * If we supported a custom url, we could use the user-supplied (also check for a duplicate)
    * Saves to the SQL Database pastes table
    * Saves the paste data to the Object Store
    *Returns the url

* Our application layer will process all incoming and outgoing requests. The application servers will be talking to the backend data store components to serve the requests.
*  Upon receiving a write request, our application server will generate a six-letter random string, which would serve as the key of the paste (if the user has not provided a custom key). The application server will then store the contents of the paste and the generated key in the database. After the successful insertion, the server can return the key to the user. One possible problem here could be that the insertion fails because of a duplicate key. Since we are generating a random key, there is a possibility that the newly generated key could match an existing one. In that case, we should regenerate a new key and try again. We should keep retrying until we don’t see failure due to the duplicate key. We should return an error to the user if the custom key they have provided is already present in our database.
* run a standalone Key Generation Service (KGS) that generates random six letters strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to store a new paste, we will just take one of the already generated keys and use it. This approach will make things quite simple and fast since we will not be worrying about duplications or collisions. KGS will make sure all the keys inserted in key-DB are unique. KGS can use two tables to store keys, one for keys that are not used yet and one for all the used keys. As soon as KGS gives some keys to an application server, it can move these to the used keys table. KGS can always keep some keys in memory so that whenever a server needs them, it can quickly provide them. As soon as KGS loads some keys in memory, it can move them to the used keys table, this way we can make sure each server gets unique keys. If KGS dies before using all the keys loaded in memory, we will be wasting those keys. We can ignore these keys given that we have a huge number of them.

* KGS a single point of failure. we can have a standby replica of KGS and whenever the primary server dies it can take over to generate and provide keys.

* Can each app server cache some keys from key-DB? Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This could be acceptable since we have 68B unique six letters keys, which are a lot more than we require.

* How does it handle a paste read request? Upon receiving a read paste request, the application service layer contacts the datastore. The datastore searches for the key, and if it is found, returns the paste’s contents. Otherwise, an error code is returned.

* To generate random URL
    - Take the MD5 hash of the user's ip_address + timestamp
        - MD5 is a widely used hashing function that produces a 128-bit hash value
        - MD5 is uniformly distributed
        - Alternatively, we could also take the MD5 hash of randomly-generated data
* Base 62 encode the MD5 hash
    - Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
    - There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
    - Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
    - The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:

In [4]:
def base_encode(num, base=62):
    digits = []
    while num > 0:
        remainder = num % base
        digits.push(remainder)
        num = num // base
    digits = digits.reverse()

* Take the first 7 characters of the output, which results in 62^7 possible values 
```
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
```

* REST API
```
$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
    "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
```

* Response

```
{
    "shortlink": "foobar"
}
```

* Rest API for read
```
$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
```

* Response
```
{
    "paste_contents": "Hello World"
    "created_at": "YYYY-MM-DD HH:MM:SS"
    "expiration_length_in_minutes": "60"
}
```

# Database Layer

![](images/pastebin3.PNG)

* Metadata database: We can use a relational database like MySQL or a Distributed Key-Value store like Dynamo or Cassandra.
* Object storage: We can store our contents in an Object Storage like Amazon’s S3. Whenever we feel like hitting our full capacity on content storage, we can easily increase it by adding more servers.

* Sharding database:
    - URL hash code % some int, will yield server where we store data. Which allows us to quickly locate the database which has the file.
    - We can skip db enitely and let URL say which server has file, but when we change number of servers it will be difficult to redistribute files.

# Analytics

* We want number of visits, may be by location and time

* Store raw data vs store data which we need
    - Raw is better, we might need some more info in future

* Simply log each visit to a file and also back it up latter

* Reading log file is not good, we can have separate database to store Month and year, URL and visit counts. Every time URL visited update the count.
* Not heavy load as not displayed on main page, still we can use cache if needed

* Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.

![](images/pastebin8.PNG)

* The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.

* An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.

* To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.

* 4 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:

    - Federation
    - Sharding
    - Denormalization
    - SQL Tuning
* We should also consider moving some data to a NoSQL Database.