# <font color='blue'> Table Of Contents </font>

## <font color='blue'> MongoDB: Air Quality Problem </font>
## <font color='blue'> Database Design </font>
## <font color='blue'> Access Control </font>
## <font color='blue'> Database Reports </font>
## <font color='blue'> Twitter: Redis Database </font>
## <font color='blue'> Netflix: Apache Cassandra </font>
## <font color='blue'> Neo4j </font>

# <font color='blue'> MongoDB: Air Quality Problem </font>

We extend the discussion of the city Air Quality database started in an earlier session with MySQL. 

Let us implement the same example in MongoDB, and understand the differences between the two databases. 

We collect data by placing different sensor devices at different replocations (reporting locations).

Collection of replocations are part of a polygon area on the map.

This database needs to host many kinds of users in the organization, with different roles.

We store the collected data in ```aq_data```.

We periodically aggregate the data, and create reports on different parameteres such as location, area and device.

## <font color='blue'> Basic Collections </font>

MongoDB is a document database - every record/document is inserted in a collection. It uses automatic sharding and replication to provide high availability.

Collection is similar to a table in MySQL.

Each document is supposed to be unique, with one unique entry - ```_id```. 

```location_id```, ```area_id```, ```user_id``` and ```device_id``` are stored as ```_id``` - in their core collections.  

### <font color='blue'> user </font>

* The ```user``` collection has basic information on users: ```user_name```, ```email```, ```phone_no```, ```address```, etc. 

* ```_id``` is unique and stores the ```user_id```

### <font color='blue'> device </font>

* The ```device``` collection stores information in fields such as ```serial_no```, ```device_type``` , ```manufacturer```, etc. ```_id``` stores ```device_id```, ensuring unique collection entries. 

* An additional field stores mapping to the current  ```location_id```, for a specific ```_id```. Each device should have a one-to-one relationship with the location.  

* We also maintain a record of installation information, such as ```installation_timestamp``` and ```is_active```, for each device. This is stored in the ```device``` collection itself.


### <font color='blue'> replocation: Reporting Location </font>

A collection of identfied potential locations in the device installation map. 

* Every ```_id``` (unique location) is tied to a specific ```device_id```.

* An additional field stores relationsip between location and various areas that it's part of.

* An array of ```area_id``` is stored with each document. Area and replocation have a many-to-many relationship. 


### <font color='blue'> area </font>

Areas can be defined amorphously, or strictly as a polygon on the map (outside the scope of this discussion)

* To store the mapping to locations, an additional field is used to store an array of ```location_id```


### <font color='blue'> aq_data </font>

* Store all data points collected from devices. 

* Store Air quality index data such as pm25, pm10, co, so2 and o3 levels.

### <font color='blue'> Core Infrastructure Data </font>

The ```device```, ```replocation```, ```area```, ```user``` collections are designed to store core data. 

In ```device```, we also store installation data, and ```is_active``` information of each device, along with ```installation_timestamp```. 

In ```replocation```, a group of areas is stored for every ```location_id```.

In the ```area``` collection, we store a group of ```location_id``` in each document.   

Each device should be connected to only one location at a time.


### <font color='blue'> Actual Sensor Data </font>
 
This data is stored in ```aq_data``` collection. We store the air quality index parameters, and the timestamp when that data was received.

### <font color='blue'> Reporting tables </font>

Reports are generated using the ```aq_data``` collection based on the location, device and area. 

## <font color='blue'> Access Control </font>

Every user has a certain level of access to devices, locations and areas. 

MongoDB stores data in a collection format. Hence we can design different groupings/lists to define user access levels, based on different parameters such as:

* Device
* Location
* Area


The three main access patterns are:
* Find all entities where a particular user has access.
* Find if a user has access to a particular entity.
* Find all users that have access to a particular entity.


We can use multi-key indexes for faster retrieval based on these access patterns.

One option is to store the access level information in the user collection only, with different fields for different entities.

But since the access information is not required as frequently, we can store it in a separate collection (```user_access```) for better performance.

<img src="http://drive.google.com/uc?export=view&id=1aeEo4jIzFLRzPFQoSYsMHGqyiUBKxUP4" width=800px>

We define ```_id``` as ```user_id```, and store its access role for all relevant locations, devices, and areas. 

A ```location_access``` array is maintained for each ```_id```, which can have multiple dictionary objects.

This is helpful in building a multi-key index in our document.  

Each object in the field stores a ```location_id``` and its ```access_type```. 

```access_type``` can have values Admin/Normal. 

Similarly, we can have two more arrays for ```area``` and ```device```, for a single ```_id``` in the same document.



Alternatively, we can create a single field to store all entities and their respective ```access level```

<img src="http://drive.google.com/uc?export=view&id=1_LuP-Ynli1io0anOh1caL1Yh5PrD-wE4" width=800px>

The above collection has a record of each ```_id``` (```user_id```), and its access role. 

We maintain a combined ```entity_list``` array for each ```_id```. 

The idea is to maintain a record of all the permissions for each user, in a single field. 

This makes it easier to search and update the record of any ```_id```.

Every array element will have three objects: ```entity_type```, ```entity_id``` and ```access_type```. 

```entity_type``` and ```entity_id``` will change with location, area and device. 


## <font color='blue'> Database Reports </font>

We can generate reports based on the following parameters:

* Location - Users can ask for reports and alerts for a particular location (East Railway Station)
* Area - Users can ask for reports and alerts in a particular area (Central Business District). This needs aggregating data across multiple locations

The specifics (over a time period) users might request:

* Maximum value
* Minimum value
* Average
* 90 percentile value
* 95 percentile value

The time quantas users might be interested in:

* daily
* weekly
* monthly
* yearly


Some potential collection structures for reports:

* aq_data_daily_locationwise(aqdataid, **locationid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, day, year)
* aq_data_weekly_locationwise(aqdataid, **locationid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, week, year)
* aq_data_monthy_areawise(aqdataid, **areaid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, month, year)
* aq_data_monthy_locationwise(aqdataid, **locationid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, month, year)

We can combine the area and location collections by using another field:

* aq_data_daily(aqdataid, **entity_type, entityid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, day, year)
* aq_data_weekly(aqdataid, **entity_type, entityid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, week, year)
* aq_data_monthy(aqdataid, **entity_type, entityid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, month, year)

Here entity_type can be device/location/area.


Unlike columns in MySQL, every field is not required to be present in every document. 

This could lead to another interesting variant where every type of report can be stored in just one collection:

* aq_report_data(aqdataid, **aggregate_type, entity_type, entityid**, datatype, minvalue, maxvalue, avgvalue, 90percentile, 95percentile, **day, week, month**, year)

Here aggregate_type could be daily/weekly/monthly. Based on this type, either day, week, or month will be present in that specific document.

Indexing would help speed up retrieval in any of these variants.

# <font color='blue'> Twitter: Redis Database </font>

Redis drives Timeline - an index of tweets indexed by an ```id```. Chaining tweets together in a list produces the Home Timeline.

##  <font color='blue'> The Network Bandwidth Problem </font>

Memcache did not work as well as Redis for Timeline, due to a fanout probem.

Twitter reads and writes happen incrementally and they are fairly small, but the timelines themselves are fairly large.

## <font color='blue'>  Long-common Prefix Problem </font>

An object has attributes which may, or may not exist. A separate key can be created for each individual attribute. 

We need to send a separate request for each individual attribute, and not all attributes may be in the cache.

##  <font color='blue'>  Block Diagram of Twitter timeline </font>

Timeline is a list of Tweet IDs - a list of small integers.

<img src="http://drive.google.com/uc?export=view&id=1uQlSLAVvnrlRGCQrc1jNShNvjPBgb8xg" width=800px>

<img src="http://drive.google.com/uc?export=view&id=1M2-mzgUl4h4_DJOV6vcMQ2Wcul6BOm5_" width=600px>

## <font color='blue'> References </font>

1. https://www.infoq.com/presentations/Real-Time-Delivery-Twitter/#downloadPdf/
2. http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html


# <font color='blue'>  Netflix : Apache Cassandra  </font>

Cassandra can scale horizontally and dynamically by adding more servers, without the need to re-shard, or reboot.

Netflix uses Cassandra for its scalability and lack of single points of failure and for cross-regional deployments.

Cassandra uses local quoram for write operations and quoram one for reads.

For multi-region entire cluster will be replicated. 

<img src="http://drive.google.com/uc?export=view&id=1Nty2c7hQKy2eksXU5n4TBM_LwUAlMcX3" width=800px>

<img src="http://drive.google.com/uc?export=view&id=1uHK521lTFh0F_OUmJ4__D99k8erj_icJ" width=800px>

### <font color='blue'> References </font>

1. https://www.slideshare.net/VinayKumarChella/how-netflix-manages-petabyte-scale-apache-cassandra-in-the-cloud
2. https://www.youtube.com/watch?v=RMSNLP_ORg8
3. https://www.slideshare.net/acunu/cassandra-eu-2012-netflixs-cassandra-architecture-and-open-source-efforts

# <font color='blue'>  neo4j </font>

## <font color='blue'> eBay </font>

eBay has a huge set of both ecommerce and world knowledge data, from different sources.

They combine this data using the Apache Airflow scheduler, to get the data into a knowledge graph, with Google Cloud and Spark. 

The data is then eventually modeled as a graph and pushed to Neo4j.

<img src="http://drive.google.com/uc?export=view&id=126GHL7WUt0Y1fdnHnYi3T-EmyOAyZ2wN" width=800px>

<img src="http://drive.google.com/uc?export=view&id=1_52RX1sMgaMtSobcy1xP8iSPkC6HU7RN" width=800px>

## <font color='blue'> Airbnb </font>

Information of the resource-creator is as important as the resource itself. 

If one knows who exactly creates and udpates data, data relationships can be easily maintained.

There are two types of relationship:

* Persistent relationship: a snapshot of the relationship 
* Transient relationship: represent random events in the relationship

Airbnb prefers Neo4j due to the graphical nature of the data involved. 

Neo4j also integrates well with the Python and elastic search.

<img src="http://drive.google.com/uc?export=view&id=1YWPkpCwk7_k_NaQcfW5bl1qrW5E31GXO" width=800px>

<img src="http://drive.google.com/uc?export=view&id=1QvMJWUibCIvbmYGffBEHpEuYEfjsAl3N" width=800px>

## <font color='blue'> References </font>

1. https://neo4j.com/blog/ebay-shopbot-graph-powered-conversational-commerce/
2. https://www.slideshare.net/neo4j/graphconnect-europe-2017-democratizing-data-at-airbnb