* Relational database has 
    - scalability problem
    - Joins are slow
    - Consistency is due to transaction, which requires locking some portion of database.
 
* Vertical scaling
    - Add memory, powerful processor and storage capacity.
* Horizontal scaling
    - replication and consistency problem arise.
* To access the data fast we need to duplicate it (denormalise)
* RDBMS supports transaction: It virtually executed first and allows user to rollback, if something go wrong.
    - Transaction is transformation of state, that has ACID property.
    - Atomic: All or nothing. Every update in transaction should get executed to call it completed. Transfer of money
    - Consistent: Data moves from one correct state to other correct state. users should not see different values for same data
    - Isolated: If transactions are executed concurrently, they should not interfere each other. Transaction should be executed in its own space. 2 transaction are updating same data, one should wait until other finish.
    - Durable: Once transaction is completed, changes will persist in any case.
    - If RDBMS is distributed across multiple node to get ACID properties need, transaction management system. The process called 2-phase commit is used. It locks all resources where update needed. Other client has to wait.
    - Another approach is compensation. Useful is web technology, transaction is committed at first place and if error happens compensation transaction is executed to revert effect.
    - 2PC introduce loss of availability and higher latency during partial failures.
* Because of strict schema in RDBMS, we have to use complex JOIN which is slow.
* Sharding and share nothing architecture:
    - Split data instead holding all in one server, to support fast access. We need a good key by which we will shard. 26 machines, each stores data with specific starting character. Divide by region, by phone number, by member since date.
    - Feature based or functional sharding: splitting in separate database the feature that are not overlapping with each other. customer table in one node, item in other, comments on other.
    - Key-based sharding: as described above
    - Look up table: one node stores data about where particular thing is and single point of failure, large load of request.
    - This is shared nothing architecture: not centralized state, each node is independent. no client contention for share resources. Advantage is we can scale it as much as we want, Ex. Cassandra, map reduce, bigtable database.

### NOSQL
* Key-value stores
    - Data items is key that has set of attributes. All data relevant to key are stored with key.
    - Amazon dynamo db, Caching technology like Redis
* Column stores
    - Cassandra, big table, HBase
* Document stores
    - Basic unit of storage is entire document, stored as JSON, XML
    - MongoDb, couchDB
* Graph database:
    - network of nodes and edges that connects the nodes. Both node and edges can have properties.
    - Neo4j, FlockDB
* Object database:
* XML database:

* Main advantage of NOSQL is availability and horizontal scalability, distributed without central control
* Cassandra is,
    - Distributed
        - Capable on running multiple machine but to user visible as unified.
    - Decentralized
        - Every node is identical, there no special node which performs managing task.
        - It uses peer to peer protocol and gossip to maintain and keep in sync of list of nodes alive.
        - No single point of failure. In contrast bigtable and Mysql use master slave mechanism. Because of it cassandra is highly available. In RDBMS we keep multiple copy of data, master send the command to slave for update and slave update it self. Mongodb also have master/slave architecture.
    - Elastically scalable:
        - Vertical scalability: Adding more hardware capability to existing machine.
        - Horizontal scaling: multiple machine having all or some data so load is distributed among the machines. In elastic scalability cluster scale up and down as per need. We can add node any time without disturbing currently running process. In the same way we can remove node.
    - Highly available and Fault-tolerant
        - Availability means measure of fulfilling request.
        - We can replace failed node with no downtime and can replicate data to multiple cluster to get availability.
    - Tuneably consistent:
        - Consistency means read always returns most recently written values. Last 1 item available and 2 customer example.
        - Cassandra is eventual consistent. It trade consistency in order to achieve availability.
        - Strict consistency: Read will ALWAYS return most recent data. In distributed system we need global clock that timestamp each operation. We have to lock all replicas until operation is completed.
        - Causal consistency: Related writes must be read in sequence.
        - Weak(Eventual) consistency: Update will propagate to all replicas in distributed systems, but may take some time. Eventually every replica will be consistent.
        - Replication factor: Number of nodes in cluster we want update to propagates to.
        - Consistency level: Client must specify on EVERY OPERATION that set to how many replicas the update must propagate in order to call successful.
    - Row-oriented database
        - Cassandra is partitioned row store
        - For given row we can have 1 or more column, each row does not have to have all columns as relational database.
        - Partitioned means each row has unique key which makes its data accessible. Key is used to distribute row to multiple data stores.
* Distributed design from Amazon dynamo db and data model from bigtable.

### CAP (Brewer's) theorem
* Consistency : All database client will read same value for same query even if concurrent update are given
* Availability: All database client always be able to read and write.
* Partition tolerance: Database can be split in multiple machine and keep working even if face of network segmentation breaks.
* At given time we can achieve any 2. 
* CA:
    - Using 2 phase commit for distributed transaction. System will block when network partition occur meaning system is limited to single data center cluster
    -Relational db
* CP:
    - Data sharding to scale. Consistent data but, unavailable when node is failed
    -Neo4j, bigtable, mongodb,hbase,redis
* AP:
    - Inacurate data sometime but always available
    - Amazon dynamo db, cassandra, couchdb
    
### Schema
* Cassandra has flexible schema
* It used to be schema free. Without schema it is difficult to determine structure of data and perform complex query.
* CQL allows to define schema. Dynamic column also can be created using Thrift based API.
* CQL collection like list, sets, maps provide way to add content in less structured way. We also can change type of column for certain instances. Allows to store JSON.

-----------
* Cassandra is powerful for write sensitive applications like logs, user activity update,

### Start cassandra
* Start server
![](images/cassandra7.jpg)
* start client
![](images/cassandra8.jpg)

### `cqlsh`
* `HELP`: for help
![](images/cassandra.jpg)

### Keyspace
* Its like database in relational database. It defines one or more table (column families)
![](images/cassandra1.jpg)
![](images/cassandra2.jpg)

In [2]:
from cassandra.cluster import Cluster

In [3]:
cluster = Cluster(['127.0.0.1'])

* We only have to pass initial access point. after driver connects to one of those nodes it will automatically discovers all the nodes in the cluster and connect to them.

In [4]:
session = cluster.connect('my_keyspace') # Establish connection, 

* 'my_keyspace' wil be default keyspace for this session. to change it use

```
session.set_keyspace('new_keyspace')
or
session.execute('Use new_keyspace')
```

In [8]:
session.execute("""
        INSERT INTO user (first_name, last_name) VALUES ('purvil', 'dave')
    """)

InvalidRequest: Error from server: code=2200 [Invalid query] message="unconfigured table user"

![](images/cassandra3.jpg)
![](images/cassandra4.jpg)

![](images/cassandra6.jpg)

* In relational model, database is outer most container. It contains tables. Table has name and it contains several columns and each columns has name. If for specific row there is no value for specific column we will declare it as null.
* In cassandra if for some row specific col is not available we can skip storing it. Row can be wider or skinny depends on number of columns it has. Wide row might have millions of columns

![](images/cassandra_rows.png)

#### Cassandra wide rows
![](images/cassandra_wide_row.png)

* Cassandra use special primary key to represent wide rows. It consist of partition key and optional set of clustering columns
* We can not change primary key after creation of table. Because it controls how the data are stored in cluster 
* Partition key is used to determine node at which row is stored.
* Static column is used to store data that is not part of primary key but is shared by every rows in partition 
* Partition key identify each partition uniquely. Clustering key is to identify rows within the partition.

* Column : name value pair
* row : container for column referenced by primary key. Contains ordered collection of columns
* table: container for ordered collection of rows
* keyspace : container for tables
* cluster: container for keyspaces that spans on one or more nodes. Also known as ring.Cassandra assign data to nodes in the cluster by arranging them in ring.

In [5]:
session.execute("""
    CREATE TABLE user (first_name text, last_name text, 
                       PRIMARY KEY (first_name));
""")

<cassandra.cluster.ResultSet at 0x1f55c141da0>

In [6]:
session.execute("""
    INSERT INTO user (first_name, last_name) VALUES ('purvil', 'dave')
""")

<cassandra.cluster.ResultSet at 0x1f55bd615c0>

In [9]:
session.execute("""
    INSERT INTO user (first_name, last_name) VALUES ('japan', 'dave');
    INSERT INTO user (first_name, last_name) VALUES ('bhavika', 'joshi');
""")

<cassandra.cluster.ResultSet at 0x1f55c642128>

In [10]:
session.execute("""
    INSERT INTO user (first_name, last_name) VALUES ('bhavika', 'joshi');
""")

<cassandra.cluster.ResultSet at 0x1f55c6423c8>

In [16]:
rows = session.execute('SELECT * FROM user')

In [17]:
for row in rows:
    print(row)

Row(first_name='japan', last_name='dave')
Row(first_name='purvil', last_name='dave')
Row(first_name='bhavika', last_name='joshi')


In [18]:
rows = session.execute('SELECT * FROM user')

In [19]:
for row in rows:
    print(row.first_name)

japan
purvil
bhavika


In [20]:
rows = session.execute('SELECT * FROM user')

In [21]:
for (first, last) in rows:
    print(first, last)

japan dave
purvil dave
bhavika joshi


In [22]:
rows = session.execute('SELECT * FROM user')

In [23]:
for row in rows:
    print(row[0])

japan
purvil
bhavika


#### Passing parameters

In [24]:
session.execute("""
    INSERT INTO user (first_name, last_name) VALUES (%s, %s)
""", ('kamil', 'patel'))

<cassandra.cluster.ResultSet at 0x1f55c647b00>

* `%s` placeholder is used for EVERY type.

In [26]:
session.execute("""INSERT INTO user (first_name) VALUES (%s)""", ('harshil',))

<cassandra.cluster.ResultSet at 0x1f55c38ca58>

* we need sequence as parameter

In [27]:
rows = session.execute('SELECT * FROM user')

In [28]:
for row in rows:
    print(row)

Row(first_name='japan', last_name='dave')
Row(first_name='purvil', last_name='dave')
Row(first_name='kamil', last_name='patel')
Row(first_name='harshil', last_name=None)
Row(first_name='bhavika', last_name='joshi')


In [29]:
session.execute("""ALTER TABLE user ADD title text""")

<cassandra.cluster.ResultSet at 0x1f55c651d68>

In [31]:
rows = session.execute('SELECT * FROM user')

In [32]:
for row in rows:
    print(row)

Row(first_name='japan', last_name='dave', title=None)
Row(first_name='purvil', last_name='dave', title=None)
Row(first_name='kamil', last_name='patel', title=None)
Row(first_name='harshil', last_name=None, title=None)
Row(first_name='bhavika', last_name='joshi', title=None)


### Timestamps 

* For each column timestamp is updated when there is an update. During conflicts most recent timestamp change wins.

In [33]:
rows = session.execute("""SELECT first_name, writetime(last_name) FROM user""")

In [35]:
for row in rows:
    print(row)

Row(first_name='japan', writetime_last_name=1555002558416269)
Row(first_name='purvil', writetime_last_name=1555001358641094)
Row(first_name='kamil', writetime_last_name=1555003083521009)
Row(first_name='harshil', writetime_last_name=None)
Row(first_name='bhavika', writetime_last_name=1555002571475594)


* We can NOT execute timestamp on primary key columns 

* We can specify timestamp manually too. But make sure specified timestamp is higher than last one for update.

In [36]:
session.execute("""
    UPDATE user USING timestamp 1555003083521555
    SET last_name = 'mane' WHERE first_name = 'kamil';
""")

<cassandra.cluster.ResultSet at 0x1f55ca82a90>

In [37]:
rows = session.execute("SELECT first_name, last_name, writetime(last_name) FROM user")

In [38]:
for row in rows:
    print(row)

Row(first_name='japan', last_name='dave', writetime_last_name=1555002558416269)
Row(first_name='purvil', last_name='dave', writetime_last_name=1555001358641094)
Row(first_name='kamil', last_name='mane', writetime_last_name=1555003083521555)
Row(first_name='harshil', last_name=None, writetime_last_name=None)
Row(first_name='bhavika', last_name='joshi', writetime_last_name=1555002571475594)


### Time to live (TTL)
* Expire data that is not needed. Works at individual column level. It tells how long cassandra will retain that column.
![](images/cassandra_ttl.jpg)

* TTL can not be set on primary key. TTL can not be set at row level

## CQL Types

### Numeric Types
* `int` 32-bit signed
* `bigint` 64-bit signed
* `smallint` 16-bit signed
* `tinyint` 8-bit signed
* `varint` variable precision signed
* `float` 32-bit
* `double` 64:bit
* `decimal` : variable precision decimal

### Textual data types
* `text`, `varchar` : unicode
* `ascii` : ascii character string

### Time and Identity types
* `timestamp`: Each column has timestamp for last modified time. We can store timestamp in column too. It is encoded as 64 bit signed int.
* `date`, `time`
* `uuid` : Universal unique identifier. 128 bit. Using `uuid()` we can get Type 4 uuid value.
* `timeuuid`: 


In [39]:
session.execute('ALTER TABLE user ADD id uuid;')

<cassandra.cluster.ResultSet at 0x1f55c8eb6a0>

In [41]:
session.execute("UPDATE user SET id = uuid() WHERE first_name = 'purvil'")

<cassandra.cluster.ResultSet at 0x1f55cb12518>

In [42]:
rows = session.execute('SELECT * FROM user')

In [43]:
for row in rows:
    print(row)

Row(first_name='japan', id=None, last_name='dave', title=None)
Row(first_name='purvil', id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', title=None)
Row(first_name='kamil', id=None, last_name='mane', title=None)
Row(first_name='harshil', id=None, last_name=None, title=None)
Row(first_name='bhavika', id=None, last_name='joshi', title=None)


* `boolean` : true/false
* `blob` binary large object. arbitrary array of bytes. Storing media file. To store text data as blob use `textAsBlob()`
* `inet` : ipv4 or ipv6 address
* `counter` : 64 bit signed int. Its values can not be set directly. only incremented or decremented. number of page view, retweets. It can not be used as PK.If used then all column must be of type counter.

### Collections

#### set:
* unordered collections of element. cqlsh returns element in order.


In [44]:
session.execute('ALTER TABLE user ADD emails set<text>;')

<cassandra.cluster.ResultSet at 0x1f55c1dd438>

In [48]:
session.execute("UPDATE user SET emails = {'davepurvil@gmail.com'} WHERE first_name = 'purvil'")

<cassandra.cluster.ResultSet at 0x1f55ca6e048>

In [59]:
rows = session.execute("""SELECT * FROM user WHERE first_name = 'purvil';""")

In [60]:
print([*rows])

[Row(first_name='purvil', emails=SortedSet(['davepurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', title=None)]


In [62]:
session.execute("""UPDATE user SET emails = emails + {'dpurvil@gmail.com'} WHERE first_name = 'purvil'""")

<cassandra.cluster.ResultSet at 0x1f55ca71be0>

In [63]:
rows = session.execute("""SELECT * FROM user WHERE first_name = 'purvil';""")

In [64]:
print([*rows])

[Row(first_name='purvil', emails=SortedSet(['davepurvil@gmail.com', 'dpurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', title=None)]


* To cleat from set use `SET emails = emails - {'dpurvil@gmail.com'}` or `SET emails = {}`

#### list:
* Values stored as an order of insertion.

In [65]:
session.execute("""ALTER TABLE user ADD phones list<text>""")

<cassandra.cluster.ResultSet at 0x1f55c1d39e8>

In [66]:
session.execute("""UPDATE user SET phones = ['8582848142'] WHERE first_name = 'purvil'""")

<cassandra.cluster.ResultSet at 0x1f55c6459b0>

In [68]:
rows = session.execute("SELECT * FROM user WHERE first_name = 'purvil'")

In [69]:
print([*rows])

[Row(first_name='purvil', emails=SortedSet(['davepurvil@gmail.com', 'dpurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', phones=['8582848142'], title=None)]


In [70]:
session.execute("""UPDATE user SET phones = phones + ['6196458947'] WHERE first_name = 'purvil'""")

<cassandra.cluster.ResultSet at 0x1f55ca8da90>

In [71]:
rows = session.execute("SELECT * FROM user WHERE first_name = 'purvil'")

In [72]:
print([*rows])

[Row(first_name='purvil', emails=SortedSet(['davepurvil@gmail.com', 'dpurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', phones=['8582848142', '6196458947'], title=None)]


In [73]:
session.execute("""UPDATE user SET phones[1] = '6193056075' WHERE first_name = 'purvil'""")

<cassandra.cluster.ResultSet at 0x1f55c1dd9e8>

In [74]:
rows = session.execute("SELECT * FROM user WHERE first_name = 'purvil'")

In [75]:
print([*rows])

[Row(first_name='purvil', emails=SortedSet(['davepurvil@gmail.com', 'dpurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', phones=['8582848142', '6193056075'], title=None)]


In [76]:
session.execute("""DELETE phones[1] from user WHERE first_name = 'purvil'""")

<cassandra.cluster.ResultSet at 0x1f55cb43630>

In [77]:
rows = session.execute("SELECT * FROM user WHERE first_name = 'purvil'")

In [78]:
print([*rows])

[Row(first_name='purvil', emails=SortedSet(['davepurvil@gmail.com', 'dpurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', phones=['8582848142'], title=None)]


#### map
* collection of key/value pair. key value can be of any type except counter.

In [79]:
session.execute("""ALTER TABLE user ADD login map<timeuuid, int>;""")

<cassandra.cluster.ResultSet at 0x1f55c1e0f98>

In [81]:
session.execute("""UPDATE user SET login = {now():13, now():18} WHERE first_name = 'japan'""")

<cassandra.cluster.ResultSet at 0x1f55c3920b8>

In [82]:
rows = session.execute("SELECT * FROM user WHERE first_name = 'japan'")

In [83]:
[*rows]

[Row(first_name='japan', emails=None, id=None, last_name='dave', login=OrderedMapSerializedKey([(UUID('0984f5c0-5c92-11e9-b6d5-bf15282b75bb'), 13), (UUID('0984f5c1-5c92-11e9-b6d5-bf15282b75bb'), 18)]), phones=None, title=None)]

### User defined type

In [84]:
session.execute("""CREATE TYPE address (
                                        street text,
                                        city text,
                                        state text,
                                        zip_code int);""")

<cassandra.cluster.ResultSet at 0x1f55c1e0438>

In [86]:
session.execute("""ALTER TABLE user ADD addresses map<text, frozen<address>>;""")

<cassandra.cluster.ResultSet at 0x1f55cb51da0>

In [88]:
session.execute("""UPDATE user SET addresses = addresses + {'home':{street:'11145 Camino', city: 'SD', state:'CA', zip_code:92126}} WHERE first_name = 'purvil'""")

<cassandra.cluster.ResultSet at 0x1f55ce25208>

![](images/describe.jpg)

### Secondary Index
* Filtering on non primary key column is not allowed.
* To do so create secondary Index.

In [89]:
session.execute("""CREATE INDEX ON user (last_name);""")

<cassandra.cluster.ResultSet at 0x1f55ce25320>

In [90]:
rows = session.execute("""SELECT * FROM user WHERE last_name = 'dave'""")

In [91]:
[*rows]

[Row(first_name='japan', addresses=None, emails=None, id=None, last_name='dave', login=OrderedMapSerializedKey([(UUID('0984f5c0-5c92-11e9-b6d5-bf15282b75bb'), 13), (UUID('0984f5c1-5c92-11e9-b6d5-bf15282b75bb'), 18)]), phones=None, title=None),
 Row(first_name='purvil', addresses=OrderedMapSerializedKey([('home', address(street='11145 Camino', city='SD', state='CA', zip_code=92126))]), emails=SortedSet(['davepurvil@gmail.com', 'dpurvil@gmail.com']), id=UUID('22addefc-0998-430d-9279-a63bfe7de7d4'), last_name='dave', login=None, phones=['8582848142'], title=None)]

* We can index on set, list, map too. To index on key of map use `KEY(map)`.
* `DROP INDEX user_last_name_idx` to drop index.

* Secondary index is not good,
    - Columns with high cardinality. indexing on address is expensive as most of the records are unique
    - Columns with low cardinality. Indexing row will become huge. imagine indexing on title
    - columns frequently updated or deleted.

* For optimal read performance denormalized table design or materialized view is preferable.

#### SSTable attached secondary index (SASI)
* Calculated and stored as a part of each SSTable file, whereas typical secondary index are stored in separate hidden table.

In [93]:
session.execute("""CREATE CUSTOM INDEX user_last_name_sasi_idx ON user (last_name) USING 'org.apache.cassandra.index.sasi.SASIIndex'""")

<cassandra.cluster.ResultSet at 0x1f55ca71748>

* We can do inequality search (>, <) and LIKE keyword with SASI.

### Hotel reservation model
![](images/hotel.png)

#### RDBMS design:
* We model as normalized tables and use foreign key to reference related data in other table.
* There can be couple of join tables to show many to many relationship.
![](images/RDBMS_hotel.png)

* We can NOT perform join in Cassandra. Either generate join at client side or make denormalize table (preferable)
* No referential integrity. Of course we can store IDs related to other table but there is NO cascading deletes.
* Denormalization: JOINs are expensive for large data. so we can denormalize data for frequent queries. In Relational database it violates codd's normal form rule. But in Cassandra denormalization is normal.
* Query first design: IN RDBMS we model database around the data. We use domain, attribute, real world entities, primary key, foreign key, join table for many to many relationship. And if our model is perfect we can get any data using complex subqueries and JOINs.
* In cassandra we model the queries first and let data organize as per query. Think most common query path the application will use and store data or create table to accomodate need.
* How to store data physically: In RDBMS we do not care about how data are stored in table. Because there is a typical way and all follows that. In Cassamdra each table are stored in different file on disk. So keep related columns on same table. Also we want to minimize number or partition we have to search to answer query.Partition is unit storage that is not divided across the nodes, minimal partition search means higher performance.
* In Relational database we can ORDER BY using any columns but in cassandra we have to mention clustering columns at the time of CREATE TABLE. and only on such columns we can order our data.

#### Defining application queries
* Q1. Find hotels near a given point of interest.
* Q2. Find information about a given hotel, such as its name and location.
* Q3. Find points of interest near a given hotel.
* Q4. Find an available room in a given date range.
* Q5. Find the rate and amenities for a room.
* Q6. Lookup a reservation by confirmation number.
* Q7. Lookup a reservation by hotel, date, and guest name.
* Q8. Lookup all reservations by guest name.
* Q9. View guest details.

![](images/hotel_cassandra.png)

* Each box accomplish certain task which unlocks subsequent steps. With Q1 we get hotel id near to POI. Using that hotel id we can fetch hotel info in Q2. Booking the room will cause write in reservation and guest records.

#### Logical data model
* we create table for each query, which capture relationship and entities from above conceptual model
* We name each table on primary entity of that table. Identify primary key, add partition key based on required query attributes and clustering columns in order to generate uniqueness and support desired sort order. Add additional attributes if needed by query. If additional attributes are same for every instance of partition key we make column static.

![](images/chebotko.png)

![](images/physical.png)

#### Hotel logical

![](images/hotel_logical.png)

#### Hotel physical

![](images/hotel_physical.png)

* There is no dedicated rooms or amenities as in relational database.
* In Q1 we have name of POI, so we keep that as primary key. There are many hotel near POI to have unique partition for each hotel we use hotel_id as clustering column.

In [94]:
session.execute("""CREATE KEYSPACE hotel WITH replication = {'class':'SimpleStrategy', 'replication_factor':3};""")

<cassandra.cluster.ResultSet at 0x1f55cb650f0>

In [96]:
session.set_keyspace('hotel')

In [97]:
session.execute("""
        CREATE TYPE address (
            street text,
            city text,
            state text,
            zip_code text,
            country text
        );
""")

<cassandra.cluster.ResultSet at 0x1f55cb51c50>

In [102]:
session.execute("""DROP TABLE hotels_by_poi""")

<cassandra.cluster.ResultSet at 0x1f55cb51ef0>

In [103]:
session.execute("""
        CREATE TABLE hotels_by_poi(
            poi_name text,
            hotel_id text,
            name text,
            phone text,
            address frozen<address>,
            PRIMARY KEY ((poi_name), hotel_id)
        ) WITH comment = 'Q1 Find hotels near given poi' AND CLUSTERING ORDER BY (hotel_id ASC);
""")

<cassandra.cluster.ResultSet at 0x1f55ce17b70>

* From Q1 we already know hotel_id, so in Q2 we can directly fetch data.

In [104]:
session.execute("""
    CREATE TABLE hotels (
        id text PRIMARY KEY,
        name text,
        phone text,
        address frozen<address>,
        pois set<text>
    ) WITH comment = 'Q2. Find information about a hotel';
""")

<cassandra.cluster.ResultSet at 0x1f55ce17ac8>

* Q3 is reverse of Q1.

In [105]:
session.execute("""
    CREATE TABLE pois_by_hotel (
        poi_name text,
        hotel_id text,
        description text,
        PRIMARY KEY ((hotel_id), poi_name)
    ) WITH comment = 'Q3. Find pois near a hotel';
""")

<cassandra.cluster.ResultSet at 0x1f55cb605f8>

* In Q4 we have hotel id and we want to find available rooms.
* hotel_id will be the primary key to group all rooms of single hotel in single partition.
* date is clustering key.

In [106]:
session.execute("""
    CREATE TABLE available_rooms_by_hotel_date (
        hotel_id text,
        date date,
        room_number smallint,
        is_available boolean,
        PRIMARY KEY ((hotel_id), date, room_number)
    ) WITH comment = 'Q4. Find available rooms by hotel / date';

""")

<cassandra.cluster.ResultSet at 0x1f55c9019b0>

In [107]:
session.execute("""
    CREATE TABLE hotel.amenities_by_room (
        hotel_id text,
        room_number smallint,
        amenity_name text,
        description text,
        PRIMARY KEY ((hotel_id, room_number), amenity_name)
    ) WITH comment = 'Q5. Find amenities for a room';

""")

<cassandra.cluster.ResultSet at 0x1f55ce52fd0>

#### Reservation logical data model

![](images/reservation_logical.png)

#### Reservation Physical

![](images/reservation_physical.png)

* Same data accessed in different way with different keys.
* We can accomplish it by denormalized table or materialized views.
* Materialized view stores preconfigured view that support queries on additional column which is not part of additional clustering keys.
* When we generate denormalized table it is our responsibility to keep them in sync. But materialized table are synced by cassandra with base table. So during write to base table there is a performance loss to keep them sync.
* Because of higher cardinality, reservation_by_confirmation is great candidate of materialized view.

In [108]:
session.execute("""
    CREATE KEYSPACE reservation
        WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
""")

<cassandra.cluster.ResultSet at 0x1f55ce520b8>

In [109]:
session.set_keyspace('reservation')

In [110]:
session.execute("""
    CREATE TYPE address (
        street text,
        city text,
        state_or_province text,
        postal_code text,
        country text
    );
""")

<cassandra.cluster.ResultSet at 0x1f55ceb4d68>

In [113]:
session.execute("""
    CREATE TABLE reservations_by_hotel_date (
        hotel_id text,
        start_date date,
        end_date date,
        room_number smallint,
        confirm_number text,
        guest_id uuid,
        PRIMARY KEY ((hotel_id, start_date), room_number)
    ) WITH comment = 'Q7. Find reservations by hotel and date';
""")

<cassandra.cluster.ResultSet at 0x1f55ce7ad68>

In [114]:
session.execute("""
        CREATE MATERIALIZED VIEW reservations_by_confirmation AS
        SELECT * FROM reservation.reservations_by_hotel_date
        WHERE confirm_number IS NOT NULL and hotel_id IS NOT NULL and
            start_date IS NOT NULL and room_number IS NOT NULL
        PRIMARY KEY (confirm_number, hotel_id, start_date, room_number);
""")



<cassandra.cluster.ResultSet at 0x1f55ce9a278>

In [115]:
session.execute("""
    CREATE TABLE reservations_by_guest (
        guest_last_name text,
        hotel_id text,
        start_date date,
        end_date date,
        room_number smallint,
        confirm_number text,
        guest_id uuid,
        PRIMARY KEY ((guest_last_name), hotel_id)
    ) WITH comment = 'Q8. Find reservations by guest name';
""")

<cassandra.cluster.ResultSet at 0x1f55ce9aeb8>

In [116]:
session.execute("""
    CREATE TABLE guests (
        guest_id uuid PRIMARY KEY,
        first_name text,
        last_name text,
        title text,
        emails set<text>,
        phone_numbers list<text>,
        addresses map<text, frozen<address>>,
        confirm_number text
    ) WITH comment = 'Q9. Find guest by ID';
""")

<cassandra.cluster.ResultSet at 0x1f55cea0a20>

* Time series pattern is special case of wide row pattern where series of measurements at specific time interval are stored in wide row. Measurement time is used as partition key.

## Architecture

* Racks : logical set of nodes in close proximity
* Dara center : logical set of Racks in same building and connected by reliable network.
* Default is single Data center DC1 and single rack RAC1.
* Cassandra store copies in multiple DC to maximize availability and partition tolerance.

### Gossip and failure detection
* Supports decentralized and fault tolerance, cassandra uses gossip protocol, which runs every second.
* Gossiper class maintain list of alive and dead nodes.
* Once per session gossiper choose random node in cluster and begin gossiping,
    - Initiator send gossipDigestSyncMessage.
    - When other node receive message it sends gossipDigestAckMessage.
    - When initiator receives it, it replies with gossipDigestAck2Message. 
* If no reply from another node, initiator mark node as dead in local list.