# Understanding NoSQL Database with Focus on MongoDb


### What is NoSQL Database?
A NoSQL database environment is, simply put, a non-relational and largely distributed database system that enables rapid, ad-hoc organization and analysis of extremely high-volume, disparate data types. NoSQL databases are sometimes referred to as cloud databases, non-relational databases, Big Data databases and a myriad of other terms and were developed in response to the sheer volume of data being generated, stored and analyzed by modern users (user-generated data) and their applications (machine-generated data).

### Types of NoSQL:
There are 4 basic types of NoSQL databases:

   1. Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3 (Dynamo)}
   2. Document-based Store- It stores documents made up of tagged elements. {Example- MongoDb}
   3. Column-based Store- Each storage block contains data from only one column, {Example- HBase, Cassandra}
   4. Graph-based-A network database that uses edges and nodes to represent and store data. {Example- Neo4J}
   
   
   
##### Key Value Store NoSQL Database:

The schema-less format of a key value database like Riak is just about what you need for your storage needs. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data. A bucket is a logical group of keys – but they don’t physically group the data. There can be identical keys in different buckets.

Performance is enhanced to a great degree because of the cache mechanisms that accompany the mappings. To read a value you need to know both the key and the bucket because the real key is a hash (Bucket+ Key).

There is no complexity around the Key Value Store database model as it can be implemented in a breeze. Not an ideal method if you are only looking to just update part of a value or query the database.

When we try and reflect back on the CAP theorem, it becomes quite clear that key value stores are great around the Availability and Partition aspects but definitely lack in Consistency.

The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

This key/value type database allow clients to read and write values using a key as follows:

    Get(key), returns the value associated with the provided key.
    Put(key, value), associates the value with the key.
    Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys.
    Delete(key), removes the entry for the key from the data store.

While Key/value type database seems helpful in some cases, but it has some weaknesses as well. One, is that the model will not provide any kind of traditional database capabilities (such as atomicity of transactions, or consistency when multiple transactions are executed simultaneously). Such  capabilities must be provided by the application itself.

Secondly, as the volume of data increases, maintaining unique values as keys may become more difficult; addressing this issue requires the introduction of some complexity in generating character strings that will remain unique among an extremely large set of keys.


##### Document Store NoSQL Database:

The data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the only difference is that the values stored (referred to as “documents”) provide some structure and encoding of the managed data. XML, JSON (Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are some common standard encodings.

One key difference between a key-value store and a document store is that the latter embeds attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents. 

Data and relationships are not stored in tables as is a norm with conventional relational databases but in fact are a collection of independent documents.
The fact that document style databases are schema-less makes adding fields to JSON documents a simple task without having to define changes first.


##### Column Store NoSQL Database:

In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema. Read and write is done using columns rather than rows.

In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast search/ access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on disk while Columnar databases store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster.

###### Data Model
    ColumnFamily:  ColumnFamily is a single structure that can group Columns and SuperColumns with ease.
    Key: the permanent name of the record. Keys have different numbers of columns, so the database can scale in an irregular way.
    Keyspace:  This defines the outermost level of an organization, typically the name of the application.
    Column:  It has an ordered list of elements aka tuple with a name and a value defined.


##### Graph Base NoSQL Database: 

In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables and columns representation, a flexible graphical representation is instead used which is perfect to address scalability concerns. Graph structures are used with edges, nodes and properties which provides index-free adjacency. Data can be easily transformed from one model to the other using a Graph Base NoSQL database.

![graph.PNG](attachment:graph.PNG)

These databases that uses edges and nodes to represent and store data.
These nodes are organised by some relationships with one another, which is represented by edges between the nodes.
Both the nodes and the relationships have some defined properties.

### CAP Theorem:


 The theorem states that networked shared-data systems can only guarantee/strongly support two of the following three properties:

Consistency - A guarantee that every node in a distributed cluster returns the same, most recent, successful write. Consistency refers to every client having the same view of the data. There are various types of consistency models. Consistency in CAP (used to prove the theorem) refers to linearizability or sequential consistency, a very strong form of consistency.

Availability - Every non-failing node returns a response for all read and write requests in a reasonable amount of time. The key word here is every. To be available, every node on (either side of a network partition) must be able to respond in a reasonable amount of time.

Partition Tolerant - The system continues to function and upholds its consistency guarantees in spite of network partitions. Network partitions are a fact of life. Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once the partition heals.
The C and A in ACID represent different concepts than C and in A in the CAP theorem.

The CAP theorem categorizes systems into three categories:

CP (Consistent and Partition Tolerant) - At first glance, the CP category is confusing, i.e., a system that is consistent and partition tolerant but never available. CP is referring to a category of systems where availability is sacrificed only in the case of a network partition.

CA (Consistent and Available) - CA systems are consistent and available systems in the absence of any network partition. Often a single node's DB servers are categorized as CA systems. Single node DB servers do not need to deal with partition tolerance and are thus considered CA systems. The only hole in this theory is that single node DB systems are not a network of shared data systems and thus do not fall under the preview of CAP.

AP (Available and Partition Tolerant) - These are systems that are available and partition tolerant but cannot guarantee consistency.


###### Refer below links for mre details:

https://www.3pillarglobal.com/insights/just-say-yes-to-nosql

https://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-databases

https://academy.datastax.com/planet-cassandra/what-is-nosql



### Introduction to MongoDB:

Refer link https://youtu.be/-0X8mr6Q8Ew for MongoDb installation and basic tutorial.

MongoDb is a Document store NoSQL database. 
Before exploring schema design, below provides a useful reference for translating terminology from the relational to MongoDB world.

![Terminology.PNG](attachment:Terminology.PNG)


The data model of MongoDb is as shown below:

![MongoDataModel.PNG](attachment:MongoDataModel.PNG)


#### Following below Commands we will create database of "AirBnB".
Get Data for Airbnb in JSON, CSV or Excel from: https://public.opendatasoft.com/explore/?sort=modified&q=Airbnb

##### Database:
In MongoDB, databases hold collections of documents.
To select a database to use, in the mongo shell, issue the use

    use airbnb;

If a database does not exist, MongoDB creates the database when you first store data for that database. As such, you can switch to a non-existent database and perform the following operation in the mongo shell.

##### Collections:
MongoDB stores documents in collections. Collections are analogous to tables in relational databases.

Create a Collection

If a collection does not exist, MongoDB creates the collection when you first store data for that collection.

    db.TweetDetails.insertOne({'cre_tms': '14-04-2018 18:14',
      'fav_cnt': 0,
      'hashtags': ['rooftop', ' beachhouse', ' beachlifestyle'],
      'retweet_cnt': 0,
      'tweet_id': 985220000000000000,
          'tweet_txt': 'Waves crashing on shore and sand for miles. Views from our rooftop. #rooftop #beachhouse #beachlifestyleâ€¦ https://t.co/9jyDBh4ldx',
      'urls': ['https://twitter.com/i/web/status/985219856701239298'],
      'user_loc': nan,
      'user_name': 'VillaRocas_Baja'})
    

The db.reviews.insertOne() is one of the methods available in the mongo shell.

"db" refers to the current database. "reviews" is the name of the collection. If the mongo shell does not accept the name of a collection, you can use the alternative db.getCollection() syntax. For instance, if a collection name contains a space or hyphen use db.getCollection() as mentioned below:

    db.getCollection("Property Reviews")
    
The above method also creates a collection with the name passed if the collection does not already exist.


##### Insert Documents:

Insert a Single Document

db.collection.insertOne() inserts a single document into a collection.

The following example inserts a new document into the inventory collection. If the document does not specify an _id field, MongoDB adds the _id field with an ObjectId value to the new document. See Insert Behavior.

    db.TweetDetails.insertOne({'cre_tms': '14-04-2018 18:14',
      'fav_cnt': 0,
      'hashtags': ['rooftop', ' beachhouse', ' beachlifestyle'],
      'retweet_cnt': 0,
      'tweet_id': 985220000000000000,
          'tweet_txt': 'Waves crashing on shore and sand for miles. Views from our rooftop. #rooftop #beachhouse #beachlifestyleâ€¦ https://t.co/9jyDBh4ldx',
      'urls': ['https://twitter.com/i/web/status/985219856701239298'],
      'user_loc': nan,
      'user_name': 'VillaRocas_Baja'})


Insert Multiple Documents

db.collection.insertMany() can insert multiple documents into a collection. Pass an array of documents to the method.

The following example inserts three new documents into the inventory collection. If the documents do not specify an _id field, MongoDB adds the _id field with an ObjectId value to each document.

    db.TweetDetails.insertMany([
        {'cre_tms': '14-04-2018 18:14',
      'fav_cnt': 0,
      'hashtags': ['rooftop', ' beachhouse', ' beachlifestyle'],
      'retweet_cnt': 0,
      'tweet_id': 985220000000000000,
      'tweet_txt': 'Waves crashing on shore and sand for miles. Views from our rooftop. #rooftop #beachhouse #beachlifestyleâ€¦ https://t.co/9jyDBh4ldx',
      'urls': ['https://twitter.com/i/web/status/985219856701239298'],
      'user_loc': nan,
      'user_name': 'VillaRocas_Baja'},
     {'cre_tms': '14-04-2018 18:14',
      'fav_cnt': 0,
      'hashtags': ['VRBO', ' AIRBNB'],
      'retweet_cnt': 0,
      'tweet_id': 985220000000000000,
      'tweet_txt': '1523729689 Exceptional Value, Beautiful, and Spaciously Comfortable Too https://t.co/RKLuDiMU7T #VRBO #AIRBNBâ€¦ https://t.co/y33TnzSH99',
      'urls': ['https://uvhr.jimdo.com'],
      'user_loc': nan,
      'user_name': 'TaylorSlowe'}
      ])

MongoDB Tutorial 2 : Insert, Update, Remove, Query https://youtu.be/CB9G5Dvv-EE

MongoDB Tutorial 3 : Indexing and aggregating. https://youtu.be/mpkmFGuC9NQ


###### Design Decisions taken while creating AirBnB database in MongoDb: 

Schema design requires a change in perspective: a) From the legacy relational data model that flattens data into rigid 2-dimensional tabular structures of rows and columns b) To a rich and dynamic document data model with embedded sub-documents and arrays
The project team should start the schema design process by considering the application’s requirements. It should model the data in a way that takes advantage of the document model’s flexibility. In schema migrations, it may be easy to mirror the relational database’s flat schema to the document model. However, this approach negates the advantages enabled by the document model’s rich, embedded data structures.

As per the structure of RDMS uses Field "Listing Id" to fetch the reviews of each property in the Properties table. Using the document model, embedded sub-documents and arrays effectively pre-JOIN data by combining related fields in a single data structure. Rows and columns that were traditionally normalized and distributed across separate tables can now be stored together in a single document, eliminating the need to JOIN separate tables when the application has to retrieve complete records. Modeling the same data in MongoDB enables us to create a schema in which we embed an array of sub-documents "Reviews" for each review directly within Properties document as shown below:


    {'City': 'Austin',
      'Country': 'United States',
      'House Rules': '*Guests must be over the age of 25 and will be an occupant of the unit during the entire reserved period. *No parties, excessive noise, or any illegal activity shall take place at property. *No pets are allowed unless otherwise noted for specific properties. * No smoking is allowed in or around any property. Cleaning: Each property will be cleaned and inspected after your departure. We ask you to help us enable the cleaning crews as much as possible. Simple things like leaving the property tidy, running the dishwasher, starting to wash the towels, and taking out the trash are expected of our guests. If excessive wear and tear is found on property, or additional cleaning is necessary due to spills, trash left on site, unclean dishes, stains to furniture, carpeting, linens, paint, wallpaper, or flooring, Guest authorizes the Owner to bill them for additional fees. If at any time the maximum number of occupants is exceeded or if the Owner receives information about excessive noise, Owner',
      'Listing Id': 14852,
      'Name': 'Architectural Gem on Lake Austin',
      'Neighbourhood': 'Westlake Hills',
      'Neighbourhood Cleansed': '78746',
      'State': 'TX',
      'Street': 'Lakeshore Drive, Austin, TX 78746, United States',
      'Zipcode': '78746',
      
      'Reviews': [ {'comments': 'The reservation was canceled 10 days before arrival. This is an automated posting.',
      'date': '2015-03-18',
      'id': 28115511,
      'listing_id': 14852,
      'reviewer_id': 6443763,
      'reviewer_name': 'Ron'},
     {'comments': 'Hetty ist eine wunderbare Gastgeberin!\r\nWir haben uns rundum wohl gefühlt, vielen Dank dafür!',
      'date': '2015-05-17',
      'id': 32374947,
      'listing_id': 14852,
      'reviewer_id': 13633695,
      'reviewer_name': 'Antje'},
     {'comments': 'The host canceled this reservation 60 days before arrival. This is an automated posting.',
      'date': '2016-03-07',
      'id': 64749139,
      'listing_id': 14852,
      'reviewer_id': 60599013,
      'reviewer_name': 'Eric'} ]
    }
    
    
The SocialTags table contains number of tags fields to store the "hashtags" present in the posts. In RDMS, this is done to maintain table in the normalized state. In MongoDb, we can strore all the fields in a single field call hashtags, which will be of type array containing all the tags in the post.

     {'cre_tms': '03-04-2018 00:47',
      'fav_cnt': 0,
      'hashtags': ['Airbnb'],
      'retweet_cnt': 1,
      'tweet_id': 980970000000000000,
      'tweet_txt': "RT @FIVRE604: I take it you're on this @CityofVancouver ?\r\n#Airbnb\r\nEasy to get addresses from George https://t.co/xyHOnE5WxP",
      'urls': ['https://twitter.com/george_affleck/status/980514295581569024'],
      'user_name': 'vanprole'}


For further reference:
https://www.mongodb.com/thank-you/white-paper/migration-rdbms-nosql-mongodb

https://apiko.com/blog/opting-for-the-database-type-for-online-marketplace-app-how-to-migrate-mysql-database-to-mongodb/

##### Python Code Snippets:

###### For connecting to MongoDb:
        from pymongo import MongoClient
        client = MongoClient()
        db = client.airbnb
        
        ## Prerequesites to excute code Start MongoDb and MongoDb Server 
        Reference Links:
        MongoDB Tutorial https://youtu.be/-0X8mr6Q8Ew
        
###### Converting CSV to BSON(Type of Data Loaded in MongoDb) and inserting data in MongoDb:
        import pandas as pd
        import json
        from bson import json_util
        filename = 'myFile.csv'
        df = pd.read_csv(filename, sep=',')
        rec_td = json_util.loads(json.dumps(df.to_dict(orient='records')))
        client = MongoClient()
        db = client.airbnb
        db.TweetDetils.insert_many(rec_td)
        
        
#### License

The text in the document by Kaushik Paranjpe is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/us/

The code in the document by Kaushik Paranjpe is licensed under the MIT License https://opensource.org/licenses/MIT
