# NoSQL and MongoDB

#### Flexible and distributed data storrage ... 

## Outline

* Prelude: XML 
* NoSQL Databases
* MongoDB
    * JSON
* Use Case: Restaurant Rating Site

## XML
<img src="IMG/xml.png" width=200>

***Extensible Markup Language (XML)*** is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. 
The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. 

### XML Tree Representation of Data
<img SRC="IMG/xml_tree.gif" width=800>

### Another XML Example
```
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
```


### XML with ***Python***

In [1]:
#in colab, we need to clone the data from the repo
!git clone https://github.com/keuperj/DATA.git

Cloning into 'DATA'...
remote: Enumerating objects: 126, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 126 (delta 11), reused 39 (delta 11), pack-reused 87[K
Receiving objects: 100% (126/126), 185.56 MiB | 7.03 MiB/s, done.
Resolving deltas: 100% (32/32), done.
Updating files: 100% (86/86), done.


In [3]:
import xml.etree.ElementTree as ET
tree = ET.parse('DATA/example.xml') #parse xml document
root = tree.getroot() #get tree root


In [4]:
#get first elements of the tree
for child in root:
    print( child.tag, child.attrib)

country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}


In [5]:
#iterate over the neighbor attribute
for neighbor in root.iter('neighbor'):
    print (neighbor.attrib)

{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}


In [6]:
#get  all country nodes and extract attributes
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    print (name, rank)

Liechtenstein 1
Singapore 4
Panama 68


#### more on the ***Python*** ***XML*** API: https://docs.python.org/3/library/xml.etree.elementtree.html

<H3>Document based Data Bases</H3>
<BR>
<img src="IMG/MongoDB.png">
    
* Data stored in documents (files)
* Flexible structure in documents (like XML)
* Queries like in SQL
* Support distributed operations (***MapReduce***)  


<H2>Use Case:</H2>
<H3>A Restaurant rating system:</H3>
<img src="IMG/TA.png" width="65%">


<H3>Implementation with MongoDB</H3>
<BR>
<img src="IMG/MongoDB.png">
    
* Properties of MongoDB
    * Document oriented DB
        * Structure description in JSON
        <img src="IMG/json.jpg">
   


* Data: open data set with restaurants and ratings:
    * https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json

<H3>Example: JSON Scheme for a restaurant </H3>

```
{
  "address": {
     "coord": [ -73.856077, 40.848447 ],
     "street": "Morris Park Ave",
     "zipcode": "10462"
  },
  "borough": "Bronx",
  "cuisine": "Bakery",
  "grades": [
     { "date": { "$date": 1393804800000 }, "score": 2 },
     { "date": { "$date": 1378857600000 }, "score": 6 },
     { "date": { "$date": 1358985600000 }, "score": 10 },  
  ],
  "name": "Morris Park Bake Shop",
  "restaurant_id": "30075445"
}
```

<H2>Hands on!</H2>

In [25]:
#NOTE: this will only work if you have a local MongoDB Server running 

#import MongoDB client module
from pymongo import MongoClient
import warnings
warnings.filterwarnings('ignore') 
#connect to MongoDB on localhost
client = MongoClient()


In [27]:
#how many worker nodes are working in th MongoDB Cluster?
client.nodes

frozenset({('localhost', 27017)})

<H3>What Data is on the  Cluster?</H3>

In [33]:
#see what databases are available
client.list_database_names()

['admin', 'config', 'demo', 'local']

In [34]:
#generate reference to "demo" database
db = client.demo

In [35]:
#list all collections 
db.list_collection_names()

['restaurants']

<H2>MongoDB Queries</H2>


In [37]:
db.restaurants.find()

<pymongo.cursor.Cursor at 0x7fd7a5b6d990>

In [38]:
db.restaurants.find()[129]

{'_id': 129,
 'address': {'building': '26',
  'coord': [-73.9983, 40.715051],
  'street': 'Pell Street',
  'zipcode': '10013'},
 'borough': 'Manhattan',
 'cuisine': 'Café/Coffee/Tea',
 'grades': [{'date': {'$date': 1404950400000}, 'grade': 'A', 'score': 10},
  {'date': {'$date': 1373587200000}, 'grade': 'A', 'score': 10},
  {'date': {'$date': 1360540800000}, 'grade': 'A', 'score': 9},
  {'date': {'$date': 1357776000000}, 'grade': 'P', 'score': 4},
  {'date': {'$date': 1343347200000}, 'grade': 'A', 'score': 12},
  {'date': {'$date': 1330300800000}, 'grade': 'A', 'score': 11},
  {'date': {'$date': 1313107200000}, 'grade': 'B', 'score': 24}],
 'name': 'Mee Sum Coffee Shop',
 'restaurant_id': '40365904'}

<H3>Structured Queries</H3>

* Number of restaurants in the city

In [43]:
len(list(db.restaurants.find({"borough": "Queens"})))

5656

* All entries with Score>10 and ZIP code 10075

In [45]:
len(list(db.restaurants.find({"grades.score": {"$gt": 10}, "address.zipcode": "10075"})))

79

<H3>Iterators</H3>

* e.g. all iterators in ZIP code 10075

In [46]:
cursor=db.restaurants.find({"cuisine": "Bakery","address.zipcode": "10075"})
for doc in cursor:
    print (doc["name"])
    

Annelies Pastries
Lady M Confections
Butterfield Express
The Belgian Cupcake


## Discussion

In [1]:
#install db locally
# MongoDB download and installation
# latest binary from: https://www.mongodb.com/try/download/community
!wget  https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu2204-7.0.7.tgz # Downloads MongoDB from official repository
!tar xfv mongodb-linux-x86_64-ubuntu2204-7.0.7.tgz     # Unpack compressed file
!rm mongodb-linux-x86_64-ubuntu2204-7.0.7.tgz         # Removes downloaded file

# Default location of database is "/data/db" folder
!mkdir data                                          # data folder creation
!mkdir data/db

--2024-04-01 18:25:07--  https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu2204-7.0.7.tgz
Resolving fastdl.mongodb.org (fastdl.mongodb.org)... 2600:9000:2761:ba00:16:717d:fc40:93a1, 2600:9000:2761:a800:16:717d:fc40:93a1, 2600:9000:2761:3800:16:717d:fc40:93a1, ...
Connecting to fastdl.mongodb.org (fastdl.mongodb.org)|2600:9000:2761:ba00:16:717d:fc40:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85286525 (81M) [application/gzip]
Saving to: ‘mongodb-linux-x86_64-ubuntu2204-7.0.7.tgz’


2024-04-01 18:25:15 (11,3 MB/s) - ‘mongodb-linux-x86_64-ubuntu2204-7.0.7.tgz’ saved [85286525/85286525]

mongodb-linux-x86_64-ubuntu2204-7.0.7/LICENSE-Community.txt
mongodb-linux-x86_64-ubuntu2204-7.0.7/MPL-2
mongodb-linux-x86_64-ubuntu2204-7.0.7/README
mongodb-linux-x86_64-ubuntu2204-7.0.7/THIRD-PARTY-NOTICES
mongodb-linux-x86_64-ubuntu2204-7.0.7/bin/install_compass
mongodb-linux-x86_64-ubuntu2204-7.0.7/bin/mongod
mongodb-linux-x86_64-ubuntu2204-7.0.7/bin/mongos


In [24]:
#start MomgoDB Server
import subprocess, os
FNULL = open(os.devnull, 'w')
p=subprocess.Popen(['mongodb-linux-x86_64-ubuntu2204-7.0.7/bin/mongod', '--dbpath', 'data/db'],stdout=FNULL, stderr=subprocess.STDOUT)

In [5]:
!pip install pymongo

Collecting pymongo
  Obtaining dependency information for pymongo from https://files.pythonhosted.org/packages/00/07/9b7612de2ac167d1dee7d18fa4e37fa830e7242c3f249f5d824931dcd26d/pymongo-4.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading pymongo-4.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Obtaining dependency information for dnspython<3.0.0,>=1.16.0 from https://files.pythonhosted.org/packages/87/a1/8c5287991ddb8d3e4662f71356d9656d91ab3a36618c3dd11b280df0d255/dnspython-2.6.1-py3-none-any.whl.metadata
  Downloading dnspython-2.6.1-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (680 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m680.7/680.7 kB[0m [31m475.0 kB/s[0m eta [36m0:00:00[0m1m464.1 kB/s[0m eta [36m0:00:01[0m
[?25hDownloading dnspython-2.6.1-py3

In [31]:
#get data
!wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json

--2024-04-01 18:38:22--  https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8000::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11874761 (11M) [text/plain]
Saving to: ‘primer-dataset.json’


2024-04-01 18:38:24 (9,18 MB/s) - ‘primer-dataset.json’ saved [11874761/11874761]



In [32]:
import json

restaurants = db.restaurants

with open('primer-dataset.json', "r") as file:
    for i,line in enumerate(file):
      data = json.loads(line)
      data['_id'] = i
      restaurants.insert_one(data)

In [19]:
!ps -A |grep mongo


  46422 ?        00:00:05 mongod <defunct>


In [20]:
!kill 46422