# Data Storage and Handling

#### A little excursion on how to store, organize and handle large data sets for analytics ... 

#### So far we have used:
* ***CSV*** files: text representation of tables
* ***Relational Data Bases***
* ***JSON*** files (from Rest comunictons) 


## Outline
* HDF5
* XML 
* JSON
* NoSQL Data Bases
* Use Case: Restaurant Rating Site

## The HDF5 Data Container Format
<img src="IMG/HDF_logo.png">

Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data with APIs for many programming languages.

#### HDF5 Structure
<img src="IMG/hdf5-folder.png" width=800>
<font size=5>[Image Source: https://www.sphenisc.com/doku.php/software/development/hdf5-phdf5]</font>

### HDF5 Key Features:
* POSIX-like syntax for internal data structures /path/to/resource
    * folders
    * meta data
    * comments (even code)
    * arrays 
* fast $n$-D data access 
* data compression
* APIs for many programming languages 

### In Python:
* ***h5py***: http://docs.h5py.org/en/stable/index.html
* ***HDF5 Docs:*** https://portal.hdfgroup.org/display/support

In [None]:
import h5py
import numpy as np
d1 = np.random.random(size = (1000,20))
d2 = np.random.random(size = (1000,200))


In [None]:
#create h5 file
hf = h5py.File('data.h5', 'w')
#write data
hf.create_dataset('dataset_1', data=d1)
hf.create_dataset('dataset_2', data=d2)
hf.close()

In [None]:
#read data
hf = h5py.File('data.h5', 'r') 
#get dateset
n1 = hf.get('dataset_1')
n1

In [None]:
#convert to NumPy
np.array(n1)

#### more on HDF5 in the lab session...

## XML
<img src="IMG/xml.png" width=200>

***Extensible Markup Language (XML)*** is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. 
The design goals of XML emphasize simplicity, generality, and usability across the Internet. It is a textual data format with strong support via Unicode for different human languages. Although the design of XML focuses on documents, the language is widely used for the representation of arbitrary data structures such as those used in web services. 

### XML Tree Representation of Data
<img SRC="IMG/xml_tree.gif" width=800>

### Another XML Example
```
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
```


### XML with ***Python***

In [None]:
#get the data
!git clone https://github.com/keuperj/DATA.git

In [None]:
import xml.etree.ElementTree as ET
tree = ET.parse('DATA/example.xml') #parse xml document
root = tree.getroot() #get tree root


In [None]:
#get first elements of the tree
for child in root:
    print( child.tag, child.attrib)

In [None]:
#iterate over the neighbor attribute
for neighbor in root.iter('neighbor'):
    print (neighbor.attrib)

In [None]:
#get  all country nodes and extract attributes
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    print (name, rank)

#### more on the ***Python*** ***XML*** API: https://docs.python.org/2/library/xml.etree.elementtree.html#

## NoSQL

<H2>NoSQL Data Bases: "<font color=red>N</font>ot <font color=red>O</font>nly SQL"</H2>
<H3>Requirements driven by Big Data and Analytics...</H3>

* Scalability
* Flexibility
* Throughput

### Typical Types of NoSQL Data Bases
* Document based Data Bases
* Key-Value Stores
* Column oriented Data Bases
* Graph Data Bases
* ...

<H3>Document based Data Bases</H3>
<BR>
<img src="IMG/MongoDB.png">
    
* Data stored in documents (files)
* Flexible structure in documents (like XML)
* Queries like in SQL
* Support distributed operations (***MapReduce***)  


<H3>Key-Value Stores</H3>
<BR>
<img src="IMG/Riak.png">
    
* Simple Data Tuple: #Key : Value
* Very high throughput 
* Very low latency  


<H3>Column oriented Data Base</H3>
<BR>
<img src="IMG/Casandra.jpg">
    
* Data in tables
* Column first data access 
    * very good performance for many analytic use cases
    * e.g. aggregation operations 
* good compression support  

<H3>Graph based Data Bases</H3>
<BR>
<img src="IMG/Neo4J.png">
    
* Data structure: Graphs {vertex,edges}
* Applications: e.g. Social Networks, ...
* Queries like "find friends of friends" ... 



#### Disadvantages of NoSQL
* Relational DBs have a solid theory 
    * ACID 
    * mathematical relation algebra
        * allows profs over DB queries   

* Is this true for NoSQL DBs ?

<b> Only with constraints !</b>

<H2>CAP Theorem </H2>
<BR>
Basic properties for DB systems [Brewer]
<img src="IMG/CAP.png" width=800>

<img src="IMG/CAP2.png" width=800>

<H2>BASE Criteria for (NoSQL) Databases</H2>
<H3><font color="red">Ba</font>sically available, <font color="red">S</font>oft-State, <font color="red">E</font>ventual Consistency</H3>

* BASE derived from CAP-Theorem 
* Replaces ACID for distributed DBs

<H2>Use Case:</H2>
<H3>A Restaurant rating system:</H3>
<img src="IMG/TA.png" width="65%">


<H3>Implementation with MongoDB</H3>
<BR>
<img src="IMG/MongoDB.png">
    
* Properties of MongoDB
    * Document oriented DB
        * Structure description in JSON
        <img src="IMG/json.jpg">
   


* Data: open data set with restaurants and ratings:
    * https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json

<H3>Example: JSON Scheme for a restaurant </H3>

```
{
  "address": {
     "coord": [ -73.856077, 40.848447 ],
     "street": "Morris Park Ave",
     "zipcode": "10462"
  },
  "borough": "Bronx",
  "cuisine": "Bakery",
  "grades": [
     { "date": { "$date": 1393804800000 }, "score": 2 },
     { "date": { "$date": 1378857600000 }, "score": 6 },
     { "date": { "$date": 1358985600000 }, "score": 10 },  
  ],
  "name": "Morris Park Bake Shop",
  "restaurant_id": "30075445"
}
```

<H2>Hands on!</H2>

In [None]:
#NOTE: this will only work if you have a local MongoDB Server running 

#import MongoDB client module
from pymongo import MongoClient
import warnings
warnings.filterwarnings('ignore') 
#connect to MongoDB on localhost
client = MongoClient()


In [None]:
#how many worker nodes are working in th MongoDB Cluster?
client.nodes

<H3>What Data is on the  Cluster?</H3>

In [None]:
#see what databases are available
client.database_names()

In [None]:
#generate reference to "demo" database
db = client.demo

In [None]:
#list all collections 
db.collection_names()

<H2>MongoDB Queries</H2>


In [None]:
db.restaurants.find().count()

In [None]:
db.restaurants.find()[129]

<H3>Structured Queries</H3>

* Number of restaurants in the city

In [None]:
db.restaurants.find({"borough": "Manhattan"}).count()

* All entries with Score>10 and ZIP code 10075

In [None]:
db.restaurants.find({"grades.score": {"$gt": 10}, "address.zipcode": "10075"}).count()

<H3>Iterators</H3>

* e.g. all iterators in ZIP code 10075

In [None]:
cursor=db.restaurants.find({"cuisine": "Bakery","address.zipcode": "10075"})
for doc in cursor:
    print (doc["name"])
    

<H2>Map-Reduce with MongoDB</H2>
<H3>Compute histogram of review scores</H3>

In [None]:
from bson.code import Code
#map function
map = Code("function () {"
            "  this.grades.forEach(function(z) {"
            "    emit(z.score, 1);"
            "  });"
            "}")

#reduce function
reduce = Code("function (key, values) {"
              "  var total = 0;"
              "  for (var i = 0; i < values.length; i++) {"
              "    total += values[i];"
              "  }"
              "  return total;"
              "}")

result = db.restaurants.map_reduce(map, reduce, "myresults")

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure()
df=pd.DataFrame(list(result.find()))
plt.bar(np.arange(20),df[0:20].value )
plt.xlabel('score')
plt.ylabel('# votes')
plt.title('Review Scores')

## Discussion