# Basic storage format for unstructured data.

The first thing to do before thinking about how to store unstructured data is to understand the difference between structured and unstructured data. 

## Structured vs Unstructured vs Semi-structured.

Structured Data:
- Row and column format (or can be easily converted to row and column)
- Fixed length/width
- Missing values = Blank
- Storing tools: CSV, TXT, XLS.

Semi-Structured or Unstructured Data:
- Contains tags,keys or other markers.
- Nested or hierarchical data.
- Avoid messy translations into a relational data mode.
- storing tools: JSON, XML, HTML, Pickle.

N.B: Unstructured Data might be an abuse of language. Every data is structured in some way or your computer would not be able to understand it. Data is atleast semi-structured. Information in the other hand... An image has structured date (pixel by pixel, author, date, type,...) but the information present on an image is unstructure (The computer does not understand that on the image there's a "red car", as an observer you structure this information by gathering the pixel and the information becomes structured)

## Dict

A dictionary is a collection which is unordered, changeable and indexed also called keys (i.e it stores semi-structured data). 

In [1]:
paper = {"authors" : ["Auteur1","Auteur2","Auteur3"],
         "title" : "This is paper 1",
         "affiliations" : ["University of Mannheim","University of Strasbourg"],
         "ref" : ["This is ref 1","This is ref 2","This is ref 3"]}
print(paper)
print(type(paper))
print(dir(paper))

{'authors': ['Auteur1', 'Auteur2', 'Auteur3'], 'title': 'This is paper 1', 'affiliations': ['University of Mannheim', 'University of Strasbourg'], 'ref': ['This is ref 1', 'This is ref 2', 'This is ref 3']}
<class 'dict'>
['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']


### TODO 1: Create a dict from the paper of [lecun et al.](https://www.researchgate.net/publication/277411157_Deep_Learning) and [goodfellow et al.](https://arxiv.org/abs/1406.2661) with authors, title, affiliations.

There's multiple options to save a dict.

In [3]:
# JSON
import json
with open('data/data.json', 'w') as fp:
    json.dump(paper, fp)
    
with open('data/data.json', 'r') as fp:
    test= json.load(fp)

print(test)
print(type(test))
print(dir(test))


{'authors': ['Auteur1', 'Auteur2', 'Auteur3'], 'title': 'This is paper 1', 'affiliations': ['University of Mannheim', 'University of Strasbourg'], 'ref': ['This is ref 1', 'This is ref 2', 'This is ref 3']}
<class 'dict'>
['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']


### TODO 2: save your dict with the 2 papers and load it back

In [5]:
# Pickle
import pickle

with open('data/data.pickle', 'wb') as fp:
    pickle.dump(paper, fp)

with open('data/data.pickle', 'rb') as fp:
    test = pickle.load(fp)

print(test)
print(type(test))
print(dir(test))

{'authors': ['Auteur1', 'Auteur2', 'Auteur3'], 'title': 'This is paper 1', 'affiliations': ['University of Mannheim', 'University of Strasbourg'], 'ref': ['This is ref 1', 'This is ref 2', 'This is ref 3']}
<class 'dict'>
['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']


In [7]:
# XML, Extensible Markup Language. tree-like structure.
# Multiple package to work with python and xml: lxml, xml.dom.minidom, xml.etree.ElementTree
# We will use lxml but make sure to look at the docs of others.

import lxml.etree

# Load the data
xml_file = "data/xml_file.nxml"
root = lxml.etree.parse(xml_file)

# Prettify = Make it human readable
# print(lxml.etree.tostring(root, encoding="unicode", pretty_print=True)) 

# xml format: <TAG1 ATTRIBUTE1><TAG2 ATTRIBUTE2> TEXT </ENDTAG2></ENDTAG1>
# TAG2 is a children node of TAG1
# read more about it here https://www.w3schools.com/html/html_elements.asp

# To access elements we use something called XPATH
# xpath("//TAG1[ATTRIBUTE1='something']/TAG2[ATTRIBUTE2='something']/text()")

abstract = root.xpath("//abstract//text()")
body = root.xpath("//body//text()")
title = root.xpath("//title-group//text()")
figures = root.xpath("//fig//text()")

aff = root.xpath("//aff/text()")
aff = [i for i in aff if not i.startswith((' ', '\t'))]
aff_label = root.xpath("//aff/label/text()")

mails =root.xpath("//author-notes/corresp")[0]
mails.getchildren()

xref = {}
for affiliation,label in zip(aff,aff_label):
    xref[label]= affiliation

authors = root.xpath("//contrib")
authors = [i.getchildren() for i in authors]
for author in authors:
    names = [i.getchildren() for i in author if i.tag == "name"][0]
    surname = [i.text for i in names if i.tag=="surname"]
    name = [i.text for i in names if i.tag=="given-names"]
    xrefs = [i.text for i in author if i.tag=="xref"]
    print(name,surname,name,xrefs)


['Zhuolin'] ['Yong'] ['Zhuolin'] ['1', '†']
['Linmei'] ['Zhuang'] ['Linmei'] ['1', '†']
['Yi'] ['Liu'] ['Yi'] ['1']
['Xin'] ['Deng'] ['Xin'] ['2']
['Dingde'] ['Xu'] ['Dingde'] ['3', '*']


### TODO 3:  do the same but for the xml_file2 put infos in a dict and save it in a json file

Now imagine you have huge amount of semi-structured data in form of a dict. For example you can request 30M tweets with the name of the person who tweeted, the text, the date, the language, the comments,...

You could store it in a JSON or a Pickle file. Problem arise when you try to open it back. You maybe don't want to load every tweets in your memory (RAM and time issues) but with these format there's no other choices. Maybe you also want to store this 30M on an other machine and connect to it from a small laptop. All of this shows you the limitation of storing semi-structured data.

As for the structured format where you went from csv to SQL DBs you can also go from JSON to noSQL DBs. That's what MongoDB is and we will learn to use it in the next chapter.

## NoSQL types

Multiple type of NoSQL DB exist. The most popular are written below

- Key-value stores 
- Document stores 
- Wide column stores
- Graph databases 

### Key-value stores

A key-value store is the simplest possible data model: it's a storage system that stores values indexed by a key (kinda like a dict). The key is generaly an id, identifier or a primary key and the value associated is a binary object, the system does not really handle the value (blackbox). CRUD operations does exist on key-value stores (More on CRUD operation in Chap II).

pros: scalable and fast
cons: for simple data and simple queries (query limited on key since values are black-box)

### document stores

A document-oriented database extends the key-values model in the sense that values are stored in a structured format called document that the database can understand (i.e it's no longer a blackbox). Therefore you are no longer limited in the queries and you can perform CRUD operations on keys but also values. This allows the user to fetch entire page of information (for example blogs that contain a specific keyword) and is much more appreciated by websites storing a lot of informations.
La structure d'un document stores est la suivante: DB-collection-Document. CRUD operations also available.

pros: Extension of key-value (value examinable), complex query.
cons: Slow for updating (not a problem as long you can have your index in RAM), difficult to query when keys are constantly changing.


### Wide column store

"It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table" (wikipedia) i.e it's more flexible than the typical SQL DB but it's not the only difference. In some wide column DB you need to specify fields (cassandra) others you don't. CRUD operations available. Read by rows (a key is assigned to each row)

pros: works well with flat data with similar scheme
cons: No ACID transactions, no complex query (Average, sum,...), hard to change the structure.

### Graph DBs

DB that consists of nodes (individuals/agent) which are connected by edges (relation between the two individuals). Each node and edge have proprety (e.g individuals has different characteristics and they are connected in a certain way defined by the property)

pros and cons can be resumed by the fact that they are specific to graph data and nothing else.



Read more about types of NoSQL DB here https://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-MIE-PDB/lectures/Lecture-06-NoSQL.pdf.

Refer to this website if you are interested in seeing the most used db knowing each type. https://db-engines.com/en/ranking

- key-value: #1 ranked is Redis
- Document-oriented: #1 is MongoDB
- Wide columns: #1 is Cassandra
- Graph: #1 is Neo4j
