# Basic storage format for unstructured data.

The first thing to do before thinking about how to store unstructured data is to understand the differences between structured and unstructured data. 

## Structured vs Unstructured vs Semi-structured.

Lot of disagrement around what exactly is structured and semi-structured data. So here's my two cents

Structured Data:
- Row and column format (or can be easily converted to row and column)
- Fixed length/width
- Missing values = Blank
- Storing tools: CSV, XLS.

Semi-Structured or Unstructured Data:
- Contains tags, keys or other markers but follow no fixed structure.
- Nested or hierarchical data.
- Avoid messy translations into a relational data mode.
- storing tools: JSON, XML, HTML.

Unstructured Data is an abuse of language. Every data is structured in some way or your computer would not be able to understand it. Data is atleast semi-structured. Information in the other hand... An image has structured date (pixel by pixel, author, date, type,...) but the information present on an image is unstructured (The computer does not understand that on the image there's a "red car", as an observer you structure this information by gathering the pixel and the information becomes structured)

## Text

Strings are the main sources of semi-structured data. Hence emails, logs, words document, ... are all considered unstructured-data.  

To have a persistant storage of text you can use a .txt file, logs, ... 

In [1]:
# simple example of insert in a txt.file

import tqdm

# tqdm for monitoring the loop
for i in tqdm.tqdm(range(10000)):
    # with open to avoid the open followed by closed
    # a+ = append and if file does not exists create it !
    with open('data/Chap2/example1.txt', 'a+') as f:
        # write in the txt file the str i and skip a line. \n is a special character like \t
        # if you want to consider \n as real text you need to put a r in front of the str
        f.write(str(i) + "\n")
        

100%|██████████| 10000/10000 [02:24<00:00, 68.97it/s]


## Dict

A dictionary is a collection which is unordered, mutable and indexed (i.e it stores semi-structured data). 

In [None]:
# Dict format in python {key:value} the value can be a list or a dict. 
# key must be a str avoid the use of "." in key

paper = {"authors" : ["Auteur1","Auteur2","Auteur3"],
         "title" : "This is paper 1",
         "affiliations" : ["University of Mannheim","University of Strasbourg"],
         "ref" : ["This is ref 1","This is ref 2","This is ref 3"]}

print(paper)
print(type(paper))
print(dir(paper))

There's multiple options to save a dict. Below is one of these option using the JSON format.
What exactly is JavaScript Object Notation (JSON) ?

"JSON is an open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types." wikipedia

Two important things from this definition: attribute-value pairs and human-readable text.

In [None]:
# JSON
# Same syntax as txt file except with use json.dump and json.load instead of write and read

import json
with open('data/Chap2/data.json', 'w') as fp:
    json.dump(paper, fp)
    
with open('data/Chap2/data.json', 'r') as fp:
    test= json.load(fp)

print(test)
print(type(test))
print(dir(test))


Continuing to work on JSON data but with a dataset from [here](https://data.world/doc/commerce-enterprise-data)

In [None]:
# CRUD operations on JSON

# READ
import json
with open('data/Chap2/data_world.json', 'r', encoding= "utf-8") as fp:
    docs = json.load(fp)
    
print(docs[0].keys())

In [None]:
# Update
import tqdm

for doc in tqdm.tqdm(docs):
    if doc["publisher"]["@type"] == "org:Organization":
        doc['title'] = "Hello world"

print(docs[1])

"""
with open('data/Chap2/data.json', 'w') as fp:
    json.dump(paper, fp)
"""

In [None]:
# Delete

[doc.pop("title") for doc in docs if doc["publisher"]["@type"] == "org:Organization"]

print(docs[0])

JSON comes as an alternative to XML. "Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.". My recommandation: use JSON. However if you encounter an XML file in your line of work here's how you can work with it:

In [None]:
# XML, Extensible Markup Language. tree-like structure.
# Multiple package to work with python and xml: lxml, xml.dom.minidom, xml.etree.ElementTree
# We will use lxml but make sure to look at the docs of others.

import lxml.etree

# Load the data
xml_file = "data/Chap2/xml_file.nxml"
root = lxml.etree.parse(xml_file)



In [None]:
# Prettify = Make it human readable
print(lxml.etree.tostring(root, encoding="unicode", pretty_print=True)) 

In [None]:
"""
Some explanation on the xml format

xml format: <TAG1 ATTRIBUTE1><TAG2 ATTRIBUTE2> TEXT </ENDTAG2></ENDTAG1>
TAG2 is a children node of TAG1
read more about it here https://www.w3schools.com/html/html_elements.asp

To access elements we use something called XPATH
xpath("//TAG1[ATTRIBUTE1='something']/TAG2[ATTRIBUTE2='something']/text()")
"""

In [None]:
abstract = root.xpath("//abstract//text()")
body = root.xpath("//body//text()")
title = root.xpath("//title-group//text()")
figures = root.xpath("//fig//text()")

aff = root.xpath("//aff/text()")
aff = [i for i in aff if not i.startswith((' ', '\t'))]
aff_label = root.xpath("//aff/label/text()")

mails =root.xpath("//author-notes/corresp")[0]
mails.getchildren()

print(abstract)

In [None]:
xref = {}
for affiliation,label in zip(aff,aff_label):
    xref[label]= affiliation

authors = root.xpath("//contrib")
authors = [i.getchildren() for i in authors]
for author in authors:
    names = [i.getchildren() for i in author if i.tag == "name"][0]
    surname = [i.text for i in names if i.tag=="surname"]
    name = [i.text for i in names if i.tag=="given-names"]
    xrefs = [i.text for i in author if i.tag=="xref"]
    print(name,surname,xref[xrefs[0]],xrefs)


## Sound and Image

Working with sound is out of the scope of this course but they are an important source of unstructured data (self-driving cars, Google image, Youtube, Twitch, ...). You already know the persistent storage for sound and music (png, jpg, wav, mp3, ...)

In [None]:
from PIL import Image
import requests

im = Image.open(requests.get("https://www.actuia.com/wp-content/uploads/2020/09/BETA-Logo.png", stream=True).raw)
im.save("data/Chap2/image_beta.png", "png")

In [None]:
import numpy as np
from scipy.io.wavfile import read

a = read("data/Chap3/sound2mongo.wav")
values = np.array(a[1],dtype=float)

## Scale it up

Now imagine you have huge amount of semi-structured data in form of a dict. For example you can request 30M tweets with the name of the person who tweeted, the text, the date, the language, the comments,...

You could store it in a JSON or a Pickle file. Problem arise when you try to open it back. You maybe don't want to load every tweets in your memory (RAM and time issues) but with these format there's no other choices. Maybe you also want to store this 30M on an other machine and connect to it from a small laptop. All of this shows you the limitation of storing without using database systems.

As for the structured format where you went from csv to SQL DBs you can also go from JSON to noSQL DBs. That's what MongoDB is and we will learn to use it in the next chapter.

## Vocabulary

Before going in more depth between the different types of NoSQL DBs we will talk a bit about the vocabulary used in the Databases context. Don't worry you do not need to know all of them by heart or understand everything at this point but when you encounter them later you can come back here if needed:

- Database system: is a collection of software programs and data structures that are designed to store, organize, manage, and retrieve large amounts of data. It typically consists of two main components: the database itself, which stores the data, and the database management system (DBMS)
- Database management system (DBMS) is responsible for tasks such as data integrity, security, backup and recovery, performance tuning, and query optimization
- Persistent storage: "a system that outlives (persists more than) the process that created it" wikipedia. You create a Python object, say a list or a dict, when you close the Python session you lose this object. In other words it's not persistent. Before closing it you save this item as a pickle or .txt file: it becomes persistent. 
- CRUD: Create Read Update Delete (CRUD) are the most basic operations you can perform on persistent storage. For example in a txt file you can append new observations, read already existing ones but also update and delete specific row (last two operation are a bit tedious in a simple .txt file but effortles on DBs)
- Database transaction: Sum of operations done in one go, the bank example is often used. First you debit an account (1st operation) then you credit the other account (2nd operation). The whole is a transaction.
- Distributed data store: Data are stored on different node (computers if you want), if one node is down you can still access your data on other nodes (You usually put Replica Sets on different nodes)
- ACID: 
    - Atomicity, either every operations in the transaction is complete or none. 
    - Consistency, DBs live on a set of rules (Primary Key, Constraints,...). At the end of the transactions theses rules are still respected
    - Isolation, in the context of concurrent transaction the result is the same as sequential transactions.
    - Durability, once a transaction is committed it becomes persistant.  

Read more about ACID compliance and NoSQL ([1](https://dba.stackexchange.com/questions/185763/why-are-nosql-databases-not-acid-compliant),[2](https://stackoverflow.com/questions/2608103/is-there-any-nosql-data-store-that-is-acid-compliant))

- CAP theorem: Consistency, Availability and Partition. "it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees" wikipedia. Well explained [here](https://www.youtube.com/watch?v=k-Yaq8AHlFA)

Read more on the CAP theorem ([1](https://en.wikipedia.org/wiki/CAP_theorem),[2](https://www.bmc.com/blogs/cap-theorem/))

Not a set-in-stone list, might update it later

## NoSQL types

Multiple types of NoSQL DB exist. The most popular are written below

- Key-value stores 
- Document stores 
- Wide column stores
- Graph databases 

### Key-value stores

A key-value store is the simplest possible data model: it's a storage system that stores values indexed by a key (kinda like a dict). The key is generaly an id, identifier or a primary key and the value associated is a binary object, the system does not really handle the value (blackbox). CRUD operations does exist on key-value stores.

pros: scalable and fast

cons: for simple data and simple queries (query limited on key since values are black-box)

### Document stores

A document-oriented database extends the key-values model in the sense that values are stored in a structured format called document that the database can understand (i.e it's no longer a blackbox). Therefore you are no longer limited in the queries and you can perform CRUD operations on keys but also values. This allows the user to fetch entire page of information (for example blogs that contain a specific keyword) and is much more appreciated by websites storing a lot of informations.
The structure of a document stores: DB-collection-Document. 

pros: Extension of key-value (value examinable), complex query.

cons: Slow for updating (not a problem as long you can have your index in RAM), difficult to query when keys are constantly changing.


### Wide column store

"It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table" (wikipedia) i.e it's more flexible than the typical SQL DB but it's not the only difference. In some wide column DB you need to specify fields (cassandra) others you don't. CRUD operations available. Read by rows (a key is assigned to each row)

pros: works well with flat data with similar scheme
cons: No complex query (Average, sum,...), hard to change the structure.

### Graph DBs

DB that consists of nodes (individuals/agent) which are connected by edges (relation between the two individuals). Each node and edge have proprety (e.g individuals has different characteristics and they are connected in a certain way defined by the property)

pros and cons can be resumed by the fact that they are specific to graph data and nothing else.



## Sum up

Read more about types of NoSQL DB here https://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-MIE-PDB/lectures/Lecture-06-NoSQL.pdf.

Refer to this website if you are interested in seeing the most used db knowing each type. https://db-engines.com/en/ranking

- key-value: #1 ranked is Redis
- Document-oriented: #1 is MongoDB
- Wide columns: #1 is Cassandra
- Graph: #1 is Neo4j


### Exercise

**1**: Lorem Ipsum is just a random txt that devs use as a placeholder for multiple things (especially web developping) when you don't have the real text and just want to test your functionnality. Put a [Lorem Ipsum](https://www.lipsum.com/) of 3 paragraphs in a txt file using python, each paragraph delimited by two new line.

**2**: Update the txt file by removing the first paragraph.

**3**: Create a dict from the paper of [lecun et al.](https://www.researchgate.net/publication/277411157_Deep_Learning) and [goodfellow et al.](https://arxiv.org/abs/1406.2661) with authors, title, affiliations.

**4**: Save the previously created dict in the JSON format and load it back.

**5**: Save the previously created dict in the pickle format. Try to open manually (i.e with a text editor), is it human readable ?

**6**: Parse the xml_file2 in the same way as in the lecture. put infos in a dict and save it in a json file.

**7**: Download an image of your choice and save it in either jpg or png.

**8**: From the data/Chap2/data_world.json file, create a set of publisher type.

**9**: From the data/Chap2/data_world.json file, delete the key of your choice and save the new dict as data_world_cleaned.json.

**10**: From the data/Chap2/data_world.json file, create the co-occurence matrix between "accessLevel" and "accrualPeriodicity".

