### Learning Objectives

This tutorial is designed to accomplish following learning objectives

* Some of the popular data formats
     * columnar and Row wise formatting of data,
     * how different data formats affects wrangling of big data
     * Pros and cons of different file formats


File Format: a Quick intuition

* In big data, the right storage format  is paramount for achiving perfomance, saving space and makign certain operations possible.
* Can save time, cost, improve computation time etc.

* We're accustomed to row-based formats
  * Excel file like file where each row is an table entry
| Transaction Date 	| Nb Items 	| Total   	|
|------------------	|----------	|---------	|
| 01/01/2001       	| 4        	| 1852.14 	|
| 01/01/2001       	| 3        	| 968.00  	|
| `...`             | `...`     | `...`     |

File Format: a Quick intuition - cont'd 
    
* This format may be inappropriate for certain types of data or operations

* Data: Imagine, that sales info above contians hundreds of millions of transactions with hundreds of thousands of transactions per day
    * The same transaction dates will be unnecessarily duplicated hundrend of thousands of time.
    * Perhaps a dictionari like format where the key is the date.
 
```python
# notice the two ellipses.
{"01/01/2001": ((4, 1852.14), (3,  968.00), ...), ...}
```

* Operation: Imagine that objectives if to compute the total sales
  * We need to read millions of lines to compute a single values.
  * Perhaps we can store the data as row data. Reading a single line is sufficient to compute the average.

|              |      |  |   |
| :---              |    :----:  | :--------: |:---:|
| **Totals** 	        | 1852.14    | 968.00     | `...` |
| **Transaction Dates** | 01/01/2001 | 01/01/2001 | `...` |
| **Items**       	    | 4        	 | 3	      | `...` |


* Question: Can you think of a scenario where the data format above is not ideal?

In [None]:
### File Formats

* There 4 consideration when selecting file fomats:
    * Row vs Column
    * Schema Management
    * Spilitability
    * Compression


### 1- Row- and Column-Based Formats


* An importan consideration when selecting a big data format

* Row-based: Ideal when using all the data 
  * example, building a machine learning models that requires all the features
  * if not all but a subset, unecessary data can be removed
    * Can drop data after each read for large datasets.
      * Avoid loading complete dataset in RAM


* Column-based storage: useful when performing operation on a subset of columns
  * Computing total sales, Or Total aggregated by date, etc.
  


### Row-Based Formats

* Simplest form of data 
* Used in many applications, from web log files to highly-structured database systems like MySql and Oracle.

* A row is a a new instance,i.e., since object containing values for all the variables (or features).

* processing the data would require reading all inputs line by line


* This is commonly used for Online Transactional Processing (OLTP). 
  * OLTP systems usually process CRUD queries (Create, Read, Update and Delete) at a record level.
  * The main emphasis for OLTP systems is focus on maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second
   * More on this when we dicuss big data platforms



### Column Based Formats


* Data is grouped by columns
* Easy to focus computation on specific columns of data
  * Ex. Search for larger value is easier since data is stored sequentially by column. 
  
* Ideal for compression
  * compression codecs (ex. GZIP) have a higher compression-ratio when compressing sequences of similar data. 
  * Let's do an experiment
    * We'll do such small experiemtn a lot, to get a handle on Python.
    
  * Typically, the slowest component in large distribution system are the disk and network
      * Using compression reduces read IO and transfers, thus speeding up the analysis.
    
* This way of processing data is usually called OLAP (OnLine Analytical Processing) query.    
 * OLAP is an approach designed to quickly answer analytics queries involving multiple dimensions
    


In [27]:
import random
random.choices([1,2,3,4], k=6)

[1, 4, 4, 1, 1, 3]

In [28]:
import random
random.choices("ACGT", k=6)

['T', 'C', 'T', 'T', 'A', 'G']

In [29]:
import string 
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [30]:
import zlib 
import string
# let's randomly generate two string of 1000, an ASCII and an INT

random_ASCII = random.choices(string.printable, k=1000)
random_numbers = random.choices(string.digits, k=1000)

len(zlib.compress( str.encode("".join(random_ASCII))))
len(zlib.compress( str.encode("".join(random_numbers))))

503

### Column-based formats: Advantages and Disadvantages

Advantages: 
* Columnar-storage of data can yield sometimes 100x- 1000x performance improvement, specifically on data with hundreds of columns
    * Those are called wide data

Disadvantages:
  * Hard to read by a human but how useful is a row-based columns anyway
  * Can be more CPU intensive to write for very large data.
    * Need to collect the data for each column before writing it file
  * difficult to access a sinlge instnaces (entry across all values)
  * Not efficient with CRUD queries

### 2- Datatype & Schema Enforcement and Evolution 

* "Schema" in a database context, means structure and organization of the data  
    * how you stucture the data: datatypes, missing values, primary keys, etc, indices, etc.
    * Organization: relationships across tables.

* Here, we mostly refer to the data type

* Whith a simple text format, (e.g.: table with values sepeated by space), datatype cannot be declared or enforced

* Declaring the type of a value provides some advantages.
  * Storage requirements: String will require more storage than boolean (2 bytes)
  * Data validity: guaranties that the dataset is valid
  * Compression: We know how to compress different types.





### 2- Datatype & Schema Enforcement and Evolution  - Cont'd
Unless your data is guaranteed to never change, you’ll need to think about schema evolution, or how your data schema changes over time. How will your file format manage fields that are added or deleted? When evaluating schema evolution specifically, there are a few key questions to ask of any data format:

How easy is it to update a schema (such as adding a field, removing or renaming a field)?
How will different versions of the schema “talk” to each other?
Is it human-readable? Does it need to be?
How fast can the schema be processed?
How does it impact the size of data?

Example, for a dictionary ... 


### 3- Splitability

* Big data can often comprise many millions of recrods, often split across a large number of files
  * Think of instance monthly logs, yearly transction, daily airplane sensorts
* Often useful to split the data across multiple machine and execute each computation separately

* Some files formats are more amenable to splitting than others.


In [None]:
### 3- Splitability - Row-based

Row-based formats can be split along row boundaries

```
# file 1  with n lines
01/01/2001       	4        	1852.14
01/01/2001       	3        	968.00
... 
```

* Splitting `file 1` with `n` across `m` total machines is easy.
  * `m-1` machines gets `round(n/m)` unique lines, last machine get remaining lines
      * Machine 1 gets lines 1 though $\frac{n}{m}$
      * Machine 2 gets lines 1 though $\frac{n}{m}+1$ to  $2 * \frac{n}{m}$ 
      * machine x
      
      
* challenges: 
    * Partitioninon over particular column value can be difficult if data stored in a random order. 
    * E.g.: splitting on number of items required sorting data on 2nd column first, then splitting after reading

### 3- Splitability row-based, nested 

* Some files formats are more amenable to splitting than others.


``` 
file 2
{"01/01/20014": [(4, 1852.14), (3, 968.00)], ....}
```

* You cannot easily split this file this file format without parsing the file first.
  * Need to read the compelte file to split it into chunks.
    * Data may need to ne loaded in RAM first.




### 3- Splitability: Column-based, nested 


* A column-based format can be split if the comutation is column-specific.

``` 
# file 3
date: 01/01/2001, 01/01/2001
nb_items: 4, 3
totals: 1852.14, 968.00
```

* With the example above, each machine is concerned with a computation on a a specific variable. For example: 
  * Machine 1 takes `date` data and computes the number of sales per months
  * Machine 2 takes the `nb_items` data and computes the number of sales per months
  * Machine 3 takes the `totals` data and computes the total sales values
  
* Machine don't have any knowledge of variables that are not given. 
  * E.g. maachine three is not given date info and cannot compute, for example, the monthly of weekly sales average.



### 4- Compression


* When working on a distributed system, data trnsfers can be a serious bottleneck
* Compression can substantially improve runtime and storage requirements

* Columnar data can achieve better compression rates than row-based data 
  * Simple way to think about it: column will have a lot more duplicate values:
      * Ex. Age Column: 21,22,21,24,25,21,22,21,19,21,21,22, ....
      
* Compression complex compression algorithms on very large files ca save on space but substantially increase compute time.
    * Uncomression/re-compression needs to occr everytime you need to access the data.


### Standardization and File Formats
'
* Naturaully, one can adopt to structure a file (either binary or txt) in their own format
  * Many companies may choose to do so internally for many reasons
  * E.g.: 

```
RECORD 1: 
FIRST_NAME_1\sLAST_NAME_1\tFIRST_NAME_2\sLAST_NAME_1\tFIRST_NAME_2\sLAST_NAME_1...
POSTION_1\sSALARY_1\tPOSTION_1\sSALARY_2\tPOSTION_1\sSALARY_3
```

* However, there are many benefits to using a standard file format. E.g.:
  * clarity — eliminating the need for guesswork or extra searching
  * Quality — designed by large teams and used extensively, which provides opportunities to find and correct bugs
  * Productivity — no need to maintain internal doc. easier to get answer online when issues arise.
  * Interoperability - you data is no longer locked to your company. Can be used acros platforms

* Some of the most used formats are CSV, JSON, Parquet, AVRO HDF5
  * Very well supported in Python



### CSV File Format

* File in the CSV (Comma-serparated values) format are usually used to exchange tabular dat 
  * Plain-text file (readable characters)
  
* CSV is a row-based file format: each row of the file is a separate data instance
  * May or may not contain a header
* structure is conveyed throught explicit commas
  * text commas are encapsulated in double quotes

```
Title,Author,Genre,Height,Publisher
"Computer Vision, A Modern Approach","Forsyth, David",data_science,255,Pearson
Data Mining Handbook,"Nisbet, Robert",data_science,242,Apress
Making Software,"Oram, Andy",computer_science,232,O'Reilly
....
```


### CSV File Format

* CSV format is not fully standardized
  * files may be sepated by other chatacters such as tabs (tsv) or spaces (ssv)
  
* Data connections are usually established using multiple CSV files. 
   * Uses foreign keys (specific columns) across files
   * Connection not expressed in the file format 
  
* Data Strucure conveyed through redundant values across files

* Native support in Python
```python
import csv
# use csv ...
```



In [10]:
# All_Time_Worldwide_Box_Office_partial.csv
import csv
csvfile = open('../Data/All_Time_Worldwide_Box_Office_partial.csv') 
movies_file = csv.reader(csvfile, delimiter=',', quotechar='"')
i = 0 
for line in movies_file:
    print(f"Line {i}: {line}")
    i+=1

Line 0: ['Rank', 'Year', 'Movie', 'WorldwideBox Office', 'DomesticBox Office', 'InternationalBox Office']
Line 1: ['1', '2009', 'Avatar', '$2,845,899,541', '$760,507,625', '$2,085,391,916']
Line 2: ['2', '2019', 'Avengers: Endgame', '$2,797,800,564', '$858,373,000', '$1,939,427,564']
Line 3: ['3', '1997', 'Titanic', '$2,207,986,545', '$659,363,944', '$1,548,622,601']
Line 4: ['4', '2015', 'Star Wars Ep. VII: The Force Awakens', '$2,064,615,817', '$936,662,225', '$1,127,953,592']
Line 5: ['5', '2018', 'Avengers: Infinity War', '$2,044,540,523', '$678,815,482', '$1,365,725,041']
Line 6: ['6', '2015', 'Jurassic World', '$1,669,979,967', '$652,306,625', '$1,017,673,342']
Line 7: ['7', '2019', 'The Lion King', '$1,654,367,425', '$543,638,043', '$1,110,729,382']
Line 8: ['8', '2015', 'Furious 7', '$1,516,881,526', '$353,007,020', '$1,163,874,506']
Line 9: ['9', '2012', 'The Avengers', '$1,515,100,211', '$623,357,910', '$891,742,301']
Line 10: ['10', '2019', 'Frozen II', '$1,446,925,396', '$4

In [11]:
# All_Time_Worldwide_Box_Office_partial.csv
import csv
csvfile = open('../Data/All_Time_Worldwide_Box_Office_partial.csv') 
movies_file = csv.DictReader(csvfile, delimiter=',', quotechar='"')
i = 0 
for line in movies_file:
    print(f"Line {i}: {line}")
    i+=1

Line 0: {'Rank': '1', 'Year': '2009', 'Movie': 'Avatar', 'WorldwideBox Office': '$2,845,899,541', 'DomesticBox Office': '$760,507,625', 'InternationalBox Office': '$2,085,391,916'}
Line 1: {'Rank': '2', 'Year': '2019', 'Movie': 'Avengers: Endgame', 'WorldwideBox Office': '$2,797,800,564', 'DomesticBox Office': '$858,373,000', 'InternationalBox Office': '$1,939,427,564'}
Line 2: {'Rank': '3', 'Year': '1997', 'Movie': 'Titanic', 'WorldwideBox Office': '$2,207,986,545', 'DomesticBox Office': '$659,363,944', 'InternationalBox Office': '$1,548,622,601'}
Line 3: {'Rank': '4', 'Year': '2015', 'Movie': 'Star Wars Ep. VII: The Force Awakens', 'WorldwideBox Office': '$2,064,615,817', 'DomesticBox Office': '$936,662,225', 'InternationalBox Office': '$1,127,953,592'}
Line 4: {'Rank': '5', 'Year': '2018', 'Movie': 'Avengers: Infinity War', 'WorldwideBox Office': '$2,044,540,523', 'DomesticBox Office': '$678,815,482', 'InternationalBox Office': '$1,365,725,041'}
Line 5: {'Rank': '6', 'Year': '2015',

### CSV Pros and Cons
➕ CSV is human-readable and easy to edit manually;

➕ CSV provides a simple scheme;

➕ CSV can be processed by almost all existing applications;

➕ CSV is easy to implement and parse;

➕ CSV is compact. For XML, you start a tag and end a tag for each column in each row. In CSV, the column headers are written only once;

* No guarantees that data won't be missing or won't be in a different format.
* Complex data structures need to be implemented using referncing into separate files

➖ There is no standard way to present binary data;

➖ Problems with CSV import (for example, no difference between NULL and quotes);

➖ Poor support for special characters;

➖ Lack of a universal standard.

### JSON File Format

* JSON (JavaScript Object Notation: open standard file format that uses human-readable text
  * typically stored using .json extension. 
* Became popular as a space saving alterntive to XML
* Inspired form JavaScript objected but is a language-independent data format. 
* very similar to python's lists and dicts
* Also supported natively in Python
  ```python
  import json
  # Do something with the json library
  ```
* The defacto language of the web
  * Supported in all modern langues and particularly web languages.

### JSON File Structure



JSOn supports the following be of the following types.

* Scalar values

    * `Numbers`: e.g. 3 
    
    * `String`: Sequence of Unicode characters surrounded by double quotation marks.
    
    * `Boolean`: `true` or `false`.

* Collections:

    * `Array`: A list of values surrounded by square brackets, for example
    * `Collections`: key" value pairs separated by a comma(,)
      *  Keys are String. value can be any valis other scalar or collection

* See the following for more details: https://docs.fileformat.com/web/json/ 
* See the following very good (useful) validator for validating JSON files or records: https://jsonformatter.curiousconcept.com/#


In [34]:
my_data = [ 
    {'First Name': "John", "Occupation": "Student", "Salary": 120_000, "volunteer": False}, 
    {'First Name': "John", "Occupation": "Student", "salary": None, "volunteer": True}
]
my_data

[{'First Name': 'John',
  'Occupation': 'Student',
  'Salary': 120000,
  'volunteer': False},
 {'First Name': 'John',
  'Occupation': 'Student',
  'salary': None,
  'volunteer': True}]

In [36]:
import json


json.dumps(my_data)
# Note the changes between the Python dict and the JSON string

'[{"First Name": "John", "Occupation": "Student", "Salary": 120000, "volunteer": false}, {"First Name": "John", "Occupation": "Student", "salary": null, "volunteer": true}]'

### Working with the Python `json` library


* `All_Time_Worldwide_Box_Office_partial.json`  structure
```json
[
 {
  "Rank": "1",
  "Year": "2009",
  "Movie": "Avatar",
  "WorldwideBox Office": "$2,845,899,541",
  "DomesticBox Office": "$760,507,625",
  "InternationalBox Office": "$2,085,391,916"
 },
 {
  "Rank": "2",
  "Year": "2019",
  "Movie": "Avengers: Endgame",
  "WorldwideBox Office": "$2,797,800,564",
  "DomesticBox Office": "$858,373,000",
  "InternationalBox Office": "$1,939,427,564"
 },
 ...
]
```

In [39]:
import json
json_file = open('../Data/All_Time_Worldwide_Box_Office_partial.json') 

movies_data = json.load(json_file)
movies_data


[{'Rank': '1',
  'Year': '2009',
  'Movie': 'Avatar',
  'WorldwideBox Office': '$2,845,899,541',
  'DomesticBox Office': '$760,507,625',
  'InternationalBox Office': '$2,085,391,916'},
 {'Rank': '2',
  'Year': '2019',
  'Movie': 'Avengers: Endgame',
  'WorldwideBox Office': '$2,797,800,564',
  'DomesticBox Office': '$858,373,000',
  'InternationalBox Office': '$1,939,427,564'},
 {'Rank': '3',
  'Year': '1997',
  'Movie': 'Titanic',
  'WorldwideBox Office': '$2,207,986,545',
  'DomesticBox Office': '$659,363,944',
  'InternationalBox Office': '$1,548,622,601'},
 {'Rank': '4',
  'Year': '2015',
  'Movie': 'Star Wars Ep. VII: The Force Awakens',
  'WorldwideBox Office': '$2,064,615,817',
  'DomesticBox Office': '$936,662,225',
  'InternationalBox Office': '$1,127,953,592'},
 {'Rank': '5',
  'Year': '2018',
  'Movie': 'Avengers: Infinity War',
  'WorldwideBox Office': '$2,044,540,523',
  'DomesticBox Office': '$678,815,482',
  'InternationalBox Office': '$1,365,725,041'},
 {'Rank': '6',
  

In [40]:
type(movies_data)

list

In [41]:
type(movies_data[0])

dict

In [46]:
for record in movies:
    print(f"The movie {record['Movie']}, grossed {record['WorldwideBox Office']} in {record['Year']}")

The movie Avatar, grossed $2,845,899,541 in 2009
The movie Avengers: Endgame, grossed $2,797,800,564 in 2019
The movie Titanic, grossed $2,207,986,545 in 1997
The movie Star Wars Ep. VII: The Force Awakens, grossed $2,064,615,817 in 2015
The movie Avengers: Infinity War, grossed $2,044,540,523 in 2018
The movie Jurassic World, grossed $1,669,979,967 in 2015
The movie The Lion King, grossed $1,654,367,425 in 2019
The movie Furious 7, grossed $1,516,881,526 in 2015
The movie The Avengers, grossed $1,515,100,211 in 2012
The movie Frozen II, grossed $1,446,925,396 in 2019


### JSON Pros and Cons

* Pros: 
    * Very well supported in modern languages and technologies, infrastructures 
    * Can be used as the basis for more performance-optimized formats Parquet or Avro (discussed next)
    * Supports hierarchical structures abstracting the need for complex relationships
    * The *defacto* standard in NoSQL databases
* Cons:
    * Much small footprint than XML but still fairly large due to repeated field names
    * Difficult to split without loading into memory first
    * Not easy to index
    * Some tentatives to add a schema but not commonly used


### AVRO File Format

* AVRO format is an advanced form of JSON format
    * Leverages some of the advantages of JSON while mitigating some of its disadvantages
* Uses a JSON definition (schma) and description in addition to the data without the repeated field names.
  * Said to be self-describng because you can include the schema and documentation in the header of the file containing the data
  * Is row-oriented; each entry is an instance of the data

* Released by the Hadoop working group in 2009 to use with Hadoop Systems
* It is a row-based format that has a high degree of splitting
* Provides mechanism to manage scheme evolution 
* Supports schema evolution

* Python need a library that understand the binary format used.
  & availble or most modern languages, including Python
  & We will use `avro` library in Python

### Pros and Cons

Pros:
    * binary data minimizes file size and maximizes efficiency.
    * Avro has reliable support for schema evolution by managing added, missing, and changed fields.
    * This allows old software to read new data, and new software to read old data — it is a critical feature if your data can change.
    * Supports schema evolution

* Cons: 
    * Data is not human radable


In [47]:
!pip install avro

Collecting avro
  Downloading avro-1.10.2.tar.gz (68 kB)
[K     |████████████████████████████████| 68 kB 811 kB/s eta 0:00:01
[?25hBuilding wheels for collected packages: avro
  Building wheel for avro (setup.py) ... [?25ldone
[?25h  Created wheel for avro: filename=avro-1.10.2-py3-none-any.whl size=96830 sha256=f956e98a2543b7c14c1ad70713c410a1d0e9aafb0bfdd61c27743181f24b73e5
  Stored in directory: /Users/mahdi/Library/Caches/pip/wheels/66/b5/b3/185a0da0ecbc3e902e24d1e2fa415db0c7342d6e3633c49d30
Successfully built avro
Installing collected packages: avro
Successfully installed avro-1.10.2


In [None]:
# ADD CODE HERE


### PARQUET Format


* Parquet was developed by Twitter and Cloudera as columnar data store
* Parquet is especially useful with wide datasets (datasets with manu columns)

* optimized for reading and is therefore ideal read-intensive workloads
* Parquet was also designed to support columnar partition
    * Splitting the data based on value similarity
    * E.g.: split on the similar values of the MONTH in a transactions dataset
    * Splits can be nested, but splitting on a second attribute. Will result in a nested folder hierarchy
        making reading a subset odf keys very easy
      
```      
    MONTH=JANUARY
        CITY=HONOLULU
           data..
        CITY=MONTREAL
            data..
        CITY=NY
            data..
        
    MONTH=FEBRUARY
        CITY=HONOLULU
           data..
        CITY=MONTREAL
            data..
        CITY=NY
            data..

    ...
      
```  

https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce
 
 

In [None]:
### PARQUET PROS and CONS

Pros: 
 * Highly compressable and since data is stored columnn-wise (compression rates up to 75%)
   * can use different compression algorithm with different datatypes
 * Seamless splittability across columns.
 * Optimized for reading data and idea for read-intensive tasks
   * Can use parallelization to read different column.
 * Data is self-describing, i.e., schema is included in with the data
Cons:
 * Very slow at writing data and not good with write-intensive applications
 * Does not suppport updates on the data as Parquet files are immutable.
