### Popular Big Data Fromats

* Data format is an important aspect of working with big data

* The rolling topic is "There ain't such a thing as free lunch"

```"There ain't no such thing as a free lunch" (TANSTAAFL), also known as "there is no such thing as a free lunch" (TINSTAAFL), is an expression that describes the cost of decision-making and consumption. The expression conveys the idea that things appearing free always have some cost paid by somebody, or that nothing in life is truly free. ``` **https://www.investopedia.com/terms/t/tanstaafl.asp**


### Popular Big Data Fromats

* Some of the issues that arise when working with data formats is:

1. Compression
  * Not all file formats are equally compressible with the same algorithm.
2. Splittability
  * How splittable is a file format.
  * Being able to split a file and run across multiple machines can be critical in some instances
3. Columnar and row-wise data formats
  * Being able to compute column-based stats can force the adoption of a format that make it easier to extract column data
  * How different data formats affect the wrangling of big data
4. Data Types and Schema Evolution.
  * Do we need to enforce data types?
  * Will my data format change over time?

### File Format: a Quick intuition

* In big data, the right storage format is paramount for achieving performance, saving space and making certain operations possible.

* Can save time, cost, improve computation time etc.

* We're accustomed to row-based formats

  * MS Excel file-like where each row is a table entry

| Transaction Date     | Nb Items     | Total       |
|------------------    |----------    |---------    |
| 01/01/2001           | 4            | 1852.14     |
| 01/01/2001           | 3            | 968.00      |
| `...`             | `...`     | `...`     |

### File Format: a Quick intuition - Cont'd
    
* This format may be inappropriate for certain types of data or operations

* Imagine that the sales info above contains hundreds of millions of transactions with hundreds of thousands of transactions each day
    * The same transaction dates will be unnecessarily duplicated hundrend of thousands of time.
    * Perhaps a dictionary like format where the key is the date would help save on storage
 
```python
# notice the two ellipses.
{"01/01/2001": ((4, 1852.14), (3,  968.00), ...), ...}
```
  * This will also be more efficient to compute operations on days
   * E.g. count number of transactions or total sales per day
   * What about computing the running sales total?

### File Format: a Quick intuition - Cont'd

* Imagine that objectives if to compute the total sales
  * We need to read millions of lines to compute a single values
* Perhaps we can store the data as row data. Reading a single line is sufficient to compute the average.

|              |      |  |   |
| :---              |    :----:  | :--------: |:---:|
| **Totals**             | 1852.14    | 968.00     | `...` |
| **Transaction Dates** | 01/01/2001 | 01/01/2001 | `...` |
| **Nb Items**               | 4             | 3          | `...` |




In [7]:
import random
random.randint(0,40)

23

In [9]:
import random
from datetime import date

totals = []
transaction_dates = []
nb_items = []

for i in range(10):
    totals.append(random.uniform(10, 10000))
    transaction_dates.append(str(date.today()))
    nb_items.append(random.randint(0,40))

print(f"Totals are: {totals}")
print(f"The transaction : {transaction_dates}")
print(f"The number items: {nb_items}")

Totals are: [8410.51478950048, 270.75336741050046, 4681.8751930660665, 2213.7419153776336, 364.7501964370208, 2971.793305945331, 9781.556556740885, 273.1911617646127, 5446.13524865059, 3727.1516001030936]
The transaction : ['2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08', '2021-09-08']
The number items: [2, 23, 24, 6, 21, 40, 9, 24, 4, 0]


In [12]:
sum(totals)

38141.46333499621

In [11]:
list(zip(transaction_dates, nb_items, totals))


[('2021-09-08', 2, 8410.51478950048),
 ('2021-09-08', 23, 270.75336741050046),
 ('2021-09-08', 24, 4681.8751930660665),
 ('2021-09-08', 6, 2213.7419153776336),
 ('2021-09-08', 21, 364.7501964370208),
 ('2021-09-08', 40, 2971.793305945331),
 ('2021-09-08', 9, 9781.556556740885),
 ('2021-09-08', 24, 273.1911617646127),
 ('2021-09-08', 4, 5446.13524865059),
 ('2021-09-08', 0, 3727.1516001030936)]

### Decisions for File Formats

* There 4 consideration when selecting file fomats:
    1. Row vs Column
    2. Schema Management
    3. Spilitability
    4. Compression


### 1. Row- and Column-Based Formats


* An important consideration when selecting a big data format

* Row-based: Ideal when using all the data
  * example, building a machine learning model that requires all the features and all the isntances
    * Avoid reading all the dataset in RAM by loading chunks at a time

* Column-based storage: useful when performing operations on a subset of columns
  * Computing total sales, or computing a total aggregated by aggregated by date, etc.


### Row-Based Formats

* Simplest form of data
* Used in most mainstream applications, from web log files to highly-structured database systems like MySql and Oracle.

* Processing all the data would require reading all inputs line by line

* This is commonly used for Online Transactional Processing (OLTP).
  * OLTP systems usually process CRUD queries (Create, Read, Update and Delete) at a record level.
  * The main emphasis for OLTP systems is maintaining data integrity in multi-access environments
* Effectiveness measured by the number of transactions per second
   * More on this when we discuss big data platforms

### Column Based Formats


* Data is grouped by columns
* Easy to focus computation on specific columns of data
  * Ex. Search for largest values is easier since data is stored sequentially by column.
 
* Ideal for compression
  * Compression codecs (e.g., GZIP, pkzip, etc..) have a higher compression-ratio when compressing sequences of similar data. 
  
```[1,2,...], ["John", "Janet"], ["Doe", "Smith",...], ["125,000", "195,129", ...]```

* If much more efficient than compressing:

```[[1, "John", "Doe", "125,000"], [2, "Janet", "Smith", "195,129"], ...]```

* This way of processing data is usually called OLAP (OnLine Analytical Processing)   
 * OLAP is an approach designed to quickly answer analytics queries involving multiple dimensions (features)
   




### Compression of Row vs. Columnar Data

  * Let's do an experiment
    * We'll perform such small experiments a lot to get a feel for Python, theory and the concepts covered.
    
  * Typically, the slowest components in large distributed systems are the disk and network
      * Using compression reduces read IO and transfers, thus speeding up the analysis.


In [27]:
import random
random.choices([1,2,3,4], k=6)

[1, 4, 4, 1, 1, 3]

In [28]:
import random
random.choices("ACGT", k=6)

['T', 'C', 'T', 'T', 'A', 'G']

In [14]:
import string 

print(string.printable)
print(string.digits)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	

0123456789


In [19]:
import zlib 
import string

# let's randomly generate two string of 1000, an ASCII and an INT

random_ASCII = random.choices(string.printable, k=10_000)
random_numbers = random.choices(string.digits, k=10_000)


In [20]:
print(len(zlib.compress( str.encode("".join(random_ASCII)))))
print(len(zlib.compress( str.encode("".join(random_numbers)))))

8412
5085


### Column-based formats: Advantages and Disadvantages


<u>Advantages</u>:
* Columnar-storage of data can yield sometimes 100x- 1000x performance improvement, specifically on wide data
    * Wide data = data with a very large number of columns (or more columns than observations)

<u>Disadvantages</u>:
  * Hard to read by a human
  * Can be more CPU intensive to write for very large data.
    * Need to collect the data for each column before writing it to file
  * Difficult to access a single instnace (entry across all values)
   * Need to parse all columns to position $i$
  * Not efficient with CRUD operations

### 2- Datatype and Schema Enforcement and Evolution

* "Schema" in a database context, means the structure and organization of the data  
    * Structure: datatypes, missing values, primary keys, etc, indices, etc.
    * Organization: relationships across tables.

* Here, we mostly refer to the data type

* In text format, (e.g.: table with values separated by space), datatype cannot be declared or enforced

* Declaring the type of a value provides some advantages.
  * Storage requirements: String will require more storage than boolean (2 bytes)
  * Data validity: guarantees that the dataset is valid
  * Compression: We know how to compress different types.

### 2- Datatype & Schema Enforcement and Evolution - Cont'd


* In the event that there is no guarantee that data won't change in the future, you may need to consider schema evolution.


* When evaluating schema evolution, there are a few key questions to ask of any data format:
  * How easy is it to update a schema (such as adding a field, removing or renaming a field)?
  * How will different versions of the schema “talk” to each other?
  * How fast can the schema be processed?
  * How does it impact the size of data?



### 3- Splitability

* Big data can often comprise many millions of records.
  * Think of instance monthly logs, yearly transactions, daily airplane sensors recording

* Often useful to split the data across multiple machines and execute each computation separately

* Some file formats are more amenable to splitting than others.



### 3- Splitability - Row-based

Row-based formats can be split along row boundaries

```
# file 1  with n lines
01/01/2001           4            1852.14
01/01/2001           3            968.00
...
```

* Splitting can be done
  * Randomly plitting `file 1` with `n` observations across `m` total machines is easy.

    * Each machine gets `ceiling(n/m)` unique lines, last machine gets remaining lines

 * To group items across one or more fields:
 
    * Partitioning over particular column values can be difficult if data is stored in a random order.
    * May require sorting the data first

### 3- Splitability row-based, nested 

* Some files formats are more amenable to splitting than others.


``` 
file 2
{"01/01/20014": [(4, 1852.14), (3, 968.00)], ....}
```

* You cannot easily split this file this file format without parsing the file first.
  * Need to read the compelte file to split it into chunks.
    * Data may need to ne loaded in RAM first.


### 3- Splitability: Column-based, nested


* A column-based format can be split if the comutation is column-specific.

```
# file 3
date: 01/01/2001, 01/01/2001
nb_items: 4, 3
totals: 1852.14, 968.00
```

Splitting can only e done column-wise:
* In the example above, each machine is concerned with a computation on a specific variable. For example:
  * Machine 1 takes `date` data and computes the number of sales per month
  * Machine 2 takes the `nb_items` data and computes the total number of sales
  * Machine 3 takes the `totals` data and computes the total sales values
 
* Machines don't have any knowledge of variables that are not given.
  * E.g. machine three is not given date info and cannot compute, for example, the monthly or weekly sales average.


### 4- Compression


* When working on a distributed system, data transfer can be a serious bottleneck
* Compression can substantially improve runtime and storage requirements

* We illustrated "naively" that columnar data can achieve better compression rates than row-based data
  * Simple way to think about it: column will have a lot more duplicate values:
      * Ex. Age Column: 21, 22, 21, 24, 25, 21, 22, 21, 19, 21, 21, 22, ....
      
* Note that complex compression algorithms on very large files can save on space but substantially increase compute time.
    * Uncompression/re-compression needs to occur every time you need to access the data.


### Standardization and File Formats

* Naturally, one can choose thier own format file format
  * Many companies may choose to do so internally for many reasons.
  * E.g.:

```
FIRST_NAME_1\sLAST_NAME_1\tFIRST_NAME_2\sLAST_NAME_2\tFIRST_NAME_3\sLAST_NAME_3...
JOBTITLE_1\sSALARY_1\tJOBTITLE_2\sSALARY_2\tJOBTITLE_3\sSALARY_3
```

* However, there are many benefits to using a standard file format. E.g.:
  * Clarity: eliminating the need for guesswork or extra searching
  * Quality: stndard formats are designed by large teams and used extensively, which provides opportunities to find and correct bugs
  * Productivity: no need to maintain internal documentation, which is easier to get answer online when issues arise.
  * Interoperability: you data is no longer locked to your company. Can be used acros platforms.

* Some of the most used formats are CSV, JSON, Parquet, AVRO HDF5
  * Very well supported in Python



### CSV File Format

* Files in the CSV (Comma-serparated values) format are usually used to exchange tabular dat
  * Plain-text file (readable characters)
 
* CSV is a row-based file format: each row of the file is a separate data instance
  * May or may not contain a header
* Structure is conveyed through explicit commas
  * Text commas are encapsulated in double quotes

```
Title,Author,Genre,Height,Publisher
"Computer Vision, A Modern Approach","Forsyth, David",data_science,255,Pearson
Data Mining Handbook,"Nisbet, Robert",data_science,242,Apress
Making Software,"Oram, Andy",computer_science,232,O'Reilly
...
```

### CSV File Format

* CSV format is not fully standardized
  * files may be sepated by other chatacters such as tabs (tsv) or spaces (ssv)
 
* Data connections are usually established using multiple CSV files.
   * Uses foreign keys (specific columns) across files
   * Connection not expressed in the file format
 
* Data Strucure conveyed through redundant values across files

* Native support in Python
```python
import csv
csv.reader(csvfile, delimiter=',', quotechar='"')
# use csv ...
```



In [25]:
# All_Time_Worldwide_Box_Office_partial.csv
import csv
csvfile = open('data/All_Time_Worldwide_Box_Office.csv') 
movies_file = csv.reader(csvfile, delimiter=',', quotechar='"')
i = 0 
for line in movies_file:
    print(f"Line {i}: {line}")
    i+=1
    if i ==10:
        break
    

Line 0: ['Rank', 'Year', 'Movie', 'WorldwideBox Office', 'DomesticBox Office', 'InternationalBox Office']
Line 1: ['1', '2009', 'Avatar', '$2,845,899,541', '$760,507,625', '$2,085,391,916']
Line 2: ['2', '2019', 'Avengers: Endgame', '$2,797,800,564', '$858,373,000', '$1,939,427,564']
Line 3: ['3', '1997', 'Titanic', '$2,207,986,545', '$659,363,944', '$1,548,622,601']
Line 4: ['4', '2015', 'Star Wars Ep. VII: The Force Awakens', '$2,064,615,817', '$936,662,225', '$1,127,953,592']
Line 5: ['5', '2018', 'Avengers: Infinity War', '$2,044,540,523', '$678,815,482', '$1,365,725,041']
Line 6: ['6', '2015', 'Jurassic World', '$1,669,979,967', '$652,306,625', '$1,017,673,342']
Line 7: ['7', '2019', 'The Lion King', '$1,654,367,425', '$543,638,043', '$1,110,729,382']
Line 8: ['8', '2015', 'Furious 7', '$1,516,881,526', '$353,007,020', '$1,163,874,506']
Line 9: ['9', '2012', 'The Avengers', '$1,515,100,211', '$623,357,910', '$891,742,301']


In [27]:
# All_Time_Worldwide_Box_Office_partial.csv
import csv
csvfile = open('data/All_Time_Worldwide_Box_Office.csv') 
movies_file = csv.DictReader(csvfile, delimiter=',', quotechar='"')
i = 0 
for line in movies_file:
    print(f"Line {i}: {line}")
    i+=1
    if i ==10:
        break

Line 0: {'Rank': '1', 'Year': '2009', 'Movie': 'Avatar', 'WorldwideBox Office': '$2,845,899,541', 'DomesticBox Office': '$760,507,625', 'InternationalBox Office': '$2,085,391,916'}
Line 1: {'Rank': '2', 'Year': '2019', 'Movie': 'Avengers: Endgame', 'WorldwideBox Office': '$2,797,800,564', 'DomesticBox Office': '$858,373,000', 'InternationalBox Office': '$1,939,427,564'}
Line 2: {'Rank': '3', 'Year': '1997', 'Movie': 'Titanic', 'WorldwideBox Office': '$2,207,986,545', 'DomesticBox Office': '$659,363,944', 'InternationalBox Office': '$1,548,622,601'}
Line 3: {'Rank': '4', 'Year': '2015', 'Movie': 'Star Wars Ep. VII: The Force Awakens', 'WorldwideBox Office': '$2,064,615,817', 'DomesticBox Office': '$936,662,225', 'InternationalBox Office': '$1,127,953,592'}
Line 4: {'Rank': '5', 'Year': '2018', 'Movie': 'Avengers: Infinity War', 'WorldwideBox Office': '$2,044,540,523', 'DomesticBox Office': '$678,815,482', 'InternationalBox Office': '$1,365,725,041'}
Line 5: {'Rank': '6', 'Year': '2015',

### CSV Pros and Cons
<u>Pros:</u>
* CSV is human-readable and easy to edit manually
* CSV provides a simple scheme
* CSV can be processed by almost all existing applications
* CSV is easy to implement and parse;
* CSV is compact (compared to, for instance JSON or MXL)
* Column headers are written only once

<u>Cons:</u>
* No guarantees that data won't be missing or won't be in a different format.
* No way to implement complex data structures
  * May need to rerences other files to implement nesting
* There is no standard way to present binary data;
* Poor support for special characters;
* Lack of a universal standard.

### JSON File Format

* JSON (JavaScript Object Notation)

* open standard file format that uses human-readable text
  * FIle typically stored using `.json` extension.

* Became popular as a space-saving alternative to XML

* Inspired by JavaScript objects but is a language-independent data format.

* Very similar to the combination of Python's lists and dicts

* Also supported natively in Python
  ```python
  import json
json.load(...)
  ```
* The defacto language of the web
  * Supported in all modern languages and particularly web languages.

### JSON File Structure

* JSOn supports the following be of the following types.

* Scalar values

    * `Numbers`: e.g. 3
    
    * `String`: Sequence of Unicode characters surrounded by double quotation marks.
    
    * `Boolean`: `true` or `false`.

* Collections:

    * `Array`: A list of values surrounded by square brackets `[]`
    * `Dictionaries`: key" value pairs separated by a comma(,) and enclosed in `{}`
      *  Keys are String. value can be any valid scalar or collection

* See the following for more details: https://docs.fileformat.com/web/json/
* See the following very good (useful) validator for validating JSON files or records: https://jsonformatter.curiousconcept.com/#

In [15]:
my_data = [ 
    {'First Name': "John", "Occupation": "Student", "Salary": 120_000, "volunteer": False}, 
    {'First Name': "John", "Occupation": "Student", "salary": None, "volunteer": True}
]
my_data

[{'First Name': 'John',
  'Occupation': 'Student',
  'Salary': 120000,
  'volunteer': False},
 {'First Name': 'John',
  'Occupation': 'Student',
  'salary': None,
  'volunteer': True}]

In [None]:
jsom.dump(s)
json.dumps()
json.load(s)

In [30]:
import json


json_representation = json.dumps(my_data)
print(json_representation)
# Note the changes between the Python dict and the JSON string

[{"First Name": "John", "Occupation": "Student", "Salary": 120000, "volunteer": false}, {"First Name": "John", "Occupation": "Student", "salary": null, "volunteer": true}]


### Working with the Python `json` library


* `All_Time_Worldwide_Box_Office_partial.json`  structure
```json
[
 {
  "Rank": "1",
  "Year": "2009",
  "Movie": "Avatar",
  "WorldwideBox Office": "$2,845,899,541",
  "DomesticBox Office": "$760,507,625",
  "InternationalBox Office": "$2,085,391,916"
 },
 {
  "Rank": "2",
  "Year": "2019",
  "Movie": "Avengers: Endgame",
  "WorldwideBox Office": "$2,797,800,564",
  "DomesticBox Office": "$858,373,000",
  "InternationalBox Office": "$1,939,427,564"
 },
 ...
]
```

In [16]:
import json
json_file = open('data/All_Time_Worldwide_Box_Office_partial.json') 

movies_data = json.load(json_file)
movies_data[0:3]


[{'Rank': '1',
  'Year': '2009',
  'Movie': 'Avatar',
  'WorldwideBox Office': '$2,845,899,541',
  'DomesticBox Office': '$760,507,625',
  'InternationalBox Office': '$2,085,391,916'},
 {'Rank': '2',
  'Year': '2019',
  'Movie': 'Avengers: Endgame',
  'WorldwideBox Office': '$2,797,800,564',
  'DomesticBox Office': '$858,373,000',
  'InternationalBox Office': '$1,939,427,564'},
 {'Rank': '3',
  'Year': '1997',
  'Movie': 'Titanic',
  'WorldwideBox Office': '$2,207,986,545',
  'DomesticBox Office': '$659,363,944',
  'InternationalBox Office': '$1,548,622,601'}]

In [40]:
type(movies_data)

list

In [41]:
type(movies_data[0])

dict

In [46]:
for record in movies:
    print(f"The movie {record['Movie']}, grossed {record['WorldwideBox Office']} in {record['Year']}")

The movie Avatar, grossed $2,845,899,541 in 2009
The movie Avengers: Endgame, grossed $2,797,800,564 in 2019
The movie Titanic, grossed $2,207,986,545 in 1997
The movie Star Wars Ep. VII: The Force Awakens, grossed $2,064,615,817 in 2015
The movie Avengers: Infinity War, grossed $2,044,540,523 in 2018
The movie Jurassic World, grossed $1,669,979,967 in 2015
The movie The Lion King, grossed $1,654,367,425 in 2019
The movie Furious 7, grossed $1,516,881,526 in 2015
The movie The Avengers, grossed $1,515,100,211 in 2012
The movie Frozen II, grossed $1,446,925,396 in 2019


### JSON Pros and Cons

* Pros:
    * Very well supported in modern languages, technologies and infrastructures
    * Can be used as the basis for more performance-optimized formats Parquet or Avro (discussed next)
    * Supports hierarchical structures abstracting the need for complex relationships
    * The *defacto* standard in NoSQL databases
* Cons:
    * Much smaller footprint than XML but still fairly large due to repeated field names
    * Difficult to split without loading into memory first
    * Not easy to index
    * Some tentatives to add a schema but not commonly used

### AVRO File Format

* AVRO format is an advanced form of JSON format
    * Leverages some of the advantages of JSON while mitigating some of its disadvantages
* Uses a JSON definition (schema) and description in addition to the data without the repeated field names.
  * Said to be self-descriptive because you can include the schema and documentation in the header of the file containing the data
  * Is row-oriented; each entry is an instance of the data

* Released by the Hadoop working group in 2009 to use with Hadoop Systems
* It is a row-based format that has a high degree of splitting
* Provides mechanism to manage schema evolution
* Supports schema evolution

* Python needs a library that understands the binary format used.
  * support for most modern languages, including Python
  * We will use `avro` library in Python

### Pros and Cons

Pros:
    * Binary data minimizes file size and maximizes efficiency
    * Avro has reliable support for schema evolution
      * Supports new missing, or changed fields.
      * This allows old software to read new data, and new software to read old data
      * It is a critical feature if your data can change.
* Cons:
    * Data is not human readable


In [17]:
# You can install form Jupyter Notebook
!pip install avro



In [None]:
# ADD CODE HERE


### PARQUET Format


* Parquet was developed by Twitter and Cloudera as a columnar data store
* Parquet is especially useful with wide datasets (datasets with many columns)

* Optimized for reading and is therefore ideal for read-intensive workloads
* Parquet was also designed to support columnar partitions
    * Splitting the data based on value similarity,w hcih results in a folder hierarchy
      * E.g.: split on the similar values of the MONTH  or department
    * Splits can be nested by splitting on a second attribute.
      * Will result in a nested folder hierarchy
     
```      
    MONTH=JANUARY
        CITY=HONOLULU
           data..
        CITY=MONTREAL
            data..
        CITY=NY
            data..
        
    MONTH=FEBRUARY
        CITY=HONOLULU
           data..
        CITY=MONTREAL
            data..
        CITY=NY
            data..

    ...
      
```  

https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce
 
 

### PARQUET PROS and CONS

* Pros: 
 * Highly compressable and since data is stored columnn-wise (compression rates up to 75%)
   * can use different compression algorithm with different datatypes
 * Seamless splittability across columns.
 * Optimized for reading data and idea for read-intensive tasks
   * Can use parallelization to read different column.
 * Data is self-describing, i.e., schema is included in with the data


* Cons:
 * Very slow at writing data and not good with write-intensive applications
 * Does not suppport updates on the data as Parquet files are immutable.
