### Learning Objectives

This tutorial is designed to accomplish following learning objectives

* Some of the popular data formats
     * columnar and Row wise formatting of data,
     * how different data formats affects wrangling of big data
     * Pros and cons of different file formats


File Format: a Quick intuition

* In big data, the right storage format  is paramount for achiving perfomance, saving space and makign certain operations possible.
* Can save time, cost, improve computation time etc.

* We're accustomed to row-based formats
  * Excel file like file where each row is an table entry
| Transaction Date 	| Nb Items 	| Total   	|
|------------------	|----------	|---------	|
| 01/01/2001       	| 4        	| 1852.14 	|
| 01/01/2001       	| 3        	| 968.00  	|
| `...`             | `...`     | `...`     |

File Format: a Quick intuition - cont'd 
    
* This format may be inappropriate for certain types of data or operations

* Data: Imagine, that sales info above contians hundreds of millions of transactions with hundreds of thousands of transactions per day
    * The same transaction dates will be unnecessarily duplicated hundrend of thousands of time.
    * Perhaps a dictionari like format where the key is the date.
 
```python
# notice the two ellipses.
{"01/01/2001": ((4, 1852.14), (3,  968.00), ...), ...}
```

* Operation: Imagine that objectives if to compute the total sales
  * We need to read millions of lines to compute a single values.
  * Perhaps we can store the data as row data. Reading a single line is sufficient to compute the average.

|              |      |  |   |
| :---              |    :----:  | :--------: |:---:|
| **Totals** 	        | 1852.14    | 968.00     | `...` |
| **Transaction Dates** | 01/01/2001 | 01/01/2001 | `...` |
| **Items**       	    | 4        	 | 3	      | `...` |


* Question: Can you think of a scenario where the data format above is not ideal?

In [None]:
### File Formats

* There 4 consideration when selecting file fomats:
    * Row vs Column
    * Schema Management
    * Spilitability
    * Compression


### 1- Row- and Column-Based Formats


* An importan consideration when selecting a big data format

* Row-based: Ideal when using all the data 
  * example, building a machine learning models that requires all the features
  * if not all but a subset, unecessary data can be removed
    * Can drop data after each read for large datasets.
      * Avoid loading complete dataset in RAM


* Column-based storage: useful when performing operation on a subset of columns
  * Computing total sales, Or Total aggregated by date, etc.
  


### Row-Based Formats

* Simplest form of data 
* Used in many applications, from web log files to highly-structured database systems like MySql and Oracle.

* A row is a a new instance,i.e., since object containing values for all the variables (or features).

* processing the data would require reading all inputs line by line


* This is commonly used for Online Transactional Processing (OLTP). 
  * OLTP systems usually process CRUD queries (Create, Read, Update and Delete) at a record level.
  * The main emphasis for OLTP systems is focus on maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second
   * More on this when we dicuss big data platforms



### Column Based Formats


* Data is grouped by columns
* Easy to focus computation on specific columns of data
  * Ex. Search for larger value is easier since data is stored sequentially by column. 
  
* Ideal for compression
  * compression codecs (ex. GZIP) have a higher compression-ratio when compressing sequences of similar data. 
  * Let's do an experiment
    * We'll do such small experiemtn a lot, to get a handle on Python.
    
  * Typically, the slowest component in large distribution system are the disk and network
      * Using compression reduces read IO and transfers, thus speeding up the analysis.
    
* This way of processing data is usually called OLAP (OnLine Analytical Processing) query.    
 * OLAP is an approach designed to quickly answer analytics queries involving multiple dimensions
    


In [27]:
import random
random.choices([1,2,3,4], k=6)

[1, 4, 4, 1, 1, 3]

In [28]:
import random
random.choices("ACGT", k=6)

['T', 'C', 'T', 'T', 'A', 'G']

In [29]:
import string 
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [30]:
import zlib 
import string
# let's randomly generate two string of 1000, an ASCII and an INT

random_ASCII = random.choices(string.printable, k=1000)
random_numbers = random.choices(string.digits, k=1000)

len(zlib.compress( str.encode("".join(random_ASCII))))
len(zlib.compress( str.encode("".join(random_numbers))))

503

### Column-based formats: Advantages and Disadvantages

Advantages: 
* Columnar-storage of data can yield sometimes 100x- 1000x performance improvement, specifically on data with hundreds of columns
    * Those are called wide data

Disadvantages:
  * Hard to read by a human but how useful is a row-based columns anyway
  * Can be more CPU intensive to write for very large data.
    * Need to collect the data for each column before writing it file
  * difficult to access a sinlge instnaces (entry across all values)
  * Not efficient with CRUD queries

### 2- Datatype & Schema Enforcement and Evolution 

* "Schema" in a database context, means structure and organization of the data  
    * how you stucture the data: datatypes, missing values, primary keys, etc, indices, etc.
    * Organization: relationships across tables.

* Here, we mostly refer to the data type

* Whith a simple text format, (e.g.: table with values sepeated by space), datatype cannot be declared or enforced

* Declaring the type of a value provides some advantages.
  * Storage requirements: String will require more storage than boolean (2 bytes)
  * Data validity: guaranties that the dataset is valid
  * Compression: We know how to compress different types.





### 2- Datatype & Schema Enforcement and Evolution  - Cont'd
Unless your data is guaranteed to never change, you’ll need to think about schema evolution, or how your data schema changes over time. How will your file format manage fields that are added or deleted? When evaluating schema evolution specifically, there are a few key questions to ask of any data format:

How easy is it to update a schema (such as adding a field, removing or renaming a field)?
How will different versions of the schema “talk” to each other?
Is it human-readable? Does it need to be?
How fast can the schema be processed?
How does it impact the size of data?

Example, for a dictionary ... 


### 3- Splitability

* Big data can often comprise many millions of recrods, often split across a large number of files
  * Think of instance monthly logs, yearly transction, daily airplane sensorts
* Often useful to split the data across multiple machine and execute each computation separately

* Some files formats are more amenable to splitting than others.


In [None]:
### 3- Splitability - Row-based

Row-based formats can be split along row boundaries

```
# file 1  with n lines
01/01/2001       	4        	1852.14
01/01/2001       	3        	968.00
... 
```

* Splitting `file 1` with `n` across `m` total machines is easy.
  * `m-1` machines gets `round(n/m)` unique lines, last machine get remaining lines
      * Machine 1 gets lines 1 though $\frac{n}{m}$
      * Machine 2 gets lines 1 though $\frac{n}{m}+1$ to  $2 * \frac{n}{m}$ 
      * machine x
      
      
* challenges: 
    * Partitioninon over particular column value can be difficult if data stored in a random order. 
    * E.g.: splitting on number of items required sorting data on 2nd column first, then splitting after reading

### 3- Splitability row-based, nested 

* Some files formats are more amenable to splitting than others.


``` 
file 2
{"01/01/20014": [(4, 1852.14), (3, 968.00)], ....}
```

* You cannot easily split this file this file format without parsing the file first.
  * Need to read the compelte file to split it into chunks.
    * Data may need to ne loaded in RAM first.




### 3- Splitability: Column-based, nested 


* A column-based format can be split if the comutation is column-specific.

``` 
# file 3
date: 01/01/2001, 01/01/2001
nb_items: 4, 3
totals: 1852.14, 968.00
```

* With the example above, each machine is concerned with a computation on a a specific variable. For example: 
  * Machine 1 takes `date` data and computes the number of sales per months
  * Machine 2 takes the `nb_items` data and computes the number of sales per months
  * Machine 3 takes the `totals` data and computes the total sales values
  
* Machine don't have any knowledge of variables that are not given. 
  * E.g. maachine three is not given date info and cannot compute, for example, the monthly of weekly sales average.



### CSV File Format

* File in the CSV (Comma-serparated values) format are usually used to exchange tabular dat 
  * Plain-text file (readable characters)
  
* CSV is a row-based file format: each row of the file is a separate data instance
  * May or may not contain a header
* structure is conveyed throught explicit commas
  * text commas are encapsulated in double quotes

```
Title,Author,Genre,Height,Publisher
"Computer Vision, A Modern Approach","Forsyth, David",data_science,255,Pearson
Data Mining Handbook,"Nisbet, Robert",data_science,242,Apress
Making Software,"Oram, Andy",computer_science,232,O'Reilly
....
```


* Data Strucure conveyed through redundant values across

* Data connections are usually established using multiple CSV files. 
   * Uses foreign keys (specific columns) across files
   * Connection not expressed in the file format 
  
* CSV format is not fully standardized
  * files may be sepated by other chatacters such as tabs (tsv) or spaces (ssv)
  
  
