### Popular Big Data Fromats

* Data format is an important aspect of working with big data

* The recurring topic is "There ain't such a thing as free lunch"

```"There ain't no such thing as a free lunch" (TANSTAAFL), also known as "there is no such thing as a free lunch" (TINSTAAFL), is an expression that describes the cost of decision-making and consumption. The expression conveys the idea that things appearing free always have some cost paid by somebody, or that nothing in life is truly free. ``` **https://www.investopedia.com/terms/t/tanstaafl.asp**


### Popular Big Data Formats

* Some of the issues that arise when working with data formats are:

1. Compression
  * Not all file formats are equally compressible with the same algorithm.
2. Splittability
  * How splittable is a file format?
  * as we saw in split apply combine, being able to split a file and run across multiple machines can be critical in some instances
3. Columnar and row-wise data formats
  * Not all the columns (variables) in a dataset are equally valuable analytically.
  * Being able to compute column-based stats can force the adoption of a format that makes it easier to extract column data
4. Data Types and Schema Evolution
  * Do we need to enforce data types?
  * Will my data format change over time?
  * With petabytes of data, it is not reasonable to think that we can just regenerate the files every time there is a change to the schema

### File Format: a Quick intuition

* In big data, the right storage format is paramount for achieving performance, saving space, and making certain operations possible.

* Can save time, minimize complexity and decrease cost.

* We're accustomed to row-based formats

  * MS Excel file-like where each row is a table entry

| Transaction Date     | Nb Items     | Total       |
|------------------    |----------    |---------    |
| 01/01/2001           | 4            | 1852.14     |
| 01/01/2001           | 3            | 968.00      |
| `...`             | `...`     | `...`     |

### File Format: a Quick intuition - Cont'd
    
* This format may be inappropriate for certain types of data or operations

* Imagine that the sales info above contains a very large number of transactions with hundreds of thousands of transactions each day
    * The same transaction dates will be unnecessarily duplicated hundreds of thousands of times.
    * Perhaps a dictionary-like format where the key is the date would help save on storage
 
```python
{"01/01/2001": ((4, 1852.14), (3,  968.00), ...), "01/02/2001":(...), ... }
```
* This will also be more efficient for computing operations on days
   * E.g. count number of transactions or total sales per day
   * What about computing the running sales total?

### File Format: a Quick intuition - Cont'd

* If the objective were to calculate the total sales, we would need to read millions of lines to compute a single value.
* Perhaps we can store the data as row data. Reading a single line is sufficient to compute the average.

|              |      |  |   |
| :---              |    :----:  | :--------: |:---:|
| **Totals**             | 1852.14    | 968.00     | `...` |
| **Transaction Dates** | 01/01/2001 | 01/01/2001 | `...` |
| **Nb Items**               | 4             | 3          | `...` |

### File Formats Decisions

* There are four considerations when selecting file formats:
    1. Row vs Column
      * What kind of analytics are important?
    2. Schema Management
      * Will my data schemas evolve?
    3. Spilitability
      * Can I store data across multiple files and potentially servers?
    4. Compression
      * How compressible is the format?

### 1. Row- and Column-Based Formats

* An important consideration when selecting a big data format

![](https://www.dropbox.com/s/an5fg7xl2uvnfb8/row_col_format.png?dl=1)

### 1. Row- and Column-Based Formats

* Row-based: Ideal when using all the data
  * Example, building a machine learning model that requires all the features and all the instances
    * Avoid reading all the dataset in RAM by loading chunks at a time
    * Required frequent conditional access to multiple colums 

* Column-based storage: useful when performing operations on a subset of columns
  * Computing total sales, or computing a total aggregated by aggregated by date, etc.


### Row-Based Formats

* Used in most mainstream applications, from web log files to highly-structured database systems like MySQL and Oracle.

* Processing all the data would require reading all inputs line by line

* This is commonly used for Online Transactional Processing (OLTP).

  * OLTP systems usually process CRUD queries (Create, Read, Update and Delete) at a record level.

  * The main emphasis for OLTP systems is maintaining data integrity in multi-access environments

* Effectiveness measured by the number of transactions per second

   * More on this when we discuss big data platforms

### Column-Based Formats


* The data is grouped by columns
* Easy to focus computation on specific columns of data
  * E.g.: compute the mean or standard deviation of a column, search for the largest
    * What do you think this last operation is less computationally intensive on a column-based format?  

* Ideal for compression
  * Compression codecs (e.g., GZIP, pkzip, etc..) have a higher compression ratio when compressing sequences of similar data types. 

  

```python
[[1,2,3,..], ["John", "Janet", "Michael", ...], ...]
```



* If much more efficient than compressing:

```python
[[1, "John", "Doe", "125,000"], [2, "Janet", "Smith", "195,129"], ...]
```

* Typically, the slowest components in large distributed systems are the disk and network
    * Using compression reduces read IO and transfers, thus speeding up the analysis.


* This way of processing data is usually called OLAP (Online Analytical Processing)   

 * OLAP is an approach designed to quickly answer analytics queries involving multiple features (variables), typically across database systems.

### Compression of Row vs. Columnar Data

  * Let's perform a quick exrperiment 
    


In [5]:
import random
random.choices([1,2,3,4], k=6)

[3, 3, 4, 1, 2, 3]

In [6]:
import random
random.choices("ACGT", k=6)

['T', 'T', 'G', 'C', 'C', 'T']

In [7]:
import string 

print(string.printable)
print(string.digits)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	

0123456789


In [11]:
import zlib 
import string

# let's randomly generate two string of 1000, an ASCII and an INT

random_ASCII = random.choices(string.printable, k=10_000)
random_numbers = random.choices(string.digits, k=10_000)
print(len(zlib.compress( str.encode("".join(random_ASCII)))))
print(len(zlib.compress( str.encode("".join(random_numbers)))))

8414
5081


In [28]:
import numpy
ratios = []
for i in range(10):
    random_ASCII = random.choices(string.printable, k=10_000)
    random_numbers = random.choices(string.digits, k=10_000)
    len_ascii = len(zlib.compress( str.encode("".join(random_ASCII)))) 
    len_numbers = len(zlib.compress( str.encode("".join(random_numbers))))
    ratios.append(len_ascii/len_numbers)
    
numpy.mean(ratios)

1.6547445239154617

In [19]:
string.ascii_uppercase

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [29]:
ratios = []
for i in range(10):
    random_ASCII = random.choices(string.printable, k=10_000)
    random_uppercase = random.choices(string.ascii_uppercase, k=10_000)
    len_ascii = len(zlib.compress( str.encode("".join(random_ASCII)))) 
    len_uppercase = len(zlib.compress( str.encode("".join(random_uppercase))))
    ratios.append(len_ascii/len_uppercase)    
numpy.mean(ratios)

1.3462144408226877

### Examples of OLAP versus OLTP in Amazon

![](https://www.dropbox.com/s/cxhwtc5s582tnp2/amazon_olap_oltp.png?dl=1)

### Column-based formats: Advantages and Disadvantages

<u>Advantages</u>:
* Columnar storage of data can sometimes yield 100x-1000x performance improvements, particularly for wide datasets


<u>Disadvantages</u>:
  *  Not efficient with CRUD operations
  * Difficult to access all features of a single instance
    * Need to parse all columns to position $i$
  * Hard to read by a human
  * Can be more CPU intensive to write for very large data.



### 2- Datatype and Schema Enforcement and Evolution

* "Schema" in a database context, means the structure and organization of the data  
    * Structure: datatypes, missing values, primary keys, etc, indices, etc.
    * Organization: relationships across tables.

* Here, we mainly refer to the data type
* In text format, (e.g.: table with values separated by space), datatype cannot be declared or enforced

* Declaring the type of a value provides some advantages.
  * Storage requirements: String categories will require more storage than boolean (2 bytes)
  * Data validity: Verifies the dataset is valid and prevents entry errors (e.g., age = Johnn)
  * Compression: there are good strategies for compressing different data types 

### 2- Datatype & Schema Enforcement and Evolution - Cont'd


* In the event that there is no guarantee that data won't change in the future, you may need to consider schema evolution.


* When evaluating schema evolution, there are a few key questions to ask of any data format:
  * How easy is it to update a schema (such as adding a field, removing or renaming a field)?
  * How will different versions of the schema impact applications?
  * How fast can the schema be processed?


### 3- Splitability

* Big data such as monthly logs, yearly transactions, daily airplane sensors recordings, can often comprise many millions of records.

* Often useful to split the data across multiple machines and execute each computation separately

* Some file formats are more amenable to splitting than others.

### 3- Splitability - Row-based

Row-based formats can be split along row boundaries

```
# file 1 with n lines
01/01/2001           4            1852.14
01/01/2001           3            968.00
...
```

* Splitting can be done
  * Randomly plitting `file 1` with `n` observations across `m` total machines is easy.

    * Each machine gets `ceiling(n/m)` unique lines, last machine gets remaining lines

 * Splitting based on one or more fields: 
    * Partitioning a rown-based file over particular column values can be difficult if data is stored in a random order.
    * May require sorting the data first

### 3- Splitability row-based, nested 

* Larg column-based data can be more difficult to split

``` 
file 2
{"01/01/20014": [(4, 1852.14), (3, 968.00)], ....}
```

* You cannot easily split this file this file format without parsing the file first.
  * Need to read the compelte file to split it into chunks.
    * Data may need to ne loaded in RAM first.


### 3- Splitability: Column-based, nested


* A column-based format can be split if the comutation is column-specific.

```
# file 3
date: 01/01/2001, 01/01/2001
nb_items: 4, 3
totals: 1852.14, 968.00
```

Splitting can only e done column-wise:
* In the example above, each machine is concerned with a computation on a specific variable. For example:
  * Machine 1 takes `date` data and computes the number of sales per month
  * Machine 2 takes the `nb_items` data and computes the total number of sales
  * Machine 3 takes the `totals` data and computes the total sales values
 
* Machines don't have any knowledge of variables that are not given.
  * E.g., if machine three is not given date info and cannot compute, for example, the monthly or weekly sales average.


### 4- Compression


* When working on a distributed system, data transfer can be a serious bottleneck
* Compression can substantially improve runtime and storage requirements

* We illustrated "naively" that columnar data can achieve better compression rates than row-based data
  * Simple way to think about it: column will have a lot more duplicate values:
      * Ex. Age Column: 21, 22, 21, 24, 25, 21, 22, 21, 19, 21, 21, 22, ....
      
* Note that complex compression algorithms on very large files can save on space but substantially increase compute time.
    * Uncompression/re-compression needs to occur every time you need to access the data.


### Standardization and File Formats

* One can always choose their own format for the file
  * Many companies may choose to do so internally for many reasons.
  * E.g.:

```
FIRST_NAME_1\sLAST_NAME_1\tFIRST_NAME_2\sLAST_NAME_2\tFIRST_NAME_3\sLAST_NAME_3...
JOBTITLE_1\sSALARY_1\tJOBTITLE_2\sSALARY_2\tJOBTITLE_3\sSALARY_3
```

* However, there are many benefits to using a standard file format. E.g.:
  * Clarity and productivity: eliminating the need for guesswork or extra searching for answer. Plus there is no need to maintain internal documentation, which makes it easier to get answers online when issues arise.

  * Quality: standard formats are designed by large teams and used extensively, which provides opportunities to optimize them

  * Interoperability: your data is no longer locked to your company (or compartmentalized) and can be used across platforms.

* Some of the most used formats are CSV, JSON, Parquet, AVRO, HDF5
  * All very well supported in Python


### CSV File Format

* Files in the CSV (Comma-separated values) format are usually used to exchange tabular data
  * Plain-text file (readable characters)
 
* CSV is a row-based file format: each row of the file is a separate data instance
  * May or may not contain a header
* Structure is conveyed through explicit commas
  * Text commas are encapsulated in double quotes

```
Title,Author,Genre,Height,Publisher
"Computer Vision, A Modern Approach","Forsyth, David",data_science,255,Pearson
Data Mining Handbook,"Nisbet, Robert",data_science,242,Apress
Making Software,"Oram, Andy",computer_science,232,O'Reilly
...
```

### CSV File Format

* CSV format is not fully standardized
  * Other characters can be used to separate files, such as tabs (tsv) or spaces (ssv)
 
* Data relationships across multiple CSV files are not expressed in the file format
  * Use same column names to indicate "foreign key" relationship
 

* Native support in Python
```python
import csv
csv.reader(csvfile, delimiter=',', quotechar='"')
# use csv ...
```

In [36]:
# All_Time_Worldwide_Box_Office_partial.csv
import csv
with open('All_Time_Worldwide_Box_Office.csv')  as csvfile:
    movies_file = csv.reader(csvfile, delimiter=',', quotechar='"')
    i = 0 
    for line in movies_file:
        print(f"Line {i}: {line}")
        i+=1
        if i ==10:
            break


Line 0: ['Rank', 'Year', 'Movie', 'WorldwideBox Office', 'DomesticBox Office', 'InternationalBox Office']
Line 1: ['1', '2009', 'Avatar', '$2,845,899,541', '$760,507,625', '$2,085,391,916']
Line 2: ['2', '2019', 'Avengers: Endgame', '$2,797,800,564', '$858,373,000', '$1,939,427,564']
Line 3: ['3', '1997', 'Titanic', '$2,207,986,545', '$659,363,944', '$1,548,622,601']
Line 4: ['4', '2015', 'Star Wars Ep. VII: The Force Awakens', '$2,064,615,817', '$936,662,225', '$1,127,953,592']
Line 5: ['5', '2018', 'Avengers: Infinity War', '$2,044,540,523', '$678,815,482', '$1,365,725,041']
Line 6: ['6', '2015', 'Jurassic World', '$1,669,979,967', '$652,306,625', '$1,017,673,342']
Line 7: ['7', '2019', 'The Lion King', '$1,654,367,425', '$543,638,043', '$1,110,729,382']
Line 8: ['8', '2015', 'Furious 7', '$1,516,881,526', '$353,007,020', '$1,163,874,506']
Line 9: ['9', '2012', 'The Avengers', '$1,515,100,211', '$623,357,910', '$891,742,301']


In [39]:
# All_Time_Worldwide_Box_Office_partial.csv
import csv
with open('All_Time_Worldwide_Box_Office.csv')  as csvfile:
    movies_file = csv.DictReader(csvfile, delimiter=',', quotechar='"')
    i = 0 
    for line in movies_file:
        print(f"Line {i}: {line}")
        i+=1
        if i ==10:
            break

Line 0: {'Rank': '1', 'Year': '2009', 'Movie': 'Avatar', 'WorldwideBox Office': '$2,845,899,541', 'DomesticBox Office': '$760,507,625', 'InternationalBox Office': '$2,085,391,916'}
Line 1: {'Rank': '2', 'Year': '2019', 'Movie': 'Avengers: Endgame', 'WorldwideBox Office': '$2,797,800,564', 'DomesticBox Office': '$858,373,000', 'InternationalBox Office': '$1,939,427,564'}
Line 2: {'Rank': '3', 'Year': '1997', 'Movie': 'Titanic', 'WorldwideBox Office': '$2,207,986,545', 'DomesticBox Office': '$659,363,944', 'InternationalBox Office': '$1,548,622,601'}
Line 3: {'Rank': '4', 'Year': '2015', 'Movie': 'Star Wars Ep. VII: The Force Awakens', 'WorldwideBox Office': '$2,064,615,817', 'DomesticBox Office': '$936,662,225', 'InternationalBox Office': '$1,127,953,592'}
Line 4: {'Rank': '5', 'Year': '2018', 'Movie': 'Avengers: Infinity War', 'WorldwideBox Office': '$2,044,540,523', 'DomesticBox Office': '$678,815,482', 'InternationalBox Office': '$1,365,725,041'}
Line 5: {'Rank': '6', 'Year': '2015',

### CSV Pros and Cons
<u>Pros:</u>
* Human-readable and easy to edit manually
* Provides a simple scheme
* Can be processed by almost all existing applications
* Easy to implement and parse;
* Compact (compared to, for instance JSON or MXL)
* Column headers are written only once

<u>Cons:</u>
* No guarantees about data integrity, i.e., data won't be missing or won't be in a different type than expected.
* Adding complex structures to a data structure is not possible
  * May need to reference other files to implement nesting
* There is no standard way to present binary data
* Lack of a universal standard can cause 

### JSON File Format

* JSON (JavaScript Object Notation)

* Open standard file format that uses human-readable text
  * FIle typically stored using `.json` extension
  
* Became popular as a space-saving alternative to Extensible Markup Language (XML)

* Inspired by JavaScript objects but is a language-independent data format

* Very similar to the combination of Python's lists and dicts

* Also supported natively in Python
  ```python
  import json
json.load(...)
  ```
* The defacto language of the web
  * Supported in all modern languages and particularly web languages.

### JSON File Structure

* JSON supports the following types.

* Scalar values
    * `Numbers`: e.g. 3
    * `String`: Sequence of Unicode characters surrounded by double quotation marks.
    * `Boolean`: `true` or `false`.

* Collections:
    * `Array`: A list of values surrounded by square brackets `[]`
    * `Dictionaries`: key" value pairs separated by a comma(,) and enclosed in `{}`
      * Keys are strings and values can be any valid scalar or collection

* See the following for more details: https://docs.fileformat.com/web/json/
* See the following very good (useful) validator for validating JSON files or records: https://jsonformatter.curiousconcept.com/#

In [40]:
my_data = [ 
    {'First Name': "John", "Occupation": "Student", "Salary": 120_000, "volunteer": False}, 
    {'First Name': "John", "Occupation": "Student", "salary": None, "volunteer": True}
]
my_data


[{'First Name': 'John',
  'Occupation': 'Student',
  'Salary': 120000,
  'volunteer': False},
 {'First Name': 'John',
  'Occupation': 'Student',
  'salary': None,
  'volunteer': True}]

```python
json.load
json.loads
json.dump
json.dumps
```

In [49]:
import json
json_representation = json.dumps(my_data)
print(json_representation)
# Note the changes between the Python dict and the JSON string

[{"First Name": "John", "Occupation": "Student", "Salary": 120000, "volunteer": false}, {"First Name": "John", "Occupation": "Student", "salary": null, "volunteer": true}]


### Working with the Python `json` library


* `All_Time_Worldwide_Box_Office_partial.json`  structure
```json
[
 {
  "Rank": "1",
  "Year": "2009",
  "Movie": "Avatar",
  "WorldwideBox Office": "$2,845,899,541",
  "DomesticBox Office": "$760,507,625",
  "InternationalBox Office": "$2,085,391,916"
 },
 {
  "Rank": "2",
  "Year": "2019",
  "Movie": "Avengers: Endgame",
  "WorldwideBox Office": "$2,797,800,564",
  "DomesticBox Office": "$858,373,000",
  "InternationalBox Office": "$1,939,427,564"
 },
 ...
]
```

In [53]:
import json
json_file = open('data/All_Time_Worldwide_Box_Office_partial.json') 
movies_data = json.load(json_file)
movies_data[0:3]


[{'Rank': '1',
  'Year': '2009',
  'Movie': 'Avatar',
  'WorldwideBox Office': '$2,845,899,541',
  'DomesticBox Office': '$760,507,625',
  'InternationalBox Office': '$2,085,391,916'},
 {'Rank': '2',
  'Year': '2019',
  'Movie': 'Avengers: Endgame',
  'WorldwideBox Office': '$2,797,800,564',
  'DomesticBox Office': '$858,373,000',
  'InternationalBox Office': '$1,939,427,564'},
 {'Rank': '3',
  'Year': '1997',
  'Movie': 'Titanic',
  'WorldwideBox Office': '$2,207,986,545',
  'DomesticBox Office': '$659,363,944',
  'InternationalBox Office': '$1,548,622,601'}]

In [54]:
type(movies_data)

list

In [55]:
type(movies_data[0])

dict

In [57]:
for record in movies_data:
    print(f"The movie {record['Movie']}, grossed {record['WorldwideBox Office']} in {record['Year']}")

The movie Avatar, grossed $2,845,899,541 in 2009
The movie Avengers: Endgame, grossed $2,797,800,564 in 2019
The movie Titanic, grossed $2,207,986,545 in 1997
The movie Star Wars Ep. VII: The Force Awakens, grossed $2,064,615,817 in 2015
The movie Avengers: Infinity War, grossed $2,044,540,523 in 2018
The movie Jurassic World, grossed $1,669,979,967 in 2015
The movie The Lion King, grossed $1,654,367,425 in 2019
The movie Furious 7, grossed $1,516,881,526 in 2015
The movie The Avengers, grossed $1,515,100,211 in 2012
The movie Frozen II, grossed $1,446,925,396 in 2019


### JSON Pros and Cons

* Pros:
    * Very well supported in modern languages, technologies and infrastructures
    * Can be used as the basis for more performance-optimized formats Parquet or Avro (discussed next)
    * Supports hierarchical structures abstracting the need for complex relationships
    * The *defacto* standard in NoSQL databases
* Cons:
    * Much smaller footprint than XML but still fairly large due to repeated field names
    * Not easy to index
    * Some tentatives to add a schema but not commonly used

### AVRO File Format

* AVRO is an advanced form of the JSON format
    * Leverages some of the advantages of JSON while mitigating some of its disadvantages
* Uses a file definition (a schema itself written in JSON) and stores data without the repeated field names.
  * Said to be self-descriptive because you can include the schema and documentation in the header of the file containing the data

* Released by the Hadoop working group in 2009 to use with Hadoop Systems
* It is a row-based format that has a high degree of splitting
* Provides mechanism to manage schema evolution
* support for most modern languages, including Python via the `avro` library

### Pros and Cons

* Pros:
    * Binary data minimizes file size and maximizes efficiency
    * A reliable support for schema evolution
      * Supports new, missing, or changed fields.
      * This allows old software to read new data, and new software to read old data
      * It is a critical feature if your data can change.
* Cons:
    * Data is not human readable
    * all the cons of a row-based format

### PARQUET Format


* Parquet was developed by Twitter and Cloudera as a columnar data store
* Parquet is especially useful with wide datasets (datasets with many columns)

* Optimized for reading and is therefore ideal for read-intensive workloads
* Parquet was also designed to support columnar partitions
    * Splitting the data based on value similarity, whcih results in a folder hierarchy
      * E.g.: split on the similar values of the MONTH  or department
    * Splits can be nested by splitting on a second attribute.
      * Will result in a nested folder hierarchy
     
```      
    MONTH=JANUARY
        CITY=HONOLULU
           data..
        CITY=MONTREAL
            data..
        CITY=NY
            data..
        
    MONTH=FEBRUARY
        CITY=HONOLULU
           data..
        CITY=MONTREAL
            data..
        CITY=NY
            data..

    ...
      
```  
https://blog.datasyndrome.com/python-and-parquet-performance-e71da65269ce
 

### PARQUET PROS and CONS

* Pros: 
 * Highly compressible since data is stored column-wise (compression rates up to 75%)
   * Can use different compression algorithms with different datatypes
 * Seamless splittability across columns.
 * Optimized for reading data and ideal for read-intensive tasks
   * Can use parallelization to read different columns.



* Cons:
 * Very slow at writing data and not good with write-intensive applications
 * Does not support updates on the data as Parquet files are immutable.


In [None]:
# Memory Mapping for working with Large Files

* Memory Mapping is a power concepts to map a portion or the complete file on disk  to a some virtual memory.

* Application can access segments of a file without having to first read the complete file in memory.

* Ideal for approach that support streaming-like processes instead of indexing.
  * Think for instance of map-like operations.

* Allows you to use memory access rather than read and write. 

* Kernel schedule reads and writes of physical pages.


In [None]:
### Better Memory Managment

* Read only data you need
  * Cols or subfiles
* Specify the column type to avoid loading the data into memory
  * even if the data fits, the read may take too much memory
* use categories, instead of strings when possible 
  * Use appropriate data types, int8 will take less memroy than an int 64

### Memory Mapping using `mmap` 

* Locality in computing leads to huge performance gains.
  * Having data on registers or in RAM is comptuationally efficient
  * Swapping degrades performance

* Mmap is `C` function to map or unmap files or devices into memory

```
It is a method of memory-mapped file I/O. It implements demand paging because file contents are not read from disk directly and initially do not use physical RAM at all. The actual reads from disk are performed in a "lazy" manner, after a specific location is accessed.```
* https://en.wikipedia.org/wiki/Memory-mapped_file

* Reading, seeking and writing are compared to in-memoy pointer operations


* Memory is allocated to a process
    * Processes may need to communicate via shared memroy and mmap makes that to share files files
    * Simplified task parallelization



### Advantages of mmap 

* No intermediate presentation of the the data
  * less overhead for the data.
* Data sharing across processes.
* Modifications are immediate.
  * modified a file handle does not modify the file. It still needs to be written to file

    


In [1]:
import pandas as pd
data = pd.read_csv("data/san_francisco_2015.csv")
data.info(memory_usage='deep')

FileNotFoundError: [Errno 2] No such file or directory: 'data/san_francisco_2015.csv'

In [None]:
%%time
import pandas as pd
data = pd.read_csv("data/san_francisco_2015.csv")
data.info(memory_usage='deep')

In [None]:
import mmap
import os

In [None]:
%%time
file_handle = open("data/san_francisco_2015.csv", "r+b") 
mmap_file = mmap.mmap(file_handle.fileno(), 0)


In [None]:
line = mmap_file.readline()
print(f"Line 0:\t{line}")
line = mmap_file.readline()
print(f"Line 1:\t{line}")
line = mmap_file.readline()
print(f"Line 1:\t{line}")

In [None]:
mmap_file.seek(0)
line = mmap_file.readline()
print(f"Line:\t{line}")

In [None]:
mmap_file.seek(0)
mmap_file.find(b'\n')

In [None]:
mmap_file.seek(282)
mmap_file.find(b'\n')

In [None]:
mmap_file.readline()

In [None]:
mmap_file.tell()

In [None]:
mmap_file[0]

In [None]:
%%time
data = open("data/san_francisco_2015.csv", 'r').read()
len(data)

In [None]:
type(mmap_file[0])

In [None]:
chr(mmap_file[0])

In [None]:
chr(mmap_file[1])

In [None]:
mmap_file[0:9]


In [None]:
# almost equivalent to 
[x.to_bytes(1, 'big') for x in mmap_file[0:9]]


In [None]:
# equivalent to 
b''.join([x.to_bytes(1, 'big') for x in mmap_file[0:9]])

In [None]:
# almsot equivalent to 
"".join([chr(x) for x in mmap_file[0:9]])

### The Dataset Library
* Combines `Arrow`'s columnar format
* Using mmap, we can share data across processes
  * allows zero-copy reads which removes virtually all serialization overhead.
* Arrow is language-agnostic so it supports different programming languages.
* Arrow is column-oriented so it is faster at querying and processing slices or columns of data.
* Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.
* Arrow supports many, possibly nested, column types.



In [1]:
# conda install -c conda-forge datasets

In [3]:
#!pip install datasets
from datasets import load_dataset
#my_data = load_dataset("san_francisco_2015.csv")

ModuleNotFoundError: No module named 'datasets'

In [None]:
https://www.dropbox.com/s/an5fg7xl2uvnfb8/row_col_format.png?dl=0