# Saving different files on different formats

## ChatGPT advice
### Feather vs. Pickle
Feather and Pickle are two different file formats used to store data in Python. They have different design goals and use cases.  
Feather is a lightweight data interchange format that was developed by Wes McKinney. It is optimized for fast storage and retrieval of data frames, especially when working with data in R or Python. Feather uses the Apache Arrow columnar format and provides an interface for reading and writing data frames efficiently and with low overhead. Feather files are binary and thus not human-readable.  
Pickle, on the other hand, is a more general-purpose serialization format in Python. It allows you to save any Python object, including lists, dictionaries, functions, and custom classes, to a file and load it back later. Pickle uses a binary format and is more flexible than Feather, but it can also be slower and more memory-intensive, especially when storing complex data structures. Pickle files are also not human-readable.  
In general, if you are working with data frames and need fast storage and retrieval, Feather is a good choice. If you need to save more complex data structures or if you need to store data that can be read and manipulated by other Python programs, Pickle may be a better option.  
It's worth noting that both formats have security vulnerabilities when loading data from untrusted sources, so it's recommended to only load data from trusted sources or to validate the data before loading it.

### Ading CSV to the mix
Comma-Separated Values (CSV) is a simple text-based file format that stores tabular data. Unlike Feather and Pickle, CSV files are human-readable and can be opened and edited in any text editor or spreadsheet software.  
CSV is a widely supported format and is compatible with many programming languages, including Python. It can be used to store and exchange data between applications and platforms. However, CSV files can be large and slow to read and write, especially when working with large datasets. In addition, CSV files do not support complex data structures like nested lists or dictionaries.  
In general, if you need to exchange data with other applications or platforms, or if you need to store data in a format that is easily accessible and editable by non-technical users, CSV is a good choice. If you need to store and work with large datasets, or if you need to save more complex data structures, you may want to consider using a more efficient format like Feather or Pickle.  
In summary, CSV is a good choice for simple, human-readable data storage and exchange, while Feather and Pickle are better suited for fast and efficient storage and retrieval of complex data structures in Python.

### And parquet
Parquet is a columnar storage format that is optimized for big data processing and storage. Like Apache Arrow, Parquet provides a common data format for in-memory data processing and is designed to be highly efficient for large datasets.  
pyarrow provides support for reading and writing Parquet files, making it a good choice for storing and exchanging data between applications and platforms.

# Some tests now
curtesy to Rob Mulla: 
- https://www.youtube.com/watch?v=u4rsA5ZiTls
- https://gist.github.com/RobMulla/738491f7bf7cfe79168c7e55c622efa5


For each format I will create a large pandas dataframe and I will measure the speeds for read/write and space efficiency.

> Note: In order for feather format to work you need installed also __pyarrow__.

In [5]:
import pandas as pd
import numpy as np

SIZE_IN_RECS = 2_000_000
def get_dataset(size):
    # Create Fake Dataset
    df = pd.DataFrame()
    df['size'] = np.random.choice(['big','medium','small'], size)
    df['age'] = np.random.randint(1, 50, size)
    df['team'] = np.random.choice(['red','blue','yellow','green'], size)
    df['win'] = np.random.choice(['yes','no'], size)
    dates = pd.date_range('2020-01-01', '2022-12-31')
    df['date'] = np.random.choice(dates, size)
    df['prob'] = np.random.uniform(0, 1, size)
    return df

def set_dtypes(df):
    df['size'] = df['size'].astype('category')
    df['team'] = df['team'].astype('category')
    df['age'] = df['age'].astype('int16')
    df['win'] = df['win'].map({'yes':True, 'no': False})
    df['prob'] = df['prob'].astype('float32')
    return df
print('Reading and writing CSV')
df = get_dataset(SIZE_IN_RECS)
df = set_dtypes(df)
%time df.to_csv('test.csv')
%time df_csv = pd.read_csv('test.csv')

print('Reading and writing Pickle')
df = get_dataset(SIZE_IN_RECS)
df = set_dtypes(df)
%time df.to_pickle('test.pickle')
%time df_pickle = pd.read_pickle('test.pickle')

print('Reading and writing Feather')
df = get_dataset(SIZE_IN_RECS)
df = set_dtypes(df)
%time df.to_feather('test.feather')
%time df_feather = pd.read_feather('test.feather')

print('Reading and writing Parquet')
df = get_dataset(5_000_000)
df = set_dtypes(df)
%time df.to_parquet('test.parquet')
%time df_parquet = pd.read_parquet('test.parquet')

Reading and writing CSV
CPU times: total: 8.53 s
Wall time: 11.8 s
CPU times: total: 1.52 s
Wall time: 1.92 s
Reading and writing Pickle
CPU times: total: 31.2 ms
Wall time: 36 ms
CPU times: total: 15.6 ms
Wall time: 37 ms
Reading and writing Feather
CPU times: total: 93.8 ms
Wall time: 81 ms
CPU times: total: 93.8 ms
Wall time: 74 ms
Reading and writing Parquet
CPU times: total: 906 ms
Wall time: 1.45 s
CPU times: total: 734 ms
Wall time: 676 ms


In [6]:
! dir

 Volume in drive E is Radical Labs
 Volume Serial Number is 442C-98BD

 Directory of e:\Work\A5\production_architecture\luigi\docs

09/02/2023  11:09    <DIR>          .
09/02/2023  11:09    <DIR>          ..
09/02/2023  10:38               256 File Storage.ipynb
09/02/2023  11:08        99,112,683 test.csv
09/02/2023  11:09        20,511,210 test.feather
09/02/2023  11:09        34,428,771 test.parquet
09/02/2023  11:09        34,001,704 test.pickle
               5 File(s)    188,054,624 bytes
               2 Dir(s)  267,657,674,752 bytes free


# Conclusions
- For now I will go with __pickle__ because is the fastest, and, also because is not limited to pandas.
- __Parquet__ has no advantages against other formats so is the first to be let go.
- __Feather__ offer a good storage on disk, this is important but in our environment must see if pyarrow comes preinstalled.
- __CSV__ is good alternative for small files that need to be seen by humans and also there is a performant visualization tool in jupyter for it.