# File Comperison Test:
- File Size
- Speed to write
- Speed to read

In [1]:
# importing relevent libraries
import pandas as pd
import numpy as np

# Metadata
- 10 Million Data Points
- Fake datapoints generated through code
- 6 Columns: Size, Age, Team, Win, Date, Probability

| Column Name | Data Type |
|-------------|-----------|
| Category    | category  |
| Team        | category  |
| Age         | int16     |
| Win         | boolean   |
| Probability | float32   |


In [2]:
def get_dataset(size):
    # Create Fake Dataset
    df = pd.DataFrame()
    df['size'] = np.random.choice(['big','medium','small'], size)
    df['age'] = np.random.randint(1, 50, size)
    df['team'] = np.random.choice(['red','blue','yellow','green'], size)
    df['win'] = np.random.choice(['yes','no'], size)
    dates = pd.date_range('2020-01-01', '2022-12-31')
    df['date'] = np.random.choice(dates, size)
    df['prob'] = np.random.uniform(0, 1, size)
    return df

    # Dfining Data Types
def set_dtypes(df):
    df['size'] = df['size'].astype('category')
    df['team'] = df['team'].astype('category')
    df['age'] = df['age'].astype('int16')
    df['win'] = df['win'].map({'yes':True, 'no': False})
    df['prob'] = df['prob'].astype('float32')
    return df

# Pandas reading and writing to CSV file
- File Size: 488,292 KB

## Advantages
- Human Readable: CSV files are plain text files and thus human-readable.
- Simplicity: CSV is a straightforward file format with a simple structure that can be understood and edited in a text editor.

## Disadvantage
- No Type Information: CSV files do not contain any type or formatting information for data. This can lead to issues like the loss of leading zeros in data fields, and all data is treated as plain text.
- Inefficient with Large Datasets: When dealing with large datasets, the lack of compression and indexing can make CSVs slow to process.
- Not Suitable for Complex Data: CSV files are not designed to contain complex, hierarchical data structures. They work best with flat data structures and struggle with nested or multi-level data.
- Lack of Standardization: Although CSVs are simple, they lack a definitive standard, which can lead to inconsistencies in handling them. For example, data containing commas can lead to issues if not correctly escaped.


In [7]:
print('Reading and writing CSV')
df = get_dataset(10_000_000)
df = set_dtypes(df)
%time df.to_csv('test.csv')
%time df_csv = pd.read_csv('test.csv')

Reading and writing CSV
CPU times: total: 33 s
Wall time: 33.3 s
CPU times: total: 6.03 s
Wall time: 6.09 s


# Pandas reading and writing to Pickle file
- File Size: 166,018 KB

## Advantages
- Python Integration: Pickle is built-in with Python, so it's simple to use with no extra dependencies.
- Flexibility: Pickle can serialize nearly any Python object

## Disadvantage
- Security: Loading a pickled file can execute arbitrary code, so it's not safe to load data that came from an untrusted source.
- Interoperability: Pickle is Python-specific and not suitable for cross-language data interchange.
- Backward compatibility: Pickle files may not be backward-compatible or might not work across different Python versions.

In [8]:
print('Reading and writing Pickle')
df = get_dataset(10_000_000)
df = set_dtypes(df)
%time df.to_pickle('test.pickle')
%time df_pickle = pd.read_pickle('test.pickle')

Reading and writing Pickle
CPU times: total: 109 ms
Wall time: 777 ms
CPU times: total: 78.1 ms
Wall time: 86.4 ms


# Pandas reading and writing to Parquet file
- File Size: 66,624 KB

## Advantages
- Compression: Parquet offers very efficient compression and encoding schemes.
- Columnar Format: As a column-oriented format, it's well-suited for data analysis and queries, especially for big data workloads.
- Big Data Ecosystem: Parquet is widely supported across the big data ecosystem, including tools like Apache Spark, Apache Arrow, etc.
- Schema Evolution: Parquet supports schema evolution which allows the structure of tables to be modified over time without breaking older readers.

## Disadvantage
- Write Speed: Parquet is a bit slower to write than other formats due to its compression mechanism
- Complexity: The format is complex, and therefore the reading and writing software is correspondingly complex.

In [9]:
print('Reading and writing Parquet')
df = get_dataset(10_000_000)
df = set_dtypes(df)
%time df.to_parquet('test.parquet')
%time df_parquet = pd.read_parquet('test.parquet')

Reading and writing Parquet
CPU times: total: 1.22 s
Wall time: 1.04 s
CPU times: total: 1.55 s
Wall time: 437 ms


# Pandas reading and writing to Feather file
- File Size: 100,123 KB

## Advantages
- Speed: Feather format is designed to be fast for reading and writing data
- Interoperability: Provides a high degree of compatibility between Python (Pandas) and R.
- Schema Storage: Feather stores data type information, preserving the data types.

## Disadvantage
- Size: Feather files can take up more disk space than other formats because they prioritize speed over compression.
- Support: While popular in Pandas and R, Feather isn't as widely supported in big data systems.

In [10]:
print('Reading and writing Feather')
df = get_dataset(10_000_000)
df = set_dtypes(df)
%time df.to_feather('test.feather')
%time df_feather = pd.read_feather('test.feather')

Reading and writing Feather
CPU times: total: 859 ms
Wall time: 392 ms
CPU times: total: 547 ms
Wall time: 226 ms


# Summary
| File System | File Size (KB) | Reading Speed (s) | Writing Speed (s) | Maximum Size |
| ------- | ------- | ------- | ------- | ------- |
| CSV  | 488,292  | 33.3  | 6.09 | Limited by the file system and available memory; performance issues may arise with large files (multi-GB range) |
| Pickle  | 166,018  | 0.777  | 0.0864 | Limited by available memory; very large objects may cause issues |
| Parquet  | 66,624  | 1.04  | 0.437 | No explicit limit; practical limit depends on the file system and the tool being used to process the data. Python libraries that handle Parquet files (like pyarrow and fastparquet) can handle very large files efficiently |
| Feather  | 100,123  | 0.392  | 0.226 | No explicit limit; as with Parquet, the practical limit depends on the file system and the tool used to process the data. Python's feather-format library can handle very large files efficiently |

### File System Recommendations:

1. **CSV**: CSV is a good choice for small datasets and for scenarios where readability and simplicity are paramount. It is a universal format, meaning you can use it to move data between different programs, including non-programming tools like Excel. However, it may not be the best choice for large, complex datasets due to its lack of compression and data type preservation.

2. **Pickle**: Pickle is a suitable choice when working strictly within Python, especially when you need to store complex data types or custom objects. Keep in mind that pickled files can pose a security risk if they come from untrusted sources and they are not backward-compatible between Python versions.

3. **Parquet**: Parquet is an excellent choice for big data processing scenarios. It provides efficient storage through its columnar format and data compression capabilities. Parquet is widely supported in the big data ecosystem (like Hadoop and Spark), making it a good choice for interoperability within these ecosystems.

4. **Feather**: Feather can be an optimal choice when you need speed for reading and writing data and interoperability between Python and R. It also preserves data types. However, be aware that Feather files may occupy more disk space than other file formats and it might not be as widely supported in big data systems.

The best file system to use depends heavily on your specific use case, including the size of your dataset, the complexity of your data, the programming languages you are using, and the specific requirements of your project.

