# Convert CSV to parquet

This script takes many `.CSV.gz` files from the previous script, and converts them into a small number of `.parquet` files. Combining many small files into few large files is known as "compaction", which improves performance.

At the same time we parse strings (all CSV data is string, really) into the proper datatype. (One advantage of parquet is that it has distinct, unambiguous types for int, float, datetime etc.) To do this, we take the known schema that we downloaded in a previous script.

* we parse timestamp strings into datetime values
* we add Nulls to files missing columns
* we convert to float, or whatever else. (e.g. if one CSV has only integers in a column, and another has floats/decimal/double in the same column, some tools get confused by that and throw an error.
* we extract the `TOP_TIMESTAMP` and `SCHEMA_VERSION` from the hive-style folder partitioning, and add them as columns.

Note that we can't use pandas for this, since pandas cannot handle empty values for most datatypes. We're trying [polars](https://docs.pola.rs/), which is going to be the next big pandas. It is supposed to be very fast and powerful. But I kept encountering issues with it, so this script ended up slow and complex, sorry.

Note that some files are too large to fit into memory. So we need to use streaming/chunking approaches. Polars is a new library which is supposed to be great for this. That's what this script mostly uses. However I found some bugs with it. So this ended up being more messy than expected. (e.g. at the time of writing, Polars cannot process 17,000 files at once, because of a stack overflow error, that appears as a seg fault error. I reported this, and it should be fixed in future versions. But even when I compiled the latest un-released version, I had other issues of the script crashing.)

Note that polars can't stream from `.csv.gz`. So we extract them to `.csv`, and then stream to `.parquet`. But we cannot extract all files at once. (That's 1.4TB at this point.) I don't have enough space for that. So we extract in chunks. This is quite slow. Make sure your disk is fast. (e.g. my external hard drive is 10 or 100x slower than my internal SSD.) The script tries to inteligently check how much disk space you have, and choose batch sizes based on that.

For datetimes, we treat them as string when reading from CSV, then cast to datetime in memory, then write to parquet. This was to ensure timezone is handled correctly. (The raw data is in `Australia/Brisbane`, which is non-DST, UTC+10.)

We do not deduplicate yet.

For other motivations for choosing `.parquet`, see the `README.md`.

This script takes about 24h to run.

This does the same thing as the similarly named file, called `01d-csv-to-parquet-pyarrow.ipynb`. But the output of that was sometimes corrupt. (That decompresses no the fly, without unzipping and writing `.CSV` to disk. So should be better.)

In [1]:
import os
import gzip
import shutil
from multiprocessing import Pool
import re
import json
from time import time
import importlib
from random import shuffle
import math

from tqdm import tqdm
import polars as pl

# utils is our local utility module
# if we change utils.py, and re-run a normal 'import'
# python won't reload it by default. (Since it's already loaded.)
# So we force a reload
import utils
importlib.reload(utils)

<module 'utils' from '/home/matthew/Documents/TSE/AppliedEconometrics/repo/utils.py'>

In [2]:
# `./data` didn't work for some reason, so I hard coded it
repo_data_dir = '/home/matthew/Documents/TSE/AppliedEconometrics/repo/data/'

# you can point this at a disk that is large, but fast.
base_data_dir = '/home/matthew/data/'

# the output of the previous script
all_csv_gz_dir = os.path.join(base_data_dir, '01-C-split-mapped-csv')

# once each folder in all_csv_gz_dir is processed, it will be moved here.
# so if you stop and re-start the script, you don't waste time re-processing.
all_csv_gz_archive = os.path.join(base_data_dir, '01-D-split-mapped-csv-done')

# we extract batches of `.csv.gz` here
all_csv_dir = os.path.join(base_data_dir, '01-D-gunzipped-csv')

# we save batches of files here. We'll end up with a few files per AEMO 'table'
all_parquet_batch_dir = os.path.join(base_data_dir, '01-D-parquet-batches')

schema_path = os.path.join(repo_data_dir, 'schemas.json')

num_processes = os.cpu_count() - 2 # *2 because we assume hyperthreading and want one spare to keep doing other stuff

## Preparation

In [3]:
logger = utils.Logger(os.path.join(repo_data_dir, 'logs.txt'), flush=True)
logger.info("Initialising Logger")

In [4]:
utils.renice()

In [5]:
version_col_name = 'SCHEMA_VERSION'
top_timestamp_col_name = 'TOP_TIMESTAMP'

## Schema preparation

In [6]:
with open(schema_path, 'r') as f:
    schemas = json.load(f)

In [7]:
# AEMO's schemas have Oracle SQL types
# map those to types polars can use
# if date_as_str, return string instead of datetime
# (because polars can't read datetimes when parsing from CSV)
def aemo_type_to_polars_type(t: str, tz=None, date_as_str=False):
    t = t.upper()
    if re.match(r"VARCHAR(2)?\(\d+\)", t):
        return pl.String()
    if re.match(r"CHAR\((\d+)\)", t):
        # single character
        # arrow has no dedicated type for that
        # so use string
        # (could use categorical?)
        return pl.String()
    elif t.startswith("NUMBER"):
        match = re.match(r"NUMBER ?\((\d+), ?(\d+)\)", t)
        if match:
            whole_digits = int(match.group(1))
            decimal_digits = int(match.group(2))
        else:
            # e.g. NUMBER(2)
            match = re.match(r"NUMBER ?\((\d+)", t)
            assert match, f"Unsure how to cast {t} to arrow type"
            whole_digits = int(match.group(1))
            decimal_digits = 0
            
        if decimal_digits == 0:
            # integer
            # we assume signed (can't tell unsigned from the schema)
            # but how many bits?
            max_val = 10**whole_digits

            if 2**(8-1) > max_val:
                return pl.Int8()
            elif 2**(16-1) > max_val:
                return pl.Int16()
            elif 2**(32-1) > max_val:
                return pl.Int32()
            else:
                return pl.Int64()
        else:
            # we could use pa.decimal128(whole_digits, decimal_digits)
            # but we don't need that much accuracy
            return pl.Float64()
    elif (t == 'DATE') or re.match(r"TIMESTAMP\((\d)\)", t):
        # watch out, when AEMO say "date" they mean "datetime"
        # for both dates and datetimes they say "date",
        # but both have a time component. (For actual dates, it's always midnight.)
        # and some dates go out as far as 9999-12-31 23:59:59.999
        # (and some dates are 9999-12-31 23:59:59.997)
        if date_as_str:
            return pl.String()
        else:
            return pl.Datetime(time_unit='ms', time_zone=tz)
    else:
        raise ValueError(f"Unsure how to convert AEMO type {t} to polars type")


In [8]:
# for the largest tables, far larger than the smallest 200 combined,
# let's drop the columns we know we won't use
# to make it faster
def get_cols_to_drop(table):
    if table == 'DISPATCHLOAD':
        to_drop = []
        cols = list(schemas[table]['columns'].keys())
        for c in cols:
            if any(c.upper().startswith(prefix) for prefix in ['RAISE', 'LOWER', 'VIOLATION', 'MARGINAL']) \
               and c not in schemas[table]['primary_keys']:
                to_drop.append(c)
        logger.info(f"Will drop {to_drop} from {table}")
        return to_drop
    else:
        # only drop columns for DISPATCHLOAD
        # the rest are too small to bother saving any space
        # and of the other large ones, we've already dropped the large tables we don't need
        # and the ones we kept, we need all columns
        return []

## Processing

In [31]:
def gunzip_file(source_path, dest_path):
    assert source_path.lower().endswith('.csv.gz')
    
    # create a destination folder
    utils.create_dir(file=dest_path)
    
    try:
        with gzip.open(source_path, 'r') as f_in:
            with open(dest_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    except Exception:
        # don't leave a corrupt file half-written
        if os.path.exists(dest_path):
            os.remove(dest_path)
        raise

def csv_gz_to_parquet(table):
    # source and dest schema are the same, except for datetimes
    source_schema = {c: aemo_type_to_polars_type(t['AEMO_type'], date_as_str=True, tz=None) for (c,t) in schemas[table]['columns'].items()}
    dest_schema = {c: aemo_type_to_polars_type(t['AEMO_type'], date_as_str=False, tz='Australia/Brisbane') for (c,t) in schemas[table]['columns'].items()}

    table_csv_gz_dir = os.path.join(all_csv_gz_dir, table)
    table_csv_gz_archive = os.path.join(all_csv_gz_archive, table)
    table_csv_dir = os.path.join(all_csv_dir, table)
    table_parquet_batched_dir = os.path.join(all_parquet_batch_dir, table)

    cols_to_drop = get_cols_to_drop(table)
    
    logger.info(f"Converting {table} from .csv.gz in {table_csv_gz_dir} to .csv in{table_csv_dir} then to .parquet in {table_parquet_batched_dir}")

    csv_gz_paths_unbatched = []
    csvgz_size = 0
    for (dir,subdirs,files) in os.walk(table_csv_gz_dir):
            for file in files:
                path = os.path.join(dir, file)
                csv_gz_paths_unbatched.append(path)
                csvgz_size = csvgz_size + os.path.getsize(path)
    assert len(csv_gz_paths_unbatched) > 0, f"No files found in {table_csv_gz_dir}"
    # make batch sizes (by bytes) more consistent
    shuffle(csv_gz_paths_unbatched)

    # calculate batch size
    # min of:
    #  max_batch_size
    #  whatever results in uncompressed CSVs taking up too much space
    
    # emperically, this is the effectiveness of our gzip algorithm
    compression_ratio = 30
    
    # emperically, polars can't handle more than this many files at once
    # (approximate)
    # https://github.com/pola-rs/polars/issues
    # 10000 is the limit normally
    if table in ['DISPATCHLOAD', 'BIDDAYOFFER']:
        # these tables are huge. >10GB compressed
        # I'm not sure why, but polars was freezing a lot. Dropping this helped,
        # at the cost of disk usage and write speed, read speed
        # so don't do this for most tables.
        # The resulting files were still a reasonable size.
        # (We don't want parquet <10MB)
        max_batch_size = 20
    else:
        max_batch_size = 10000

    # choose batch size based on how much free disk space we have
    free_bytes = shutil.disk_usage(all_csv_dir).free
    free_bytes = free_bytes / 2 # leave room for CSV + parquet + extra
    
    # figure out total uncompressed size
    uncompressed_size = csvgz_size * compression_ratio

    # divide to get number of batches
    num_batches = uncompressed_size / free_bytes

    batch_size = math.ceil(len(csv_gz_paths_unbatched) / num_batches)
    batch_size = min(batch_size, max_batch_size)
    num_batches = math.ceil(len(csv_gz_paths_unbatched) / batch_size)

    logger.info(f"For table {table}, {len(csv_gz_paths_unbatched)} .csv.gz files totaling {csvgz_size} bytes, estimating {uncompressed_size=}, {num_batches=}, so choosing {batch_size=}")

    # remove files from previous run
    shutil.rmtree(table_csv_dir, ignore_errors=True)
    shutil.rmtree(table_parquet_batched_dir, ignore_errors=True)
    utils.create_dir(table_parquet_batched_dir)
    
    for (batch_num, csv_gz_paths) in enumerate(utils.batched(csv_gz_paths_unbatched, batch_size), start=1):
        logger.info(f"Unzipping {len(csv_gz_paths)} CSV.gz in batch {batch_num}/{num_batches} for {table}\n")

        assert all(p.lower().endswith('.csv.gz') for p in csv_gz_paths), f"Found a file in {table_csv_gz_dir} which does not end in .csv.gz"
        csv_paths = [os.path.join(table_csv_dir, os.path.relpath(path=csv_gz_path, start=table_csv_gz_dir)).replace('.gz', '') for csv_gz_path in csv_gz_paths]

        with Pool() as p:
            p.starmap(gunzip_file, zip(csv_gz_paths, csv_paths))

        logger.info(f"Scanning {len(csv_gz_paths)} CSVs in batch {batch_num}/{num_batches} for {table}\n")
        datasets = []
        for (csv_gz_path, csv_path) in zip(csv_gz_paths, csv_paths):
            
            match = re.search(f"/{version_col_name}=(\d+)/", csv_path)
            assert match, f"Unable to extract schema version from {csv_path}"
            schema_version = int(match.group(1))
    
            match = re.search(f"/{top_timestamp_col_name}=([\d_]+)/", csv_path)
            assert match, f"Unable to extract top_timestamp from {csv_path}"
            top_timestamp = match.group(1)

            gunzip_file(source_path=csv_gz_path, dest_path=csv_path)
    
            ds = pl.scan_csv(csv_path, dtypes=source_schema, try_parse_dates=False, low_memory=True)
            ds = ds.with_columns(pl.lit(schema_version, dtype=pl.UInt8()).alias(version_col_name))
            ds = ds.with_columns(pl.lit(top_timestamp, dtype=pl.String()).alias(top_timestamp_col_name))

            assert ds.fetch().shape[0] > 0, f"File {csv_path} = {csv_gz_path} has no data rows"
            
            datasets.append(ds)
        logger.info(f"Scanned {len(csv_gz_paths)} CSVs in batch {batch_num}/{num_batches} for {table}\n")
    
        assert len(datasets) > 0, f"No CSV files found for {table}?"
    
        ds = pl.concat(datasets, how='diagonal')

        # drop columns we don't need (in the biggest tables)
        for col in cols_to_drop:
            if col in ds.columns:
                ds = ds.select(pl.exclude(col))
    
        # parse datetimes into actual datetime
        for (c,s) in dest_schema.items():
            if isinstance(s['AEMO_type'], pl.Datetime):
                ds = ds.with_columns(pl.col(c).str.strptime(pl.Datetime, "%Y/%m/%d %H:%M:%S", strict=False).dt.replace_time_zone(time_zone="Australia/Brisbane"))


        assert ds.fetch().shape[0] > 0, f"Combined df for {table} has no data rows"
        assert ds.fetch().shape[0] > 0, f"Combined df for {table} has no data rows"
        
        
        pq_path = os.path.join(table_parquet_batched_dir, f"{batch_num:04}.parquet")
    
        logger.info(f"Beginning processing of {len(datasets)} CSVs in batch {batch_num} for {table}\n")
        try:
            ds.sink_parquet(pq_path)
        except pl.ComputeError as ex:
            # tidy up. Don't leave a corrupt file there
            if os.path.exists(pq_path):
                os.remove(pq_path)
                
            # sometimes AEMO data has a float when we expect an int
            logger.exception(ex)
            match = re.search(r"could not parse `(\d+)\.(\d+)` as dtype `[iu]\d+` at column '([_\w]+)'", str(ex))
            if match:
                print(ex)
                whole_digits = int(match.group(1))
                decimal_digits = int(match.group(2))
                col = match.group(3)
                logger.warn(f"Changing column {col} in {table} from {schemas[table]['columns'][col]['AEMO_type']} to float")
                schemas[table]['columns'][col]['AEMO_type'] = f"NUMBER({whole_digits},{decimal_digits})"
                return csv_gz_to_parquet(table)
            else:
                raise
        else:
            # validate that what we just wrote was not corrupt
            # nor empty
            assert pl.scan_parquet(pq_path).fetch().shape[0] > 0, f"No rows found in {pq_path} from {csv_paths}"
            shutil.rmtree(table_csv_dir)

    
    # move the source .csv.gz files to another directory
    # so if we fail on the nth table and retry
    # we don't reprocess this table
    utils.create_dir(table_csv_gz_archive)
    os.rename(table_csv_gz_dir, table_csv_gz_archive)
    
    logger.info(f"Finished converting table {table} from .csv.gz to .parquet")


In [32]:
table = 'IRFMAMOUNT'
csv_gz_to_parquet(table)

AssertionError: No rows found in /home/matthew/data/01-D-parquet-batches/IRFMAMOUNT/0001.parquet from ['/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_09_07_04_47_00/PUBLIC_DVD_IRFMAMOUNT_201208010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_07_07_01_50_05/PUBLIC_DVD_IRFMAMOUNT_201406010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_03_01_00_35_04/PUBLIC_DVD_IRFMAMOUNT_201609010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_05_07_01_09_54/PUBLIC_DVD_IRFMAMOUNT_201204010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_01_21_00_30_06/PUBLIC_DVD_IRFMAMOUNT_202112010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_06_07_01_03_21/PUBLIC_DVD_IRFMAMOUNT_201005010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_10_07_00_30_06/PUBLIC_DVD_IRFMAMOUNT_201409010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_10_07_00_30_05/PUBLIC_DVD_IRFMAMOUNT_202109010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_01_07_01_10_01/PUBLIC_DVD_IRFMAMOUNT_201312010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_02_07_00_45_06/PUBLIC_DVD_IRFMAMOUNT_201601010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_11_07_00_01_56/PUBLIC_DVD_IRFMAMOUNT_201110010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2009_08_07_01_16_07/PUBLIC_DVD_IRFMAMOUNT_200907010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_04_07_00_35_09/PUBLIC_DVD_IRFMAMOUNT_201503010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_07_09_00_20_03/PUBLIC_DVD_IRFMAMOUNT_202006010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_01_07_01_01_58/PUBLIC_DVD_IRFMAMOUNT_201112010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_03_03_00_30_12/PUBLIC_DVD_IRFMAMOUNT_201610010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_04_07_01_01_50/PUBLIC_DVD_IRFMAMOUNT_201103010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_02_06_16_05_13/PUBLIC_DVD_IRFMAMOUNT_202101010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_09_07_00_58_21/PUBLIC_DVD_IRFMAMOUNT_201008010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_01_07_00_58_35/PUBLIC_DVD_IRFMAMOUNT_201012010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_05_07_00_57_54/PUBLIC_DVD_IRFMAMOUNT_201304010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_03_07_00_35_08/PUBLIC_DVD_IRFMAMOUNT_201602010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_08_07_12_20_00/PUBLIC_DVD_IRFMAMOUNT_201907010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_07_07_01_03_02/PUBLIC_DVD_IRFMAMOUNT_201106010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_11_08_00_05_04/PUBLIC_DVD_IRFMAMOUNT_201810010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_10_07_00_35_06/PUBLIC_DVD_IRFMAMOUNT_202209010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_12_07_02_00_10/PUBLIC_DVD_IRFMAMOUNT_201511010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_06_07_01_06_50/PUBLIC_DVD_IRFMAMOUNT_201105010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_02_07_01_08_40/PUBLIC_DVD_IRFMAMOUNT_201101010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_07_08_00_25_08/PUBLIC_DVD_IRFMAMOUNT_201706010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_09_08_00_40_12/PUBLIC_DVD_IRFMAMOUNT_202008010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_09_07_03_20_03/PUBLIC_DVD_IRFMAMOUNT_201708010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_07_07_01_17_02/PUBLIC_DVD_IRFMAMOUNT_201206010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_01_07_08_35_08/PUBLIC_DVD_IRFMAMOUNT_202212010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_05_08_00_35_05/PUBLIC_DVD_IRFMAMOUNT_201904010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_09_07_00_40_01/PUBLIC_DVD_IRFMAMOUNT_201808010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_06_07_01_01_11/PUBLIC_DVD_IRFMAMOUNT_201305010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_02_07_00_45_06/PUBLIC_DVD_IRFMAMOUNT_202001010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_07_12_13_35_06/PUBLIC_DVD_IRFMAMOUNT_202106010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_02_07_01_02_45/PUBLIC_DVD_IRFMAMOUNT_201301010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_12_07_01_01_46/PUBLIC_DVD_IRFMAMOUNT_201011010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_05_07_09_00_01/PUBLIC_DVD_IRFMAMOUNT_202104010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_09_07_01_01_57/PUBLIC_DVD_IRFMAMOUNT_201308010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_08_07_00_06_55/PUBLIC_DVD_IRFMAMOUNT_201207010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_08_07_00_40_00/PUBLIC_DVD_IRFMAMOUNT_201807010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_05_09_16_35_11/PUBLIC_DVD_IRFMAMOUNT_202304010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_08_08_02_20_08/PUBLIC_DVD_IRFMAMOUNT_202307010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_02_08_00_05_03/PUBLIC_DVD_IRFMAMOUNT_201801010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_11_07_01_03_05/PUBLIC_DVD_IRFMAMOUNT_201310010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_10_07_00_35_03/PUBLIC_DVD_IRFMAMOUNT_201509010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_03_02_00_45_05/PUBLIC_DVD_IRFMAMOUNT_201901010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_06_07_04_00_00/PUBLIC_DVD_IRFMAMOUNT_201705010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_07_13_00_35_02/PUBLIC_DVD_IRFMAMOUNT_202206010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_10_07_19_55_06/PUBLIC_DVD_IRFMAMOUNT_201709010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_07_07_00_56_02/PUBLIC_DVD_IRFMAMOUNT_201306010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_03_19_10_05_02/PUBLIC_DVD_IRFMAMOUNT_201402010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_02_01_16_50_04/PUBLIC_DVD_IRFMAMOUNT_202012010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_12_07_00_40_08/PUBLIC_DVD_IRFMAMOUNT_202211010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_07_07_18_20_05/PUBLIC_DVD_IRFMAMOUNT_201606010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_01_07_01_00_05/PUBLIC_DVD_IRFMAMOUNT_201912010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_05_07_00_58_22/PUBLIC_DVD_IRFMAMOUNT_201004010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_03_02_02_55_12/PUBLIC_DVD_IRFMAMOUNT_201812010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_02_07_00_27_44/PUBLIC_DVD_IRFMAMOUNT_201001010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_12_07_00_00_05/PUBLIC_DVD_IRFMAMOUNT_202111010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_04_07_00_30_00/PUBLIC_DVD_IRFMAMOUNT_202103010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_08_07_00_05_06/PUBLIC_DVD_IRFMAMOUNT_201607010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_06_06_16_50_05/PUBLIC_DVD_IRFMAMOUNT_202105010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_06_09_11_30_03/PUBLIC_DVD_IRFMAMOUNT_202305010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_04_07_00_35_05/PUBLIC_DVD_IRFMAMOUNT_201603010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_11_11_10_25_03/PUBLIC_DVD_IRFMAMOUNT_201710010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_08_07_00_40_05/PUBLIC_DVD_IRFMAMOUNT_201407010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_09_14_00_40_01/PUBLIC_DVD_IRFMAMOUNT_201608010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_11_07_05_15_05/PUBLIC_DVD_IRFMAMOUNT_202310010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_03_07_00_51_16/PUBLIC_DVD_IRFMAMOUNT_201002010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_04_07_01_04_58/PUBLIC_DVD_IRFMAMOUNT_201203010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2009_10_07_00_17_34/PUBLIC_DVD_IRFMAMOUNT_200909010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_04_07_00_55_06/PUBLIC_DVD_IRFMAMOUNT_202003010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_06_07_01_07_00/PUBLIC_DVD_IRFMAMOUNT_201205010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_08_09_00_00_11/PUBLIC_DVD_IRFMAMOUNT_202207010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_09_07_00_45_10/PUBLIC_DVD_IRFMAMOUNT_202208010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_03_07_00_58_46/PUBLIC_DVD_IRFMAMOUNT_201102010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_04_30_00_35_05/PUBLIC_DVD_IRFMAMOUNT_201903010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_12_12_00_05_00/PUBLIC_DVD_IRFMAMOUNT_201711010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_06_07_00_35_04/PUBLIC_DVD_IRFMAMOUNT_201405010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_03_03_06_15_10/PUBLIC_DVD_IRFMAMOUNT_201612010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_10_07_01_15_14/PUBLIC_DVD_IRFMAMOUNT_201309010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_12_07_00_35_00/PUBLIC_DVD_IRFMAMOUNT_201411010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_09_07_00_55_46/PUBLIC_DVD_IRFMAMOUNT_201108010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_10_07_01_42_05/PUBLIC_DVD_IRFMAMOUNT_201009010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_01_09_00_35_12/PUBLIC_DVD_IRFMAMOUNT_201712010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_05_07_00_35_01/PUBLIC_DVD_IRFMAMOUNT_201504010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_03_07_00_30_16/PUBLIC_DVD_IRFMAMOUNT_202302010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2009_12_09_14_45_25/PUBLIC_DVD_IRFMAMOUNT_200911010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_11_07_01_01_55/PUBLIC_DVD_IRFMAMOUNT_201210010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_10_07_01_01_53/PUBLIC_DVD_IRFMAMOUNT_201109010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_03_07_00_30_07/PUBLIC_DVD_IRFMAMOUNT_201502010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_03_07_00_09_53/PUBLIC_DVD_IRFMAMOUNT_201202010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_05_08_00_30_02/PUBLIC_DVD_IRFMAMOUNT_201804010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_03_07_00_57_21/PUBLIC_DVD_IRFMAMOUNT_201302010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_01_07_02_10_06/PUBLIC_DVD_IRFMAMOUNT_201412010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2009_11_07_00_59_00/PUBLIC_DVD_IRFMAMOUNT_200910010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_09_07_01_50_01/PUBLIC_DVD_IRFMAMOUNT_201508010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_12_07_00_35_12/PUBLIC_DVD_IRFMAMOUNT_202311010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_06_16_00_45_07/PUBLIC_DVD_IRFMAMOUNT_202005010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_09_07_00_00_06/PUBLIC_DVD_IRFMAMOUNT_202308010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_02_07_00_40_02/PUBLIC_DVD_IRFMAMOUNT_201401010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_05_07_00_40_08/PUBLIC_DVD_IRFMAMOUNT_202004010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_05_07_01_01_49/PUBLIC_DVD_IRFMAMOUNT_201104010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_03_06_15_45_04/PUBLIC_DVD_IRFMAMOUNT_202102010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_08_07_01_01_05/PUBLIC_DVD_IRFMAMOUNT_201307010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_04_12_00_40_04/PUBLIC_DVD_IRFMAMOUNT_201803010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_11_07_00_35_01/PUBLIC_DVD_IRFMAMOUNT_201910010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_06_07_00_35_01/PUBLIC_DVD_IRFMAMOUNT_201505010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_04_07_00_34_57/PUBLIC_DVD_IRFMAMOUNT_201003010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_09_07_00_50_09/PUBLIC_DVD_IRFMAMOUNT_202108010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_06_07_11_25_12/PUBLIC_DVD_IRFMAMOUNT_201905010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_10_07_00_56_55/PUBLIC_DVD_IRFMAMOUNT_201209010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_02_08_00_30_05/PUBLIC_DVD_IRFMAMOUNT_202201010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_11_07_00_40_06/PUBLIC_DVD_IRFMAMOUNT_201410010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_02_07_00_30_14/PUBLIC_DVD_IRFMAMOUNT_202301010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_04_12_00_30_01/PUBLIC_DVD_IRFMAMOUNT_201703010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_07_07_00_35_00/PUBLIC_DVD_IRFMAMOUNT_201506010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2009_09_07_00_13_33/PUBLIC_DVD_IRFMAMOUNT_200908010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_09_07_00_35_03/PUBLIC_DVD_IRFMAMOUNT_201408010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_04_07_01_04_59/PUBLIC_DVD_IRFMAMOUNT_201303010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_06_07_00_30_02/PUBLIC_DVD_IRFMAMOUNT_202205010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_07_07_00_00_42/PUBLIC_DVD_IRFMAMOUNT_201006010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_05_07_00_30_05/PUBLIC_DVD_IRFMAMOUNT_202204010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_05_09_00_00_08/PUBLIC_DVD_IRFMAMOUNT_201704010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_01_07_00_40_05/PUBLIC_DVD_IRFMAMOUNT_201512010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_03_07_03_45_04/PUBLIC_DVD_IRFMAMOUNT_202202010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_11_08_00_25_07/PUBLIC_DVD_IRFMAMOUNT_202210010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_08_07_00_55_17/PUBLIC_DVD_IRFMAMOUNT_201007010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_10_05_00_30_04/PUBLIC_DVD_IRFMAMOUNT_201909010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_03_09_00_30_07/PUBLIC_DVD_IRFMAMOUNT_201902010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_04_07_00_40_09/PUBLIC_DVD_IRFMAMOUNT_202303010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_08_07_00_05_05/PUBLIC_DVD_IRFMAMOUNT_202007010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_06_07_00_35_01/PUBLIC_DVD_IRFMAMOUNT_201805010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_12_07_00_40_01/PUBLIC_DVD_IRFMAMOUNT_201911010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_12_07_00_56_55/PUBLIC_DVD_IRFMAMOUNT_201211010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_12_07_02_25_03/PUBLIC_DVD_IRFMAMOUNT_201311010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2013_01_07_00_56_54/PUBLIC_DVD_IRFMAMOUNT_201212010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_03_09_08_55_08/PUBLIC_DVD_IRFMAMOUNT_202002010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_03_02_00_30_08/PUBLIC_DVD_IRFMAMOUNT_201701010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_10_11_12_55_08/PUBLIC_DVD_IRFMAMOUNT_202309010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2023_07_07_00_25_12/PUBLIC_DVD_IRFMAMOUNT_202306010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_10_07_00_35_00/PUBLIC_DVD_IRFMAMOUNT_202009010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_02_07_00_05_01/PUBLIC_DVD_IRFMAMOUNT_201501010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_11_07_00_35_00/PUBLIC_DVD_IRFMAMOUNT_202010010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_08_08_00_30_02/PUBLIC_DVD_IRFMAMOUNT_201707010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_03_08_09_05_03/PUBLIC_DVD_IRFMAMOUNT_201802010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2012_02_07_00_01_56/PUBLIC_DVD_IRFMAMOUNT_201201010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_07_08_09_50_04/PUBLIC_DVD_IRFMAMOUNT_201906010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_10_09_14_15_06/PUBLIC_DVD_IRFMAMOUNT_201809010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_08_07_02_20_08/PUBLIC_DVD_IRFMAMOUNT_201507010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_08_07_00_20_04/PUBLIC_DVD_IRFMAMOUNT_202107010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_06_07_00_40_07/PUBLIC_DVD_IRFMAMOUNT_201605010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_08_07_01_01_53/PUBLIC_DVD_IRFMAMOUNT_201107010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2020_12_08_00_45_06/PUBLIC_DVD_IRFMAMOUNT_202011010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_11_07_00_56_43/PUBLIC_DVD_IRFMAMOUNT_201010010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_05_19_16_30_00/PUBLIC_DVD_IRFMAMOUNT_201403010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2021_11_08_00_50_05/PUBLIC_DVD_IRFMAMOUNT_202110010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2010_01_07_00_55_28/PUBLIC_DVD_IRFMAMOUNT_200912010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_07_08_00_30_01/PUBLIC_DVD_IRFMAMOUNT_201806010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2019_09_07_00_35_04/PUBLIC_DVD_IRFMAMOUNT_201908010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2014_05_07_00_30_10/PUBLIC_DVD_IRFMAMOUNT_201404010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2018_12_07_00_40_03/PUBLIC_DVD_IRFMAMOUNT_201811010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2016_05_07_00_45_06/PUBLIC_DVD_IRFMAMOUNT_201604010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2015_11_07_00_05_05/PUBLIC_DVD_IRFMAMOUNT_201510010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2011_12_07_01_05_38/PUBLIC_DVD_IRFMAMOUNT_201111010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2022_04_07_10_25_08/PUBLIC_DVD_IRFMAMOUNT_202203010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_03_10_18_30_12/PUBLIC_DVD_IRFMAMOUNT_201702010000.CSV', '/home/matthew/data/01-D-gunzipped-csv/IRFMAMOUNT/SCHEMA_VERSION=1/TOP_TIMESTAMP=2017_03_03_03_05_12/PUBLIC_DVD_IRFMAMOUNT_201611010000.CSV']

In [82]:
csv_path = '/home/matthew/data/debug/test-in.CSV'
pq_path = '/home/matthew/data/debug/test-out.parquet'

n_rows = 1025
df = pl.DataFrame({"x": n_rows * [1], "y": n_rows * [2]})
df.write_csv(csv_path)

lf = pl.scan_csv(csv_path, low_memory=low_memory)
lf.sink_parquet(pq_path)
assert lf.collect().shape[0] > 0, f"source CSV is empty"
df = pl.read_parquet(pq_path)
assert df.shape[0] > 0, f"Dest file is empty"
assert df.shape[0] == n_rows, f"Dest file has wrong row count"


AssertionError: Dest file is empty

In [83]:
n_rows

1025

In [None]:
for table in tqdm(os.listdir(all_csv_gz_dir)):
    csv_gz_to_parquet(table)