Skip to content
Parses specific part of CSV file. Used for PARALLEL CSV PARSING.
C++ Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmark
contrib
example
include
script
test
.gitignore
.travis.yml
Doxyfile
README.md
UNLICENSE

README.md

PartialCsvParser

Build Status

PartialCsvParser is a C++ CSV parser.

This parser is meant to be created to parse a CSV file in parallel.

Table of Contents generated with DocToc

Installation

PartialCsvParser is a Single-header library distributed under public domain.

Just copy PartialCsvParser.hpp into your include path and include it. You can also git add the header file to your repository, and even modify it.

I appreciate your pull requests if you make some improvements :)

Features

  • Pretty good single-thread & multi-thread performance.

    • Following graphs show sequential performance comparison with other CSV parsers and scalability evaluation. Check benchmark/ for more detailed explanation on performance.

      Comparison of CSV parser's performance Scalability on clokoap100

  • Input CSV from both files and memories.

  • Simple interface working with STL (Standard Template Library).

  • Column separator (, by default) and line separator (\n by default) are customizable.

    • Also usable for TSV parsing.
  • Parses both CSV with header line and without it.

  • UTF-8 support.

  • Range in a file can be specified to parse part of a CSV file.

    • Data-parallelism is easily realized by creating threads with different range.

Examples

Some examples are available in example/ directory.

You can also build and run them quickly.

$ cd example/
$ cmake . && make
$ ./00_parse_with_1parser

Simplest example: Parse and print a CSV file

example/00_parse_with_1parser.cpp

/**
 * Parses a CSV file and print the contents.
 */

#include <PartialCsvParser.hpp>
#include <vector>
#include <string>
#include <iostream>

int main() {
  PCP::CsvConfig csv_config("english.csv");

  // parse header line
  std::vector<std::string> headers = csv_config.get_headers();
  // print headers
  std::cout << "Headers:" << std::endl;
  for (size_t i = 0; i < headers.size(); ++i)
    std::cout << headers[i] << "\t";
  std::cout << std::endl << std::endl;

  // instantiate parser
  PCP::PartialCsvParser parser(csv_config);  // parses whole body of CSV without range options.

  // parse & print body lines
  std::vector<std::string> row;
  while (!(row = parser.get_row()).empty()) {
    std::cout << "Got a row: ";
    for (size_t i = 0; i < row.size(); ++i)
      std::cout << row[i] << "\t";
    std::cout << std::endl;
  }

  return 0;
}

Output:

$ ./00_parse_with_1parser
Headers:
Country Name    Style

Got a row: Japan        Shonan Gold     Fruit Beer
Got a row: Scotland     Punk IPA        IPA
Got a row: Germany      Franziskaner    Hefe-Weissbier

More examples

Anti-features

  • Parsing only. No support to write out a CSV file.

  • Multi-byte line separator like CRLF is not supported.

    • This may be easily fixed, thanks :D
  • Enclosure character (typically ") is not supported.

    • The following CSV file is recognized to have 2-row and 2-column, while it should 1-row and 3-column if " is treated as enclosure character.

      aaa,"bbb
      ccccccc",ddd
      
    • The reason: Say you are a parser. You have the range starting with 3rd 'c'. You see " in front of you. Is that open-quote or close-quote? You cannot tell without parsing from the beginning of file.

Reference manual

Reference manual powered by Doxygen is available.

Parser behaviors

All lines of CSV file are parsed exactly once

PCP::PartialCsvParser::PartialCsvParser() takes 2 offsets: parse_from and parse_to.

If you have multiple threads and each of them holds different part of [parse_from, parse_to], CSV file is parsed in parallel.

It is assured that all lines of CSV file are parsed exactly once if all instances of parsers' [parse_from, parse_to] ranges cover [ PCP::CsvConfig::body_offset() , PCP::CsvConfig::filesize() - 1] without gaps and overlaps (See the following diagram).

header1,header2 \n aaaaaaaa,bbbbbbbbbb \n ccccccccc,dddd \n
                   ^                                      ^
                   body_offset                            filesize - 1

                   <----------><-----------><------------->
                     parser1      parser2       parser3

For developers

How to run test cases

  1. Get Google Test.
$ wget https://googletest.googlecode.com/files/gtest-1.7.0.zip
$ unzip gtest-1.7.0.zip
$ mv gtest-1.7.0 /path/to/PartialCsvParser/contrib/gtest
  1. Build test cases executables.
$ cd test
$ cmake . && make
  1. Run unit tests and integrated tests.
$ ./run_unit_test
$ ./run_integrated_test

Author

Sho Nakatani, a.k.a. laysakura

LICENSE

This project is distributed under public domain.

See UNLICENSE for more detailed explanation.

You can’t perform that action at this time.