Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on csv programming exercise and encoding rows missing specific header keys #176

Open
CoreyWinkelmannPP opened this issue Aug 12, 2019 · 2 comments · May be fixed by #219
Open

Question on csv programming exercise and encoding rows missing specific header keys #176

CoreyWinkelmannPP opened this issue Aug 12, 2019 · 2 comments · May be fixed by #219

Comments

@CoreyWinkelmannPP
Copy link

CoreyWinkelmannPP commented Aug 12, 2019

Problem Description

I have the need to take a collection of csv documents in a folder and merge them together into one really large csv document. The columns within each file will contain some overlapping columns and some unique columns. The script will read each of these files, merge them, and then write out the new csv file.

Solution I have working (but seems a little slow in comparison to a go or rust implementation)

Rust and Go on the data set would run this scenario in 100 to 200ms. The Haskell version below would do it in 300 to 400 ms. A python version was running within that 300 to 400ms realm as well which is why I think Haskell should be able to do this faster.

I have coded the following and originally I was hoping to stream through the files and process and build up the results using conduit but I ended up bailing and outputting the files and then going through those and processing them one off. I want a more efficient and idiomatic Haskell version for accomplishing this and was wondering if anyone would give me some insights on what that may look like. One issue I did come across with the below solution was that I had to change the cassava code to allow an empty string to be returned when the map lookup returned Nothing instead of failing like the current version does.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Conduit
import System.FilePath (takeExtension)
import Data.Csv
import qualified Data.Vector as V
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS
import qualified Data.Map as M
import Data.Either
import Data.List (nub)
import Control.Monad

type Column = M.Map BS.ByteString LBS.ByteString
type Rows = V.Vector Column
type CsvDocument = (Header, Rows)
type CsvDocuments = V.Vector BS.ByteString
type ErrorMsg = String

getCsvDocuments :: ConduitM a c (ResourceT IO) CsvDocuments
getCsvDocuments = sourceDirectoryDeep True "."
        .| filterC (\fp -> takeExtension fp == ".csv")
        .| awaitForever sourceFile
        .| sinkVector

mergeHeader :: Header -> Header -> Header
mergeHeader h1 h2 = V.fromList . nub . V.toList $ (h1 V.++ h2)

combineCsvDocuments :: CsvDocument -> BS.ByteString -> CsvDocument
combineCsvDocuments acc csv = (mergedHeader, mergedBody)
    where
        decodedCsv = fromRight (V.empty, V.empty) . decodeByName . LBS.fromStrict $ csv
        mergedHeader = mergeHeader (fst decodedCsv) (fst acc)
        mergedBody = snd acc V.++ snd decodedCsv

mapFiles :: CsvDocuments -> CsvDocument
mapFiles = V.foldl' combineCsvDocuments (V.empty, V.empty)

main :: IO ()
main = do
    files <- runConduitRes getCsvDocuments
    let document = mapFiles files
    LBS.writeFile
        "output/combined_response.csv"
        (encodeByName
            (fst document)
            (V.toList . snd $ document))
    return ()

About me

I am learning concepts from Haskell but still a beginner. I pick out some challenges and try them out in Haskell but I am always looking for feedback and better ways of doing them from more experienced individuals. I develop in Object Oriented Languages for my current role which I understand well. I am trying to expand my knowledge with gaining a better understanding of how Functional Programming can improve my development skills.

Thanks in advance for any help you can give!

@CoreyWinkelmannPP
Copy link
Author

When implementing a solution for handling this content I had to compile a local copy of this library to bypass the hard coded error path when doing a lookup on the Map. At least, that is what would get it to work. Below is the function that is doing the lookup and failing on Nothing. In order to get the above example to work I had to change the Nothing case and have it return an empty string. That would then allow it to build any number of columns whether or not the rows had all of the data expected. I am still wondering what the best approach those here would take on this problem without having to change the internal library code. Thanks again for any insights you all have!

namedRecordToRecord :: Header -> NamedRecord -> Record
namedRecordToRecord hdr nr = V.map find hdr
  where
    find n = case HM.lookup n nr of
        Nothing -> moduleError "namedRecordToRecord" $
                   "header contains name " ++ show (B8.unpack n) ++
                   " which is not present in the named record"
        Just v  -> v

@jkarni
Copy link

jkarni commented Mar 15, 2022

I've come across the same problem. Encoding different CSVs and then merging them is expensive. Having the ability to decide (in, presumably, EncodeOptions) how to encode headers that are missing in the data seems like an overall nicer experience. If such a change would be welcome, I could submit a PR.

@lrworth lrworth linked a pull request Nov 8, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants