Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve perfomance, using StringBuilder instead of string concate #7

Merged
merged 1 commit into from
Apr 14, 2017

Conversation

yelliver
Copy link

I have a 100MB csv file, it takes 15 minutes to read this file. After read the source code I see that using string concate is very terible, I replace it with StringBuilder. Now it takes only few seconds compare to 15 minutes to read this file!

@ImrePyhvel ImrePyhvel merged commit e2c1051 into nortal:develop Apr 14, 2017
@ImrePyhvel
Copy link
Collaborator

Yes, should you have a wide table with small values then StringBuilder can definitely have a huge impact. I'm actually surprised we haven't caught this before. Thanks for the catch!

NB! be aware your PR was non-compiling due to missing using-statement and also changed the expectations about handling missing/empty values (3 tests broken). Fixed these issues in develop branch 8cb19791462171358212e58745080c496d75d0fca).

FYI, current algorithm is still simplistic and targeting mostly csv files which are relatively narrow and streaming toolset is designed to help handling files with many rows. Regardless, parsing 100MB using streaming API (parser.ReadNextRow()) should not have taken 15 mins. I would appreciate if you could make a repro file available to me for further investigation on handling such structures..

@yelliver
Copy link
Author

can you provide me your email, because my csv file have some sensitive data, cannot public here

@ImrePyhvel
Copy link
Collaborator

Send the link here: imre.pyhvel@gmail.com

@yelliver
Copy link
Author

I forgot that the problem is not large file but the large filed!

@ImrePyhvel
Copy link
Collaborator

Regardless, if possible do provide the example file. 15mins is unacceptable. I assume this should be some interesting edge case where single fields were split up into too many pieces during lexing phase, A typical use case would have no string concatenations in that step.

@yelliver
Copy link
Author

I am surprised, too. I try with OpenCSV in Java and it takes more than 20 minutes. I will send you my file in next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants