Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2019-05-16: Remove duplicate lines of a file preserving their order in Linux #3

Open
iridakos opened this issue May 16, 2019 · 6 comments

Comments

Projects
None yet
4 participants
@iridakos
Copy link
Owner

commented May 16, 2019

Feel free to write any comments you have for the post Remove duplicate lines of a file preserving their order in Linux

@iridakos

This comment has been minimized.

Copy link
Owner Author

commented May 16, 2019

awk

@orachas

This comment has been minimized.

Copy link

commented May 18, 2019

One of the best references is the original book on "new awk" by the language authors. The original paperback was quite expensive, but a PDF is now freely available.

https://archive.org/download/pdfy-MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf

@iridakos

This comment has been minimized.

Copy link
Owner Author

commented May 18, 2019

@orachas Thank you! I will add it to the post 👍

@Narigo

This comment has been minimized.

Copy link

commented May 29, 2019

That's a nice way to remove duplicated lines! I've used hashes to count occurrences in scripts and other programs, so it feels actually quite natural to use - I didn't know that you could set that up with awk so easily.

From my intuition, I'd suppose that using the awk approach might run into memory issues with very large files as it will keep a lot of content from the file in memory, right?

The cat | sort | sort | cut approach seems like something that might have the same problem. sort probably needs to take the whole file into consideration before being able to sort it. So that might suffer from an even bigger memory consumption than the awk solution...

@iridakos

This comment has been minimized.

Copy link
Owner Author

commented May 29, 2019

@Narigo Thank you! 😃

From my intuition, I'd suppose that using the awk approach might run into memory issues with very large files as it will keep a lot of content from the file in memory, right?

Yes, you are correct.

The cat | sort | sort | cut approach seems like something that might have the same problem.

I am not familiar with sort's internals but it seems that it does use intermediate temporary files. Check this comment thread on the HN post: https://news.ycombinator.com/item?id=20038605 for more details.

@mnault

This comment has been minimized.

Copy link

commented May 29, 2019

Wow!.. Excellent mastery of awk. I raise my hat to you sir! :)
http://www.allitebooks.org/?s=awk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.