Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use JSON instead of pickle for tld data #81

Closed
andresriancho opened this issue Dec 2, 2015 · 14 comments
Closed

Use JSON instead of pickle for tld data #81

andresriancho opened this issue Dec 2, 2015 · 14 comments

Comments

@andresriancho
Copy link

.tld_set_snapshot and .tld_set use pickle to store the TLD information. While this is perfectly fine in most cases it brings the following issues:

  • Binary data stored in your git repository is a bad practice
  • If this library gets packaged for ubuntu/debian, the maintainer will complain. I just started using tldextract in the w3af project which is part of ubuntu/debian. When the package maintainer packages the next version he'll most likely dislike the binary blob
  • Pickles are "executables". A specially crafted pickle can trigger an arbitrary remote command execution when unpickled. While I did review the library for bugs and backdoors before including it in w3af, I did not read the whole pickle; which is bad for w3af user's security.
@floer32
Copy link
Collaborator

floer32 commented Dec 2, 2015

This is a good idea.

@floer32
Copy link
Collaborator

floer32 commented Dec 2, 2015

I wasn't sure if performance would degrade, looks like it would actually improve. Haha.

Pickle vs JSON — Which is Faster?
If you’re here for the short answer — JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps).

source

@john-kurkowski
Copy link
Owner

👍

This could be the needle to finally cut a 2.0 release. If this gets done, can abandon pickle entirely, no need for backwards compatibility. (Then clean up other 1.x deprecations.)

@andresriancho
Copy link
Author

Awesome, happy to see you guys liked the idea

@mauricioabreu
Copy link
Contributor

Good idea. :-)
I already contribute to tldextract project. Maybe it is a chance to contribute again.

@john-kurkowski
Copy link
Owner

@mauricioabreu by all means!

I started a 2.0 branch. Target any work there. Then you can run free with this issue, with less worry about backwards compatibility.

@mauricioabreu
Copy link
Contributor

Could not it be a plain text file?

@mauricioabreu
Copy link
Contributor

I mean that JSON is good for structured data, no?
We don't have a structure here, just a file separated by new lines.

Python would support this by reading lines as a normal file. What do you think @andresriancho ?

@andresriancho
Copy link
Author

I usually use JSON even for very simple data because it allows me to add more "structure" to it later (if required by the next versions of the software) without rewriting the code; but a plain text file sounds good also.

@mauricioabreu
Copy link
Contributor

Good! Thanks for your opinion @andresriancho I will start it using plain text. :)

@john-kurkowski
Copy link
Owner

I would start it with JSON for to be more forward thinking, as mentioned.

As a concrete use case, metadata about each suffix will be required for #66, i.e. whether it's a private suffix or not. Teasing this from JSON is easy. Teasing from plain text is less fun.

@mauricioabreu
Copy link
Contributor

@john-kurkowski okay! I am happy with these two alternatives. :-)
Thanks! Going to use JSON then.

@john-kurkowski
Copy link
Owner

This weekend, I'll use #91 to work toward a 2.0 release, to close this.

john-kurkowski added a commit that referenced this issue Mar 13, 2016
@john-kurkowski
Copy link
Owner

Closed via #92.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants