Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ch5: Can't dowload data because ClearBits permanently shut down #3

Closed
tcfuji opened this issue Jan 30, 2014 · 2 comments
Closed

Ch5: Can't dowload data because ClearBits permanently shut down #3

tcfuji opened this issue Jan 30, 2014 · 2 comments

Comments

@tcfuji
Copy link

tcfuji commented Jan 30, 2014

According to this (http://blog.stackexchange.com/category/cc-wiki-dump/) the websites provided for the data in the book are no longer valid. The blog claims that the data is now at the Internet Archive. However, I cannot seem to locate the data necessary for chapter 5.

@wrichert
Copy link
Collaborator

Indeed, according to the website the data has been moved to Internet Archive a week ago to
https://archive.org/download/stackexchange/stackexchange_archive.torrent.

Tonight, I will download, check the content of the torrnet and report back in this thread.

Thanks for pointing this out, @TFGIT!

@wrichert
Copy link
Collaborator

So, I looked into the torrent with ktorrent and it contains stackoverflow-*.7z files of the proper sizes. In our case we would be interested in stackoverflow.com-Posts.7z. To convert it to the better chewable TSV format, you might want to change https://github.com/luispedro/BuildingMachineLearningSystemsWithPython/blob/master/ch05/so_xml_to_tsv.py#L27 accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants