Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try out Python ETL scripts on Windows box in city #24

Closed
daguar opened this issue Mar 3, 2014 · 22 comments
Closed

Try out Python ETL scripts on Windows box in city #24

daguar opened this issue Mar 3, 2014 · 22 comments
Assignees

Comments

@daguar
Copy link
Contributor

daguar commented Mar 3, 2014

Lauren -- Adding this issue because I think a next step is to see if you can run my scripts that do the Netfile data ETL on your Windows box:

https://github.com/daguar/netfile-etl

An alternative is to run them on an external Unix-y server (like Heroku or elsewhere) and then set up a job to download them every day to a computer within the city.

@daguar
Copy link
Contributor Author

daguar commented Mar 3, 2014

I can stop by and see if I can get this set up. Basically we'd just need to download/install:

Cygwin http://www.cygwin.com/
Python http://www.python.org/download/
Pip https://sites.google.com/site/pydatalog/python/pip-for-windows
Csvkit (just run pip install csvkit after Pip is installed)

Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.

@sunnyrjuneja
Copy link

I’m not sure if you guys have used cygwin before but I personally did not have a good experience. I think it might be easier to setup putty and a DO instance for $5/mo.
-- 
Sunny Juneja
Sent with Airmail

On March 3, 2014 at 10:24:14 AM, Dave Guarino (notifications@github.com) wrote:

I can stop by and see if I can get this set up. Basically we'd just need to download/install:

Cygwin http://www.cygwin.com/
Python http://www.python.org/download/
Pip https://sites.google.com/site/pydatalog/python/pip-for-windows
Csvkit (just run pip install csvkit after Pip is installed)

Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.


Reply to this email directly or view it on GitHub.

@lla2105
Copy link

lla2105 commented Mar 3, 2014

Hey Dave! I can download and install Cygwin, python, and pip right now. We can experiment on if the script works and if there is anything else we still need to do to get this all up and running. I'm free at work all day today and all day tomorrow .... when works best for you to stop by? Thanks everybody!!!

@sunnyrjuneja
Copy link

@daguar I can volunteer my personal server to do this.

@tdooner
Copy link
Member

tdooner commented Mar 3, 2014

heroku++

@tdooner tdooner closed this as completed Mar 3, 2014
@tdooner tdooner reopened this Mar 3, 2014
@tdooner
Copy link
Member

tdooner commented Mar 4, 2014

(as it turns out, Heroku is a difficult platform to do Unixy things on, like wget, unzip, etc. I'm working on getting it to run on my branch of netfile-etl but it's proving to be an arduous process)

@sunnyrjuneja
Copy link

@ted27 i think it honestly might be work than its worth because that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month. i think using a digital ocean instance or someone's personal server (like mine!) is the best way of going forward.

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

@whatasunnyday:

that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month.

Agreed it's not really Heroku's use-case, but you can do for free with the job scheduler ( https://devcenter.heroku.com/articles/scheduler ); the dyno cost is simply the time it takes the job to run, so a nightly 5-minute task like this one will be way under the limit, and we could throw a simple Python single-page service with code that just displays the contents of the S3 bucket it's saving to.

@ted27: Thanks for getting started with an attempt on Heroku! I tried deploying and got wget and unzip not found, so, yeah, I think it's a problem with buildpack silliness.

@sunnyrjuneja
Copy link

I had no idea heroku had a free job scheduler. Cooool. Thanks for the
share.
On Mar 4, 2014 10:04 AM, "Dave Guarino" notifications@github.com wrote:

@whatasunnyday https://github.com/whatasunnyday:

that isn't exactly heroku's use case. you could probably do it with a
messaging queue and worker dyno but that's $30 bucks a month.

Agreed it's not really Heroku's use-case, but you can do for free with
the job scheduler ( https://devcenter.heroku.com/articles/scheduler );
the dyno cost is simply the time it takes the job to run, so a nightly
5-minute task like this one will be way under the limit, and we could throw
a simple Python single-page service with code that just displays the
contents of the S3 bucket it's saving to.

@ted27 https://github.com/ted27: Thanks for getting started with an
attempt on Heroku! I tried deploying and got wget and unzip not found,
so, yeah, I think it's a problem with buildpack silliness.

Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-36654429
.

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

Yeah, it's pretty badass.

This is actually the exact use-case of Docker. But I'm a little more comfortable having it on a service we know could be there forever and be free, so I think futzing around with Heroku is the right call.

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

PS, @ted27: one of the issues with Heroku is the ephemeral and non-writeable disk. You can write to /tmp however.

This means that (a) the scripts can't write any files to the folder they're located in [which is how it's currently written], (b) any data written to /tmp will not be there after the script completes

So the job I'd probably set up would be:

  • Copy all script files to /tmp
  • Run bash run_all.sh
  • Upload resulting CSVs in /tmp to S3

Alternately, the scripts could be modified to always work in /tmp, but I think having the default be just writing to the current folder is the more scripty and reasonable-to-expect way for it to work.

@tdooner
Copy link
Member

tdooner commented Mar 4, 2014

Agreed @daguar about the general methodology. By combining various buildpacks I was able to get the wget and unzip to run. However, you have to compile everything yourself. And, openssl (required to wget files from an SSL server) is not compiled in by default. So, yeah, lots of manual tweaking to get it working.

But the main benefit of Heroku (as I see it) is that we have shared ownership of a project - i.e. you can invite people to collaborate on the project with you. That way we don't rely on any single person's server or recurring attention for the data to populate!

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

@ted27: Oh boy. So does your Heroku instance have it running now? (Just git cloneing and pushing to Heroku out of the box didn't let it work for me, so maybe your compilation was done manually in console?)

If you'll be around tonight we can hack on this and get it working.

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

@migurski also pointed me to this, his notes on getting packaged binaries w/ Heroku: https://github.com/codeforamerica/heroku-buildpack-pygeo/blob/master/Build.md

@migurski
Copy link

migurski commented Mar 4, 2014

Heroku has curl already, which can be a fine wget replacement.

@migurski
Copy link

migurski commented Mar 4, 2014

Also, Python has baked-in support for zip files. It’s pretty easy to use so it’s possible you could skip compiling binaries altogether.

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

@migurski -- Thanks; and, yeah, most of this is my laziness (I wrote these scripts super quickly, and I know actually all of it could be done in pure Python, even.)

I used wget because it's simpler for 404's and 302's (which are happening here in the naive implementation) but looks like I can do curl -f -L now that I take the 5 minutes to read.

@migurski
Copy link

migurski commented Mar 4, 2014

Quick attempt to get unzip built on Heroku:

curl -L http://sourceforge.net/projects/infozip/files/UnZip%206.x%20%28latest%29/UnZip%206.0/unzip60.tar.gz/download | tar -xzvf -
cd unzip60
make -f unix/Makefile generic

…and unzip now works. At 156KB it should be fine to include in Git repo. ldd suggests it’s not linked too badly:

linux-vdso.so.1 =>  (0x00007fff6a1b9000)
libc.so.6 => /lib/libc.so.6 (0x00007f48da7c7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f48dab60000)

@migurski
Copy link

migurski commented Mar 4, 2014

…and the result, which should Just Work™: http://dbox.teczno.com/unzip.gz

@daguar
Copy link
Contributor Author

daguar commented Mar 4, 2014

Okay, @ted27 I've replaced wget with curl in my repo if you want to rebase. Will take a look at Python vs. unzip buildpack shortly.

@daguar
Copy link
Contributor Author

daguar commented Mar 7, 2014

I think I can actually save us all from the Heroku+S3 steps and run this on Lauren's comp, which now has Vagrant + an Ubuntu VM!

Documenting (incomplete) setup here: daguar/netfile-etl#2

@daguar
Copy link
Contributor Author

daguar commented Mar 8, 2014

We've got it working on Lauren's comp!!!

Next steps for this are:

Dave:

  • Create a virtual machine custom for Lauren's PC (smallish, containing all dependencies)
  • Configure VM to dump to specified folder on Lauren's PC (local disk, confirm folder location with Lauren)
  • Set up cron job for the script

Lauren:

  • Move the DataSync jobs to her local computer (off of shared drive)
  • Edit the Windows Scheduler and DataSync jobs to look for new (local drive) folder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants