Try out Python ETL scripts on Windows box in city #24

daguar · 2014-03-03T18:21:35Z

Lauren -- Adding this issue because I think a next step is to see if you can run my scripts that do the Netfile data ETL on your Windows box:

https://github.com/daguar/netfile-etl

An alternative is to run them on an external Unix-y server (like Heroku or elsewhere) and then set up a job to download them every day to a computer within the city.

daguar · 2014-03-03T18:24:13Z

I can stop by and see if I can get this set up. Basically we'd just need to download/install:

Cygwin http://www.cygwin.com/
Python http://www.python.org/download/
Pip https://sites.google.com/site/pydatalog/python/pip-for-windows
Csvkit (just run pip install csvkit after Pip is installed)

Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.

sunnyrjuneja · 2014-03-03T18:25:21Z

I’m not sure if you guys have used cygwin before but I personally did not have a good experience. I think it might be easier to setup putty and a DO instance for $5/mo.
--
Sunny Juneja
Sent with Airmail

On March 3, 2014 at 10:24:14 AM, Dave Guarino (notifications@github.com) wrote:

I can stop by and see if I can get this set up. Basically we'd just need to download/install:

Cygwin http://www.cygwin.com/
Python http://www.python.org/download/
Pip https://sites.google.com/site/pydatalog/python/pip-for-windows
Csvkit (just run pip install csvkit after Pip is installed)

Ugh, writing this all out makes me a sad panda. Maybe we will use Heroku.

—
Reply to this email directly or view it on GitHub.

lla2105 · 2014-03-03T18:51:42Z

Hey Dave! I can download and install Cygwin, python, and pip right now. We can experiment on if the script works and if there is anything else we still need to do to get this all up and running. I'm free at work all day today and all day tomorrow .... when works best for you to stop by? Thanks everybody!!!

sunnyrjuneja · 2014-03-03T21:40:14Z

@daguar I can volunteer my personal server to do this.

tdooner · 2014-03-03T22:20:37Z

heroku++

tdooner · 2014-03-04T05:02:26Z

(as it turns out, Heroku is a difficult platform to do Unixy things on, like wget, unzip, etc. I'm working on getting it to run on my branch of netfile-etl but it's proving to be an arduous process)

sunnyrjuneja · 2014-03-04T05:06:27Z

@ted27 i think it honestly might be work than its worth because that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month. i think using a digital ocean instance or someone's personal server (like mine!) is the best way of going forward.

daguar · 2014-03-04T18:04:21Z

@whatasunnyday:

that isn't exactly heroku's use case. you could probably do it with a messaging queue and worker dyno but that's $30 bucks a month.

Agreed it's not really Heroku's use-case, but you can do for free with the job scheduler ( https://devcenter.heroku.com/articles/scheduler ); the dyno cost is simply the time it takes the job to run, so a nightly 5-minute task like this one will be way under the limit, and we could throw a simple Python single-page service with code that just displays the contents of the S3 bucket it's saving to.

@ted27: Thanks for getting started with an attempt on Heroku! I tried deploying and got wget and unzip not found, so, yeah, I think it's a problem with buildpack silliness.

sunnyrjuneja · 2014-03-04T18:06:33Z

I had no idea heroku had a free job scheduler. Cooool. Thanks for the
share.
On Mar 4, 2014 10:04 AM, "Dave Guarino" notifications@github.com wrote:

@whatasunnyday https://github.com/whatasunnyday:

that isn't exactly heroku's use case. you could probably do it with a
messaging queue and worker dyno but that's $30 bucks a month.

Agreed it's not really Heroku's use-case, but you can do for free with
the job scheduler ( https://devcenter.heroku.com/articles/scheduler );
the dyno cost is simply the time it takes the job to run, so a nightly
5-minute task like this one will be way under the limit, and we could throw
a simple Python single-page service with code that just displays the
contents of the S3 bucket it's saving to.

@ted27 https://github.com/ted27: Thanks for getting started with an
attempt on Heroku! I tried deploying and got wget and unzip not found,
so, yeah, I think it's a problem with buildpack silliness.

Reply to this email directly or view it on GitHubhttps://github.com//issues/24#issuecomment-36654429
.

daguar · 2014-03-04T18:15:28Z

Yeah, it's pretty badass.

This is actually the exact use-case of Docker. But I'm a little more comfortable having it on a service we know could be there forever and be free, so I think futzing around with Heroku is the right call.

daguar · 2014-03-04T18:29:08Z

PS, @ted27: one of the issues with Heroku is the ephemeral and non-writeable disk. You can write to /tmp however.

This means that (a) the scripts can't write any files to the folder they're located in [which is how it's currently written], (b) any data written to /tmp will not be there after the script completes

So the job I'd probably set up would be:

Copy all script files to /tmp
Run bash run_all.sh
Upload resulting CSVs in /tmp to S3

Alternately, the scripts could be modified to always work in /tmp, but I think having the default be just writing to the current folder is the more scripty and reasonable-to-expect way for it to work.

tdooner · 2014-03-04T19:21:27Z

Agreed @daguar about the general methodology. By combining various buildpacks I was able to get the wget and unzip to run. However, you have to compile everything yourself. And, openssl (required to wget files from an SSL server) is not compiled in by default. So, yeah, lots of manual tweaking to get it working.

But the main benefit of Heroku (as I see it) is that we have shared ownership of a project - i.e. you can invite people to collaborate on the project with you. That way we don't rely on any single person's server or recurring attention for the data to populate!

daguar · 2014-03-04T23:00:29Z

@ted27: Oh boy. So does your Heroku instance have it running now? (Just git cloneing and pushing to Heroku out of the box didn't let it work for me, so maybe your compilation was done manually in console?)

If you'll be around tonight we can hack on this and get it working.

daguar · 2014-03-04T23:07:02Z

@migurski also pointed me to this, his notes on getting packaged binaries w/ Heroku: https://github.com/codeforamerica/heroku-buildpack-pygeo/blob/master/Build.md

migurski · 2014-03-04T23:11:38Z

Heroku has curl already, which can be a fine wget replacement.

migurski · 2014-03-04T23:15:04Z

Also, Python has baked-in support for zip files. It’s pretty easy to use so it’s possible you could skip compiling binaries altogether.

daguar · 2014-03-04T23:18:08Z

@migurski -- Thanks; and, yeah, most of this is my laziness (I wrote these scripts super quickly, and I know actually all of it could be done in pure Python, even.)

I used wget because it's simpler for 404's and 302's (which are happening here in the naive implementation) but looks like I can do curl -f -L now that I take the 5 minutes to read.

migurski · 2014-03-04T23:28:05Z

Quick attempt to get unzip built on Heroku:

curl -L http://sourceforge.net/projects/infozip/files/UnZip%206.x%20%28latest%29/UnZip%206.0/unzip60.tar.gz/download | tar -xzvf -
cd unzip60
make -f unix/Makefile generic

…and unzip now works. At 156KB it should be fine to include in Git repo. ldd suggests it’s not linked too badly:

linux-vdso.so.1 =>  (0x00007fff6a1b9000)
libc.so.6 => /lib/libc.so.6 (0x00007f48da7c7000)
/lib64/ld-linux-x86-64.so.2 (0x00007f48dab60000)

migurski · 2014-03-04T23:35:58Z

…and the result, which should Just Work™: http://dbox.teczno.com/unzip.gz

daguar · 2014-03-04T23:50:48Z

Okay, @ted27 I've replaced wget with curl in my repo if you want to rebase. Will take a look at Python vs. unzip buildpack shortly.

daguar · 2014-03-07T18:19:19Z

I think I can actually save us all from the Heroku+S3 steps and run this on Lauren's comp, which now has Vagrant + an Ubuntu VM!

Documenting (incomplete) setup here: daguar/netfile-etl#2

daguar · 2014-03-08T02:03:02Z

We've got it working on Lauren's comp!!!

Next steps for this are:

Dave:

Create a virtual machine custom for Lauren's PC (smallish, containing all dependencies)
Configure VM to dump to specified folder on Lauren's PC (local disk, confirm folder location with Lauren)
Set up cron job for the script

Lauren:

Move the DataSync jobs to her local computer (off of shared drive)
Edit the Windows Scheduler and DataSync jobs to look for new (local drive) folder

daguar assigned lla2105 Mar 3, 2014

tdooner closed this as completed Mar 3, 2014

tdooner reopened this Mar 3, 2014

daguar mentioned this issue Mar 4, 2014

Running *nix inside a Windows organization? codeforamerica/howto#1

Closed

daguar mentioned this issue Mar 4, 2014

Replace wget with curl to avoid Heroku buildpack sadness daguar/netfile-etl#1

Merged

daguar mentioned this issue Mar 5, 2014

Get Netfile ETL scripts running on Heroku #32

Closed

4 tasks

daguar added the city-backend label Mar 11, 2014

mikeubell closed this as completed Jul 23, 2014

daguar mentioned this issue Aug 10, 2014

Compile custom pdftk binary with version 2.1 codeforamerica/clean#120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try out Python ETL scripts on Windows box in city #24

Try out Python ETL scripts on Windows box in city #24

daguar commented Mar 3, 2014

daguar commented Mar 3, 2014

sunnyrjuneja commented Mar 3, 2014

lla2105 commented Mar 3, 2014

sunnyrjuneja commented Mar 3, 2014

tdooner commented Mar 3, 2014

tdooner commented Mar 4, 2014

sunnyrjuneja commented Mar 4, 2014

daguar commented Mar 4, 2014

sunnyrjuneja commented Mar 4, 2014

daguar commented Mar 4, 2014

daguar commented Mar 4, 2014

tdooner commented Mar 4, 2014

daguar commented Mar 4, 2014

daguar commented Mar 4, 2014

migurski commented Mar 4, 2014

migurski commented Mar 4, 2014

daguar commented Mar 4, 2014

migurski commented Mar 4, 2014

migurski commented Mar 4, 2014

daguar commented Mar 4, 2014

daguar commented Mar 7, 2014

daguar commented Mar 8, 2014

Try out Python ETL scripts on Windows box in city #24

Try out Python ETL scripts on Windows box in city #24

Comments

daguar commented Mar 3, 2014

daguar commented Mar 3, 2014

sunnyrjuneja commented Mar 3, 2014

lla2105 commented Mar 3, 2014

sunnyrjuneja commented Mar 3, 2014

tdooner commented Mar 3, 2014

tdooner commented Mar 4, 2014

sunnyrjuneja commented Mar 4, 2014

daguar commented Mar 4, 2014

sunnyrjuneja commented Mar 4, 2014

daguar commented Mar 4, 2014

daguar commented Mar 4, 2014

tdooner commented Mar 4, 2014

daguar commented Mar 4, 2014

daguar commented Mar 4, 2014

migurski commented Mar 4, 2014

migurski commented Mar 4, 2014

daguar commented Mar 4, 2014

migurski commented Mar 4, 2014

migurski commented Mar 4, 2014

daguar commented Mar 4, 2014

daguar commented Mar 7, 2014

daguar commented Mar 8, 2014