# Retrieve data and move it to PUIdata

The PUIDATA environmental variable is set to the path of PUIdata. First, make sure you create a PUIdata directory. _If you use jupyterhub_ the directory has to be in your home directory (on compute), since the path is set for everyone to $HOME/PUIdata. 

You can retrieve the variable with os.getenv()

In [1]:
from __future__ import print_function
import os
print (os.getenv("PUIDATA"))

None


In [2]:
#if none force the setup
os.environ["PUIDATA"] = "%s/PUIdata"%os.getenv("HOME")

# Downloading a single file and moving it to PUIdata. 

Note that if the file is on GitHub I need to download the __raw__ file, not the simple link, which is a formatted html page

In [13]:
#reading in directly from a url
import pandas as pd
pd.read_csv("https://serv.cusp.nyu.edu/~fbianco/PUIdata/311_Service_Requests_from_2010_to_Present_head.csv")

Unnamed: 0,Unique Key,Created Date,Closed Date,Agency,Agency Name,Complaint Type,Descriptor,Location Type,Incident Zip,Incident Address,...,Bridge Highway Name,Bridge Highway Direction,Road Ramp,Bridge Highway Segment,Garage Lot Name,Ferry Direction,Ferry Terminal Name,Latitude,Longitude,Location
0,34212771,09/01/2016 12:00:06 AM,09/01/2016 06:02:10 AM,NYPD,New York City Police Department,Noise - Residential,Loud Television,Residential Building/House,11413.0,137-47 CARSON STREET,...,,,,,,,,40.675208,-73.754949,"(40.67520813178531, -73.75494944502233)"
1,34212392,09/01/2016 12:00:35 AM,09/02/2016 04:53:48 PM,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,10469.0,943 EAST 217 STREET,...,,,,,,,,40.879986,-73.856707,"(40.87998640855491, -73.85670741555177)"
2,34214520,09/01/2016 12:00:37 AM,09/01/2016 01:05:43 AM,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,11369.0,98-02 25 AVENUE,...,,,,,,,,40.764632,-73.871623,"(40.76463242381882, -73.87162284918391)"
3,34212483,09/01/2016 12:00:54 AM,09/01/2016 03:15:32 AM,NYPD,New York City Police Department,Blocked Driveway,No Access,Street/Sidewalk,10466.0,1017 EAST 226 STREET,...,,,,,,,,40.885559,-73.850782,"(40.885558858976104, -73.85078238918491)"
4,34214231,09/01/2016 12:01:15 AM,09/06/2016 09:27:00 AM,DOT,Department of Transportation,Street Condition,Pothole,,10040.0,4700 BROADWAY,...,,,,,,,,40.864140,-73.929501,"(40.864140093130686, -73.92950060219349)"
5,34223410,09/01/2016 12:02:00 AM,09/01/2016 11:33:00 AM,DEP,Department of Environmental Protection,Water System,Hydrant Running Full (WA4),,10031.0,,...,,,,,,,,40.820098,-73.955076,"(40.820097574803015, -73.95507644617044)"
6,34219357,09/01/2016 12:02:00 AM,09/02/2016 12:00:00 PM,DSNY,BCC - Queens East,Sanitation Condition,12 Dead Animals,Sidewalk,11365.0,180-15 64 AVENUE,...,,,,,,,,40.737962,-73.792767,"(40.73796206550144, -73.79276687084037)"
7,34211394,09/01/2016 12:02:03 AM,,TLC,Taxi and Limousine Commission,Taxi Complaint,Driver Complaint,,11103.0,,...,,,,,,,,40.763893,-73.914994,"(40.76389308093824, -73.91499378600639)"
8,34217294,09/01/2016 12:02:07 AM,09/01/2016 03:46:46 AM,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,Residential Building/House,10468.0,2685 GRAND CONCOURSE,...,,,,,,,,40.866838,-73.893802,"(40.86683824468553, -73.89380189150275)"
9,34213164,09/01/2016 12:03:16 AM,09/01/2016 03:46:38 AM,NYPD,New York City Police Department,Noise - Commercial,Loud Music/Party,Store/Commercial,11222.0,284 DRIGGS AVENUE,...,,,,,,,,40.722566,-73.948944,"(40.7225662037401, -73.94894420645994)"


## Downloading zipped data and uppacking it into PUIdata

Downloading the data (available on [Pronto's Website](http://www.prontocycleshare.com/datachallenge)) 
The total download is about 70MB, and the unzipped files are around 900MB. This is taken from Jake VanderPlas Pythonic Preambulations blog https://github.com/jakevdp/PythonicPerambulations/blob/master/content/downloads/notebooks/ProntoData.ipynb

In [4]:
# downloading a zipped file
!curl -O https://s3.amazonaws.com/pronto-data/open_data_year_one.zip
# unpacking into $PUIDATA
!unzip open_data_year_one.zip -d $PUIDATA

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70.8M  100 70.8M    0     0  11.8M      0  0:00:05  0:00:05 --:--:-- 11.5M
Archive:  open_data_year_one.zip
replace /nfshome/fb55/PUIdata/2015_station_data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [5]:
!ls $PUIDATA

2015_station_data.csv
2015_status_data.csv
2015_trip_data.csv
2015_weather_data.csv
311_Service_Requests_from_2010_to_Present_head.csv
311_Service_Requests_from_2010_to_Present_short.csv
dhsdaily.csv
NYPD_7_Major_Felony_Incidents.csv
README.txt
test.txt


Note that if I try again to unzip the zip command will ask if I want to overwrite (just stop the command with control+c and remove the files. To run your notebook again, leave the lines of code in the notebook, and comment them.

In [6]:
# downloading a zipped file
!curl -O https://s3.amazonaws.com/pronto-data/open_data_year_one.zip
# unpacking into $PUIDATA
!unzip open_data_year_one.zip -d $PUIDATA

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70.8M  100 70.8M    0     0  83.5M      0 --:--:-- --:--:-- --:--:-- 83.5M
Archive:  open_data_year_one.zip
replace /nfshome/fb55/PUIdata/2015_station_data.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


# Downloading data from the CUSP DF - you can only do it if you have a compute acccount

In [None]:
DFDATA = os.getenv("DFDATA")

# on the CUSP data facility (compute, jupiterhub) the DFDATA should point to /gws/open/NYCOpenData/nycopendata/data/
# set it up on compute as export DFDATA="/gws/open/NYCOpenData/nycopendata/data/"
df_gas = pd.read_csv(DFDATA + "/uedp-fegm/1414245967/uedp-fegm")

In [None]:
df_gas