# Data Retrieval

We will be using data from the [Galaxy Zoo 1](https://www.zooniverse.org/projects/zookeeper/galaxy-zoo/about/results#galaxy-zoo-1)

Which can be easily accessed in [SDSSs' DR19 SkyServer](https://skyserver.sdss.org/dr19/en/home.aspx) through SQL queries. You will need to get an account.

For our purpose we need 3 datasets:
* Galaxy morphology classification (elliptical or spiral), coming from [zooVotes table](https://skyserver.sdss.org/dr14/en/help/browser/browser.aspx#&&history=description+zooVotes+U) or [zooSpec table](https://skyserver.sdss.org/dr14/en/help/browser/browser.aspx#&&history=description+zooSpec+U). To be used as the target to train the ML model
* Photometry data of the objects with a morphology classification from [PhotoObjAll](https://skyserver.sdss.org/dr14/en/help/browser/browser.aspx#&&history=description+PhotoObjAll+U) or [PhotoObjDR7](https://skyserver.sdss.org/dr14/en/help/browser/browser.aspx#&&history=description+PhotoObjDR7+U). Can be used for the selection of the objects of interest and also as features ML model.
* Images of galaxies to be used as features for the ML model. From the [cutout service:](https://skyserver.sdss.org/dr14/en/tools/chart/image.aspx)


## Galaxy Zoo and SDSS Photometry data


Submit the query below into a [CAS Job](https://skyserver.sdss.org/CasJobs/default.aspx) and download the dataset in CSV format.

```
select ZooSpec.*, PhotoObjDR7.* into MyDB.ZooSpecPhoto from ZooSpec inner join PhotoObjDR7 on PhotoObjDR7.dr7objid=ZooSpec.dr7objid
```




## Load the CSV data

In [1]:
import pandas as pd

In [2]:
ZOO_DATA = '/home/torradeflot/Downloads/ZooSpecPhotoDR19_torradeflot.csv'

In [3]:
df = pd.read_csv(ZOO_DATA)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 659272 entries, 0 to 659271
Data columns (total 90 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   specobjid      659272 non-null  int64  
 1   objid          659272 non-null  int64  
 2   dr7objid       659272 non-null  int64  
 3   ra             659272 non-null  float64
 4   dec            659272 non-null  float64
 5   rastring       659272 non-null  object 
 6   decstring      659272 non-null  object 
 7   nvote          659272 non-null  int64  
 8   p_el           659272 non-null  float64
 9   p_cw           659272 non-null  float64
 10  p_acw          659272 non-null  float64
 11  p_edge         659272 non-null  float64
 12  p_dk           659272 non-null  float64
 13  p_mg           659272 non-null  float64
 14  p_cs           659272 non-null  float64
 15  p_el_debiased  659272 non-null  float64
 16  p_cs_debiased  659272 non-null  float64
 17  spiral         659272 non-nul

In [26]:
f'{df.memory_usage().values.sum()/1024/1024:.2f} MiB'

'452.69 MiB'

In [29]:
df.head(2)

Unnamed: 0,specobjid,objid,dr7objid,ra,dec,rastring,decstring,nvote,p_el,p_cw,...,secTarget,extinction_u,extinction_g,extinction_r,extinction_i,extinction_z,htmID,fieldID,Column4,size
0,1578595830217074688,1237661463301455961,587735742076551256,233.8266,34.68083,15:35:18.38,+34:40:51.0,29,0.207,0.0,...,0,0.11352,0.083527,0.060581,0.045937,0.03257,15330488891386,587735742076551168,394855002142670848,6.489538
1,1578598304118237184,1237661463301456237,587735742076551517,233.7615,34.60428,15:35:02.75,+34:36:15.4,40,0.0,1.0,...,0,0.113007,0.08315,0.060307,0.045729,0.032422,15330488568818,587735742076551168,394855002180419584,4.706573


## Image data

Two different cutout services:

* [SDSS Cutout service](https://skyserver.sdss.org/dr19/VisualTools) (0.4 s/img)
* [HIPS2FITS](https://alasky.cds.unistra.fr/hips-image-services/hips2fits) (0.3 s/img)
* [Legacy Survey](https://www.legacysurvey.org/svtips/) (VERY SLOW! 2-4 s/img)


In [147]:
from pathlib import Path
import urllib
import time
import concurrent.futures

IMAGE_PIXSCALE = 0.4 # arcsec/pixel
IMAGE_SIZE_PX = 64
IMAGE_WIDTH_PX = IMAGE_SIZE_PX
IMAGE_HEIGHT_PX = IMAGE_SIZE_PX
#URL = 'http://skyserver.sdss.org/dr14/SkyServerWS/ImgCutout/getjpeg?ra={ra}&dec={dec}&scale={scale}&width={width}&height={height}'
URL = (
    'https://skyserver.sdss.org/DR19/SkyserverWS/ImgCutout/getjpeg?'
    'ra={ra}&dec={dec}&scale={scale}&width={width}&height={height}'
)

FOLDER_NAME = 'images/{scale}/{width}x{height}'
FILE_NAME = 'ra{ra}_dec{dec}.jpg'

SAMPLE_SIZE=100
MAX_TRIES = 3

def timing(f):
    def wrap(*args):
        time1 = time.time()
        ret = f(*args)
        time2 = time.time()
        logging.info('%s function took %0.3f ms' % (f.func_name, (time2-time1)*1000.0))
        return ret
    return wrap

def download_image_bytes(ra, dec, scale=IMAGE_PIXSCALE, width=IMAGE_WIDTH_PX, height=IMAGE_HEIGHT_PX):
    n_tries = 0
    while n_tries < MAX_TRIES:
        try:
            url = URL.format(ra=ra, dec=dec, scale=scale, width=width, height=height)
            response = urllib2.urlopen(url)
        except Exception as e:
            logging.error(''.join(traceback.format_exception(*sys.exc_info())))
            n_tries += 1
            if n_tries < MAX_TRIES:
                logging.info('Going to retry')
            else:
                raise Exception('Number of allowed retries exceeded!')

    return response

def download_image_jpg(
    objid, ra, dec, scale=IMAGE_PIXSCALE, width=IMAGE_WIDTH_PX,
    height=IMAGE_HEIGHT_PX, root_folder='.', service='SDSS'
):

    folder = str(objid % 100)

    p = Path(root_folder)
    child_folder = p / folder
    child_folder.mkdir(parents=True, exist_ok=True)
        
    whole_file_path = child_folder / f'{objid}.jpg'
    if whole_file_path.exists():
        print('file already exists: {}'.format(whole_file_path))
    else:
        n_tries = 0
        downloaded = False
        while n_tries < 3 and (not downloaded):
            try:
                url = get_url(service, ra=ra, dec=dec, scale=scale, width=width, height=height)
                urllib.request.urlretrieve(url, whole_file_path)
                downloaded = True
            except:
                n_tries += 1
                time.sleep(0.1)
        if n_tries >=3 and not downloaded:
            raise Exception(f'Number of tries exceeded for object {objid}')

def download_image_batch(
    images,
    scale=IMAGE_PIXSCALE, width=IMAGE_WIDTH_PX,
    height=IMAGE_HEIGHT_PX, root_folder='.', service='SDSS',
    max_workers=10
):
    # We can use a with statement to ensure threads are cleaned up promptly
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Start the load operations and mark each future with its URL
        future_to_url = {executor.submit(
            download_image_jpg,
            **image,
            root_folder=root_folder,
            service=service
        ): image for image in images}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))

def get_url(service, *args, **kwargs):
    if service == 'SDSS':
        return get_SDSS_url(*args, **kwargs)
    elif service == 'HIPS2FITS':
        return get_HIPS2FITS_url(*args, **kwargs)
    elif service == 'LEGACYSURVEY':
        return get_LEGACYSURVEY_url(*args, **kwargs)
    else:
        raise Exception(f'Service={service} not allowed, choose one of [SDSS, HIPS2FITS, LEGACYSURVEY]')

def get_SDSS_url(ra, dec, scale, width, height):
    return URL.format(ra=ra, dec=dec, scale=scale, width=width, height=height)

HIPS2FITS_URL = (
    'https://alasky.cds.unistra.fr/hips-image-services/hips2fits?'
    'hips=CDS%2FP%2FSDSS9%2Fcolor'
    '&width={width}&height={height}&fov={fov}&projection=TAN'
    '&ra={ra}&dec={dec}'
    '&format=jpg'
)

def get_HIPS2FITS_url(ra, dec, scale, width, height):
    fov = max(width, height)*scale/3600
    return HIPS2FITS_URL.format(ra=ra, dec=dec, fov=fov, width=width, height=height)

LEGACYSURVEY_URL = (
    "https://www.legacysurvey.org/viewer/cutout.jpg?"
    "ra={ra}&dec={dec}&pixscale={scale}&layer=sdss&size={size}"
)
def get_LEGACYSURVEY_url(ra, dec, scale, width, height):
    return LEGACYSURVEY_URL.format(ra=ra, dec=dec, scale=scale, size=max(width, height))


In [30]:
df[['objid', 'ra', 'dec']].head(5)

Unnamed: 0,objid,ra,dec
0,1237661463301455961,233.8266,34.68083
1,1237661463301456237,233.7615,34.60428
2,1237661463301521615,233.912,34.5407
3,1237661463301521650,233.9483,34.48045
4,1237661463301587266,234.0413,34.49664


In [70]:
print(df[['ra', 'dec', 'objid']].head(5).to_csv(index=False))

ra,dec,objid
233.8266,34.68083,1237661463301455961
233.7615,34.60428,1237661463301456237
233.912,34.5407,1237661463301521615
233.9483,34.48045,1237661463301521650
234.0413,34.49664,1237661463301587266



### Serial download

In [51]:
%%time
download_image_jpg(*df.iloc[10000][['objid', 'ra', 'dec']].values)

CPU times: user 6.07 ms, sys: 0 ns, total: 6.07 ms
Wall time: 535 ms


In [108]:
%%time
for i in range(100):
    download_image_jpg(*df.iloc[400 + i][['objid', 'ra', 'dec']].values,
                       root_folder='./imgs/SDSS', service='SDSS')

file already exists: imgs/SDSS/96/1237661463844028496.jpg
file already exists: imgs/SDSS/87/1237661463844028687.jpg
file already exists: imgs/SDSS/19/1237661463844094319.jpg
file already exists: imgs/SDSS/87/1237661463844159987.jpg
file already exists: imgs/SDSS/15/1237661463844160015.jpg
file already exists: imgs/SDSS/59/1237661463844225259.jpg
file already exists: imgs/SDSS/85/1237661463844225285.jpg
file already exists: imgs/SDSS/93/1237661463844225493.jpg
file already exists: imgs/SDSS/6/1237661463844225506.jpg
file already exists: imgs/SDSS/43/1237661463844290843.jpg
file already exists: imgs/SDSS/49/1237661463844290849.jpg
file already exists: imgs/SDSS/54/1237661463844290854.jpg
file already exists: imgs/SDSS/29/1237661463844290929.jpg
file already exists: imgs/SDSS/49/1237661463844290949.jpg
file already exists: imgs/SDSS/76/1237661463844421776.jpg
file already exists: imgs/SDSS/2/1237661463844421902.jpg
file already exists: imgs/SDSS/86/1237661463844421986.jpg
file already exi

In [109]:
%%time
for i in range(100):
    download_image_jpg(*df.iloc[200+i][['objid', 'ra', 'dec']].values,
                       root_folder='./imgs/HIPS2FITS', service='HIPS2FITS')

CPU times: user 474 ms, sys: 94.2 ms, total: 568 ms
Wall time: 31.8 s


In [103]:
%%time
for i in range(100):
    download_image_jpg(*df.iloc[i][['objid', 'ra', 'dec']].values,
                       root_folder='./imgs/LEGACYSURVEY', service='LEGACYSURVEY')

CPU times: user 722 ms, sys: 121 ms, total: 843 ms
Wall time: 2min 57s


### Parallel download

In [152]:
%%time
BATCH_SIZE = 100
INIT_INDEX = 300
images = df.iloc[INIT_INDEX:INIT_INDEX + BATCH_SIZE][['objid', 'ra', 'dec']].to_dict(orient='records')

download_image_batch(images, root_folder='./imgs/HIPS2FITS', service='HIPS2FITS', max_workers=20)

CPU times: user 374 ms, sys: 94.5 ms, total: 469 ms
Wall time: 3.46 s
