# Welcome to the SETI Institute Code Challenge!

This first tutorial will explain a little bit on what the data is and where to get it.

# Introduction

For the Code Challenge, you will be using the **"primary" data set**, as we've called it. The primary data set is   

    * labeled data set of 350,000 simulated signals
    * 7 different labels, or "signal classifications"
    * total of about 128 GB of data
    
This data set should be used to train your models. 

**You do not need to use all the data to train your models** if you do not want to or need to consume the entire set. 

There are also a **`small` and a `medium` sized subset** of these primary data files. 

Read below on how to download these data.

## Simple Data Format

Each data file has a simple format: 

    * a JSON header in the first line that contains:
        * UUID
        * signal_classification (label)
    * followed by stream complex-valued time-series data. 

The `ibmseti` Python package is available to assist in reading this data and performing some basic operations for you. 

## Basic Warmup Data Set.

There is also a second data set that you may use for warmup, which we call the **"basic" data set**.  

    * 4 different signal classifications
    * 1000 simulation files for each class: 4000 files total
    * ~1 GB in total. 
       
This basic set should be used as a sanity check and for early-stage prototyping. 

### Basic Set versus Primary Set

> The difference between the `basic` and `primary` data sets is that the signals simulated in the `basic` set have, on average, much higher signal to noise ratio (they are larger amplitude signals). They also have other characteristics that will make the different signal classes very distinguishable. **You should be able to get very high signal classification accuracy with the basic data set.**  The primary data set has smaller amplitude signals and can look more similar to each other, making classification accuracy more difficult with this data set. There are also only 4 classes in the basic data set and 7 classes in the primary set. 




# IBM Object Storage

The data are stored in `containers` on IBM Object Storage. You can access these data with HTTP calls. 

The URL for all data files is composed of

  `base_url/container/objectname`.
 
The `base_url` is:

In [None]:
#If you are running this in IBM Apache Spark (via Data Science Experience)
base_url = 'https://dal05.objectstorage.service.networklayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#ELSE, if you are outside of IBM:
#base_url = 'https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b'

#NOTE: if you are outside of IBM, pulling down data will be slower. :/

In [None]:
#Defining a local data folder to dump data

import os

#Note for hackathon participants that use the Spark Enterprise Cluster: You MUST define a unique local data folder
# to save your work in order to avoid colliding with other teams using the enterprise clusters.
mydatafolder = 'my_team_name_data_folder'
if os.path.exists(mydatafolder) is False:
    os.makedirs(mydatafolder)

## Basic Data Set

We'll start with the basic data set.  Because the basic data set is small, we've created a `.zip` file of the full data set that you can download directly.  

In [None]:
import os

In [None]:
basic_container = 'simsignals_basic_v2'
basic4_zip_file = 'basic4.zip'

In [None]:
os.system('curl {}/{}/{} > {}'.format(base_url, basic_container, basic4_zip_file, mydatafolder + '/' + basic4_zip_file))

In [None]:
!ls -al my_team_name_data_folder/basic4.zip

### Basic UUID/Class List

To assist with analysis, we've included a file that contains the `UUID, signal classification` for the basic data set. (The `signal_classification` is also in the header of each simulation file, if you wish to use that instead.)

* public_list_basic_v2_26may_2017.csv

https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_files/public_list_basic_v2_26may_2017.csv

## Primary Data Set

The full primary data set is very large. Thus, we provide a `primary_small` subset and a `primary_medium` subset of the data.  The `primary_small` contains 1000 simulations of each of the seven classes and is 2.6 GB in size (1.9 GB zipped). The `primary_medium` subset contains 10000 simulations of each of the seven classes and is 26 GB in size (19 GB zipped).  

### Primary Small

The `primary_small` subset can be found in a zip file in
* contianer = 'simignals_v2_zipped'
* objectname = 'primary_small.zip'

In [None]:
primary_small_url = '{}/simsignals_v2_zipped/primary_small.zip'.format(base_url)
os.system('curl {} > {}'.format(primary_small_url, mydatafolder +'/primary_small.zip'))

###### Primary Small UUID/Class List

Like the basic set, there is a CSV file containing the UUID, signal classifications for each file in the `primary_small` subset.

In [None]:
primary_small_csv_url = '{}/simsignals_files/public_list_primary_v2_small_1june_2017.csv'.format(base_url)
os.system('curl {} > {}'.format(primary_small_csv_url, mydatafolder +'/public_list_primary_v2_small_1june_2017.csv'))

### Primary Medium

Similarly, the `primary_medium` subset can be found in a handful of zip files

* contianer = 'simignals_v2_zipped'
* objectname = 'primary_medium_N.zip'
* for N = 1, 2, 3, 4, 5, 6

In [None]:
med_N = '{}/simsignals_v2_zipped/primary_medium_{}.zip'

for i in range(1,7):
    med_url = med_N.format(base_url, i)
    output_file = mydatafolder + '/primary_medium_{}.zip'.format(i)
    print 'GETing', output_file
    os.system('curl {} > {}'.format(med_url, output_file ))

###### Primary Medium UUID/Class List

Here too, there is a CSV file containing the UUID, signal classifications for each file in the `primary_medium` subset.

In [None]:
med_csv_url = '{}/simsignals_files/public_list_primary_v2_medium_1june_2017.csv'.format(base_url)
os.system('curl {} > {}'.format(med_csv_url, mydatafolder + '/public_list_primary_v2_medium_1june_2017.csv'))

### Primary Full set

Because the full set is so incredibly large, we currently only have these 350,000 files available separately on object storage. (However, we are working to zip these files up to make them more easily consumed). 

The `primary_full` list can be found here: 

In [None]:
prim_full = '{}/simsignals_files/public_list_primary_v2_full_1june_2017.csv'.format(base_url)
os.system('curl {} > {}'.format(prim_full, mydatafolder + '/public_list_primary_v2_full_1june_2017.csv'))

One can download this list and begin to pull down files individually if desired. Warning, however, this will take approximately a billion years if you are not running on IBM Apache Spark -- IBM Apache Spark and Object Storage exist in the same data center and share a fast network connection. 

The data are found in 

`base_url/simsignals_v2/<uuid>.dat`

For example:

https://dal.objectstorage.open.softlayer.com/v1/AUTH_cdbef52bdf7a449c96936e1071f0a46b/simsignals_v2/aa7d082f-9263-4533-a9d4-5595c5cdde25.dat

**We are working to make the primary full data set more easily consumed as a set of .zip files, as this current setup is less than ideal. You will be notified. The data will be already available for participants of the hackathon, however.**

If you wish to programmatically begin to download the full data set you may use the following code.

In [None]:
import requests
import copy

In [None]:
file_list_container = 'simsignals_files'
file_list = 'public_list_primary_v2_full_1june_2017.csv'
primary_data_container = 'simsignals_v2'

In [None]:
r = requests.get('{}/{}/{}'.format(base_url, file_list_container, file_list), timeout=(9.0, 21.0))
filecontents = copy.copy(r.content)

In [None]:
full_primary_files = [line.split(',') for line in filecontents.split('\n')]
print 'header', full_primary_files[0]
full_primary_files = full_primary_files[1:-1] #strip the header and empty last element

In [None]:
#save your data into a local subfolder
save_to_folder = 'primary_data_set'
if os.path.exists(save_to_folder) is False:
    os.mkdir(save_to_folder)

In [None]:
count = 0
total = len(full_primary_files)
for row in full_primary_files:
    r = requests.get('{}/{}/{}.dat'.format(base_url, primary_data_container, row[0]), timeout=(9.0, 21.0))
    
    if count % 100 == 0:
        print 'done ', count, ' out of ',  total
    count += 1
    
    with open('{}/{}'.format(save_to_folder, row[0]+'.dat'), 'w' ) as fout:
        fout.write(r.content)

# Test Data Set

Once you've trained your model, done all of your testing, and tweaks and are ready to submit an entry to the contest, you'll need to download the test data set and apply your model to that.  

The test data set is similar to the labeled data, except that the JSON header is missing the 'signal_classification' key, and just contains the 'uuid'. 

Like the other sets, this set is found in a `.zip` file in the `simsignals_v2_zipped` container;

In [None]:
test_set_url = '{}/simsignals_v2_zipped/primary_testset.zip'.format(base_url)
os.system('curl {} > {}'.format(test_set_url, mydatafolder +'/primary_testset.zip'))

There are approximately 1000 simulations of each of the 7 signal classes -- but not exactly 1000 (+- some largeish number) so you can't cheat :). 

See the [Judging Criteria document](https://github.com/setiQuest/ML4SETI/blob/master/Judging_Criteria.ipynb) for more details.