# Planet: Understanding the Amazon from Space

## Getting the data

In [1]:
from fastai.vision import *
import os
import shutil

You can download the data directly using the [Kaggle API](https://github.com/Kaggle/kaggle-api). 

First, install the Kaggle API by uncommenting the following line and executing it, or by executing it in your terminal (depending on your platform you may need to modify this slightly to either add source activate fastai or similar, or prefix pip with a path. Have a look at how conda install is called for your platform (Depending on your environment, you may also need to append "--user" to the command).

In [2]:
# ! {sys.executable} -m pip install kaggle --upgrade

Then you need to upload your credentials from Kaggle on your instance. Login to [Kaggle](www.kaggle.com) and click on your profile picture on the top right corner, then 'My account'. Scroll down until you find a button named 'Create New API Token' and click on it. This will trigger the download of a file named 'kaggle.json'.

Upload this file to the directory this notebook is running in, by clicking "Upload" on your main Jupyter page, then uncomment and execute the next two commands (or run them in a terminal). For Windows, uncomment the last two commands.

In [3]:
# ! mkdir -p ~/.kaggle/
# ! mv kaggle.json ~/.kaggle/

# For Windows, uncomment these two commands
# ! mkdir %userprofile%\.kaggle
# ! move kaggle.json %userprofile%\.kaggle

You're all set to download the data from [planet competition](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space). **You first need to go to its main page and accept its rules**, and run the two cells below (uncomment the shell commands to download and unzip the data). If you get a ```403 forbidden error``` it means you haven't accepted the competition rules yet (you have to go to the competition page, click on *Rules* tab, and then scroll to the bottom to find the accept button).

In [4]:
path = Path('../data')
path.mkdir(parents=True, exist_ok=True)
path

PosixPath('../data')

In [5]:
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train-jpg.tar.7z -p {path}  
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f train_v2.csv -p {path}
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg.tar.7z -p {path}  
# ! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg-additional.tar.7z -p {path} 
# ! unzip -q -n {path}/train_v2.csv.zip -d {path}

If you have trouble downloading the data with the command line tool (it seems there has been an issue getting this particular dataset from the cli), you can navigate to the [competition's data page](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/data) and manually download the files. You will only need the files: ```train-jpg.tar```, ```train_v2.csv```, ```test-jpg.tar```, and ```test-jpg-additional.tar```. Then upload the files to the directory this notebook is running in by clicking "Upload" on your main Jupyter page. You can then uncomment and run the cell below.

In [6]:
! mv train-jpg.tar.7z {path}
! mv test-jpg.tar.7z {path}
! mv test-jpg-additional.tar.7z {path}
! mv train_v2.csv.zip {path}
! unzip -q -n {path}/train_v2.csv.zip -d {path}

To extract the content of this file, we'll need 7zip, so uncomment the following line if you need to install it.

In [7]:
# ! conda install --yes --prefix {sys.prefix} -c haasad eidl7zip

And now we can unpack the data (this might take a few minutes to complete).

In [8]:
! 7za -bd -y -so x {path}/train-jpg.tar.7z | tar xf - -C {path}
! 7za -bd -y -so x {path}/test-jpg.tar.7z | tar xf - -C {path}
! 7za -bd -y -so x {path}/test-jpg-additional.tar.7z | tar xf - -C {path}

In [9]:
! rm {path}/*.tar.7z
! rm {path}/*.zip

Having two separate test files is a bit annoying, so I'll combined them from the start to make life easier moving forward.

In [10]:
source = str(path/'test-jpg-additional/')
dest = str(path/'test-jpg')
files = os.listdir(source)
for f in files:
    shutil.move(source+'/'+f, dest)

In [12]:
! rm -r {path}/'test-jpg-additional'

In [13]:
print('# File sizes')
p = str(path)
for f in os.listdir(p):
    if not os.path.isdir(p+'/'+ f):
        print (f.ljust(30) + str(round(os.path.getsize(p+'/'+ f) / 1000000, 2)) + 'MB')
    else:
        sizes = [os.path.getsize(p +'/'+f + '/'+x)/1000000 for x in os.listdir(p+'/'+f)]
        print(f.ljust(30) + str(round(sum(sizes), 2)) + 'MB' + '({} files)'.format(len(sizes)))

# File sizes
test-jpg                      958.88MB(61191 files)
train_v2.csv                  1.43MB
train-jpg                     634.68MB(40479 files)
densenet121.pkl               32.64MB
resnet50.pkl                  102.86MB
resnet152.pkl                 241.92MB
test_labels.csv               1.57MB
resnet101.pkl                 179.1MB
submissions                   33.03MB(14 files)
models                        3169.82MB(13 files)
addtl_labels.csv              0.79MB
densenet169.pkl               57.77MB
