# Areal Project

<div>
<img src="logo.jpg" width=150 ALIGN="left" border="20">
<h1> Starting Kit for raw data (images)</h1>
<br>This code was tested with <br>
Python 3.6.7 <br>
Created by Areal Team <br><br>
ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". The CDS, CHALEARN, AND/OR OTHER ORGANIZERS OR CODE AUTHORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL AUTHORS AND ORGANIZERS BE LIABLE FOR ANY SPECIAL, 
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. 
</div>

<div>
    <h2>Introduction </h2>
     <br>
Aerial imagery has been a primary source of geographic data for quite a long time. With technology progress, aerial imagery became really practical for remote sensing : the science of obtaining information about an object, area or phenomenon.
Nowadays, there are many uses of image recognition spanning from robotics/drone vision to autonomous driving vehicules or face detection.
<br>
In this challenge, we will use pre-processed data, coming from landscape images. The goal is to learn to differentiate common and uncommon landscapes such as a beach, a lake or a meadow.
    Data comes from part of the data set (NWPU-RESISC45) originally used in <a href="https://arxiv.org/pdf/1703.00121.pdf?fbclid=IwAR16qo-EX_Z05ZpxvWG8F-oBU0SlnY-3BPCWBVVOGPyJcVy7BBqCKjnsvJo">Remote Sensing Image Scene Classification</a>. This data set contains 45 categories while we only kept 13 out of them.

References and credits: 
Yuliya Tarabalka, Guillaume Charpiat, Nicolas Girard for the data sets presentation.<br>
Gong Cheng, Junwei Han, and Xiaoqiang Lu, for the original article on the chosen data set.
</div>

### Requirements 

The next cell will install all the required dependencies on your computer. You should consider replacing pip with pip3 if pip is related to python2.7 on your computer, or comment it if you already have the dependencies/are running in the docker of the challenge (runnable with the name areal/codalab:pytorch if you know how to run a docker).

In [1]:
#!pip install --user -r requirements.txt

In [2]:
import numpy as np
import random
import re

In [3]:
model_dir = "sample_code_submission"
result_dir = 'sample_result_submission/' 
problem_dir = 'ingestion_program/'  
score_dir = 'scoring_program/'

In [4]:
from sys import path; path.append(model_dir); path.append(problem_dir); path.append(score_dir);

Go through the challenge website and watch the trailer video.

#### Question 1: Briefly explain the problem.

TODO

#### Question 2: What is the scoring metric used to evaluate submissions?

TODO

<div>
    <h1> Step 1: Exploratory data analysis </h1>
<p>
We provide sample_data with the starting kit, but to prepare your submission, you must fetch the public_data from the challenge website and point to it.
</div>

In [7]:
#data_dir = 'sample_data'
data_dir = 'sample_data' # download "public_data" from the challenge website
data_name = 'Areal'

<h2 style="color:red " >Warning</h2>

<p style="font-style:italic"> In case you want to load the full data </p> 
Files being big, your computer needs to have enough space available in your RAM. It should take about 3-4GB while loading and 1.5GB in the end.

In [8]:
from ingestion_program.data_io import read_as_df
data = read_as_df(data_dir  + '/' + data_name)

Reading sample_data/Areal_train from AutoML format
Number of examples = 65
Number of features = 49152
        Class
0       beach
1   chaparral
2       cloud
3      desert
4      forest
5      island
6        lake
7      meadow
8    mountain
9       river
10        sea
11   snowberg
12    wetland
Number of classes = 13


In [9]:
data.shape

(65, 49153)

In [10]:
data.head()

Unnamed: 0,pixel_1_1_R,pixel_1_1_G,pixel_1_1_B,pixel_1_2_R,pixel_1_2_G,pixel_1_2_B,pixel_1_3_R,pixel_1_3_G,pixel_1_3_B,pixel_1_4_R,...,pixel_128_126_R,pixel_128_126_G,pixel_128_126_B,pixel_128_127_R,pixel_128_127_G,pixel_128_127_B,pixel_128_128_R,pixel_128_128_G,pixel_128_128_B,target
0,145,145,121,113,113,89,73,75,53,65,...,191,164,134,196,169,139,202,175,145,desert
1,193,168,138,191,166,136,201,176,146,194,...,196,171,140,197,172,141,201,176,145,desert
2,83,86,67,65,68,49,73,78,58,78,...,115,110,91,115,107,88,141,133,114,meadow
3,16,52,48,15,51,47,15,52,45,15,...,65,83,61,58,75,56,68,85,66,river
4,60,79,47,80,99,67,62,81,51,45,...,182,197,202,121,135,144,120,137,147,mountain


In [None]:
data.describe()

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

num_toshow = 6
fig, _axs = plt.subplots(nrows=2, ncols=3, figsize=(10,10))
fig.subplots_adjust(hspace=0.3)
axs = _axs.flatten()

for i in range(num_toshow):
    img = data.iloc[i].values[:-1].reshape(128,128,3)
    label = data.iloc[i].values[-1:]
    axs[i].set_title('Example of {}'.format(label))
    axs[i].imshow(img.astype(float) / 255)

plt.show()

In [None]:
print(data.iloc[:, -1:])
X = data.iloc[:, :-1]
y = data.iloc[:, -1:]

In [None]:
np.unique(data["target"]).shape

In [None]:
data[data["target"]=="island"].shape

#### Code 1: compute statistics of the dataset.

* How many features?
* How many data points?
* How many classes?
* What is the most represented class?
* What is the least represented class?

In [None]:
#Features: 128*128*3
#Data points: 5200
#Classes: 13
#They are equally represented: all by 400 samples

# Step 2 : Building a predictive model

<h2 style="color:red " >Warning</h2>

<p style="font-style:italic"> In case you want to load the full data </p> 
This time, also, still make sure that your RAM has at least 2-3GB available.

In [None]:
from data_manager import DataManager
D = DataManager(data_name, data_dir, replace_missing=False, verbose=True)
print(D)

In [None]:
X_train = D.data['X_train']
Y_train = D.data['Y_train']

### Processing

Basically, there are two approaches:

* Use raw data as input. This may be the good way to go with, for instance, deep learning models.
* Do feature engineering: process the data to create features. You can then use this features as the input of your classifier (Random forest, SVM, etc.). An example of feature is the number of blue pixel in the image. Feature extraction can also be done by a CNN.

In [None]:
# TODO (if you want)

### Use of the baseline model

Using our BasicCNN model needs PyTorch libraries installed.

In case you have them but still encounter errors related to them, you should probably do an upgrade : 

    pip install -U torch

Our model is a simple implementation of a Convolutional Neural Network (CNN).

More information on CNN:
* [Convolutional neural network on Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network)
* [A Comprehensive Guide to Convolutional Neural Networks (blog)](https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53)

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
del mod

In [None]:
from sample_code_submission.model import BasicCNN, SimpleConvModel
import model

In [None]:
mod = SimpleConvModel()
mod

In [None]:
m = BasicCNN(verbose=True, use_cuda=True)
print(m.model_conv)
trained_model_name = model_dir + data_name

In [None]:
m.fit(X_train, Y_train)

In [None]:
Y_hat_train = m.predict(D.data['X_train'])
Y_hat_valid = m.predict(D.data['X_valid'])
Y_hat_test = m.predict(D.data['X_test'])

In [63]:
# m.save(trained_model_name)                 
result_name = result_dir + data_name
from data_io import write
write(result_name + '_train.predict', Y_hat_train)
write(result_name + '_valid.predict', Y_hat_valid)
write(result_name + '_test.predict', Y_hat_test)
!ls $result_name*

sample_result_submission/Areal_test.predict
sample_result_submission/Areal_train.predict
sample_result_submission/Areal_valid.predict


#### Question 3: What are the hyperparameters of a CNN?

TODO

#### Code 2: Edit model.py to vary the CNN's hyperparameter

In [28]:
#TODO in model.py

#### Code 3: Try another model (e.g. Random Forest, SVM, etc.)

In [29]:
#TODO in another model.py file

# Scoring the result

Obviously, since it is made with sample_data, which has too few samples, results won't be really good

In [34]:
from libscores import get_metric
metric_name, scoring_function = get_metric()
print('Using scoring metric:', metric_name)

Using scoring metric: accuracy


In [35]:
print('Ideal score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_train))
print('Training score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_hat_train))
if len(D.data['Y_valid']) > 0 and len(D.data['Y_test']) > 0:
    print('Valid score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_valid'], Y_hat_valid))
    print('Test score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_test'], Y_hat_test))

Ideal score for the accuracy metric = 1.0000
Training score for the accuracy metric = 0.4515


## Confusion matrix

In [36]:
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_train, Y_hat_train)

array([[  0,  74,   0,  20,  63,  72,  17,   7,   0,   0, 100,  47,   0],
       [  0, 378,   0,  19,   2,   0,   0,   0,   0,   0,   0,   1,   0],
       [  0,  51,   0,   5,  22,  58,  25,   0,   0,   0, 152,  87,   0],
       [  0,  36,   0, 351,   0,   0,   0,   7,   0,   0,   5,   1,   0],
       [  0,  75,   0,   6, 269,   0,  13,  37,   0,   0,   0,   0,   0],
       [  0,   6,   0,   1,  12, 318,  20,  12,   0,   0,  23,   8,   0],
       [  0,  64,   0,   2,  68,  12, 216,  20,   0,   0,  13,   5,   0],
       [  0,   9,   0,   2,  11,   0,   0, 378,   0,   0,   0,   0,   0],
       [  0, 190,   0,  33,  89,   2,  32,  36,   0,   0,  12,   6,   0],
       [  0,  48,   0,   3, 188,  10,  69,  36,   0,   0,  43,   3,   0],
       [  0,   0,   0,   0,   7,  33,   9,   0,   0,   0, 339,  12,   0],
       [  0,  42,   0,   0,  30,   5,   6,   0,   0,   0, 218,  99,   0],
       [  0, 144,   0,   2, 141,   2,  34,  62,   0,   0,  14,   1,   0]])

#### Question 4: what does the confusion matrix represent?

TODO

#### Code 4: display the confusion matrix with a colored heatmap

In [22]:
# TODO

## Cross validation

CV scores on sample_data doesn't have enough data, and so isn't meaningful.
Run it with the full data to see meaningful values.

In [21]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score

In [22]:
scores = cross_val_score(BasicCNN(), X_train, Y_train, cv=3, scoring=make_scorer(scoring_function))
print('\nCV score (95 perc. CI): %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))


CV score (95 perc. CI): 0.12 (+/- 0.12)


#### Question 5: Why is there a standard deviation associated with the cross-validation score?

TODO

# Submission

## Example

Example needs to have python3 installed

Test to see whether submission with ingestion program is working

In [23]:
!python3 $problem_dir/ingestion.py $data_dir $result_dir $problem_dir $model_dir

Using input_dir: /home/adrien/Documents/competitions/Image_recognition_challenge/image_recognition/public_data
Using output_dir: /home/adrien/Documents/competitions/Image_recognition_challenge/image_recognition/starting_kit/sample_result_submission
Using program_dir: /home/adrien/Documents/competitions/Image_recognition_challenge/image_recognition/starting_kit/ingestion_program
Using submission_dir: /home/adrien/Documents/competitions/Image_recognition_challenge/image_recognition/starting_kit/sample_code_submission


************************************************
******** Processing dataset Areal ********
************************************************
Info file found : /home/adrien/Documents/competitions/Image_recognition_challenge/image_recognition/public_data/Areal_public.info
[+] Success in  0.02 sec
[+] Success in 49.47 sec
[+] Success in  0.13 sec
[+] Success in 18.58 sec
[+] Success in  0.00 sec
[+] Success in 19.27 sec
[+] Success in  0.00 sec
DataManager : Areal
info:
	usag

### Test scoring program

In [24]:
scoring_output_dir = 'scoring_output'
!python3 $score_dir/score.py $data_dir $result_dir $scoring_output_dir



# Prepare the submission

In [25]:
import datetime 
from data_io import zipdir
the_date = datetime.datetime.now().strftime("%y-%m-%d-%H-%M")
sample_code_submission = './sample_code_submission_' + the_date + '.zip'
sample_result_submission = './sample_result_submission_' + the_date + '.zip'
zipdir(sample_code_submission, model_dir)
zipdir(sample_result_submission, result_dir)
print("Submit one of these files:\n" + sample_code_submission + "\n" + sample_result_submission)

Submit one of these files:
./sample_code_submission_20-12-02-15-35.zip
./sample_result_submission_20-12-02-15-35.zip


# Try to submit your submissions on Codalab!