## Multi-label prediction with Planet Amazon dataset

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from fastai.vision import *
from PIL import Image as pimage

In [None]:
path = Config.data_path()/'bengali'
path.mkdir(parents=True, exist_ok=True)
path
path_train = path/'train'
path_test = path/'test'
path_train.mkdir(parents=True, exist_ok=True)
path_test.mkdir(parents=True, exist_ok=True)

In [None]:
# df0 = pd.read_parquet(path/'train_image_data_0.parquet')
# df1 = pd.read_parquet(path/'train_image_data_1.parquet')
# df2 = pd.read_parquet(path/'train_image_data_2.parquet')
# df3 = pd.read_parquet(path/'train_image_data_3.parquet')


In [None]:
# df = df0
# df.append(df1).append(df2).append(df3)
# df.shape

In [None]:
for fnum in range(4):
    df = pd.read_parquet(path/f'train_image_data_{fnum}.parquet')

    for index in range(df.shape[0]):
        tempimg = df.iloc[index,1:].to_numpy().astype(np.uint8).reshape(137,236)
        tempimg1 = pimage.fromarray(tempimg,'L')
        tempimg1.save(path_train/f'{df.iloc[index,0]}.png','PNG')

In [None]:
for fnum in range(4):
    df = pd.read_parquet(path/f'test_image_data_{fnum}.parquet')

    for index in range(df.shape[0]):
        tempimg = df.iloc[index,1:].to_numpy().astype(np.uint8).reshape(137,236)
        tempimg1 = pimage.fromarray(tempimg,'L')
        tempimg1.save(path_test/f'{df.iloc[index,0]}.png','PNG')

In [None]:
# img = df.iloc[1,1:].to_numpy().astype(int).reshape(137,236)
# img1 = pimage.fromarray(img)
# df.iloc[2,0]

## Multiclassification

Contrary to the pets dataset studied in last lesson, here each picture can have multiple labels. If we take a look at the csv file containing the labels (in 'train_v2.csv' here) we see that each 'image_name' is associated to several tags separated by spaces.

In [None]:
df = pd.read_csv(path/'train_v2.csv')
df.head()

To put this in a `DataBunch` while using the [data block API](https://docs.fast.ai/data_block.html), we then need to using `ImageList` (and not `ImageDataBunch`). This will make sure the model created has the proper loss function to deal with the multiple classes.

In [None]:
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)

In [None]:
doc(ImageList.from_csv)

In [None]:


We use parentheses around the data block pipeline below, so that we can use a multiline statement without needing to add '\\'.

In [None]:
np.random.seed(42)
src = (ImageList.from_csv(path, 'train_v2.csv', folder='train-jpg', suffix='.jpg')
       .split_by_rand_pct(0.2)
       .label_from_df(label_delim=' '))

In [None]:
data = (src.transform(tfms, size=128)
        .databunch().normalize(imagenet_stats))

`show_batch` still works, and show us the different labels separated by `;`.

In [None]:
data.show_batch(rows=3, figsize=(12,9))

In [None]:
??accuracy_thresh

To create a `Learner` we use the same function as in lesson 1. Our base architecture is resnet50 again, but the metrics are a little bit differeent: we use `accuracy_thresh` instead of `accuracy`. In lesson 1, we determined the predicition for a given class by picking the final activation that was the biggest, but here, each activation can be 0. or 1. `accuracy_thresh` selects the ones that are above a certain threshold (0.5 by default) and compares them to the ground truth.

As for Fbeta, it's the metric that was used by Kaggle on this competition. See [here](https://en.wikipedia.org/wiki/F1_score) for more details.

In [None]:
arch = models.resnet50

In [None]:
acc_02 = partial(accuracy_thresh, thresh=0.2)
f_score = partial(fbeta, thresh=0.2)
learn = cnn_learner(data, arch, metrics=[acc_02, f_score])

We use the LR Finder to pick a good learning rate.

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

Then we can fit the head of our network.

In [None]:
lr = 0.01

In [None]:
learn.fit_one_cycle(5, slice(lr))

In [None]:
learn.save('stage-1-rn50')

...And fine-tune the whole model:

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()
learn.recorder.plot()

In [None]:
learn.fit_one_cycle(5, slice(1e-6, lr/5))

In [None]:
learn.save('stage-2-rn50')

In [None]:
data = (src.transform(tfms, size=256)
        .databunch(bs=32).normalize(imagenet_stats))

learn.data = data
data.train_ds[0][0].shape

In [None]:
learn.freeze()

In [None]:
# learn.lr_find()
learn.recorder.plot()

In [None]:
lr=1e-6/2

In [None]:
learn.fit_one_cycle(2, slice(lr))

In [None]:
learn.load('stage-1-256-rn50')

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(5, slice(1e-5, lr/5))

In [None]:
learn.recorder.plot_losses()

In [None]:
learn.save('stage-2-256-rn50')

You won't really know how you're going until you submit to Kaggle, since the leaderboard isn't using the same subset as we have for training. But as a guide, 50th place (out of 938 teams) on the private leaderboard was a score of `0.930`.

In [None]:
learn.export()

## fin

(This section will be covered in part 2 - please don't ask about it just yet! :) )

In [None]:
#! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg.tar.7z -p {path}  
#! 7za -bd -y -so x {path}/test-jpg.tar.7z | tar xf - -C {path}
#! kaggle competitions download -c planet-understanding-the-amazon-from-space -f test-jpg-additional.tar.7z -p {path}  
#! 7za -bd -y -so x {path}/test-jpg-additional.tar.7z | tar xf - -C {path}

In [None]:
test = ImageList.from_folder(path/'test-jpg').add(ImageList.from_folder(path/'test-jpg-additional'))
len(test)

In [None]:
learn = load_learner(path, test=test)
preds, _ = learn.get_preds(ds_type=DatasetType.Test)

In [None]:
thresh = 0.2
labelled_preds = [' '.join([learn.data.classes[i] for i,p in enumerate(pred) if p > thresh]) for pred in preds]

In [None]:
labelled_preds[:5]

In [None]:
fnames = [f.name[:-4] for f in learn.data.test_ds.items]

In [None]:
df = pd.DataFrame({'image_name':fnames, 'tags':labelled_preds}, columns=['image_name', 'tags'])

In [None]:
df.to_csv(path/'submission.csv', index=False)

In [None]:
! kaggle competitions submit planet-understanding-the-amazon-from-space -f {path/'submission.csv'} -m "My submission"

Private Leaderboard score: 0.9296 (around 80th)