# Kaggle Dog Breed Identification Challenge
Check out the challenge [here](https://www.kaggle.com/c/dog-breed-identification).

Now we'll be building the same thing as in [Lesson1FULL(1)](https://github.com/neelkamalsb/DeepLearningWithFastai) notebook but on a different architecture and different data with minimum steps.</p>
Earlier we had a general approach as follows
1.  precompute=True
2.  Use lr_find() to find highest learning rate where loss is still clearly improving
3.  Train last layer from precomputed activations for 1-2 epochs
4.  Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
5.  Unfreeze all layers
6.  Set earlier layers to 3x-10x lower learning rate than next higher layer
7.  Use lr_find() again
8.  Train full network with cycle_mult=2 until over-fitting



## Downloading the data
We download the data using [this](https://github.com/floydwch/kaggle-cli).
Check out more [here](http://wiki.fast.ai/index.php/Kaggle_CLI).

In [0]:
!pip install kaggle-cli
!kg config -g -u 'Your username' -p 'Your password' -c dog-breed-identification

In [0]:
!kg download

In [3]:
!ls

labels.csv.zip	sample_data  sample_submission.csv.zip	test.zip  train.zip


In [0]:
!sudo apt install unzip
!unzip train.zip
!unzip test.zip
!unzip labels.csv.zip

In [8]:
!ls

labels.csv	sample_submission.csv	   test.zip
labels.csv.zip	sample_submission.csv.zip  train
sample_data	test			   train.zip


In [9]:
!unzip sample_submission.csv.zip

Archive:  sample_submission.csv.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: sample_submission.csv   


In [10]:
!ls -l

total 733972
-rw-r--r-- 1 root root    482063 Sep 28  2017 labels.csv
-rw-r--r-- 1 root root    218954 Nov 23 09:43 labels.csv.zip
drwxr-xr-x 2 root root      4096 Nov 20 18:17 sample_data
-rw-r--r-- 1 root root  25200295 Sep 28  2017 sample_submission.csv
-rw-r--r-- 1 root root    288160 Nov 23 09:43 sample_submission.csv.zip
drwxr-xr-x 2 root root    688128 Sep 28  2017 test
-rw-r--r-- 1 root root 362738853 Nov 23 09:43 test.zip
drwxr-xr-x 2 root root    663552 Sep 28  2017 train
-rw-r--r-- 1 root root 361279070 Nov 23 09:44 train.zip


## Some Google Colab Setup

In [0]:
%matplotlib inline

In [0]:
!pip3 install fastai==0.7.0

In [0]:
!pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl 
!pip3 install torchvision

In [0]:
!pip install torchtext==0.2.3

In [0]:
!pip install pillow==4.0.0

## Let's GO!

In [0]:
from fastai.imports import *

In [0]:
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

In [22]:
torch.cuda.is_available()             #this should return true

True

In [23]:
torch.backends.cudnn.enabled          #this should return true.

True

###  Get the weights

In [18]:
!wget http://files.fast.ai/models/weights.tgz

--2018-11-23 10:09:31--  http://files.fast.ai/models/weights.tgz
Resolving files.fast.ai (files.fast.ai)... 67.205.15.147
Connecting to files.fast.ai (files.fast.ai)|67.205.15.147|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1195411318 (1.1G) [text/plain]
Saving to: ‘weights.tgz’


2018-11-23 10:09:42 (101 MB/s) - ‘weights.tgz’ saved [1195411318/1195411318]



In [0]:
! tar -xzf weights.tgz 

In [34]:
!ls


labels.csv      sample_submission.csv      test.zip  train.zip
labels.csv.zip  sample_submission.csv.zip  [0m[01;34mtmp[0m/      [01;34mweights[0m/
[01;34msample_data[0m/    [01;34mtest[0m/                      [01;34mtrain[0m/    weights.tgz


In [0]:
!mv weights /usr/local/lib/python3.6/dist-packages/fastai/

### Define some requirements and get a Cross validation set

In [0]:
PATH = './'
sz = 224
arch = resnext101_64
bs = 8

In [25]:
label_csv = f'{PATH}labels.csv'                                #Get the labels.
n = len(list(open(label_csv)))                                 #See the no. of labels.
val_idxs = get_cv_idxs(n, cv_idx=3)                            #Get the validation set. 
                                                               #'get_cv_idxs(n)' gives numpy.ndarray of random
                                                               # indices from 'labels_csv' for validation set.
print('Total no. training Examples: ', n)                      #See how many examples are there.
print('No. of Cross Validation Set Examples: ', len(val_idxs)) #We get 20% in the validation set.
val_idxs

Total no. training Examples:  10223
No. of Cross Validation Set Examples:  2044


array([3386, 8026, 9503, ..., 7504, 8531, 6511])

### Check out the data

In [0]:
label_df = pd.read_csv(label_csv)
label_df.head(10)

Unnamed: 0,id,breed
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo
2,001cdf01b096e06d78e9e5112d419397,pekinese
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever
5,002211c81b498ef88e1b40b9abf84e1d,bedlington_terrier
6,00290d3e1fdd27226ba27a8ce248ce85,bedlington_terrier
7,002a283a315af96eaea0e28e7163b21b,borzoi
8,003df8b8a8b05244b1d920bb6cf451f9,basenji
9,0042188c895a2f14ef64a918ed9c7b64,scottish_deerhound


In [0]:
label_df.pivot_table(index='breed', aggfunc=len).sort_values('id', ascending=False)

Unnamed: 0_level_0,id
breed,Unnamed: 1_level_1
scottish_deerhound,126
maltese_dog,117
afghan_hound,116
entlebucher,115
bernese_mountain_dog,114
shih-tzu,112
great_pyrenees,111
pomeranian,111
basenji,110
samoyed,109


### Build a function to get data varying sizes easily

In [0]:
def get_data(sz, bs):
  tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
  data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', 
                                      test_name='test', num_workers=4, 
                                      val_idxs=val_idxs, suffix='.jpg', 
                                      tfms=tfms, bs=bs)
  return data if bs>300 else data.resize(340,'tmp')

In [27]:
data = get_data(sz, bs)


HBox(children=(IntProgress(value=0, max=6), HTML(value='')))

                                                     


### Learn!

In [36]:
learn = ConvLearner.pretrained(arch, data, precompute=True, ps=0.5)


100%|██████████| 1023/1023 [05:35<00:00,  3.89it/s]
100%|██████████| 256/256 [01:23<00:00,  3.88it/s]
100%|██████████| 1295/1295 [07:02<00:00,  3.70it/s]


In [37]:
lrf=learn.lr_find()
lrf

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…

 72%|███████▏  | 733/1023 [00:09<00:04, 63.52it/s, loss=7.67]


In [0]:
learn.lr_find(1e-7,1e2)
learn.sched.plot(n_skip=0, n_skip_end=0)

In [0]:
learn.sched.plot_lr()    #This is not plotting, and I am unable to find the reason
                         #Suggestions are welcomed!

In [0]:
learn.sched.plot()       #This is not plotting, and I am unable to find the reason
                         #Suggestions are welcomed!

In [42]:
learn.fit(1e-2, 3)

HBox(children=(IntProgress(value=0, description='Epoch', max=3, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy   
    0      0.799682   0.405888   0.873777  
    1      0.632344   0.383956   0.887965  
    2      0.481198   0.343744   0.896771  



[array([0.34374]), 0.8967710371819961]

Pretty big errors on training data than the validation set. It means that our model is underfitting.

In [43]:
learn.precompute=False
learn.fit(1e-2, 5, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=5, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy   
    0      0.494789   0.298513   0.912427  
    1      0.481034   0.30267    0.91047   
    2      0.459537   0.300343   0.910959  
    3      0.442421   0.294351   0.904599  
    4      0.364932   0.296458   0.907534  



[array([0.29646]), 0.9075342465753424]

In [44]:
learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy   
    0      0.340949   0.297825   0.908513  
    1      0.452497   0.301744   0.908023  
    2      0.341018   0.281875   0.914384  
    3      0.451636   0.349841   0.903131  
    4      0.307107   0.314065   0.907534  
    5      0.25724    0.282693   0.911937  
    6      0.267021   0.276607   0.914873  



[array([0.27661]), 0.9148727984344422]

In [0]:
learn.save('224_pre')

In [0]:
learn.load('224_pre')

In [47]:
log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)



In [48]:
accuracy_np(probs, y)

0.9227005870841487

In [0]:
preds = np.exp(learn.predict(is_test=True))

In [50]:
preds.shape

(10357, 120)

In [0]:
df = pd.DataFrame(preds)
df.columns = data.classes

In [54]:
df.columns


Index(['affenpinscher', 'afghan_hound', 'african_hunting_dog', 'airedale',
       'american_staffordshire_terrier', 'appenzeller', 'australian_terrier',
       'basenji', 'basset', 'beagle',
       ...
       'toy_poodle', 'toy_terrier', 'vizsla', 'walker_hound', 'weimaraner',
       'welsh_springer_spaniel', 'west_highland_white_terrier', 'whippet',
       'wire-haired_fox_terrier', 'yorkshire_terrier'],
      dtype='object', length=120)

In [0]:
df.insert(0, 'id', [o[5:-4] for o in data.test_ds.fnames])

In [57]:
df.head()

Unnamed: 0,id,affenpinscher,afghan_hound,african_hunting_dog,airedale,american_staffordshire_terrier,appenzeller,australian_terrier,basenji,basset,...,toy_poodle,toy_terrier,vizsla,walker_hound,weimaraner,welsh_springer_spaniel,west_highland_white_terrier,whippet,wire-haired_fox_terrier,yorkshire_terrier
0,2162f28a3151f4ca907a8a9d79492618,9.622564e-07,0.0001746814,1.03345e-06,2.247167e-06,3.342854e-08,1.207091e-07,1.257938e-06,4.863871e-08,2.688949e-07,...,1.694619e-06,2.411577e-07,1.539381e-05,3.109882e-07,4.684487e-07,3.708372e-05,2.786743e-08,2.014952e-08,7.040341e-07,6.49508e-06
1,d4592b2ce3f4b01ccf09e3dd066366cf,0.9999599,3.333576e-09,3.705285e-09,7.549974e-10,4.191065e-11,3.774746e-11,5.368271e-11,1.04609e-10,2.178532e-09,...,2.143924e-08,1.257417e-10,2.594498e-09,3.006835e-10,1.546833e-10,3.053817e-11,2.116222e-10,3.660598e-11,8.29291e-10,3.707443e-08
2,a04ec5d3e358109699247c1d60dd6d2e,9.586493e-11,1.43073e-11,2.484441e-11,7.330782e-12,9.430868e-11,5.596404e-11,2.236877e-11,2.454773e-09,4.731412e-08,...,8.831192e-10,1.271661e-11,9.172752e-05,6.681428e-08,1.283227e-11,8.227715e-10,1.585996e-11,1.766201e-11,1.016528e-13,4.297434e-10
3,32b09857c983edf030d8b27136a841cc,2.154818e-07,6.210754e-06,6.51991e-08,2.044864e-07,1.223535e-08,1.624409e-07,1.569498e-08,4.337887e-09,3.483875e-09,...,4.302683e-10,3.366662e-07,6.468835e-09,1.786672e-08,7.141433e-08,5.546946e-07,3.300172e-09,1.841218e-05,1.587495e-06,2.244094e-08
4,bc3a55555f74f590e7ddbf7a9e6475f7,1.502749e-06,3.379953e-06,1.399289e-07,1.313091e-09,3.488472e-08,1.643816e-08,9.654488e-09,1.338602e-08,8.626689e-09,...,1.012352e-07,6.708266e-09,1.1576e-08,4.279106e-09,1.345923e-09,6.166624e-10,1.666774e-09,1.433464e-08,4.794136e-09,3.153239e-07


In [0]:
SUBM = f'{PATH}/SUBM/'
os.makedirs(SUBM, exist_ok=True)
df.to_csv(f'{SUBM}/sub.gz', compression='gzip', index=False)

In [0]:
df.to_csv(f'{SUBM}/sub.csv', index=False)

In [0]:
from google.colab import files
files.download(f'{SUBM}/sub.csv') 

That's all folks!