# Kaggle Dog Breed Identification Challenge
Check out the challenge [here](https://www.kaggle.com/c/dog-breed-identification).

Now we'll be building the same thing as in [Lesson1FULL(1)](https://github.com/neelkamalsb/DeepLearningWithFastai) notebook but on a different architecture and different data with minimum steps.</p>
Earlier we had a general approach as follows
1.  precompute=True
2.  Use lr_find() to find highest learning rate where loss is still clearly improving
3.  Train last layer from precomputed activations for 1-2 epochs
4.  Train last layer with data augmentation (i.e. precompute=False) for 2-3 epochs with cycle_len=1
5.  Unfreeze all layers
6.  Set earlier layers to 3x-10x lower learning rate than next higher layer
7.  Use lr_find() again
8.  Train full network with cycle_mult=2 until over-fitting



## Downloading the data
We download the data using [this](https://github.com/floydwch/kaggle-cli).
Check out more [here](http://wiki.fast.ai/index.php/Kaggle_CLI).

In [0]:
!pip install kaggle-cli
!kg config -g -u neelkamalsb -p imaimninkgl -c dog-breed-identification

In [0]:
!kg download

In [0]:
!ls

labels.csv.zip	sample_data  sample_submission.csv.zip	test.zip  train.zip


In [0]:
!sudo apt install unzip
!unzip train.zip
!unzip test.zip
!unzip labels.csv.zip

In [0]:
!ls

labels.csv	sample_data		   test      train
labels.csv.zip	sample_submission.csv.zip  test.zip  train.zip


In [0]:
!unzip sample_submission.csv.zip

In [0]:
!ls -l

total 709352
-rw-r--r-- 1 root root    482063 Sep 28  2017 labels.csv
-rw-r--r-- 1 root root    218954 Nov 23 02:03 labels.csv.zip
drwxr-xr-x 2 root root      4096 Nov 20 18:17 sample_data
-rw-r--r-- 1 root root    288160 Nov 23 02:03 sample_submission.csv.zip
drwxr-xr-x 2 root root    688128 Sep 28  2017 test
-rw-r--r-- 1 root root 362738853 Nov 23 02:03 test.zip
drwxr-xr-x 2 root root    663552 Sep 28  2017 train
-rw-r--r-- 1 root root 361279070 Nov 23 02:03 train.zip


## Some Google Colab Setup

In [0]:
%matplotlib inline

In [0]:
!pip3 install fastai==0.7.0

In [0]:
!pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl 
!pip3 install torchvision

In [0]:
!pip install torchtext==0.2.3

In [0]:
!pip install pillow==4.0.0

## Let's GO!

In [0]:
from fastai.imports import *

In [0]:
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

In [0]:
torch.cuda.is_available()             #this should return true

True

In [0]:
torch.backends.cudnn.enabled          #this should return true.

True

### Define some requirements and get a Cross validation set

In [0]:
PATH = './'
sz = 224
arch = resnet34
bs = 8

In [0]:
label_csv = f'{PATH}labels.csv'                                #Get the labels.
n = len(list(open(label_csv)))                                 #See the no. of labels.
val_idxs = get_cv_idxs(n, cv_idx=3)                            #Get the validation set. 
                                                               #'get_cv_idxs(n)' gives numpy.ndarray of random
                                                               # indices from 'labels_csv' for validation set.
print('Total no. training Examples: ', n)                      #See how many examples are there.
print('No. of Cross Validation Set Examples: ', len(val_idxs)) #We get 20% in the validation set.
val_idxs

Total no. training Examples:  10223
No. of Cross Validation Set Examples:  2044


array([3386, 8026, 9503, ..., 7504, 8531, 6511])

### Check out the data

In [0]:
label_df = pd.read_csv(label_csv)
label_df.head(10)

Unnamed: 0,id,breed
0,000bec180eb18c7604dcecc8fe0dba07,boston_bull
1,001513dfcb2ffafc82cccf4d8bbaba97,dingo
2,001cdf01b096e06d78e9e5112d419397,pekinese
3,00214f311d5d2247d5dfe4fe24b2303d,bluetick
4,0021f9ceb3235effd7fcde7f7538ed62,golden_retriever
5,002211c81b498ef88e1b40b9abf84e1d,bedlington_terrier
6,00290d3e1fdd27226ba27a8ce248ce85,bedlington_terrier
7,002a283a315af96eaea0e28e7163b21b,borzoi
8,003df8b8a8b05244b1d920bb6cf451f9,basenji
9,0042188c895a2f14ef64a918ed9c7b64,scottish_deerhound


In [0]:
label_df.pivot_table(index='breed', aggfunc=len).sort_values('id', ascending=False)

Unnamed: 0_level_0,id
breed,Unnamed: 1_level_1
scottish_deerhound,126
maltese_dog,117
afghan_hound,116
entlebucher,115
bernese_mountain_dog,114
shih-tzu,112
great_pyrenees,111
pomeranian,111
basenji,110
samoyed,109


### Build a function to get data varying sizes easily

In [0]:
def get_data(sz, bs):
  tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
  data = ImageClassifierData.from_csv(PATH, 'train', f'{PATH}labels.csv', 
                                      test_name='test', num_workers=4, 
                                      val_idxs=val_idxs, suffix='.jpg', 
                                      tfms=tfms, bs=bs)
  return data if bs>300 else data.resize(340,'tmp')

In [0]:
data = get_data(sz, bs)


HBox(children=(IntProgress(value=0, max=6), HTML(value='')))

                                                     


### Learn!

In [0]:
learn = ConvLearner.pretrained(arch, data, precompute=True, ps=0.5)


In [0]:
lrf=learn.lr_find()
lrf

HBox(children=(IntProgress(value=0, description='Epoch', max=1, style=ProgressStyle(description_width='initial…



In [0]:
learn.sched.plot_lr()    #This is not plotting, and I am unable to find the reason
                         #Suggestions are welcomed!

In [0]:
learn.sched.plot()       #This is not plotting, and I am unable to find the reason
                         #Suggestions are welcomed!

In [0]:
learn.fit(1e-2, 3)

HBox(children=(IntProgress(value=0, description='Epoch', max=3, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy   
    0      1.526604   0.744559   0.770548  
    1      1.20965    0.711759   0.789628  
    2      0.993023   0.64217    0.802838  


[array([0.64217]), 0.8028375733855186]

Pretty big errors on training data than the validation set. It means that our model is underfitting.

In [0]:
learn.precompute=False
learn.fit(1e-2, 5, cycle_len=1)

HBox(children=(IntProgress(value=0, description='Epoch', max=5, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy   
    0      1.021099   0.571636   0.824853  
    1      0.949509   0.566267   0.827299  
    2      0.923297   0.549844   0.828767  
    3      0.805149   0.545509   0.831703  
    4      0.80754    0.546004   0.833659  


[array([0.546]), 0.8336594911937377]

Still the same issue....because we didn't actually address it!
So, we'll have to keep searching for more optimum spots i.e. keep decreasing the learning rate for longer period. 

In [0]:
learn.fit(1e-2, 3, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=7, style=ProgressStyle(description_width='initial…

epoch      trn_loss   val_loss   accuracy   
    0      0.845072   0.544619   0.830235  
    1      0.829021   0.559367   0.825832  
    2      0.765292   0.529275   0.830724  
    3      0.867494   0.587093   0.82045   
    4      0.868209   0.535415   0.830724  
    5      0.684352   0.534892   0.826321  
    6      0.758954   0.52974    0.838063  


[array([0.52974]), 0.8380626223091977]

In [0]:
learn.save('224_pre')

In [0]:
learn.load('224_pre')

In [0]:
log_preds,y = learn.TTA()
probs = np.mean(np.exp(log_preds),0)



In [0]:
accuracy_np(probs, y)

0.8542074363992173

In [0]:
preds = np.exp(learn.predict(is_test=True))

In [0]:
fnames = [x.split("/")[1].split(".")[0] for x in data.test_dl.dataset.fnames]

In [0]:
final = pd.read_csv(f"{PATH}/sample_submission.csv")
final.loc[:,1:] = preds

In [0]:
f = "submit1.csv"
final.to_csv(f"{PATH}{f}", compression="gzip", index=False)

from IPython.display import FileLink
FileLink(f"{PATH}{f}")

In [0]:
from google.colab import files
files.download('submit1.csv') 