https://www.kaggle.com/c/leaf-classification

Can you see the random forest for the leaves?

There are estimated to be nearly half a million species of plant in the world. Classification of species has been historically problematic and often results in duplicate identifications. Automating plant recognition might have many applications, including:

    Species population tracking and preservation
    Plant-based medicinal research
    Crop and food supply management

The objective of this playground competition is to use binary leaf images and extracted features, including shape, margin & texture, to accurately identify 99 species of plants. Leaves, due to their volume, prevalence, and unique characteristics, are an effective means of differentiating plant species. They also provide a fun introduction to applying techniques that involve image-based features.

As a first step, try building a classifier that uses the provided pre-extracted features. Next, try creating a set of your own features. Finally, examine the errors you're making and see what you can do to improve.



The dataset consists approximately 1,584 images of leaf specimens (16 samples each of 99 species) which have been converted to binary black leaves against white backgrounds. Three sets of features are also provided per image: a shape contiguous descriptor, an interior texture histogram, and a ﬁne-scale margin histogram. For each feature, a 64-attribute vector is given per leaf sample.

Note that of the original 100 species, we have eliminated one on account of incomplete associated data in the original dataset.
File descriptions

    train.csv - the training set
    test.csv - the test set
    sample_submission.csv - a sample submission file in the correct format
    images - the image files (each image is named with its corresponding id)

Data fields

    id - an anonymous id unique to an image
    margin_1, margin_2, margin_3, ..., margin_64 - each of the 64 attribute vectors for the margin feature
    shape_1, shape_2, shape_3, ..., shape_64 - each of the 64 attribute vectors for the shape feature
    texture_1, texture_2, texture_3, ..., texture_64 - each of the 64 attribute vectors for the texture feature


In [2]:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')

In [5]:
#see how data looks
train.head()

Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,Acer_Opalus,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.0,0.00293,0.00293,0.035156,0.0,0.0,0.004883,0.0,0.025391
1,2,Pterocarya_Stenoptera,0.005859,0.0,0.03125,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.0,0.0,0.000977,0.023438,0.0,0.0,0.000977,0.039062,0.022461
2,3,Quercus_Hartwissiana,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.1543,0.0,0.005859,0.000977,0.007812,0.0,0.0,0.0,0.020508,0.00293
3,5,Tilia_Tomentosa,0.0,0.003906,0.023438,0.005859,0.021484,0.019531,0.023438,0.0,...,0.0,0.000977,0.0,0.0,0.020508,0.0,0.0,0.017578,0.0,0.047852
4,6,Quercus_Variabilis,0.005859,0.003906,0.048828,0.009766,0.013672,0.015625,0.005859,0.0,...,0.09668,0.0,0.021484,0.0,0.0,0.0,0.0,0.0,0.0,0.03125


In [6]:
train.describe()

Unnamed: 0,id,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
count,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,...,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0,990.0
mean,799.59596,0.017412,0.028539,0.031988,0.02328,0.014264,0.038579,0.019202,0.001083,0.007167,...,0.036501,0.005024,0.015944,0.011586,0.016108,0.014017,0.002688,0.020291,0.008989,0.01942
std,452.477568,0.019739,0.038855,0.025847,0.028411,0.01839,0.05203,0.017511,0.002743,0.008933,...,0.063403,0.019321,0.023214,0.02504,0.015335,0.060151,0.011415,0.03904,0.013791,0.022768
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,415.25,0.001953,0.001953,0.013672,0.005859,0.001953,0.0,0.005859,0.0,0.001953,...,0.0,0.0,0.000977,0.0,0.004883,0.0,0.0,0.0,0.0,0.000977
50%,802.5,0.009766,0.011719,0.025391,0.013672,0.007812,0.015625,0.015625,0.0,0.005859,...,0.004883,0.0,0.005859,0.000977,0.012695,0.0,0.0,0.003906,0.00293,0.011719
75%,1195.5,0.025391,0.041016,0.044922,0.029297,0.017578,0.056153,0.029297,0.0,0.007812,...,0.043701,0.0,0.022217,0.009766,0.021484,0.0,0.0,0.023438,0.012695,0.029297
max,1584.0,0.087891,0.20508,0.15625,0.16992,0.11133,0.31055,0.091797,0.03125,0.076172,...,0.42969,0.20215,0.17285,0.2002,0.10645,0.57813,0.15137,0.37598,0.086914,0.1416


In [14]:
#see how data looks
test.head()

Unnamed: 0,id,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,4,0.019531,0.009766,0.078125,0.011719,0.003906,0.015625,0.005859,0.0,0.005859,...,0.006836,0.0,0.015625,0.000977,0.015625,0.0,0.0,0.0,0.003906,0.053711
1,7,0.007812,0.005859,0.064453,0.009766,0.003906,0.013672,0.007812,0.0,0.033203,...,0.0,0.0,0.006836,0.001953,0.013672,0.0,0.0,0.000977,0.037109,0.044922
2,9,0.0,0.0,0.001953,0.021484,0.041016,0.0,0.023438,0.0,0.011719,...,0.12891,0.0,0.000977,0.0,0.0,0.0,0.0,0.015625,0.0,0.0
3,12,0.0,0.0,0.009766,0.011719,0.017578,0.0,0.003906,0.0,0.003906,...,0.012695,0.015625,0.00293,0.036133,0.013672,0.0,0.0,0.089844,0.0,0.008789
4,13,0.001953,0.0,0.015625,0.009766,0.039062,0.0,0.009766,0.0,0.005859,...,0.0,0.042969,0.016602,0.010742,0.041016,0.0,0.0,0.007812,0.009766,0.007812


In [11]:
#In the dataset, the species column is the label
#get all labels in the dataset (99 as mentioned in the introduction)
species_list=np.unique(train[['species']])
print species_list
print len(species_list)

['Acer_Capillipes' 'Acer_Circinatum' 'Acer_Mono' 'Acer_Opalus'
 'Acer_Palmatum' 'Acer_Pictum' 'Acer_Platanoids' 'Acer_Rubrum'
 'Acer_Rufinerve' 'Acer_Saccharinum' 'Alnus_Cordata' 'Alnus_Maximowiczii'
 'Alnus_Rubra' 'Alnus_Sieboldiana' 'Alnus_Viridis' 'Arundinaria_Simonii'
 'Betula_Austrosinensis' 'Betula_Pendula' 'Callicarpa_Bodinieri'
 'Castanea_Sativa' 'Celtis_Koraiensis' 'Cercis_Siliquastrum'
 'Cornus_Chinensis' 'Cornus_Controversa' 'Cornus_Macrophylla'
 'Cotinus_Coggygria' 'Crataegus_Monogyna' 'Cytisus_Battandieri'
 'Eucalyptus_Glaucescens' 'Eucalyptus_Neglecta' 'Eucalyptus_Urnigera'
 'Fagus_Sylvatica' 'Ginkgo_Biloba' 'Ilex_Aquifolium' 'Ilex_Cornuta'
 'Liquidambar_Styraciflua' 'Liriodendron_Tulipifera'
 'Lithocarpus_Cleistocarpus' 'Lithocarpus_Edulis' 'Magnolia_Heptapeta'
 'Magnolia_Salicifolia' 'Morus_Nigra' 'Olea_Europaea' 'Phildelphus'
 'Populus_Adenopoda' 'Populus_Grandidentata' 'Populus_Nigra' 'Prunus_Avium'
 'Prunus_X_Shmittii' 'Pterocarya_Stenoptera' 'Quercus_Afares'
 'Querc

In [15]:
#encode labels - convert categorical column to numerical
from sklearn.preprocessing import LabelEncoder

In [16]:
le=LabelEncoder()
le.fit(train.species)
list(le.classes_)
labels=le.transform(train.species)

In [20]:
#remove species column from train dataset
train=train.drop(['species'],axis=1)

In [21]:
#perform train test split to train model
from sklearn.cross_validation import train_test_split

In [22]:
X_train, X_val, y_train, y_val = train_test_split(train, labels, test_size=0.2, random_state=37)

In [26]:
#imports for classification
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
#from sklearn.tree import DecisionTreeClassifier

#run classification
clf=RandomForestClassifier()
clf.fit(X_train,y_train)

#now, make predictions from the classifier
y_predicts=clf.predict(X_val)
acc = accuracy_score(y_val, y_predicts)
print "Val acc: ", round(acc,3)

Val acc:  0.828
