# Classifying Dog Breeds

Goal is to use the American Kennel Club (AKC) breed taxonomy to classify the breeds in the dataset into AKC breeds.

In [1]:
from fuzzywuzzy import fuzz
import json
import numpy as np
import pandas as pd
import random
import re
import wikipedia

The Wikipedia page *List of dog breeds recognized by the American Kennel Club* has a list of dog breeds along with their AKC classification. We can grab this data, and then use fuzzy text matching to match these to the breeds in our data.

In [2]:
akc_breeds = wikipedia.page("List of dog breeds recognized by the American Kennel Club")

# Manually identified links that are NOT breeds
exclude = [u'Dog breed', u'American Kennel Club', u'List of dog breeds', 
           u'List of dog breeds recognized by the Canadian Kennel Club']
breeds = [page for page in akc_breeds.links if page not in exclude]
page_content = akc_breeds.content

# Come up with a dictionary of breed to classification
def identify_classification(breed, page_content):
    start, i = page_content.index(breed) + len(breed) + 1, 1
    while '\n' not in page_content[start:start + i + 1][-2:]:
        i += 1
    classification = page_content[start:start + i]
    if classification[0] == ' ':
        return classification[1:]
    else:
        return classification

classifications = {}
for breed in breeds:
    
    # Manual fixes, cases where the links and link text are not the same
    breed = (breed
             .replace(' (dog)', '')
             .replace(' (dog breed)', '')
             .replace('American Cocker Spaniel', 'Cocker Spaniel')
             .replace('American Eskimo Dog', 'American Eskimo Dog (Miniature)')
             .replace('Australian Silky Terrier', 'Silky Terrier')
             .replace('Bergamasco Shepherd', 'Bergamasco')
             .replace('English Mastiff', 'Mastiff')
             .replace('Griffon Bruxellois', 'Brussels Griffon')
             .replace('Hungarian Vizsla', 'Vizsla')
             .replace('Rough Collie', 'Collie'))
    try:
        classifications[breed] = identify_classification(breed, page_content)
    except:
        print 'No luck with', breed

There are some cases where dogs are classified into multiple clases. This is OK, as seen below, since so many of the breeds in the shelter are mixes, each dog will be allowed to be members of multiple classes.

In [3]:
print 'Classifications:'
set(classifications.values())

Classifications:


{u'Herding',
 u'Hound',
 u'Non-Sporting',
 u'Non-Sporting & Toy',
 u'Pequeno, Hound',
 u'Sporting',
 u'Terrier',
 u'Terrier & Toy',
 u'Toy',
 u'Working'}

Match the classifications to our data. Since so many dogs are mixed-breed, let's make it so that a dog can be part of multiple classes.

In [4]:
train = pd.read_csv('data/train.csv')
train_breeds = list(train['Breed'].unique())
train_breed_classifications = {}
for train_breed in train_breeds:
    classes = []
    
    # Remove the word 'Mix' and identify breeds separately for strings separated
    # by a /, since this often distinguishes between 2 different breeds
    train_breed_clean = train_breed.replace(' Mix', '')
    train_breed_split = train_breed_clean.split('/')
    
    # For each breed, assign the class to the mapped breed/classification w/
    # the smallest Levenshtein distance between breed names using fuzzywuzzy
    for partial_breed in train_breed_split:
        high_score, current_class = 0, None
        for classified_breed in classifications.keys():
            score = fuzz.ratio(partial_breed, classified_breed)
            if score > high_score:
                high_score = score
                current_class = classifications[classified_breed]
        classes.append(current_class)
        
    # Split cases where the breed name is 'A & B' separately into A, B
    for myclass in classes:
        if '&' in myclass:
            for subclass in myclass.split(' & '):
                classes.append(subclass)
            classes = [c for c in classes if c != myclass]
    train_breed_classifications[train_breed] = set(classes)

Let's check what the final possible classes are:

In [5]:
final_classes = []
for breed, myclass in train_breed_classifications.items():
    for subclass in myclass:
        if subclass not in final_classes:
            final_classes.append(subclass)
print final_classes

[u'Herding', u'Terrier', u'Working', u'Toy', u'Hound', u'Sporting', u'Non-Sporting']


Let's do a few spot checks of the results.

In [6]:
rand_keys = [random.randint(0, len(train_breed_classifications.keys())) for x in range(0, 10)]
for i, k in enumerate(train_breed_classifications.keys()):
    if i in rand_keys:
        print 'Breed', k
        print 'Classification(s)', train_breed_classifications[k], '\n'

Breed Carolina Dog Mix
Classification(s) set([u'Working']) 

Breed Staffordshire/Boston Terrier
Classification(s) set([u'Non-Sporting', u'Terrier']) 

Breed Airedale Terrier Mix
Classification(s) set([u'Terrier']) 

Breed Lhasa Apso/Shih Tzu
Classification(s) set([u'Non-Sporting', u'Toy']) 

Breed Finnish Spitz Mix
Classification(s) set([u'Non-Sporting']) 

Breed Siberian Husky/Border Collie
Classification(s) set([u'Herding', u'Working']) 

Breed Chinese Sharpei
Classification(s) set([u'Toy']) 

Breed Cavalier Span/Papillon
Classification(s) set([u'Sporting', u'Toy']) 

Breed Nova Scotia Duck Tolling Retriever/Golden Retriever
Classification(s) set([u'Sporting']) 

Breed Jack Russell Terrier/Miniature Poodle
Classification(s) set([u'Toy', u'Terrier']) 



Write the final `train_breed_classifications` dictionary to a JSON file.

In [7]:
train_breed_classifications = dict(
    (k, list(v)) for k, v in train_breed_classifications.items())
with open('dogbreeds.json', 'w') as f:
     json.dump(train_breed_classifications, f)