## What's this skin thing?

We're first going to scrape images from [dermnet](http://www.dermnet.com/dermatology-pictures-skin-disease-pictures/), using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [27]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import collections

First, we'll initialize bs4 and grab the index page.

In [2]:
def urlToSoup(url):
    return BeautifulSoup(urlopen(url), 'html.parser')

# Downloading webpage and initializing soup
soup = urlToSoup('http://www.dermnet.com/dermatology-pictures-skin-disease-pictures/')

print(str(soup.prettify)[:1000])

<bound method Tag.prettify of <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Dermnet: Dermatology Pictures - Skin Disease Pictures</title>
<meta content="document" name="resource-type"/>
<meta content="15 days" name="revisit-after"/>
<meta content="Dermatology Pictures - Skin Disease Photos." name="description"/>
<meta content="acne picture, psoriasis picture, eczema picture, herpes picture, actinic keratosis picture, alopecia areata picture, std picture, Atopic Dermatitis picture, Contact Dermatitis picture, tinea picture, Melanoma picture" name="keywords"/>
<meta content="all=index,follow" name="robots"/>
<meta content="Global" name="distribution"/>
<meta content="Safe For Kids" name="rating"/>
<meta content="Dermnet.com - all rights reserved" name="copyright"/>
<meta content="Dermnet

### Grabbing classes
Now, we'll find all of the sub-pages that contain specific disease types.

In [3]:
# Grabbing all urls on page and filtering to image subpages

# limiting to images under "complete list"
table = soup.find('table', {"width":'100%'})

# grabbing all links
url_elements = table.findChildren('a')

# getting link value
url_links = [elem.get('href') for elem in url_elements]

# url_links includes None elementa
clean_url_links = filter(None, url_links)
image_links = [x for x in clean_url_links if '/images/' in x]

print("Sample:")
[print(link) for link in image_links[:5]]
print("There are {} different categories".format(len(image_links)))

Sample:
/images/Acanthosis-Nigricans
/images/Accessory-Nipple
/images/Accessory-Trachus
/images/Acid-Burn
/images/Acne-Closed-Comedo
There are 643 different categories


Now, for each of the categories, we need to know how many pages of images there are. 

### Helper functions

In [28]:
def getMaxPages(image_link):
    soup = urlToSoup('http://www.dermnet.com' + image_link)
    page_elems = soup.find('div',{'class':'pagination'})
    
    if page_elems is None:
        max_pages = 0
    else:
        page_elem_links = page_elems.findChildren('a')
        page_nums = [int(elem.text) for elem in page_elem_links if elem.text.isnumeric()]
        max_pages = page_nums[-1]
    return max_pages

def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

## Counting pages per class

In [9]:
# warning, this takes a long time
links_and_pages = [(link, getMaxPages(link)) for link in image_links]

links_and_pages[:10]

[('/images/Acanthosis-Nigricans', 6),
 ('/images/Accessory-Nipple', 0),
 ('/images/Accessory-Trachus', 0),
 ('/images/Acid-Burn', 0),
 ('/images/Acne-Closed-Comedo', 4),
 ('/images/Acne-Cystic', 13),
 ('/images/Acne-Excoriated', 3),
 ('/images/Acne-Histology', 0),
 ('/images/Acne-Infantile', 2),
 ('/images/Acne-Keloidalis', 6)]

In [7]:
def takeSecond(elem):
    return elem[1]

links_and_pages.sort(key=takeSecond, reverse=True)
reduced = links_and_pages[:5]

In [10]:
reduced

[('/images/Seborrheic-Keratoses-Ruff', 43),
 ('/images/Herpes-Zoster', 37),
 ('/images/Atopic-Dermatitis-Adult-Phase', 28),
 ('/images/Psoriasis-Chronic-Plaque', 27),
 ('/images/Eczema-Hand', 26)]

## Grabbing all image links

### Helper functions

In [11]:
def getPageThumbnailElements(page_soup):
    return page_soup.findAll("div", {'class':'thumbnails'})


def getImageUrls(thumbnail):
    return thumbnail.findChild("a").findChild("img").attrs["src"]


def getImageLinks(page_url):
    soup = urlToSoup(page_url)
    thumbnail_elts = getPageThumbnailElements(soup)
    urls = [getImageUrls(thumbnail_elt) for thumbnail_elt in thumbnail_elts]
    return urls

# General URL structure:
# http://www.dermnet.com/images/Acanthosis-Nigricans/photos/1

def getPageImageLinks(page_url, page_num):
    full_page_url = page_url + "/photos/" + str(page_num)
    return getImageLinks(full_page_url)

print(getImageLinks('http://www.dermnet.com' + reduced[0][0]), 2)


['http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-00134.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-1.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-10.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-100.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-101.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-102.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-103.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-104.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-105.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-106.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-107.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-108.jpg'] 2


### Grabbing image urls for each class

In [14]:
# this section grabs all of the image URLs for each of the conditions
# Note -- really slow to run

base = 'http://www.dermnet.com'

tagged_images = []

for condition in links_and_pages:
    condition_image_list = []
    for page_num in range(1, max(2,condition[1]+1)):
        new_links = getPageImageLinks(base + condition[0], page_num)
        condition_image_list.extend(new_links)
    tagged_images.append((condition[0], condition_image_list))
    

### Evaluating

In [15]:
def print_counted_tags(tags):
    for tag, urls in tags:
        print("{} has {} images so far".format(tag, len(urls)))

print_counted_tags(tagged_images)

/images/Acanthosis-Nigricans has 64 images so far
/images/Accessory-Nipple has 10 images so far
/images/Accessory-Trachus has 6 images so far
/images/Acid-Burn has 2 images so far
/images/Acne-Closed-Comedo has 44 images so far
/images/Acne-Cystic has 147 images so far
/images/Acne-Excoriated has 28 images so far
/images/Acne-Histology has 3 images so far
/images/Acne-Infantile has 16 images so far
/images/Acne-Keloidalis has 61 images so far
/images/Acne-Mechanica has 1 images so far
/images/Acne-Open-Comedo has 72 images so far
/images/Acne-Primary-Lesions has 6 images so far
/images/Acne-Pustular has 72 images so far
/images/Acne-Scar has 11 images so far
/images/Acne-Steroid has 0 images so far
/images/Acquired-Digital-Fibrokeratoma has 7 images so far
/images/Acrocyanosis has 3 images so far
/images/Acrodermatitis-Chronica-Atrophicans has 2 images so far
/images/Acrodermatitis-Enteropathica has 6 images so far
/images/Actinic-Cheilitis-Squamous--Cell-Lip has 62 images so far
/imag

In [19]:
clean_tags[0]

('Acanthosis-Nigricans',
 ['http://www.dermnet.com/dn2/allJPG3/03dermatitisDrug112205-2.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-1.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-10.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-11.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-12.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-13.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-14.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-15.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-16.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-17.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-18.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-19.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-2.jpg',
  'http://www.dermnet.com/dn2/allJPG3/acanthosis-nigricans-20.jpg',
  'http://www.dermnet.co

### Cleaning urls / classes

In [16]:
# Little cleanup -- could refactor code later if needed
clean_tags = []

for tag, urls in tagged_images:
    clean_tag = tag[8:]
    clean_urls = []
    for url in urls:
        clean_url = url.replace("allJPGThumb3","allJPG3")
        clean_urls.append(clean_url)
    clean_tags.append((clean_tag, clean_urls))


In [17]:
print_counted_tags(clean_tags)

Acanthosis-Nigricans has 64 images so far
Accessory-Nipple has 10 images so far
Accessory-Trachus has 6 images so far
Acid-Burn has 2 images so far
Acne-Closed-Comedo has 44 images so far
Acne-Cystic has 147 images so far
Acne-Excoriated has 28 images so far
Acne-Histology has 3 images so far
Acne-Infantile has 16 images so far
Acne-Keloidalis has 61 images so far
Acne-Mechanica has 1 images so far
Acne-Open-Comedo has 72 images so far
Acne-Primary-Lesions has 6 images so far
Acne-Pustular has 72 images so far
Acne-Scar has 11 images so far
Acne-Steroid has 0 images so far
Acquired-Digital-Fibrokeratoma has 7 images so far
Acrocyanosis has 3 images so far
Acrodermatitis-Chronica-Atrophicans has 2 images so far
Acrodermatitis-Enteropathica has 6 images so far
Actinic-Cheilitis-Squamous--Cell-Lip has 62 images so far
Actinic-Comedones has 15 images so far
Actinic-Granuloma has 5 images so far
Actinic-Keratosis-5-FU has 61 images so far
Actinic-keratosis-Arm has 3 images so far
Actinic-Ke

### Combining duplicate classes

In [60]:
import pandas as pd

def firstkey(val):
    return val[0]

duplicates = {
    'Eczema-Fingertips': 'Eczema',
    'Eczema-Nummular': 'Eczema',
    'Stasis-Dermatitis': 'Eczema',
    'Eczema-Foot': 'Eczema',
    'Eczema-Hand': 'Eczema',
    'Eczema-Subacute': 'Eczema',
    'Atopic-Dermatitis-Adult-Phase': 'Eczema',
    'Atopic-Dermatitis-Childhood-Phase':'Eczema',
    'Atopic-Dermatitis-Infant-phase':'Eczema',
    'Eczema-Acute': 'Eczema', 
    'Eczema-Anal': 'Eczema', 
    'Eczema-Areola': 'Eczema', 
    'Eczema-Arms': 'Eczema', 
    'Eczema-Asteatotic': 'Eczema', 
    'Eczema-Axillae': 'Eczema', 
    'Eczema-Chronic': 'Eczema', 
    'Eczema-Ears': 'Eczema', 
    'Eczema-Face': 'Eczema', 
    'Eczema-Herpeticum': 'Eczema', 
    'Eczema-Histology': 'Eczema', 
    'Eczema-Hyperkeratotic': 'Eczema', 
    'Eczema-Impetiginized': 'Eczema', 
    'Eczema-Leg': 'Eczema', 
    'Eczema-Lids': 'Eczema', 
    'Eczema-Penis': 'Eczema', 
    'Eczema-Scrotum': 'Eczema',
    'Eczema-Trunk-Generalized': 'Eczema',
    'Eczema-Vaccinatum': 'Eczema',
    'Eczema-Vulva': 'Eczema',
    'Candidiasis-large-Skin-Folds': 'Intertrigo',
    'Seborrheic-Keratoses-Ruff':'Seborrheic-Dermatitis',
    'Seborrheic-Keratoses-Smooth':'Seborrheic-Dermatitis',
    'Seborrheic-Keratosis-Irritated':'Seborrheic-Dermatitis'
}
dup_keys = duplicates.keys()

new_tags = []

for tag, imgs in clean_tags:
    if tag in dup_keys:
        newtag = duplicates[tag]
        new_tags.append((newtag, imgs))
    else:
        new_tags.append((tag, imgs))

print_counted_tags(new_tags)

Acanthosis-Nigricans has 64 images so far
Accessory-Nipple has 10 images so far
Accessory-Trachus has 6 images so far
Acid-Burn has 2 images so far
Acne-Closed-Comedo has 44 images so far
Acne-Cystic has 147 images so far
Acne-Excoriated has 28 images so far
Acne-Histology has 3 images so far
Acne-Infantile has 16 images so far
Acne-Keloidalis has 61 images so far
Acne-Mechanica has 1 images so far
Acne-Open-Comedo has 72 images so far
Acne-Primary-Lesions has 6 images so far
Acne-Pustular has 72 images so far
Acne-Scar has 11 images so far
Acne-Steroid has 0 images so far
Acquired-Digital-Fibrokeratoma has 7 images so far
Acrocyanosis has 3 images so far
Acrodermatitis-Chronica-Atrophicans has 2 images so far
Acrodermatitis-Enteropathica has 6 images so far
Actinic-Cheilitis-Squamous--Cell-Lip has 62 images so far
Actinic-Comedones has 15 images so far
Actinic-Granuloma has 5 images so far
Actinic-Keratosis-5-FU has 61 images so far
Actinic-keratosis-Arm has 3 images so far
Actinic-Ke

In [60]:
print_counted_tags(combined_classes_and_links)

Seborrheic-Dermatitis has 514 images so far
Eczema has 331 images so far
Eczema has 302 images so far
Seborrheic-Dermatitis has 288 images so far
Eczema has 236 images so far
Intertrigo has 193 images so far
Seborrheic-Dermatitis has 176 images so far
Eczema has 142 images so far
Seborrheic-Dermatitis has 134 images so far
Eczema has 124 images so far
Eczema has 113 images so far
Eczema has 97 images so far
Eczema has 92 images so far
Intertrigo has 94 images so far
Eczema has 26 images so far


In [69]:
def first(tuples):
    return [k for k,v in tuples]


def values(key, tuples):
    value_list = [v for k, v in tuples if k == key]
    return flatten(value_list)


def combineByKey(d):
    unique_keys = set(first(d))
    new_tuples = []
    for key in unique_keys:
        new_values = list(values(key, d))
        new_tuples.append((key, new_values))
    return new_tuples


combined_tags = combineByKey(new_tags)
print_counted_tags(combined_tags)

Notalgia-Paraesthetica has 5 images so far
Hydrocystoma has 10 images so far
Radiation-Dermatitis has 4 images so far
Chronic-Bullous-Dermatosis-Childhood has 1 images so far
Grovers-Disease has 59 images so far
Pediculosis-Lids has 5 images so far
Actinic-Keratosis-Ear has 14 images so far
Contact-Airborne has 3 images so far
Botryomycosis-Staph has 2 images so far
Pretibia-myxedema has 27 images so far
Epidermal-Nevus has 73 images so far
Dyshidrosis has 59 images so far
Pemphigus-Foliaceous has 30 images so far
Rosacea-Nose has 78 images so far
kerion has 18 images so far
Psoriasis-Vulva has 2 images so far
Congenital-Anomalies has 8 images so far
Psoriasis-Erythrodermic has 39 images so far
Candida-Groin has 13 images so far
Squamous-Cell-Carcinoma-Lip has 28 images so far
Tinea-Ringworm-Nigra has 4 images so far
Erythema-Annulare-Centrifugum has 18 images so far
Xanthomas has 83 images so far
Mongolian-Spot has 8 images so far
Epidermal-Nevus-Inflammatory has 6 images so far
Trich

### Filtering out small classes

In [70]:
threshold = 100 # number of images / class

cleaned_classes_and_links = [(tag, urls) for tag, urls 
                             in combined_tags 
                             if len(urls) > threshold]


### Saving Work

In [71]:
# Saving work
import pickle

pickle_out = open("cleaned_classes_and_links.pickle","wb")
pickle.dump(cleaned_classes_and_links, pickle_out)
pickle_out.close()