## What's this skin thing?

We're first going to scrape images from [dermnet](http://www.dermnet.com/dermatology-pictures-skin-disease-pictures/), using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [3]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

First, we'll initialize bs4 and grab the index page.

In [4]:
def urlToSoup(url):
    return BeautifulSoup(urlopen(url), 'html.parser')

# Downloading webpage and initializing soup
soup = urlToSoup('http://www.dermnet.com/dermatology-pictures-skin-disease-pictures/')

print(str(soup.prettify)[:1000])

<bound method Tag.prettify of <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Dermnet: Dermatology Pictures - Skin Disease Pictures</title>
<meta content="document" name="resource-type"/>
<meta content="15 days" name="revisit-after"/>
<meta content="Dermatology Pictures - Skin Disease Photos." name="description"/>
<meta content="acne picture, psoriasis picture, eczema picture, herpes picture, actinic keratosis picture, alopecia areata picture, std picture, Atopic Dermatitis picture, Contact Dermatitis picture, tinea picture, Melanoma picture" name="keywords"/>
<meta content="all=index,follow" name="robots"/>
<meta content="Global" name="distribution"/>
<meta content="Safe For Kids" name="rating"/>
<meta content="Dermnet.com - all rights reserved" name="copyright"/>
<meta content="Dermnet

Now, we'll find all of the sub-pages that contain specific disease types.

In [5]:
# Grabbing all urls on page and filtering to image subpages

# limiting to images under "complete list"
table = soup.find('table', {"width":'100%'})

# grabbing all links
url_elements = table.findChildren('a')

# getting link value
url_links = [elem.get('href') for elem in url_elements]

# url_links includes None elementa
clean_url_links = filter(None, url_links)
image_links = [x for x in clean_url_links if '/images/' in x]

print("Sample:")
[print(link) for link in image_links[:5]]
print("There are {} different categories".format(len(image_links)))

Sample:
/images/Acanthosis-Nigricans
/images/Accessory-Nipple
/images/Accessory-Trachus
/images/Acid-Burn
/images/Acne-Closed-Comedo
There are 643 different categories


Now, for each of the categories, we need to know how many pages of images there are. 

In [6]:
def getMaxPages(image_link):
    soup = urlToSoup('http://www.dermnet.com' + image_link)
    page_elems = soup.find('div',{'class':'pagination'})
    
    if page_elems is None:
        max_pages = 0
    else:
        page_elem_links = page_elems.findChildren('a')
        page_nums = [int(elem.text) for elem in page_elem_links if elem.text.isnumeric()]
        max_pages = page_nums[-1]
    return max_pages

In [7]:
# warning, this takes a long time
links_and_pages = [(link, getMaxPages(link)) for link in image_links]

links_and_pages[:10]

[('/images/Acanthosis-Nigricans', 6),
 ('/images/Accessory-Nipple', 0),
 ('/images/Accessory-Trachus', 0),
 ('/images/Acid-Burn', 0),
 ('/images/Acne-Closed-Comedo', 4),
 ('/images/Acne-Cystic', 13),
 ('/images/Acne-Excoriated', 3),
 ('/images/Acne-Histology', 0),
 ('/images/Acne-Infantile', 2),
 ('/images/Acne-Keloidalis', 6)]

In [8]:
def takeSecond(elem):
    return elem[1]

links_and_pages.sort(key=takeSecond, reverse=True)
reduced = links_and_pages[:5]

In [9]:
reduced

[('/images/Seborrheic-Keratoses-Ruff', 43),
 ('/images/Herpes-Zoster', 37),
 ('/images/Atopic-Dermatitis-Adult-Phase', 28),
 ('/images/Psoriasis-Chronic-Plaque', 27),
 ('/images/Eczema-Hand', 26)]

In [12]:
def getPageThumbnailElements(page_soup):
    return page_soup.findAll("div", {'class':'thumbnails'})


def getImageUrls(thumbnail):
    return thumbnail.findChild("a").findChild("img").attrs["src"]


def getImageLinks(page_url):
    soup = urlToSoup(page_url)
    thumbnail_elts = getPageThumbnailElements(soup)
    urls = [getImageUrls(thumbnail_elt) for thumbnail_elt in thumbnail_elts]
    return urls

# General URL structure:
# http://www.dermnet.com/images/Acanthosis-Nigricans/photos/1

def getPageImageLinks(page_url, page_num):
    full_page_url = page_url + "/photos/" + str(page_num)
    return getImageLinks(full_page_url)

print(getImageLinks('http://www.dermnet.com' + reduced[0][0]), 2)


['http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-00134.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-1.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-10.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-100.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-101.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-102.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-103.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-104.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-105.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-106.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-107.jpg', 'http://www.dermnet.com/dn2/allJPGThumb3/seborrheic-keratoses-ruff-108.jpg'] 2


In [72]:
# this section grabs all of the image URLs for each of the conditions
# Note -- really slow to run

base = 'http://www.dermnet.com'

tagged_images = []

for condition in links_and_pages:
    condition_image_list = []
    for page_num in range(1, max(2,condition[1]+1)):
        new_links = getPageImageLinks(base + condition[0], page_num)
        condition_image_list.extend(new_links)
    tagged_images.append((condition[0], condition_image_list))
    

In [73]:
for tag, urls in tagged_images:
    print("{} has {} images so far".format(tag, len(urls)))

/images/Seborrheic-Keratoses-Ruff has 504 images so far
/images/Herpes-Zoster has 432 images so far
/images/Atopic-Dermatitis-Adult-Phase has 324 images so far
/images/Psoriasis-Chronic-Plaque has 312 images so far
/images/Eczema-Hand has 300 images so far
/images/Seborrheic-Dermatitis has 276 images so far
/images/Seborrheic-Dermatitis has 276 images so far
/images/Keratoacanthoma has 263 images so far
/images/Lichen-Planus has 264 images so far
/images/Lichen-Planus has 264 images so far
/images/Epidermal-Cyst has 240 images so far
/images/Eczema-Nummular has 228 images so far
/images/Tinea-Ringworm-Body has 216 images so far
/images/Tinea-Ringworm-Versicolor has 216 images so far
/images/Lichen-Simplex-Chronicus has 204 images so far
/images/Candidiasis-large-Skin-Folds has 192 images so far
/images/Psoriasis-Palms-Soles has 192 images so far
/images/Scabies has 192 images so far
/images/Granuloma-Annulare has 180 images so far
/images/Malignant-Melanoma has 180 images so far
/image

In [89]:
# Little cleanup -- could refactor code later if needed

clean_tags = []

for tag, urls in tagged_images:
    clean_tag = tag[8:]
    clean_urls = []
    for url in urls:
        clean_url = url.replace("allJPGThumb3","allJPG3")
        clean_urls.append(clean_url)
    clean_tags.append((clean_tag, clean_urls))


In [91]:
# Saving work
import pickle

pickle_out = open("tagged_image_urls.pickle","wb")
pickle.dump(clean_tags, pickle_out)
pickle_out.close()