# This Notebook

In this notebook we will explore scraping and Fish Species Database creation. From the intro notebook, we are interested in storing information per species on:

1. Scientific Name, Genus Species ex Holacanthus tricolor (str)
2. Fish Family ex Pomacanthidae
3. Common Name ex. Rock Beauty (str or list of strs)
4. Common Category ex Angelfishes (str)
5. range Range ex 12-20 (cm) (tuple of int)
6. Depth Occurence ex 3-25 (m) (tuple of int)
7. Known Distribution (here the distribution tag from the site could be combined after scraping with the location of the exemplar images and cleaned of vague terms)
8. Distinctive Features and Behaviors ex Black Lipstick

We've seen that some of the sites have geotags / coordinates and we could look into creating a map in the future, but for now we will try and create a string list of locations.

## Technologies

To start out with, we will try beautifulsoup4, requests, and eventually the google images API

In [1]:
import requests
import os
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

# Scraping Sections by Website

## [reefguide.org](https://reefguide.org/carib/cat_sci.html)

So from this site I would like to start from their index page, which conveniently has been demarcated elsewhere to contain only caribbean species. Then, we will need to go through each entry in the index table, enter that page, find and store all the appropriate information as well as download any images. Let's get familiar with BS4 and see what the soup obj and parser looks like.

In [3]:
URL = 'https://reefguide.org/carib/cat_sci.html'
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

In [4]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Caribbean Reefs - Index of Species by Scientific Names
  </title>
  <script src="jquery/js/jquery-1.7.1.min.js" type="text/javascript">
  </script>
  <script src="jquery/js/jquery-ui-1.8.16.custom.min.js" type="text/javascript">
  </script>
  <script src="js/mainindex.js" type="text/javascript">
  </script>
  <link href="jquery/css/ui-lightness/jquery-ui-1.8.16.custom.css" rel="Stylesheet" type="text/css"/>
  <link href="css/all.css" rel="stylesheet" type="text/css"/>
  <link href="css/catalog.css" rel="stylesheet" type="text/css"/>
  <!-- Global site tag (gtag.js) - Google Analytics -->
  <script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-3281928-1">
  </script>
  <script>
   window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-3281928-1');
  </script>
 </head>
 <body>
  

In [5]:
print(soup.get_text())





Caribbean Reefs - Index of Species by Scientific Names














Florent's Guide To The Florida, Bahamas &
Caribbean Reefs


Home


Area


Caribbean


 Pacific


 South Florida


 Hawaii


 Worldwide




Index


 By Common Names


 By Categories


 By Scientific Names




Updates


About


RSS


Search






By Common Name | By Category | By Scientific Names
        




AAblennes hiansAbudefduf saxatilis          taurusAcanthemblemaria aspera          maria          spinosaAcanthopleura granulataAcanthostracion polygonius          quadricornisAcanthurus bahianus          chirurgus          coeruleusAcetabularia caliculus          crenulataAchelous ordwayi          sebae          spinimanusAcropora cervicornis          palmataActinoporus elongatusAcyrtus artiusAetobatus narinariAgaricia agaricites f. danai          agaricities f.          agaricities f.          fragilis          grahamae          humilis          lamarcki          tenuifoliaAgelas cerebrum          citrina    

In [6]:
soup.find_all('td')[0].prettify()

'<td>\n <div class="bigalpha">\n  A\n </div>\n <a class="tocnamesci" href="flatneedle.html">\n  Ablennes hians\n </a>\n <br/>\n <a class="tocnamesci" href="sergeantmajor.html">\n  Abudefduf saxatilis\n </a>\n <br/>\n <a class="tocnamesci" href="nightsergeant.html">\n  taurus\n </a>\n <br/>\n <a class="tocnamesci" href="roughheadblenny.html">\n  Acanthemblemaria aspera\n </a>\n <br/>\n <a class="tocnamesci" href="secretaryblenny.html">\n  maria\n </a>\n <br/>\n <a class="tocnamesci" href="spinyheadblenny.html">\n  spinosa\n </a>\n <br/>\n <a class="tocnamesci" href="chiton.html">\n  Acanthopleura granulata\n </a>\n <br/>\n <a class="tocnamesci" href="honeycomb.html">\n  Acanthostracion polygonius\n </a>\n <br/>\n <a class="tocnamesci" href="scrawledcowfish.html">\n  quadricornis\n </a>\n <br/>\n <a class="tocnamesci" href="surgeonfish.html">\n  Acanthurus bahianus\n </a>\n <br/>\n <a class="tocnamesci" href="doctorfish.html">\n  chirurgus\n </a>\n <br/>\n <a class="tocnamesci" href="blu

In [7]:
for result in soup.find_all('a'):
    try:
        print(result.get('href'))
    except:
        pass

../home.html
None
../carib/index1.html
../indopac/index1.html
../keys/index1.html
../hawaii/index1.html
../index1.html
None
cat.html
cat_grp.html
cat_sci.html
../latest.html
../about.html
http://reefguide.org/reefguide.xml
../search.html
cat.html
cat_grp.html
flatneedle.html
sergeantmajor.html
nightsergeant.html
roughheadblenny.html
secretaryblenny.html
spinyheadblenny.html
chiton.html
honeycomb.html
scrawledcowfish.html
surgeonfish.html
doctorfish.html
bluetang.html
greenmermaidswineglass.html
whitemermaidswineglass.html
redhairswimmingcrab.html
ocellateswimmingcrab.html
blotchedswimmingcrab.html
staghorn.html
elkhorncoral.html
elegantanemone.html
acyrtusartius.html
eagleray.html
scaledlettucecoral.html
lettucecoral.html
purplelettucecoral.html
fragilesaucercoral.html
dimpledsheetcoral.html
lowreliefletttuce.html
whitestarsheet.html
thinleaflettuce.html
agelascerebrum.html
agelascitrina.html
elephantear.html
agelasconifera.html
agelastubulata.html
brownclusteredtube.html
branchingtube

Okay so it seems that the href contained in the table entry needs to be appended to https://reefguide.org/carib/ and then we can follow it to the species page. So we will create a cleaned list of all the species from the index page to use for accessing all the pages.

In [8]:
table_contents = soup.find_all('td')
ref_lst = []
for result in table_contents:
    try:
        species = result.find_all('a')
        for specie in species:
            # print(specie.get('href'))
            ref_lst.append(specie.get('href'))
    except:
        pass

In [9]:
ref_lst

['flatneedle.html',
 'sergeantmajor.html',
 'nightsergeant.html',
 'roughheadblenny.html',
 'secretaryblenny.html',
 'spinyheadblenny.html',
 'chiton.html',
 'honeycomb.html',
 'scrawledcowfish.html',
 'surgeonfish.html',
 'doctorfish.html',
 'bluetang.html',
 'greenmermaidswineglass.html',
 'whitemermaidswineglass.html',
 'redhairswimmingcrab.html',
 'ocellateswimmingcrab.html',
 'blotchedswimmingcrab.html',
 'staghorn.html',
 'elkhorncoral.html',
 'elegantanemone.html',
 'acyrtusartius.html',
 'eagleray.html',
 'scaledlettucecoral.html',
 'lettucecoral.html',
 'purplelettucecoral.html',
 'fragilesaucercoral.html',
 'dimpledsheetcoral.html',
 'lowreliefletttuce.html',
 'whitestarsheet.html',
 'thinleaflettuce.html',
 'agelascerebrum.html',
 'agelascitrina.html',
 'elephantear.html',
 'agelasconifera.html',
 'agelastubulata.html',
 'brownclusteredtube.html',
 'branchingtube.html',
 'bonefish.html',
 'aligergallus.html',
 'queenconch.html',
 'redsnappingshrimp.html',
 'alpheuspolystictu

In [10]:
len(ref_lst)

685

Hmmmm so that was effective however I have realized that this list contains coral, sponges, invertebrates as well. let's look at a species page and see if there is anything we can use to ID fish entries.

In [11]:
needle_URL = 'https://reefguide.org/carib/flatneedle.html'
needle_page = requests.get(needle_URL)
needle_soup = BeautifulSoup(needle_page.content, "html.parser") # lol

In [12]:
print(needle_soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Flat Needlefish - Ablennes hians - Needlefishes -  - Caribbean Reefs
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Florent's Guide To The Caribbean Reefs - Flat Needlefish - Ablennes hians - Needlefishes -  - Needlefishes - Florida, Caribbean, Bermuda, Bahamas, Brazil, Gulf of Mexico, Indo-Pacific, Australia - " name="description"/>
  <meta content="Florent's Guide To The Caribbean Reefs - Flat Needlefish - Ablennes hians - Needlefishes -  - Needlefishes - Florida, Caribbean, Bermuda, Bahamas, Brazil, Gulf of Mexico, Indo-Pacific, Australia - " name="keywords"/>
  <meta content="Florent Charpin" name="author"/>
  <script src="jquery/js/jquery-1.7.1.min.js" type="text/javascript">
  </script>
  <script src="jquery/js/jquery-ui-1.8.16.custom.min.js" type="text/javascript">
  </script>
  <script src="js/mainindex.js" type="text/javascript">
  </script>
  <link href="jquery/css/ui-lightness/jqu

In [13]:
print(' '.join(needle_soup.get_text().split()))

Flat Needlefish - Ablennes hians - Needlefishes - - Caribbean Reefs Florent's Guide To The Florida, Bahamas & Caribbean Reefs Home Area Caribbean Pacific South Florida Hawaii Eastern Pacific French Polynesia Worldwide Index By Common Names By Categories By Scientific Names Updates About RSS Search Species Flat Needlefish St Kitts St Kitts Back to Needlefishes page Flat Needlefish Scientific Name: Ablennes hians Family: Belonidae Category: Needlefishes Size: 1 to 3 feet (30 to 90 cm) Depth: 0-30 ft. (0-10 m) Distribution: Florida, Caribbean, Bermuda, Bahamas, Brazil, Gulf of Mexico, Indo-Pacific, Australia Silvery fishes Needlefishes Flat NeedlefishRedfin Needlefish All Photographs© 2023 Florent Charpin


In [14]:
needle_soup.find_all('div',{'class':'infodetails'})

[<div class="infodetails"><span class="details">Scientific Name: </span><span class="sntitle">Ablennes hians</span></div>,
 <div class="infodetails"> </div>,
 <div class="infodetails"></div>,
 <div class="infodetails"><span class="details">Family: </span><span class="sntitle">Belonidae</span></div>,
 <div class="infodetails"></div>,
 <div class="infodetails"><span class="details">Category: </span><span class="details2">Needlefishes</span></div>,
 <div class="infodetails"></div>,
 <div class="infodetails"><span class="details">Size: </span><span class="details2">1 to 3 feet (30 to 90 cm)</span>  </div>,
 <div class="infodetails"><span class="details">Depth: </span><span class="details2">0-30 ft. (0-10 m)</span></div>,
 <div class="infodetails"><span class="details">Distribution: </span><span class="details2">Florida, Caribbean, Bermuda, Bahamas, Brazil, Gulf of Mexico, Indo-Pacific, Australia</span></div>]

In [15]:
[result.get_text() for result in needle_soup.find_all('div',{'class':'infodetails'})]

['Scientific Name: Ablennes hians',
 ' ',
 '',
 'Family: Belonidae',
 '',
 'Category: Needlefishes',
 '',
 'Size: 1 to 3 feet (30 to 90 cm)\xa0\xa0',
 'Depth: 0-30 ft. (0-10 m)',
 'Distribution: Florida, Caribbean, Bermuda, Bahamas, Brazil, Gulf of Mexico, Indo-Pacific, Australia']

In [16]:
details = [result.get_text() for result in needle_soup.find_all('div',{'class':'infodetails'})]
print(f"Scientific Name: {details[0].rsplit(': ')[-1]}")
print(f"Family Name: {details[3].rsplit(': ')[-1]}")
print(f"Category Name: {details[5].rsplit(': ')[-1]}")
range_idx_start = details[7].find('(') + 1
range_idx_end = details[7].find(')') - 3
print(f"range Range: {details[7][range_idx_start:range_idx_end].split()[0:3:2]}")
depth_idx_start = details[8].find('(') + 1
depth_idx_end = details[8].find(')') - 2
print(f"Depth Range: {details[8][depth_idx_start:depth_idx_end].split('-')}")
print(f"Distribution: {details[9].rsplit(': ')[-1].split(', ')}")

Scientific Name: Ablennes hians
Family Name: Belonidae
Category Name: Needlefishes
Size Range: ['30', '90']
Depth Range: ['0', '10']
Distribution: ['Florida', 'Caribbean', 'Bermuda', 'Bahamas', 'Brazil', 'Gulf of Mexico', 'Indo-Pacific', 'Australia']


In [17]:
[details[7].find('('),details[7].find(')')]

[18, 30]

In [18]:
needle_soup.find_all('a',{'class':'pixsel'})

[<a class="pixsel" href="pixhtml/flatneedle1.html"><img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle1.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/></a>,
 <a class="pixsel" href="pixhtml/flatneedle2.html"><img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle2.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/></a>]

In [19]:
[result for result in needle_soup.find_all('a',{'class':'pixsel'})]

[<a class="pixsel" href="pixhtml/flatneedle1.html"><img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle1.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/></a>,
 <a class="pixsel" href="pixhtml/flatneedle2.html"><img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle2.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/></a>]

In [20]:
needle_soup.find_all('img')

[<img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle1.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/>,
 <img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle2.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/>,
 <img alt="" src="../pix/thumb3/flatneedle1.jpg" title=""/>,
 <img alt="" src="../pix/thumb3/redfinneedlefish1.jpg" title=""/>]

In [21]:
[result.get('src') for result in needle_soup.find_all('img',{'class':'selframe'})]

['../pix/thumb2/flatneedle1.jpg', '../pix/thumb2/flatneedle2.jpg']

In [22]:
img_link = 'https://reefguide.org/pix/flatneedle1.jpg'
img_data = requests.get(img_link).content
with open('needlefish_test.jpg', 'wb') as handler:
    handler.write(img_data)

In [23]:
[result.get_text() for result in needle_soup.find_all('div',{'class':'main3'})]

['\xa0', '\xa0']

In [24]:
needle_soup.find_all('div',{'class':'galleryspan'})

[<div class="galleryspan">
 <a class="pixsel" href="pixhtml/flatneedle1.html"><img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle1.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/></a>
 <div class="main2">St Kitts</div>
 <div class="main3"> </div>
 </div>,
 <div class="galleryspan">
 <a class="pixsel" href="pixhtml/flatneedle2.html"><img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle2.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/></a>
 <div class="main2">St Kitts</div>
 <div class="main3"> </div>
 </div>]

In [25]:
image_groups = needle_soup.find_all('div',{'class':'galleryspan'})
first_image = image_groups[0]
print(first_image.find('img'))
print(first_image.find('main3'))
print(first_image.find('img').get('src').rsplit('/',1)[-1])

<img alt="Flat Needlefish - Ablennes hians - St Kitts" class="selframe" src="../pix/thumb2/flatneedle1.jpg" title="Flat Needlefish - Ablennes hians - St Kitts"/>
None
flatneedle1.jpg


In [26]:
first_image.find('main3') == None

True

In [27]:
needle_soup.find_all('div',{'class':'typetitle'})[0].get_text()

'Flat Needlefish'

Okay.. so I think from this we have figured out how to get all the information we are currently interested in, plus a way to download and save the images. Lets put together a function that will do everything, taking in only the reference we got from the list of species above and a flag for whether or not we should download the images.

In [165]:
def reefguide_scrape(species_suffix,download_flag):
    ### This function takes in the end of a URL and uses it to access the appropriate webpage on reefguide.org
    ### For example, being passed 'flatneedle.html' will access the URL 'https://reefguide.org/carib/flatneedle.html'
    ### It will then collect the desired species information from the web page as well as download all the images
    ### If the download flag is boolean True using the function reefguide_images and store them in the appropriate folder
    ### on my HDD.

    # Get page HTML

    reefguide_prefix = 'https://reefguide.org/carib/'
    species_URL = reefguide_prefix + species_suffix
    species_page = requests.get(species_URL)
    species_soup = BeautifulSoup(species_page.content, "html.parser")
    
    # Get Common Name

    common_name = species_soup.find_all('div',{'class':'typetitle'})[0].get_text()

    # Get info from details box

    details = [result.get_text() for result in species_soup.find_all('div',{'class':'infodetails'})]

    # Extract Scientific Name, Scientific Family, and Common Family Name

    scientific_name = details[0].rsplit(': ')[-1]
    scientific_family = details[3].rsplit(': ')[-1]
    common_family = details[5].rsplit(': ')[-1]

    # Extract Size and Depth Ranges
    # There is a helper function defined below

    size_range_cm = range_handler(details[7].rsplit(': ')[-1])

    # couple cases where depth doesn't exist

    try:
        depth_range_cm = range_handler(details[8].rsplit(': ')[-1])
        depth_range_m = (depth_range_cm[0] / 100, depth_range_cm[1] / 100) 
    except:
        depth_range_m = (0,0)
        
    # Extract Distribution str list

    geodist = details[9].rsplit(': ')[-1].split(', ')

    if download_flag:
        reefguide_download(species_soup, scientific_family, scientific_name)
    else:
        pass

    return [ scientific_name, scientific_family, common_name, common_family, size_range_cm, depth_range_m, geodist ]

def reefguide_download(species_soup,scientific_family,scientific_name):
    ### This function downloads all images from the species page and stores them in a species labelled folder on my HDD
    ### The folder is named for the scientific name as well as fits into the scientific family heirarchy
    ### The function will separate images into different folders based on labels such as juvenile, initial

    reefguide_pix_prefix = 'https://reefguide.org/pix/'

    image_groups = species_soup.find_all('div',{'class':'galleryspan'})
    for image in image_groups:

        # Get Image URL and image data

        image_suffix = image.find('img').get('src').rsplit('/',1)[-1]
        img_link = reefguide_pix_prefix + image_suffix
        img_data = requests.get(img_link).content

        # Determine if image has phase label such as juvenile, initial, terminal. If not, leave blank
        try:
            phase_tag = image.find_all('div', {'class': 'main3'})[0].get_text()
        except:
            phase_tag = None

        if phase_tag == None:
            phase_tag = ''
        else:
            phase_tag = phase_tag.replace(' ','_')

        # Construct save path

        path_prefix = 'E:/LargeDatasets/SpeciesID-Images/'
        folder_path = path_prefix + scientific_family + '/' + scientific_name.replace(' ','_') + '_' + phase_tag
        img_path = folder_path + '/' + image_suffix

        # check if path exists and if not make it so

        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        # check if file already exists and if so skip

        if os.path.exists(img_path):
            continue
        
        with open(img_path, 'wb') as handler:
            handler.write(img_data)

    return

In [106]:
reefguide_scrape('rainbowparrot.html',1)

['Scarus guacamaia',
 'Scaridae',
 'Rainbow Parrotfish',
 'Parrotfishes',
 (45.0, 150.0),
 (3.0, 25.0),
 ['Caribbean', 'Bahamas', 'Florida', 'Bermuda']]

In [30]:
columns = [ 'scientific_name', 'scientific_family', 'common_name', 'common_family', 'range_cm', 'depth_range_m', 'geodist' ]
FishDF = pd.DataFrame(columns = columns)
FishDF

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist


In [161]:
FishDF.loc[len(FishDF)] = reefguide_scrape('flatneedle.html',0)
FishDF

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
0,Ablennes hians,Belonidae,Flat Needlefish,Needlefishes,"(30.0, 90.0)","(0.0, 10.0)","[Florida, Caribbean, Bermuda, Bahamas, Brazil,..."
1,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
2,Ablennes hians,Belonidae,Flat Needlefish,Needlefishes,"(30.0, 90.0)","(0.0, 10.0)","[Florida, Caribbean, Bermuda, Bahamas, Brazil,..."
3,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
4,Ablennes hians,Belonidae,Flat Needlefish,Needlefishes,"(30.0, 90.0)","(0.0, 10.0)","[Florida, Caribbean, Bermuda, Bahamas, Brazil,..."


In [162]:
FishDF.loc[len(FishDF)] = reefguide_scrape('nurseshark.html',0)
FishDF

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
0,Ablennes hians,Belonidae,Flat Needlefish,Needlefishes,"(30.0, 90.0)","(0.0, 10.0)","[Florida, Caribbean, Bermuda, Bahamas, Brazil,..."
1,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
2,Ablennes hians,Belonidae,Flat Needlefish,Needlefishes,"(30.0, 90.0)","(0.0, 10.0)","[Florida, Caribbean, Bermuda, Bahamas, Brazil,..."
3,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
4,Ablennes hians,Belonidae,Flat Needlefish,Needlefishes,"(30.0, 90.0)","(0.0, 10.0)","[Florida, Caribbean, Bermuda, Bahamas, Brazil,..."
5,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."


Yoohooo!!!!! It works, and downloading the images works too. So.. now I have to decide if we want to use all the refs from the html list above, or if I want to manually go through and clean the list for only fish. I think that is what I will do so, lets bring back the list.

In [33]:
ref_lst

['flatneedle.html',
 'sergeantmajor.html',
 'nightsergeant.html',
 'roughheadblenny.html',
 'secretaryblenny.html',
 'spinyheadblenny.html',
 'chiton.html',
 'honeycomb.html',
 'scrawledcowfish.html',
 'surgeonfish.html',
 'doctorfish.html',
 'bluetang.html',
 'greenmermaidswineglass.html',
 'whitemermaidswineglass.html',
 'redhairswimmingcrab.html',
 'ocellateswimmingcrab.html',
 'blotchedswimmingcrab.html',
 'staghorn.html',
 'elkhorncoral.html',
 'elegantanemone.html',
 'acyrtusartius.html',
 'eagleray.html',
 'scaledlettucecoral.html',
 'lettucecoral.html',
 'purplelettucecoral.html',
 'fragilesaucercoral.html',
 'dimpledsheetcoral.html',
 'lowreliefletttuce.html',
 'whitestarsheet.html',
 'thinleaflettuce.html',
 'agelascerebrum.html',
 'agelascitrina.html',
 'elephantear.html',
 'agelasconifera.html',
 'agelastubulata.html',
 'brownclusteredtube.html',
 'branchingtube.html',
 'bonefish.html',
 'aligergallus.html',
 'queenconch.html',
 'redsnappingshrimp.html',
 'alpheuspolystictu

In [34]:
reefguide_hrefs_fish = ['flatneedle.html',
 'sergeantmajor.html',
 'nightsergeant.html',
 'roughheadblenny.html',
 'secretaryblenny.html',
 'spinyheadblenny.html',
 'honeycomb.html',
 'scrawledcowfish.html',
 'surgeonfish.html',
 'doctorfish.html',
 'bluetang.html',
 'eagleray.html',
 'bonefish.html',
 'scrawledfile.html',
 'hawkfish.html',
 'pipehorse.html',
 'blackmargate.html',
 'porkfish.html',
 'whitestarcardinalfish.html',
 'apogonbinotatus.html',
 'whitestarcardinalfish.html',
 'flamefish.html',
 'palecardinalfish.html',
 'apogonpseudomaculatus.html',
 'apogonrobinsi.html',
 'beltedcardinalfish.html',
 'sheepshead.html',
 'blackfincardinalfish.html',
 'trumpet.html',
 'queentrigger.html',
 'largeeyetoadfish.html',
 'spotfinhogfish.html',
 'spanishhogfish.html',
 'peacock.html',
 'eyedflounder.html',
 'smallmouthgrunt.html',
 'joltheadporgy.html',
 'saucereyeporgy.html',
 'plumaporgy.html',
 'lancerdragonet.html',
 'whitespottedfile.html',
 'orangespottedfile.html',
 'oceantrigger.html',
 'sharpnosepuffer.html',
 'yellowjack.html',
 'barjack.html',
 'bluerunner.html',
 'horseeyejack.html',
 'blackjack.html',
 'bullshark.html',
 'reefshark.html',
 'loggerheadturtle.html',
 'cherubfish.html',
 'graysby.html',
 'coney.html',
 'chaenopsislimbaughi.html',
 'spadefish.html',
 'foureyebutterflyfish.html',
 'spotfinbutter.html',
 'reefbutterflyfish.html',
 'bandedbutterfly.html',
 'greenturtle.html',
 'bridledburrfish.html',
 'webburrfish.html',
 'chilomycterusreticulatus.html',
 'bluechromis.html',
 'chromisenchrysura.html',
 'sunshinefish.html',
 'brownchromis.html',
 'creolewrasse.html',
 'colongoby.html',
 'pallidgoby.html',
 'bridledgoby.html',
 'peppermintgoby.html',
 'glassgoby.html',
 'coryphopterustortugae.html',
 'coryphopterusvenezuelae.html',
 'dactylopterusvolitans.html',
 'southernray.html',
 'balloon.html',
 'porcupine.html',
 'diplectrumbivittatum.html',
 'diplectrumformosum.html',
 'spottailpinfish.html',
 'sharksucker.html',
 'echeneisneucratoides.html',
 'chainmoray.html',
 'brownencrustingoctopus.html',
 'caymancleaninggoby.html',
 'orangesidedgoby.html',
 'sharknosegoby.html',
 'cleaninggoby.html',
 'yellowlinegoby.html',
 'barsnoutgoby.html',
 'elacatinuslori.html',
 'spotlightgoby.html',
 'yellownosegoby.html',
 'neongoby.html',
 'rainbowrunner.html',
 'sailfinblenny.html',
 'flagfinblenny.html',
 'chestnutmoray.html',
 'mulattocongereel.html',
 'blackedgetriplefin.html',
 'eostichopusarnesoni.html',
 'redhind.html',
 'goliathgrouper.html',
 'nassau.html',
 'jackknife.html',
 'spotteddrum.html',
 'hawksbill.html',
 'fistulariatabacaria.html',
 'yellowfinmojarra.html',
 'nurseshark.html',
 'quillfin.html',
 'fairybasslet.html',
 'blackcapbasslet.html',
 'greenmoray.html',
 'goldentailmoray.html',
 'spottedmoray.html',
 'purplemouthmoray.html',
 'whitemargate.html',
 'tomtate.html',
 'caesargrunt.html',
 'frenchgrunt.html',
 'spanishgrunt.html',
 'cottonwick.html',
 'sailorchoice.html',
 'whitegrunt.html',
 'bluestripedgrunt.html',
 'slipperydick.html',
 'yellowcheekwrasse.html',
 'yellowheadwrasse.html',
 'clownwrasse.html',
 'rainbowwrasse.html',
 'blackearwrasse.html',
 'puddingwife.html',
 'ballyhoo.html',
 'browngardeneel.html',
 'glasseye.html',
 'caribbeanwhiptailstingray.html',
 'hippocampuserectus.html',
 'longsnoutseahorse.html',
 'blueangel.html',
 'queenangel.html',
 'rockbeauty.html',
 'squirrel.html',
 'longspinesquirrel.html',
 'floridaseacucumber.html',
 'donkeycucumber.html',
 'tigertailcucumber.html',
 'yellowbellyhamlet.html',
 'yellowtailhamlet.html',
 'bluehamlet.html',
 'shyhamlet.html',
 'indigohamlet.html',
 'blackhamlet.html',
 'barredhamlet.html',
 'tanhamlet.html',
 'hamlethyb.html',
 'butterhamlet.html',
 'boga.html',
 'threerowedcucumber.html',
 'kyphosuscinerascens.html',
 'chub.html',
 'kyphosusvaigiensis.html',
 'labrisomusconditus.html',
 'labrisomuscricota.html',
 'hairyblenny.html',
 'hogfish.html',
 'spottedtrunk.html',
 'buffalotrunkfish.html',
 'smoothtrunk.html',
 'candybasslet.html',
 'peppermintbasslet.html',
 'arrowblenny.html',
 'muttonsnapper.html',
 'schoolmaster.html',
 'blackfinsnapper.html',
 'cuberasnapper.html',
 'graysnapper.html',
 'dogsnapper.html',
 'mahogany.html',
 'lanesnapper.html',
 'sandtile.html',
 'goldlineblenny.html',
 'diamondblenny.html',
 'saddledblenny.html',
 'giantmanta.html',
 'tarpon.html',
 'blackdurgon.html',
 'micrognathuscrinitus.html',
 'yellowtaildamsel.html',
 'fringedfilefish.html',
 'slenderfile.html',
 'yellowgoat.html',
 'blackgrouper.html',
 'yellowmouthgrouper.html',
 'tiger.html',
 'yellowfingrouper.html',
 'sharptaileel.html',
 'goldspottedeel.html',
 'blackbarsoldier.html',
 'lesserray.html',
 'lemonshark.html',
 'longjawsquirrel.html',
 'neslongus.html',
 'reefoctopus.html',
 'octopusmacropus.html',
 'commonoctopus.html',
 'yellowtailsnapper.html',
 'ogcocephalusnasutus.html',
 'ophichthusophis.html',
 'redlipblenny.html',
 'jawfish.html',
 'bandedjawfish.html',
 'duskyjawfish.html',
 'parablenniusmarmoreus.html',
 'creolefish.html',
 'highhat.html',
 'parequesumbrosus.html',
 'glassysweeper.html',
 'grayangel.html',
 'frenchangel.html',
 'priacanthusarenatus.html',
 'rustygoby.html',
 'longsnoutbutter.html',
 'hiddenseacucumber.html',
 'spottedgoat.html',
 'commonlionfish.html',
 'whaleshark.html',
 'tuskedgoby.html',
 'freckledsoapfish.html',
 'greatersoapfish.html',
 'rypticussubbifrenatus.html',
 'sanopusastrifer.html',
 'splendidtoad.html',
 'reefsquirrel.html',
 'duskysquirrel.html',
 'midnightparrot.html',
 'blueparrot.html',
 'rainbowparrot.html',
 'stripedparrotfish.html',
 'princessparrot.html',
 'queenparrot.html',
 'cero.html',
 'scorpaenaalbifimbria.html',
 'scorpaenagrandicornis.html',
 'mushroomscorpionfish.html',
 'scorpion.html',
 'reefscorpionfish.html',
 'almacojack.html',
 'lanternbass.html',
 'serranussubligarius.html',
 'tobacco.html',
 'harlequinbass.html',
 'chalkbass.html',
 'greenblotchparrotfish.html',
 'redbandparrot.html',
 'redtailparrot.html',
 'yellowtailparrot.html',
 'stoplightparrotfish.html',
 'bandtailpuffer.html',
 'checkeredpuffer.html',
 'barracuda.html',
 'duskydamsel.html',
 'longfindamselfish.html',
 'beaugregory.html',
 'bicolordamsel.html',
 'threespotdamsel.html',
 'cocoadamselfish.html',
 'spinnerdolphin.html',
 'redfinneedlefish.html',
 'channelflounder.html',
 'symphurusdiomedeanus.html',
 'sanddiver.html',
 'bluestripedlizardfish.html',
 'bluehead.html',
 'tigrigobiusharveyi.html',
 'greenbandedgoby.html',
 'permit.html',
 'palometa.html',
 'westindianmanatee.html',
 'dolphin.html',
 'mottledmojarra.html',
 'yellowray.html',
 'sargassumtrigger.html',
 'rosyrazorfish.html',
 'pearlyrazorfish.html',
 'greenrazor.html',
]

In [35]:
len(reefguide_hrefs_fish)

290

In [36]:
reefguide_hrefs_NotFish = [ species for species in ref_lst if species not in reefguide_hrefs_fish]

In [37]:
print(len(ref_lst))
print(len(reefguide_hrefs_NotFish))

685
396


In [38]:
396 + 290

686

In [40]:
len(set(reefguide_hrefs_fish)) == len(reefguide_hrefs_fish)

False

In [41]:
reefguide_hrefs_fish = list(set(reefguide_hrefs_fish))

In [45]:
fish_refs_DF = pd.DataFrame(data = reefguide_hrefs_fish)
notFish_refs_DF = pd.DataFrame(data = reefguide_hrefs_NotFish)
fish_refs_DF

Unnamed: 0,0
0,slipperydick.html
1,freckledsoapfish.html
2,giantmanta.html
3,apogonbinotatus.html
4,mahogany.html
...,...
284,redtailparrot.html
285,cottonwick.html
286,palecardinalfish.html
287,caesargrunt.html


In [46]:
fish_refs_DF.to_pickle('./files/reefguide_refs_fish.pkl')
notFish_refs_DF.to_pickle('./files/reefguide_refs_notFish.pkl')


Okay! Now is the moment of truth.. lets scrape info and images for all the refs we have for fish species.

In [52]:
columns = [ 'scientific_name', 'scientific_family', 'common_name', 'common_family', 'size_range_cm', 'depth_range_m', 'geodist' ]
reefguide_fish_DF = pd.DataFrame(columns = columns)

for species in reefguide_hrefs_fish:

    species_info = reefguide_scrape(species,1)

    reefguide_fish_DF.loc[len(reefguide_fish_DF)] = species_info

ValueError: could not convert string to float: 'to'

In [53]:
reefguide_fish_DF

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]"
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical]
3,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(10.0,)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]"
4,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]"
5,Lutjanus synagris,Lutjanidae,Lane Snapper,Snappers,"(20.0, 30.0)","(2.0, 40.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico, ..."
6,Synodus saurus,Synodontidae,Bluestriped Lizardfish,Lizardfishes,"(10.0, 18.0)","(0.0, 12.0)","[Caribbean, Bahamas]"
7,Trichechus manatus,Trichechidae,West Indian Manatee,Manatees,"(400.0,)","(0.0, 3.0)","[Florida, Caribbean, Brazil]"
8,Halichoeres maculipinna,Labridae,Clown Wrasse,Wrasses,"(7.0, 12.0)","(3.0, 12.0)","[Florida, Caribbean, Bahamas, Bermuda, Brazil]"
9,Holacanthus ciliaris,Pomacanthidae,Queen Angelfish,Angelfishes,"(20.0, 35.0)","(6.0, 25.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."


In [54]:
reefguide_hrefs_fish[15]

'tarpon.html'

It appears I need to fix a bug with how the sizes are grabbed. After much work and testing with cases below, I have created a generic handler that will interpret strings with ranges in a number of different formats found on the website. It should work for both depth and size. However I should note that it will return the range in cm, so for depth we will need to convert to m

In [145]:
case1 = 'Up to 13 ft. (4 m)'
case2 = 'Up to 4 in. (10 cm)'
case3 = '3 - 10 ft. (1 to 3 m)'
case4 = '0 - 1 ft. (10 to 100 cm)'
case5 = '2 - 5 ft. (60 cm to 1.5 m)'
case6 = '1 - 4 ft.'
case7 = '0.25-1 in.'
case8 = 'Up to 5 in.'

tests = [ case1, case2, case3, case4, case5, case6, case7, case8 ]

for test in tests:

    range_idx_start = test.find('(') + 1
    range_idx_end = test.find(')')

    if range_idx_end == -1:
        inside_parens = test
    else:
        inside_parens = test[range_idx_start:range_idx_end]
    contents_list = inside_parens.split()
    print(contents_list)

['4', 'm']
['10', 'cm']
['1', 'to', '3', 'm']
['10', 'to', '100', 'cm']
['60', 'cm', 'to', '1.5', 'm']
['1', '-', '4', 'ft.']
['0.25-1', 'in.']
['Up', 'to', '5', 'in.']


In [154]:
def range_handler(range_details):
    
    range_idx_start = range_details.find('(') + 1
    range_idx_end = range_details.find(')')

    if range_idx_end == -1:
        inside_parens = range_details
    else:
        inside_parens = range_details[range_idx_start:range_idx_end]
    contents_list = inside_parens.split()

    if len(contents_list[0].split('-')) > 1:
        contents_list = contents_list[0].split('-')+[contents_list[1]]
    elif 'Up' in contents_list:
        contents_list = contents_list[2:]

    if len(contents_list) == 2:
        if 'in.' in contents_list:
            range_in = [0,contents_list[0]]
            range_in = np.array(list(map(float,range_in)))
            range_cm = tuple(np.round(range_in * 2.54))
        elif 'ft.' in contents_list:
            range_ft = [0,contents_list[0]]
            range_ft = np.array(list(map(float,range_ft)))
            range_cm = tuple(np.round(range_ft * 12 * 2.54))
        elif 'm' in contents_list:
            range_m = [0,contents_list[0]]
            range_m = np.array(list(map(float,range_m)))
            range_cm = tuple(range_m * 100)
        else:
            range_cm = tuple(map(float,[0, contents_list[0]]))
    elif len(contents_list) == 3:
        if 'in.' in contents_list:
            range_in = [contents_list[0],contents_list[1]]
            range_in = np.array(list(map(float,range_in)))
            range_cm = tuple(np.round(range_in * 2.54))
        elif 'ft.' in contents_list:
            range_ft = [contents_list[0],contents_list[1]]
            range_ft = np.array(list(map(float,range_ft)))
            range_cm = tuple(np.round(range_ft * 12 * 2.54))
        elif 'm' in contents_list:
            range_m = [contents_list[0],contents_list[1]]
            range_m = np.array(list(map(float,range_m)))
            range_cm = tuple(range_m * 100)
        else:
            range_cm = tuple(map(float,[contents_list[0],contents_list[1]]))
    elif len(contents_list) == 4:
        if 'in.' in contents_list:
            range_in = [contents_list[0],contents_list[2]]
            range_in = np.array(list(map(float,range_in)))
            range_cm = tuple(np.round(range_in * 2.54))
        elif 'ft.' in contents_list:
            range_ft = [contents_list[0],contents_list[2]]
            range_ft = np.array(list(map(float,range_ft)))
            range_cm = tuple(np.round(range_ft * 12 * 2.54))
        elif 'm' in contents_list:
            range_m = [contents_list[0],contents_list[2]]
            range_m = np.array(list(map(float,range_m)))
            range_cm = tuple(range_m * 100)
        else:
            range_cm = tuple(map(float,[contents_list[0],contents_list[2]]))
    else:
        lower = float(contents_list[0])
        higher = float(contents_list[3])
        if contents_list[1] == 'in.':
            range_cm = (lower * 2.54, higher * 12 * 2.54)
        else:
            range_cm = (lower, higher * 100)

    return range_cm

Okay! Should be better.. lets try again.

In [166]:
columns = [ 'scientific_name', 'scientific_family', 'common_name', 'common_family', 'range_cm', 'depth_range_m', 'geodist' ]
reefguide_fish_DF = pd.DataFrame(columns = columns)

for species in reefguide_hrefs_fish[0:20]:

    species_info = reefguide_scrape(species,1)

    reefguide_fish_DF.loc[len(reefguide_fish_DF)] = species_info

In [167]:
reefguide_fish_DF

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,range_cm,depth_range_m,geodist
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]"
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical]
3,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]"
4,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]"
5,Lutjanus synagris,Lutjanidae,Lane Snapper,Snappers,"(20.0, 30.0)","(2.0, 40.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico, ..."
6,Synodus saurus,Synodontidae,Bluestriped Lizardfish,Lizardfishes,"(10.0, 18.0)","(0.0, 12.0)","[Caribbean, Bahamas]"
7,Trichechus manatus,Trichechidae,West Indian Manatee,Manatees,"(0.0, 400.0)","(0.0, 3.0)","[Florida, Caribbean, Brazil]"
8,Halichoeres maculipinna,Labridae,Clown Wrasse,Wrasses,"(7.0, 12.0)","(3.0, 12.0)","[Florida, Caribbean, Bahamas, Bermuda, Brazil]"
9,Holacanthus ciliaris,Pomacanthidae,Queen Angelfish,Angelfishes,"(20.0, 35.0)","(6.0, 25.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."


Yay! I think its working. Let's let it rip for the whole ref_list so I can have it done.

In [168]:
columns = [ 'scientific_name', 'scientific_family', 'common_name', 'common_family', 'size_range_cm', 'depth_range_m', 'geodist' ]
reefguide_fish_DF = pd.DataFrame(columns = columns)

for species in reefguide_hrefs_fish:

    species_info = reefguide_scrape(species,1)

    reefguide_fish_DF.loc[len(reefguide_fish_DF)] = species_info

In [169]:
reefguide_fish_DF

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]"
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical]
3,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]"
4,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]"
...,...,...,...,...,...,...,...
284,Sparisoma chrysopterum,Scaridae,Redtail Parrotfish,Parrotfishes,"(35.0, 40.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda]"
285,Haemulon melanurum,Haemulidae,Cottonwick,Grunts,"(17.0, 25.0)","(3.0, 15.0)","[Caribbean, Florida, Bahamas, Bermuda, Brazil,..."
286,Apogon planifrons,Apogonidae,Pale Cardinalfish,Cardinalfishes,"(3.0, 10.0)","(3.0, 30.0)","[Caribbean, Bahamas, South Florida, Brazil]"
287,Haemulon carbonarium,Haemulidae,Caesar Grunt,Grunts,"(18.0, 30.0)","(3.0, 15.0)","[Caribbean, Bahamas, Florida, Bermuda]"


In [170]:
reefguide_fish_DF.to_pickle('./files/fishspecies_reefguide_info.pkl')

In [113]:
reefguide_hrefs_fish[188]

'peppermintgoby.html'

In [171]:
reefguide_fish_DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 289 entries, 0 to 288
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   scientific_name    289 non-null    object
 1   scientific_family  289 non-null    object
 2   common_name        289 non-null    object
 3   common_family      289 non-null    object
 4   size_range_cm      289 non-null    object
 5   depth_range_m      289 non-null    object
 6   geodist            289 non-null    object
dtypes: object(7)
memory usage: 18.1+ KB


In [173]:
cat_columns = [ 'scientific_family', 'common_family']
for column in cat_columns:
    print('====================================================')
    print(reefguide_fish_DF[column].value_counts())

Serranidae        32
Gobiidae          22
Labridae          15
Pomacentridae     13
Haemulidae        12
                  ..
Congridae          1
Ogcocephalidae     1
Raspailiidae       1
Torpedinidae       1
Megalopidae        1
Name: scientific_family, Length: 72, dtype: int64
Gobies                 22
Wrasses                15
Grunts                 12
Parrotfishes           11
Hamlets                10
                       ..
Cornetfishes            1
Sweepers                1
Mackerels and Tunas     1
Eagle Rays              1
Bonnetmouths            1
Name: common_family, Length: 78, dtype: int64


## Smithsonian Institute

I think we will scrape this one next. As it has a large database it should include most species as well as a number of images. So the plan is to build off our reefguide image and dataset, using our scientific_name as the key and to help look up the fish on this and other websites.

In [112]:
reefguide_fish_DF = pd.read_pickle('./files/fishspecies_reefguide_info.pkl')

In [113]:
reefguide_fish_DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 289 entries, 0 to 288
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   scientific_name    289 non-null    object
 1   scientific_family  289 non-null    object
 2   common_name        289 non-null    object
 3   common_family      289 non-null    object
 4   size_range_cm      289 non-null    object
 5   depth_range_m      289 non-null    object
 6   geodist            289 non-null    object
dtypes: object(7)
memory usage: 18.1+ KB


In [114]:
reefguide_fish_DF.head()

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]"
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical]
3,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]"
4,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]"


In [115]:
reefguide_fish_DF.scientific_name[0]

'Halichoeres bivittatus'

In [116]:
reefguide_fish_DF.loc[reefguide_fish_DF.scientific_name == 'Acanthurus bahianus']

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
155,Acanthurus bahianus,Acanthuridae,Ocean Surgeonfish,Surgeonfishes,"(15.0, 30.0)","(5.0, 25.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."


In [117]:
URL = 'https://biogeodb.stri.si.edu/caribbean/en/thefishes/systematic'
page = requests.get(URL)

smithsonian_soup = BeautifulSoup(page.content, "html.parser")

Okay so looking at the web page inspector, from the URL above we can get the href for all the species we currently have in our reefguide dataframe. Let's explore how to get the hrefs from the soup obj.

In [118]:
first_row = smithsonian_soup.find('a',{'style':'padding-left:80px;'})

In [119]:
print(first_row.prettify())
print(first_row.get_text())

<a class="internal" href="spe/5756" id="5756" style="padding-left:80px;">
 Chiloscyllium punctatum
</a>
Chiloscyllium punctatum


In [120]:
all_entries = smithsonian_soup.find_all('a',{'style':'padding-left:80px;'})

In [121]:
all_entries[0:5]

[<a class="internal" href="spe/5756" id="5756" style="padding-left:80px;">Chiloscyllium punctatum</a>,
 <a class="internal" href="spe/24" id="24" style="padding-left:80px;">Ginglymostoma cirratum</a>,
 <a class="internal" href="spe/26" id="26" style="padding-left:80px;">Rhincodon typus</a>,
 <a class="internal" href="spe/2674" id="2674" style="padding-left:80px;">Carcharias taurus</a>,
 <a class="internal" href="spe/31" id="31" style="padding-left:80px;">Odontaspis ferox</a>]

In [122]:
reefguide_fish_DF.loc[reefguide_fish_DF.scientific_name == all_entries[1].get_text()]

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist
169,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of..."


In [123]:
all_entries[1].get('href')

'spe/24'

In [124]:
# get tuple of scientific name and species href for all species already in reefguide_df

reefguide_fish_DF['smithsonian_href'] = pd.Series(dtype = 'object')
for row in reefguide_fish_DF.iterrows():
    for entry in all_entries:
        if row[1].scientific_name == entry.get_text():
            row[1].smithsonian_href = entry.get('href')
            continue

In [125]:
reefguide_fish_DF.head()

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",spe/3894
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]",spe/3537
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical],
3,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]",spe/3595
4,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]",spe/3690


In [126]:
reefguide_fish_DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 289 entries, 0 to 288
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   scientific_name    289 non-null    object
 1   scientific_family  289 non-null    object
 2   common_name        289 non-null    object
 3   common_family      289 non-null    object
 4   size_range_cm      289 non-null    object
 5   depth_range_m      289 non-null    object
 6   geodist            289 non-null    object
 7   smithsonian_href   251 non-null    object
dtypes: object(8)
memory usage: 28.4+ KB


Hmm.. so already 38 species not found in the database. Let's see which ones

In [127]:
reefguide_fish_DF.loc[reefguide_fish_DF.smithsonian_href.isna() ]

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical],
7,Trichechus manatus,Trichechidae,West Indian Manatee,Manatees,"(0.0, 400.0)","(0.0, 3.0)","[Florida, Caribbean, Brazil]",
29,Eretmochelys imbricata,Cheloniidae,Hawksbill Turtle,Turtles,"(30.0, 107.0)","(0.0, 21.0)",[Circumtropical],
41,Micrognathus crinitus,Syngnathidae,Harlequin Pipefish,Pipefishes,"(12.0, 20.0)","(2.0, 18.0)","[Caribbean, Bahamas, South Florida, Brazil]",
46,Isostichopus badionotus,Stichopodidae,Three-Rowed Sea Cucumber,Sea Cucumbers,"(25.0, 40.0)","(0.0, 60.0)","[Caribbean, Bahamas, Florida, Bermuda]",
65,Elacatinus dilepis,Gobiidae,Orangesided Goby,Gobies,"(1.2, 1.8)","(8.0, 30.0)","[Bahamas, Caribbean]",
69,Stegastes variabilis,Pomacentridae,Cocoa Damselfish,Damselfishes,"(8.0, 10.0)","(5.0, 18.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",
72,Caretta caretta,Cheloniidae,Loggerhead Turtle,Turtles,"(30.0, 100.0)","(0.0, 60.0)",[Circumtropical],
78,Carangoides ruber,Carangidae,Bar Jack,Jacks,"(20.0, 35.0)","(0.0, 18.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",
82,Sargocentron vexillarium,Holocentridae,Dusky Squirrelfish,Squirrelfishes,"(8.0, 15.0)","(0.0, 15.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]",


Soooo I think we will eliminate a number of these that aren't fish for now, since I didn't really need to include them in the first place. For the ones that are fish, we will look them up and try and see if there are inconsistincies with scientific names or other issues that would lead to us removing the row altogether.

1. Elacatinus dilepis --> According to fishbase.se, this is an outdated and invalid taxonomic identification. Replace with Tigrigobius dilepis
2. Stegastes variabilis	--> This is the correct name on fishbase, though it does have another entry for Stegastes xanthurus and they share the common name Cocoa Damselfish. Smithsonian uses that as well, so we will change it
3. Carangoides ruber --> fishbase says this is outdated, use Caranx ruber instead. Smithsonian agrees
4. Sargocentron vexillarium	--> fishbase and smithsonian say Neoniphon vexillarium
5. Kyphosus sectatrix/bigibbus	--> fishbase says bigibbus is a 'Brown Chub' endemic only to Pacific whereas sectatrix is know as the Bermuda sea chub found circumglobally
6. Coryphopterus personatus/hyalinus --> it seems like reefguide guy took the liberty of assuming that these are actually the same species. split this entry
7. Elactinus lobeli --> reefguide spelling mistake, should be Elacatinus
8. Carangoides bartholomaei	--> same as bar jack, should be Caranx not Carangoides
9. Acanthurus bahianus --> There seems to be argument over whether this is a distinct species from Acanthurus tractus, but since tractus is what smithsonian uses lets go with that
10. Inermia vittata	--> fishbase and smithsonian use Haemulon vittatum
11. Amphelikturus dendriticus --> fishbase thinks this is okay, smithsonian uses Acentronura dendritica which fishbase says is not the currently accepted name
12. Ulaema lefroyi --> fishbase likes, smithsonian uses outdated Eucinostomus lefroyi
13. Emblemariopsis signifer --> fishbase likes, smithsonian calls this the Caribbean Blenny, Emblemariopsis carib, which looks like a completely different fish on fishbase
14. Hypoplectrus sp. --> remove, this is a hybrid species
15. Paranthias furcifer --> fishbase likes, smithsonian uses dated Cephalopholis furcifer
16. Ophioblennius atlanticus --> fishbase likes, smithsonian uses Ophioblennius macclurei. Fishbase agrees this is also a species, and I think macclurei looks more like what I would choose for redlip blenny
17. Sargocentron coruscum --> fishbase likes, smithsonian uses Neoniphon coruscum, which fishbase says doesn't exist. A quick google search has me leaning toward Sargocentron coruscum

So, for the ones reefguide clearly has wrong, we will modify those entries, remembering to check our image structure. For the ones smithsonian has wrong, we know what they use so we can create a little dictionary for replacement just during lookup. The rest we will remove the rows from reefguide_fish_DF for now.

In [128]:
reefguide_fish_DF.scientific_name.loc[65]

'Elacatinus dilepis'

In [129]:
# items that just need renaming

reefguide_fish_DF.scientific_name.loc[65] = 'Tigrigobius dilepis'
reefguide_fish_DF.scientific_name.loc[69] = 'Stegastes xanthurus'
reefguide_fish_DF.scientific_name.loc[78] = 'Caranx ruber'
reefguide_fish_DF.scientific_name.loc[82] = 'Neoniphon vexillarium'
reefguide_fish_DF.scientific_name.loc[115] = 'Kyphosus sectatrix'
reefguide_fish_DF.common_name.loc[115] = 'Bermuda Chub'
reefguide_fish_DF.scientific_name.loc[142] = 'Elacatinus lobeli'
reefguide_fish_DF.scientific_name.loc[153] = 'Caranx bartholomaei'
reefguide_fish_DF.scientific_name.loc[155] = 'Acanthurus tractus'
reefguide_fish_DF.scientific_name.loc[161] = 'Haemulon vittatum'
reefguide_fish_DF.scientific_name.loc[250] = 'Ophioblennius macclurei'

# dictionary for when smithsonian is wrong

smith_interpreter = {'Amphelikturus dendriticus' : 'Acentronura dendritica',
                     'Ulaema lefroyi' : 'Eucinostomus lefroyi',
                     'Emblemariopsis signifer' : 'Emblemariopsis carib',
                     'Paranthias furcifer' : 'Cephalopholis furcifer',
                     'Sargocentron coruscum' : 'Neoniphon coruscum'}

# splitting random combined entry

reefguide_fish_DF.loc[289] = [ 'Coryphopterus hyalinus', reefguide_fish_DF.scientific_family.loc[137], 'Glass Goby', reefguide_fish_DF.common_family.loc[137], 
                               reefguide_fish_DF.size_range_cm.loc[137], reefguide_fish_DF.depth_range_m.loc[137], reefguide_fish_DF.geodist.loc[137], 
                               reefguide_fish_DF.smithsonian_href.loc[137] ]
reefguide_fish_DF.loc[137] = [ 'Coryphopterus personatus', reefguide_fish_DF.scientific_family.loc[137], 'Masked Goby', reefguide_fish_DF.common_family.loc[137], 
                               reefguide_fish_DF.size_range_cm.loc[137], reefguide_fish_DF.depth_range_m.loc[137], reefguide_fish_DF.geodist.loc[137], 
                               reefguide_fish_DF.smithsonian_href.loc[137] ]

# dropping hybrid species
# commenting because ran and index no longer exists
# reefguide_fish_DF.drop(index = 232, inplace = True)

  return asarray(a).ndim


In [130]:
reefguide_fish_DF.loc[reefguide_fish_DF.smithsonian_href.isna() ]

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical],
7,Trichechus manatus,Trichechidae,West Indian Manatee,Manatees,"(0.0, 400.0)","(0.0, 3.0)","[Florida, Caribbean, Brazil]",
29,Eretmochelys imbricata,Cheloniidae,Hawksbill Turtle,Turtles,"(30.0, 107.0)","(0.0, 21.0)",[Circumtropical],
41,Micrognathus crinitus,Syngnathidae,Harlequin Pipefish,Pipefishes,"(12.0, 20.0)","(2.0, 18.0)","[Caribbean, Bahamas, South Florida, Brazil]",
46,Isostichopus badionotus,Stichopodidae,Three-Rowed Sea Cucumber,Sea Cucumbers,"(25.0, 40.0)","(0.0, 60.0)","[Caribbean, Bahamas, Florida, Bermuda]",
65,Tigrigobius dilepis,Gobiidae,Orangesided Goby,Gobies,"(1.2, 1.8)","(8.0, 30.0)","[Bahamas, Caribbean]",
69,Stegastes xanthurus,Pomacentridae,Cocoa Damselfish,Damselfishes,"(8.0, 10.0)","(5.0, 18.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",
72,Caretta caretta,Cheloniidae,Loggerhead Turtle,Turtles,"(30.0, 100.0)","(0.0, 60.0)",[Circumtropical],
78,Caranx ruber,Carangidae,Bar Jack,Jacks,"(20.0, 35.0)","(0.0, 18.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",
82,Neoniphon vexillarium,Holocentridae,Dusky Squirrelfish,Squirrelfishes,"(8.0, 15.0)","(0.0, 15.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]",


In [131]:
reefguide_fish_DF['smithsonian_href'] = pd.Series(dtype = 'object')
for row in reefguide_fish_DF.iterrows():
    for entry in all_entries:
        
        if entry.get_text() in smith_interpreter.values():
            key_list = list(smith_interpreter.keys())
            val_list = list(smith_interpreter.values())
            pos = val_list.index(entry.get_text())
            smith_name = key_list[pos]

            for entry in all_entries:
                if smith_name in entry:
                    entry = entry
                    break

        if row[1].scientific_name == entry.get_text():
            row[1].smithsonian_href = entry.get('href')
            continue

In [132]:
reefguide_fish_DF.loc[reefguide_fish_DF.smithsonian_href.isna() ]

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
2,Manta birostris,Myliobatidae,Giant Manta Ray,Manta Rays,"(180.0, 700.0)","(0.0, 12.0)",[Circumtropical],
7,Trichechus manatus,Trichechidae,West Indian Manatee,Manatees,"(0.0, 400.0)","(0.0, 3.0)","[Florida, Caribbean, Brazil]",
29,Eretmochelys imbricata,Cheloniidae,Hawksbill Turtle,Turtles,"(30.0, 107.0)","(0.0, 21.0)",[Circumtropical],
41,Micrognathus crinitus,Syngnathidae,Harlequin Pipefish,Pipefishes,"(12.0, 20.0)","(2.0, 18.0)","[Caribbean, Bahamas, South Florida, Brazil]",
46,Isostichopus badionotus,Stichopodidae,Three-Rowed Sea Cucumber,Sea Cucumbers,"(25.0, 40.0)","(0.0, 60.0)","[Caribbean, Bahamas, Florida, Bermuda]",
72,Caretta caretta,Cheloniidae,Loggerhead Turtle,Turtles,"(30.0, 100.0)","(0.0, 60.0)",[Circumtropical],
84,Tursiops truncatus,Delphinidae,Bottlenose Dolphin,Dolphins,"(152.0, 366.0)","(0, 0)",[Warm and temperate seas worldwide],
100,Holothuria mexicana,Holothuriidae,Donkey Dung Sea Cucumber,Sea Cucumbers,"(25.0, 35.0)","(0.0, 37.0)","[Caribbean, Bahamas, Florida Keys]",
101,Narcine brasiliensis,Torpedinidae,Lesser Electric Ray,Electric Rays,"(25.0, 45.0)","(0.0, 25.0)","[Florida, Gulf of Mexico, Southern Caribbean, ...",
106,Ectyoplasia ferox,Raspailiidae,Brown Encrusting Octopus Sponge,Common Sponges,"(15.0, 40.0)","(6.0, 25.0)","[Caribbean, Bahamas, Florida]",


In [133]:
reefguide_fish_DF.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 289 entries, 0 to 289
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   scientific_name    289 non-null    object
 1   scientific_family  289 non-null    object
 2   common_name        289 non-null    object
 3   common_family      289 non-null    object
 4   size_range_cm      289 non-null    object
 5   depth_range_m      289 non-null    object
 6   geodist            289 non-null    object
 7   smithsonian_href   263 non-null    object
dtypes: object(8)
memory usage: 20.3+ KB


Nice! seems good.... so let's drop the remaining rows containing NaNs and then pickle the new reefguide_fish_DF as well as save changes we made to both.

In [141]:
reefguide_fish_DF_1b = reefguide_fish_DF.copy()
reefguide_fish_DF_1b = reefguide_fish_DF_1b.reset_index(drop = True)
reefguide_fish_DF_2 = reefguide_fish_DF_1b.dropna()
reefguide_fish_DF_2 = reefguide_fish_DF_2.reset_index(drop = True)

In [142]:
reefguide_fish_DF_1b.to_pickle('./files/fishspecies_reefguide_info_1b.pkl')
reefguide_fish_DF_2.to_pickle('./files/fishspecies_reefguide_info_2.pkl')

In [2]:
fish_DF = pd.read_pickle('./files/fishspecies_reefguide_info_2.pkl')

In [3]:
fish_DF.head()

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",spe/3894
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]",spe/3537
2,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]",spe/3595
3,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]",spe/3690
4,Lutjanus synagris,Lutjanidae,Lane Snapper,Snappers,"(20.0, 30.0)","(2.0, 40.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico, ...",spe/3692


In [4]:
fish_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   scientific_name    263 non-null    object
 1   scientific_family  263 non-null    object
 2   common_name        263 non-null    object
 3   common_family      263 non-null    object
 4   size_range_cm      263 non-null    object
 5   depth_range_m      263 non-null    object
 6   geodist            263 non-null    object
 7   smithsonian_href   263 non-null    object
dtypes: object(8)
memory usage: 16.6+ KB


Awesome! Now we have this to scrape images from, of course remembering to rename any of the folder structure from our initial reefguide scrape as necessary.

In [20]:
fish_DF.loc[fish_DF.scientific_name == 'Ginglymostoma cirratum']['smithsonian_href'].values[0].split('/')[-1]

'24'

Looking at the gallery pages for smithsonian, for some reason the href we found isn't what is used as the URL prefix. However, the number is correct as an ID so we can use that and append it to a standard URL.

In [6]:
nurse_gallery_URL = 'https://biogeodb.stri.si.edu/caribbean/en/gallery/specie/24'
page = requests.get(nurse_gallery_URL)
nurse_soup = BeautifulSoup(page.content, "html.parser")

In [24]:
blue_box = nurse_soup.find('div',{'class':'bg-blue'})
blue_box.find_all('img')[0].get('src')

'/caribbean/resources/img/images/species/24_247.jpg'

In [12]:
img_link = 'https://biogeodb.stri.si.edu/caribbean/resources/img/images/species/24_247.jpg'
img_data = requests.get(img_link).content
with open('nurse_test.jpg', 'wb') as handler:
    handler.write(img_data)

Okay.. so while there are some restrictions due to copyright on this site, preventing downloading via right clicking, it seems that I can still scrape the images just fine sooooo let's go!

In [65]:
def smith_scrape(scientific_name,scientific_family,smithsonian_href,download_flag):
    ### This function takes in the end of a URL and uses it to access the appropriate webpage on reefguide.org
    ### For example, being passed 'flatneedle.html' will access the URL 'https://reefguide.org/carib/flatneedle.html'
    ### It will then collect the desired species information from the web page as well as download all the images
    ### If the download flag is boolean True using the function reefguide_images and store them in the appropriate folder
    ### on my HDD.

    # Get species N from href

    species_N = smithsonian_href.split('/')[-1]

    # Get page HTML

    smith_prefix = 'https://biogeodb.stri.si.edu/caribbean/en/gallery/specie/'
    species_URL = smith_prefix + species_N
    species_page = requests.get(species_URL)
    species_soup = BeautifulSoup(species_page.content, "html.parser")
    
    # Get Img Links

    blue_box = species_soup.find('div',{'class':'bg-blue'})
    img_link_suffixes = blue_box.find_all('img')

    if download_flag:
        smith_download(img_link_suffixes, scientific_family, scientific_name)
    else:
        pass

    return

def smith_download(img_link_suffixes,scientific_family,scientific_name):
    ### This function downloads all images from the species page and stores them in a species labelled folder on my HDD
    ### The folder is named for the scientific name as well as fits into the scientific family heirarchy
    ### The function will separate images into different folders based on labels such as juvenile, initial

    smith_pix_prefix = 'https://biogeodb.stri.si.edu'

    for html_suffix in img_link_suffixes:

        # Get Image URL and image data

        suffix = html_suffix.get('src')
        img_link = smith_pix_prefix + suffix
        img_data = requests.get(img_link).content

        # Construct save path (for some reason there are spaces at the end of the folders after the reefguide scrape?)

        path_prefix = 'E:/LargeDatasets/SpeciesID-Images/'
        folder_path = path_prefix + scientific_family + '/' + scientific_name.replace(' ','_') + '_'
        filename = suffix.split('/')[-1]
        img_path = folder_path + '/' + filename

        # check if path exists and if not make it so

        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        # check if file already exists and if so skip

        if os.path.exists(img_path):
            continue
        
        with open(img_path, 'wb') as handler:
            handler.write(img_data)

    return

In [56]:
fish_DF.loc[fish_DF.scientific_name == 'Ginglymostoma cirratum']

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
158,Ginglymostoma cirratum,Ginglymostomatidae,Nurse Shark,Nurse Sharks,"(150.0, 350.0)","(4.0, 30.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",spe/24


In [57]:
smith_scrape('Ginglymostoma cirratum','Ginglymostomatidae','spe/24',1)

'works'

In [37]:
base_dir = 'E:/LargeDatasets/SpeciesID-Images/'
os.listdir(base_dir)

['Acanthuridae',
 'Albulidae',
 'Apogonidae',
 'Aulostomidae',
 'Balistidae',
 'Batrachoididae',
 'Belonidae',
 'Blenniidae',
 'Bothidae',
 'Callionymidae',
 'Carangidae',
 'Carcharhinidae',
 'Chaenopsidae',
 'Chaetodontidae',
 'Cheloniidae',
 'Cirrhitidae',
 'Congridae',
 'Cynoglossidae',
 'Dactylopteridae',
 'Dasyatidae',
 'Delphinidae',
 'Diodontidae',
 'Echeneidae',
 'Ephippidae',
 'Fistulariidae',
 'Gerreidae',
 'Ginglymostomatidae',
 'Gobiidae',
 'Grammatidae',
 'Haemulidae',
 'Hemiramphidae',
 'Holocentridae',
 'Holothuriidae',
 'Inermiidae',
 'Kyphosinae',
 'Labridae',
 'Labrisomidae',
 'Lutjanidae',
 'Malacanthidae',
 'Megalopidae',
 'Monacanthidae',
 'Mullidae',
 'Muraenidae',
 'Myliobatidae',
 'Octopodidae',
 'Ogcocephalidae',
 'Ophichthidae',
 'Opisthognathidae',
 'Ostraciidae',
 'Paralichthyidae',
 'Pempheridae',
 'Pomacanthidae',
 'Pomacentridae',
 'Priacanthidae',
 'Raspailiidae',
 'Rhincodontidae',
 'Scaridae',
 'Sciaenidae',
 'Sclerodactylidae',
 'Scombridae',
 'Scorpa

In [48]:
os.listdir(base_dir+'Acanthuridae')

['Acanthurus_chirurgus_',
 'Acanthurus_coeruleus_',
 'Acanthurus_coeruleus_Intermediate_Phase\xa0',
 'Acanthurus_coeruleus_Juvenile\xa0',
 'Acanthurus_tractus_Post-Larval_phase\xa0',
 'Acanthurus_tractus_\xa0']

In [49]:
os.rename('E:/LargeDatasets/SpeciesID-Images/'+'Acanthuridae'+'/'+'Acanthurus_tractus_\xa0','E:/LargeDatasets/SpeciesID-Images/'+'Acanthuridae'+'/'+'Acanthurus_tractus_\xa0'.replace('\xa0',''))

In [53]:
base_dir = 'E:/LargeDatasets/SpeciesID-Images/'
family_dir = os.listdir(base_dir)

for family in family_dir:
    species_dir = os.listdir(base_dir+family)

    for species in species_dir:
        if '\xa0' in species:
            os.rename(base_dir+family+'/'+species,base_dir+family+'/'+species.replace('\xa0',''))

Okay, so there was an error somehow when I first scraped reefguide, likely due to how i handled 'phase_tag' that made it so there was a trailing whitespace on all my folder names. The above seems to have fixed that issue and was verified with the nurse shark download. Let's try the smith_scrape for all the species!

In [61]:
fish_DF.loc[fish_DF.scientific_name == 'Ginglymostoma cirratum'].apply(lambda x: smith_scrape(x.scientific_name, x.scientific_family, x.smithsonian_href,1), axis = 1)

158    works
dtype: object

In [60]:
fish_DF.loc[1:2]

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]",spe/3537
2,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]",spe/3595


In [64]:
fish_DF.loc[1:2].apply(lambda x: smith_scrape(x.scientific_name, x.scientific_family, x.smithsonian_href,1), axis = 1)

1    works
2    works
dtype: object

In [66]:
fish_DF.apply(lambda x: smith_scrape(x.scientific_name, x.scientific_family, x.smithsonian_href,1), axis = 1)

0      None
1      None
2      None
3      None
4      None
       ... 
258    None
259    None
260    None
261    None
262    None
Length: 263, dtype: object

Hell yes!!!!! Looking through my images and seeing the new pictures all in the right folder is pretty bad ass. It really has improved the training set so far. Onto the next site!

## Snorkel STJ

As with Smithsonian, we need to figure out how to look up the fish we already have in our DF on this website. They are by local common name, which may prove a challenge, but we will try our best.

In [2]:
URL = 'https://www.snorkelstj.com/list-species.html'
page = requests.get(URL)

STJ_soup = BeautifulSoup(page.content, "html.parser")

In [3]:
fish_DF = pd.read_pickle('./files/fishspecies_reefguide_info_2.pkl')

In [4]:
fish_DF.head()

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",spe/3894
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]",spe/3537
2,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]",spe/3595
3,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]",spe/3690
4,Lutjanus synagris,Lutjanidae,Lane Snapper,Snappers,"(20.0, 30.0)","(2.0, 40.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico, ...",spe/3692


In [5]:
fish_DF.common_family.value_counts()

Gobies                 23
Wrasses                15
Grunts                 12
Parrotfishes           11
Jacks                   9
                       ..
Cornetfishes            1
Sweepers                1
Mackerels and Tunas     1
Round Stingrays         1
Bonnetmouths            1
Name: common_family, Length: 68, dtype: int64

Okay so the way that SnorkelSTJ is set up is that the page above has a list of species by common name. Once you go to a species page by clicking a common name, there is a header with the scientific name. So, what I think we are going to try first is to find all the common family divisions they have in the species list page, and cross reference that with our common family column. If there is acceptable overlap, we will get the HTML references from those family headers, then double check the scientific names when we actually access the pages themselves.

In [9]:
STJ_soup.find_all('span',{'class':'auto-style3'})

[<span class="auto-style3">snorkelstj.com</span>,
 <span class="auto-style3">FISH </span>,
 <span class="auto-style3">Angelfish</span>,
 <span class="auto-style3">Barracuda:</span>,
 <span class="auto-style3">Basslet:</span>,
 <span class="auto-style3">Batfish:</span>,
 <span class="auto-style3"> Blennies:</span>,
 <span class="auto-style3"><a href="bonefish.html">Bonefish</a></span>,
 <span class="auto-style3">Boxfish:</span>,
 <span class="auto-style3">Butterflyfish:</span>,
 <span class="auto-style3">Cardinal Fishes:</span>,
 <span class="auto-style3">Chub:</span>,
 <span class="auto-style3">Clingfish:</span>,
 <span class="auto-style3">Damselfish:</span>,
 <span class="auto-style3">Drums &amp; Croakers:</span>,
 <span class="auto-style3">Eels:</span>,
 <span class="auto-style3">Filefishes:</span>,
 <span class="auto-style3">Flounder:</span>,
 <span class="auto-style3">Goatfish:</span>,
 <span class="auto-style3"><a href="goby_gallery.html">Gobies:</a></span>,
 <span class="auto-sty

In [10]:
family_headers = STJ_soup.find_all('span',{'class':'auto-style3'})
text_list = []
for family in family_headers:
    text_list.append(family.get_text())

In [18]:
text_list = [text.strip().strip(':') for text in text_list]

In [19]:
text_list

['snorkelstj.com',
 'FISH',
 'Angelfish',
 'Barracuda',
 'Basslet',
 'Batfish',
 'Blennies',
 'Bonefish',
 'Boxfish',
 'Butterflyfish',
 'Cardinal Fishes',
 'Chub',
 'Clingfish',
 'Damselfish',
 'Drums & Croakers',
 'Eels',
 'Filefishes',
 'Flounder',
 'Goatfish',
 'Gobies',
 'Groupers & Sea Basses',
 'Flying Gurnards',
 'Grunts',
 'Hamlets',
 'Herring',
 'Jacks',
 'Jawfish',
 'Lionfish',
 'Lizardfish',
 'Mackerel',
 'Margates',
 'Mojarra',
 'Mullets',
 'Needlefish',
 'Halfbeaks',
 'Parrotfishes',
 'Pipefish',
 'Porcupinefishes',
 'Porgies',
 'Pufferfishes',
 'Rays',
 'Remoras',
 'Scorpionfish',
 'Sea Horses',
 'Sharks',
 'Snappers',
 'Snook',
 'Squirrelfishes',
 'Stingray',
 'Surgeonfish',
 'Sweepers',
 'Tarpon',
 'Triggerfish',
 'Tilefish',
 'Trumpetfish',
 'Trunkfish',
 'Wrasses',
 'Creatures',
 'Sea Anemones',
 'Barnacles',
 'Bivalves',
 'Bryozoans',
 'Chitons',
 'Comb Jellies',
 'Corallomorph',
 'Crabs',
 'Hydroids',
 'Isopod',
 'Jellyfish',
 'Lobster',
 'Limpets',
 'Nudibranch',


In [28]:
overlap = [ text for text in text_list if text in fish_DF.common_family.values ]
not_in = [ text for text in text_list if text not in fish_DF.common_family.values ]

In [29]:
overlap

['Filefishes',
 'Gobies',
 'Flying Gurnards',
 'Grunts',
 'Hamlets',
 'Jacks',
 'Halfbeaks',
 'Parrotfishes',
 'Porcupinefishes',
 'Porgies',
 'Pufferfishes',
 'Remoras',
 'Snappers',
 'Squirrelfishes',
 'Sweepers',
 'Wrasses']

In [30]:
not_in

['snorkelstj.com',
 'FISH',
 'Angelfish',
 'Barracuda',
 'Basslet',
 'Batfish',
 'Blennies',
 'Bonefish',
 'Boxfish',
 'Butterflyfish',
 'Cardinal Fishes',
 'Chub',
 'Clingfish',
 'Damselfish',
 'Drums & Croakers',
 'Eels',
 'Flounder',
 'Goatfish',
 'Groupers & Sea Basses',
 'Herring',
 'Jawfish',
 'Lionfish',
 'Lizardfish',
 'Mackerel',
 'Margates',
 'Mojarra',
 'Mullets',
 'Needlefish',
 'Pipefish',
 'Rays',
 'Scorpionfish',
 'Sea Horses',
 'Sharks',
 'Snook',
 'Stingray',
 'Surgeonfish',
 'Tarpon',
 'Triggerfish',
 'Tilefish',
 'Trumpetfish',
 'Trunkfish',
 'Creatures',
 'Sea Anemones',
 'Barnacles',
 'Bivalves',
 'Bryozoans',
 'Chitons',
 'Comb Jellies',
 'Corallomorph',
 'Crabs',
 'Hydroids',
 'Isopod',
 'Jellyfish',
 'Lobster',
 'Limpets',
 'Nudibranch',
 'Oysters:See bivalves\nOctopus',
 'Sea Cucumbers',
 'Sea Slugs',
 'Seahares',
 'Shrimp',
 'Snails',
 'Sponges',
 'Squid',
 'Starfish',
 'Sea Stars',
 'Tunicates',
 'Urchins',
 'Worms',
 'Zoanthids',
 'CORALS',
 'Fire Corals',
 

In [27]:
fish_DF.common_family.nunique()

68

Hmm.. not great. Let's just manual up this bad boy

In [34]:
fish_DF.common_family.value_counts().index

Index(['Gobies', 'Wrasses', 'Grunts', 'Parrotfishes', 'Jacks', 'Seabasses',
       'Groupers', 'Damselfishes', 'Hamlets', 'Snappers', 'Cardinalfishes',
       'Moray Eels', 'Labrisomid Blennies', 'Angelfishes', 'Tube Blennies',
       'Porcupinefishes', 'Porgies', 'Scorpionfishes', 'Butterflyfishes',
       'Filefishes', 'Boxfishes', 'Squirrelfishes', 'Triggerfishes', 'Drums',
       'Chromis', 'Snake Eels', 'Soapfishes', 'Chubs', 'Toadfishes',
       'Surgeonfishes', 'Jawfishes', 'Requiem Sharks', 'Pufferfishes',
       'Goatfishes', 'Combtooth Blennies', 'Needlefishes', 'Remoras',
       'Lefteye Flounders', 'Bigeyes', 'Lizardfishes', 'Basslets', 'Seahorses',
       'Spadefishes', 'Trumpetfishes', 'Flying Gurnards', 'Triplefin Blennies',
       'Tonguefishes', 'Lionfishes', 'Soldierfishes', 'Dragonets', 'Halfbeaks',
       'Mojarras', 'Barracudas', 'Tilefishes', 'Eagle Rays', 'Nurse Sharks',
       'Bonefishes', 'Whale Sharks', 'Tarpons', 'Sand Flounders',
       'Garden Eels', 'Hawk

In [35]:
manual_clean = [ 'Angelfish',
 'Barracuda',
 'Basslet',
 'Batfish',
 'Blennies',
 'Bonefish',
 'Boxfish',
 'Butterflyfish',
 'Cardinal Fishes',
 'Chub',
 'Clingfish',
 'Damselfish',
 'Drums & Croakers',
 'Eels',
 'Flounder',
 'Goatfish',
 'Groupers & Sea Basses',
 'Herring',
 'Jawfish',
 'Lionfish',
 'Lizardfish',
 'Mackerel',
 'Margates',
 'Mojarra',
 'Mullets',
 'Needlefish',
 'Pipefish',
 'Rays',
 'Scorpionfish',
 'Sea Horses',
 'Sharks',
 'Snook',
 'Stingray',
 'Surgeonfish',
 'Tarpon',
 'Triggerfish',
 'Tilefish',
 'Trumpetfish',
 'Trunkfish',
 'Creatures',
 'Sea Anemones',
 'Barnacles',
 'Bivalves',
 ]
desired_groups = list(set(overlap + manual_clean))

In [43]:
content = STJ_soup.find('section',{'id':'content'})
    

In [60]:
content.find('span').find_next('span')

<span class="auto-style3">Angelfish</span>

In [68]:
content.find('a')

<a href="french-angelfish.html">French</a>

In [76]:
content.find('li')

<li><a href="french-angelfish.html">French</a><br/>
<a href="gray-angelfish.html">Gray</a><br/>
<a href="queen-angelfish.html">Queen</a><br/>
<a href="rock-beauty-angelfish.html">Rock Beauty </a></li>

In [79]:
href_list = []
for a_sect in content.find_all('a'):
    href_list.append(a_sect.get('href'))

In [80]:
href_list

['french-angelfish.html',
 'gray-angelfish.html',
 'queen-angelfish.html',
 'rock-beauty-angelfish.html',
 'rock-beauty-angelfish.html',
 'barracuda.html',
 'fairy-basslet.html',
 'harleguin-bass.html',
 'shortnose-batfish.html',
 'blenny_gallery.html',
 'barfin-blenny.html',
 'blennies.html',
 'dusky-blenny.html',
 'goldline-blenny.html',
 'hairy-blenny.html',
 'mimic-blenny.html',
 'molly-miller.html',
 'orangespotted-blenny.html',
 'pearl-blenny.html',
 'puffcheek-blenny.html',
 'redlip-blenny.html',
 'rosy-blenny.html',
 'triplefin.html',
 'saddled-blenny.html',
 'seaweed-blenny.html',
 'secretary-blenny.html',
 'spinyhead-blenny.html',
 'spotcheek-blenny.html',
 'twinhorn-blenny.html',
 'bonefish.html',
 'honeycomb-cowfish.html',
 'scrawled-cowfish.html',
 'banded-butterflyfish.html',
 '4eye-butterfly.html',
 'spotfin-butterflyfish.html',
 'belted-cardinalfish.html',
 'blackfin-cardinalfish.html',
 'flamefish.html',
 'dusky-cardinalfish.html',
 'chub.html',
 'clingfish_gallery.htm

Sigh... looks like we will just clean the html list manually. The page structure is not well organized / consistent so trying to extract only the appropriate hrefs is getting annoying. Since I will need to perform a check on the scientific name once entering the species page, no need to do an extensive check here.

In [127]:
STJ_hrefs = ['french-angelfish.html',
 'gray-angelfish.html',
 'queen-angelfish.html',
 'rock-beauty-angelfish.html',
 'barracuda.html',
 'fairy-basslet.html',
 'harleguin-bass.html',
 'shortnose-batfish.html',
 'barfin-blenny.html',
 'dusky-blenny.html',
 'goldline-blenny.html',
 'hairy-blenny.html',
 'mimic-blenny.html',
 'molly-miller.html',
 'orangespotted-blenny.html',
 'pearl-blenny.html',
 'puffcheek-blenny.html',
 'redlip-blenny.html',
 'rosy-blenny.html',
 'triplefin.html',
 'saddled-blenny.html',
 'seaweed-blenny.html',
 'secretary-blenny.html',
 'spinyhead-blenny.html',
 'spotcheek-blenny.html',
 'twinhorn-blenny.html',
 'bonefish.html',
 'honeycomb-cowfish.html',
 'scrawled-cowfish.html',
 'banded-butterflyfish.html',
 '4eye-butterfly.html',
 'spotfin-butterflyfish.html',
 'belted-cardinalfish.html',
 'blackfin-cardinalfish.html',
 'flamefish.html',
 'dusky-cardinalfish.html',
 'chub.html',
 'clingfish_gallery.html',
 'beaugregory.html',
 'bi-colored-damselfish.html',
 'chromis.html',
 'chromis.html',
 'cocoa-damselfish.html',
 'dusky-damselfish.html',
 'longfin-damselfish.html',
 'night-sergeant.html',
 'threespot-damselfish.html',
 'sergeant-major.html',
 'yellowtail-damselfish.html',
 'spotted-drum.html',
 'highhat.html',
 'reef-croaker.html',
 'brown-garden-eel.html',
 'chain-moray-eel.html',
 'chestnut-moray.html',
 'goldentail-moray.html',
 'goldspotted-eel.html',
 'green-moray-eel.html',
 'purplemouth-moray.html',
 'spotted-moray-eel.html',
 'orangespotted-filefish.html',
 'scrawled-filefish.html',
 'slender-filefish.html',
 'whitespotted-filefish.html',
 'unicorn-filefish.html',
 'peacock-flounder.html',
 'spotted-goatfish.html',
 'yellow-goatfish.html',
 'cleaning-goby.html',
 'colon-goby.html',
 'dash-goby.html',
 'frillfin-goby.html',
 'masked-glass-goby.html',
 'goldspot-goby.html',
 'greenbanded-goby.html',
 'masked-glass-goby.html',
 'nineline_goby.html',
 'black-grouper.html',
 'coney.html',
 'graysby.html',
 'greater-soapfish.html',
 'mutton-hamlet.html',
 'nassau-grouper.html',
 'red-hind.html',
 'rock-hind.html',
 'tobaccofish.html',
 'flying-gurnard.html',
 'black-margate.html',
 'bluestriped-grunt.html',
 'caesar-grunt.html',
 'french-grunt.html',
 'porkfish.html',
 'sailors-choice.html',
 'smallmouth-grunt.html',
 'spanish-grunt.html',
 'tomtate-grunt.html',
 'white-margate.html',
 'white-grunt.html',
 'hamlets.html',
 'barred-hamlet.html',
 'hybrid-black-hamlet.html',
 'butter-hamlet.html',
 'indigo-hamlet.html',
 'mutton-hamlet.html',
 'tan-hamlet.html',
 'hybrid-yellowbelly-hamlet.html',
 'yellowtail-hamlet.html',
 'hybrid-yellowtail-hamlet.html',
 'red-ear-herring.html',
 'almaco-jack.html',
 'bar-jack.html',
 'bigeye-scad.html',
 'blue-runner-jack.html',
 'horse-eye-jack.html',
 'leather-jacket.html',
 'palometa-jack.html',
 'permit-jack.html',
 'yellow-jack.html',
 'banded-jawfish.html',
 'mottled-jawfish.html',
 'yellowhead-jawfish.html',
 'indo-pacific-lionfish.html',
 'sand-diver-lizardfish.html',
 'inshore-lizardfish.html',
 'cero-mackerel.html',
 'black-margate.html',
 'white-margate.html',
 'flagfin-mojarra.html',
 'mottled-mojarra.html',
 'yellowfin-mojarra.html',
 'white-mullet.html',
 'needlefish.html',
 'ballyhoo.html',
 'bucktooth-parrotfish.html',
 'princess-parrotfish.html',
 'queen-parrotfish.html',
 'rainbow-parrotfish.html',
 'redband-parrotfish.html',
 'redfin-yellowtail-parrotfish.html',
 'stoplight-parrotfish.html',
 'striped-parrotfish.html',
 'harlequin-pipefish.html',
 'shortfin-pipefish.html',
 'balloonfish.html',
 'bridled-burrfish.html',
 'porcupinefish.html',
 'sea-bream.html',
 'sheepshead-porgy.html',
 'silver-porgy.html',
 'bandtail-pufferfish.html',
 'checkered-pufferfish.html',
 'sharpnose-pufferfish.html',
 'southern-stingray.html',
 'spotted-eagle-ray.html',
 'sharksucker.html',
 'plumed-scorpionfish.html',
 'reef-scorpionfish.html',
 'spotted-scorpionfish.html',
 'blacktip-shark.html',
 'lemon-shark.html',
 'nurse-shark.html',
 'cubera-snapper.html',
 'dog-snapper.html',
 'glasseye-snapper.html',
 'gray-snapper.html',
 'lane-snapper.html',
 'mahogany-snapper.html',
 'mutton-snapper.html',
 'schoolmaster-snapper.html',
 'yellowtail-snapper.html',
 'common-snook.html',
 'blackbar-soldierfish.html',
 'common-squirrelfish.html',
 'dusky-squirrelfish.html',
 'longspine-squirrelfish.html',
 'reef-squirrelfish.html',
 'southern-stingray.html',
 'blue-tang.html',
 'doctorfish.html',
 'surgeonfish.html',
 'glassy-sweeper.html',
 'tarpon.html',
 'queen-triggerfish.html',
 'sand-tilefish.html',
 'trumpetfish.html',
 'buffalo-trunkfish.html',
 'smooth-trunkfish.html',
 'spotted-trunkfish.html',
 'blackear-wrasse.html',
 'bluehead-wrasse.html',
 'clown-wrasse.html',
 'creole-wrasse.html',
 'green-razorfish.html',
 'pearly-razorfish.html',
 'puddingwife.html',
 'rosy-razorfish.html',
 'slippery-dick-wrasse.html',
 'spanish-hogfish.html',
 'yellowhead-wrasse.html']

Okay, this should get us started at least.

In [111]:
STJ_page_prefix = 'https://www.snorkelstj.com/'
test_species_suffix = 'shortfin-pipefish.html'
page = requests.get(STJ_page_prefix+test_species_suffix)

test_STJ_soup = BeautifulSoup(page.content, "html.parser")

In [112]:
test_content = test_STJ_soup.find('section',{'id':'content'})

In [121]:
test_content.find('span',{'class':'auto-style2'}).get_text()

'Cosmocampus elucens'

In [91]:
test_content.find_all('span',{'class':'Italic'})[0].get_text() in fish_DF.scientific_name.values

True

Okay, the check is easy enough. Now just get the images and put them in the correct folder. First let's add the stj_hrefs to our fish_DF. We will do this by accessing all the pages, checking if the species is in our DF, and then storing it if yes.

In [190]:
STJ_page_prefix = 'https://www.snorkelstj.com/'

fish_DF['stj_href'] = pd.Series(dtype = 'object')
STJ_scientific_names = []
for STJ_href in STJ_hrefs:
    page = requests.get(STJ_page_prefix+STJ_href)
    soup = BeautifulSoup(page.content, "html.parser")
    content = soup.find('section',{'id':'content'})
    if STJ_href == 'unicorn-filefish.html':
        species_scientific = content.find('span',{'class':'auto-style3'}).get_text()
    elif STJ_href == 'shortfin-pipefish.html':
        species_scientific = content.find('span',{'class':'auto-style2'}).get_text()
    else:
        species_scientific = content.find('span',{'class':'Italic'}).get_text()

    second_species = None
    if '(' in species_scientific:
        species_scientific = species_scientific.split('(')[1].replace(')','')
    elif '/' in species_scientific:
        halves = species_scientific.split('/')
        halves[0] = halves[0].strip()
        genus = halves[0].split()[0]
        species = halves[0].split()[1]
        species_scientific = genus + ' ' + species
        if len(halves[1].split()) == 1:
            second_species = genus + ' ' + halves[1].strip()
        else:
            second_species = halves[1].strip()
    elif '-' in species_scientific:
        species_scientific = species_scientific.split('-')[1]
    elif '\xa0' in species_scientific:
        species_scientific = species_scientific.split('\xa0')[0]
    elif ',' in species_scientific:
        halves = species_scientific.split(',')
        species_scientific = halves[0]
        second_species = halves[1].strip()
        
    species_scientific = species_scientific.replace('\t','').replace('\n','').replace('\r','')
    species_scientific = species_scientific.replace('Most likely: ','')
    species_scientific = species_scientific.strip()    
    STJ_scientific_names.append(species_scientific)
    if second_species != None:
        STJ_scientific_names.append(second_species)

    if species_scientific in fish_DF.scientific_name.values:
        fish_DF.stj_href.loc[fish_DF.scientific_name == species_scientific] = STJ_href
    if (second_species != None) and (second_species in fish_DF.scientific_name.values):
        fish_DF.stj_href.loc[fish_DF.scientific_name == second_species] = STJ_href

    # check for common name match but keep current href if already found by scientific name
    common_name = ' '.join(STJ_href.replace('.html','').split('-'))
    common_name = ' '.join([text.capitalize() for text in common_name.split()])
    if (common_name in fish_DF.common_name.values) and (fish_DF.stj_href.loc[fish_DF.common_name == common_name].isnull().values.any()):
        fish_DF.stj_href.loc[fish_DF.common_name == common_name] = STJ_href


In [191]:
fish_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   scientific_name    263 non-null    object
 1   scientific_family  263 non-null    object
 2   common_name        263 non-null    object
 3   common_family      263 non-null    object
 4   size_range_cm      263 non-null    object
 5   depth_range_m      263 non-null    object
 6   geodist            263 non-null    object
 7   smithsonian_href   263 non-null    object
 8   stj_href           152 non-null    object
dtypes: object(9)
memory usage: 18.6+ KB


Yikes.. barely half? Let's look at some scientific names..

In [172]:
STJ_scientific_names

['Pomacanthus paru',
 'Pomacanthus arcuatus',
 'Holacanthus ciliaris',
 'Holacanthus tricolor',
 'Sphyraena barracuda',
 'Gramma loreto',
 'Serranus tigrinus',
 'Ogcocephalus nasutus',
 'Malacoctenus versicolor',
 'Malacoctenus gilli',
 'Malacoctenus aurolineatus',
 'Labrisomus nuchipinnis',
 'Labrisomus guppyi',
 'Scartella cristata',
 'Hypleurochilus springeri',
 'Entomacrodus nigricans',
 'Labrisomus bucciferus',
 'Ophioblennius macclurei',
 'Malacoctenus macropus',
 'Enneanectes spp.',
 'Malacoctenus triangulatus',
 'Parablennius marmoreus',
 'Acanthemblemaria maria',
 'Acanthemblemaria spinosa',
 'Labrisomus nigricinctus',
 'Coralliozetus cardonae',
 'Albula vulpes',
 'Acanthostracion polygonia',
 'Acanthostracion quadricornis',
 'Chaetodon striatus',
 'Chaetodon capistratus',
 'Chaetodon ocellatus',
 'Apogon townsendi',
 'Apogon binotatus',
 'Astrapogon puncticulatus',
 'Apogon maculatus',
 'Phaeoptyx pigmentaria',
 'Kyphosus sectatrix',
 'Kyphosus incisor',
 'Arcos macrophthalmu

In [167]:
len(STJ_scientific_names)

206

Alright so we know that we will be short at least ~60.. There are a lot that contained weird characters for some reason, so we cleaned those up Also there are a few pages that seem to be for multiple species that arent getting picked up well. So we split them up into individual species and dealt with those. Unfortunately, a lot of them are like 'Hypoplectrus sp.' which is really unhelpful. We could try and match by the common name as well... perhaps we can look for a common name match by extracting the common name from the href. Doing an exact match by common name netted us a dozen more references.. not bad. The two things left for improvement are: fuzzy matching by common name, and looking up scientific name inconsistencies or mistakes as we did for the smithsonian database. Seeing as I don't want to do that right now and we can only improve by about 50 species, we are going to move on.

In [206]:
def stj_scrape(scientific_name,scientific_family,stj_href,download_flag):
    ### This function takes in the end of a URL and uses it to access the appropriate webpage on reefguide.org
    ### For example, being passed 'french-angelfish.html' will access the URL 'https://www.snorkelstj.com/french-angelfish.html'
    ### It will then collect the desired species information from the web page as well as download all the images
    ### If the download flag is boolean True using the function reefguide_images and store them in the appropriate folder
    ### on my HDD.

    # Get page HTML

    stj_prefix = 'https://www.snorkelstj.com/'
    species_URL = stj_prefix + stj_href
    species_page = requests.get(species_URL)
    species_soup = BeautifulSoup(species_page.content, "html.parser")
    
    # Get Img Links

    content = species_soup.find('section',{'id':'content'})
    img_link_suffixes = content.find_all('img')

    if download_flag:
        stj_download(img_link_suffixes, scientific_family, scientific_name)
    else:
        pass

    return

def stj_download(img_link_suffixes,scientific_family,scientific_name):
    ### This function downloads all images from the species page and stores them in a species labelled folder on my HDD
    ### The folder is named for the scientific name as well as fits into the scientific family heirarchy
    ### The function will separate images into different folders based on labels such as juvenile, initial

    stj_pix_prefix = 'https://www.snorkelstj.com/'

    for html_suffix in img_link_suffixes:

        # Get Image URL and image data

        suffix = html_suffix.get('src')
        img_link = stj_pix_prefix + suffix
        img_data = requests.get(img_link).content

        # Construct save path

        path_prefix = 'E:/LargeDatasets/SpeciesID-Images/'
        folder_path = path_prefix + scientific_family + '/' + scientific_name.replace(' ','_') + '_'
        filename = suffix.split('/')[-1]
        img_path = folder_path + '/' + filename
        img_path = img_path.replace('\r','').replace('\n','')

        # check if path exists and if not make it so

        if not os.path.exists(folder_path):
            os.makedirs(folder_path)

        # check if file already exists and if so skip

        if os.path.exists(img_path):
            continue
        
        with open(img_path, 'wb') as handler:
            handler.write(img_data)

    return

In [198]:
fish_DF.stj_href.loc[fish_DF.common_name == 'Freckled Soapfish']

1    NaN
Name: stj_href, dtype: object

In [199]:
fish_DF.stj_href.loc[fish_DF.common_name == 'Freckled Soapfish'].isnull().values.any()

True

In [200]:
fish_DF.loc[fish_DF.stj_href.isnull() == False][:2]

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href,stj_href
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",spe/3894,slippery-dick-wrasse.html
2,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]",spe/3595,belted-cardinalfish.html


In [207]:
fish_DF.loc[fish_DF.stj_href.isnull() == False].apply(lambda x: stj_scrape(x.scientific_name, x.scientific_family, x.stj_href,1), axis = 1)

0      None
2      None
3      None
4      None
6      None
       ... 
250    None
251    None
257    None
261    None
262    None
Length: 152, dtype: object

Nice! Now let's save the fish_DF with the stj_hrefs column added

In [208]:
fish_DF.to_pickle('./files/fishspecies_reefguide_info_3.pkl')

In [209]:
fish_DF = pd.read_pickle('./files/fishspecies_reefguide_info_3.pkl')

In [210]:
fish_DF.head()

Unnamed: 0,scientific_name,scientific_family,common_name,common_family,size_range_cm,depth_range_m,geodist,smithsonian_href,stj_href
0,Halichoeres bivittatus,Labridae,Slippery Dick,Wrasses,"(12.0, 20.0)","(2.0, 12.0)","[Caribbean, Bahamas, Florida, Bermuda, Gulf of...",spe/3894,slippery-dick-wrasse.html
1,Rypticus bistrispinus,Serranidae,Freckled Soapfish,Soapfishes,"(7.5, 13.0)","(3.0, 21.0)","[Caribbean, Bahamas, South Florida, Brazil]",spe/3537,
2,Apogon binotatus,Apogonidae,Barred Cardinalfish,Cardinalfishes,"(0.0, 10.0)","(1.0, 45.0)","[Caribbean, Bahamas, South Florida]",spe/3595,belted-cardinalfish.html
3,Lutjanus mahogoni,Lutjanidae,Mahogany Snapper,Snappers,"(18.0, 30.0)","(6.0, 18.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico]",spe/3690,mahogany-snapper.html
4,Lutjanus synagris,Lutjanidae,Lane Snapper,Snappers,"(20.0, 30.0)","(2.0, 40.0)","[Caribbean, Bahamas, Florida, Gulf of Mexico, ...",spe/3692,lane-snapper.html
