Skip to content

Pipeline for downloading and classifying Fishbase images by family.

Notifications You must be signed in to change notification settings

hiweller/spiderfish

Repository files navigation

spiderfish 0.0.2

Author: Hannah Weller
Email: hannahiweller@gmail.com

Pipeline for downloading and sorting Fishbase images.

About

Uses python and R to download all Fishbase images of a user-specified family, then sorts out standardized (full body lateral view, relatively uniform background) images for easy digitization. The pipeline follows 4 steps after you give it a fish family:

  1. Downloads all images of that fish family that have confirmed identifications from Fishbase using Scrapy in python (http://scrapy.org/). Crawls species pages for photographs. See 'fishbase' folder.

  2. Sorts images into 'Pass' or 'Fail' categories based on regionalized color histogram indices with OpenCV in python (http://opencv.org/). See 'Sorting.py' script. Image classification parameters are explained in 'Image Sorting' page on the Wiki. Which is also the only page on the Wiki. Go hard.

  3. Fetches a list of all species in family from Fishbase using the rfishbase package in R (https://github.com/ropensci/rfishbase).

  4. Outputs list of species in family missing images from fishbase (if any), list of species whose only images were rejected by the image classifier (not good candidates for morphometrics/digitization), and a list of which species correspond to which image names.

The results are stored in a directory specified by the user, in a 'FamilyName/' folder. The folder has 3 subfolders: 'Pass', 'Fail', and 'All', which contain images passed by the classifier, images rejected by the classifier, and duplicates of all images in case you don't care about the classification, respectively. The folder also contains a JSON file with all the image URLs, image names, and species names, as well as the generated CSV files.

Usage

To clone the repository onto your computer:

git clone https://github.com/nitlon/spiderfish

The pipeline is run through a simple shell script that just calls the steps above in the right order. There are two options for running it:

  1. Add the 'spiderFish' command in 'bash_profile_pipeline.sh' to your bash profile (instructions are in the script).

  2. Source the 'shell_pipeline.sh' script after making it executable (instructions are in the script).

The two options are identical other than how they're called -- personally, I prefer the bash profile option, since you can call it from any directory with the 'spiderFish' command instead of having to specify the exact filepath of the shell pipeline script.

Once called, spiderFish will prompt you for the name of a fish family (either all lowercase or first letter capitalized) and the destination for the results folder. For example:

user$ spiderFish
Fish family?
Chaetodontidae

Folder destination?
/Users/user/Desktop

If the family name is valid, scrapy will start crawling Fishbase for images, starting with something like this:

2016-08-16 11:52:34 [scrapy] INFO: Scrapy 1.1.1 started (bot: fishbase)
2016-08-16 11:52:34 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'fishbase.spiders', 'FEED_URI': 'banjosidae.json', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['fishbase.spiders'], 'BOT_NAME': 'fishbase', 'LOG_STDOUT': True, 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
2016-08-16 11:52:34 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
...a bunch of other settings like that...
 'scrapy.pipelines.images.ImagesPipeline']
2016-08-16 12:55:47 [scrapy] INFO: Spider opened
2016-08-16 12:55:47 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-16 12:56:47 [scrapy] INFO: Crawled 209 pages (at 209 pages/min), scraped 80 items (at 80 items/min)

After 1-2 minutes of initializing, scrapy usually downloads images at about 50-100 images/minute from Fishbase. This is by far the longest step. Once it finishes, the image classifier and species tallies take a few seconds to run, followed by output like this:

2016-08-16 13:01:47 [scrapy] INFO: Closing spider (finished)
2016-08-16 13:01:47 [scrapy] INFO: Stored json feed (549 items) in: Chaetodontidae.json
2016-08-16 13:01:47 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 503515,
 'downloader/request_count': 1616,
...a bunch of other log stats...
 'scheduler/enqueued/memory': 517,
 'start_time': datetime.datetime(2016, 8, 16, 17, 55, 29, 882400)}
2016-08-16 13:01:47 [scrapy] INFO: Spider closed (finished)
Loading required package: methods
2 species missing.
List of missing species is saved in /Users/user/Desktop/Chaetodontidae/Chaetodontidae_missingPics.csv.

127 species have at least one image.
Of these species, 2 species have only images rejected by the image classifier.
A list of these species is stored in /Users/user/Desktop/Chaetodontidae_failOnly.csv.

Corresponding species and image names are saved in /Users/user/Desktop/Chaetodontidae_speciesURLs.csv.
Finished. Display results? (y/n)

Entering 'y' will display all the images in the 'Pass' folder.

You can also call different pieces of the pipeline separately, for example if you want to scrape images but not sort them, or sort images in a different folder, etc. This is most easily accomplished by copying the commands for these separate steps as specified in the pipeline scripts, and substituting your own arguments.

Requirements

Updates

  • 0.0.2 (Dec. 29, 2016): OpenCV3 compatibility

Troubleshooting

  • May need to run $ pip install image if getting 'No module named PIL' error

About

Pipeline for downloading and classifying Fishbase images by family.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published