Viper Scraper

Set-Up

Before using any script, run pipenv shell to enter the virtual environment.

Using the Twitter scraper requires registering as a Twitter developer and providing authentication keys. Place your keys in either .my_keys (in .gitignore) or config/keys.json. See the Twitter Developer page.

Scraping Twitter

viper_scraper.py twitter [-h] [-d Data Directory] [-t Tracking File] 
                         [-l Limit] [--photos_as_limit]

-d Data Directory : Directory to save results to

-t Tracking File : Path to a text file containing a list of phrases, one per line, to track. See the Twitter page for filteringrealtime tweets.

-l Limit : If photos as limit is true, the approximate number of images to scrape. Else the approximate number of tweets to scrape.

--photos_as_limit : If present, Limit refers to the number of images to scrape rather than number of tweets

The Twitter scraper filters realtime tweets using the Twitter API. Text, metadata, and references to downloaded images are stored in data.csv under the specified directory.

YOLO integration with Twitter

python viper_scraper.py yolo ...

The VIPER scraper also integrates You Only Look Once (YOLO) real-time object detection.

For each tweet that passes the filter, the scraper will:

Download the original image, if present.
Save a version of the image with bounding boxes and predictions labelled
Save a .json file containing the confidences for each class
Save text and metadata, along with references to these files, in data.csv under the specified directory

viper_scraper.py yolo [-h] [-d Data Directory] [-t Tracking File]
                      [-l Limit] [--photos_as_limit] --names NAMES
                      --config CONFIG --weights WEIGHTS [-c CONFIDENCE]
                      [-th THRESHOLD]

In addition to the arguments shared by the basic Twitter scraper, YOLO integration takes these additional arguments:

--names NAMES : A file containing the names, one per line, associated with the weights and config file for YOLO, e.g. coco.names.

--config CONFIG : Config file for YOLO, e.g. yolov3.cfg.

--weights WEIGHTS : Weights file for YOLO, e.g. yolov3.weights.

-c CONFIDENCE : Minimum confidence to filter weak detections, default 0.5.

-th THRESHOLD : Threshold when applying non-maxima suppression, default 0.3.

For example, to use the pretrained YOLO model (coco.names, yolov3.cfg, and yolov3.weights) with plane_tracking.txt, download the files and run:

python viper_scraper.py yolo -d data_yolo_planes -t config/plane_tracking.txt -l 1000 --names yolo/coco.names --config yolo/yolov3.cfg --weights yolo/yolov3.weights -c .5 -th .3

Scraping Instagram

python viper_scraper.py instagram ...

This script and associated utility scripts are based on Antonie Lin's non-API instagram scraper under the MIT license. Visit his (now-archived) repository at:

https://github.com/iammrhelo/InstagramCrawler

This is a non-API Instagram scraper using Selenium. As such, it is liable to break as Instagram changes their site. I will try to maintain its integrity but please feel free to contribute.

Scrape n images and associated captions from either a user or a hashtag.

Before use, run

bash utils/get_gecko.sh
bash utils/get_phantomjs.sh
source utils/set_path.sh

Usage

viper_scraper.py instagram [-h] [-d DIR_PREFIX] [-q QUERY] [-n NUMBER] [-c caption] [-l Headless] [-a AUTHENTICATION] [-f FIREFOX_PATH]

-d Directory Prefix : The directory to save data to.

-q QUERY : The target (user or hashtag) to crawl. Add '#' for hashtags

-n NUMBER : The number of posts to download.

-c Caption : Add this flag to download captions when donaloading photos.

-l headless : If set, will use PhantomJS driver to run script as headless

-a AUTHENTICATION : Path to authentication JSON file - necessary for headless.

f FIREFOX_PATH : Path to the firefox installation for selenium.

Examples:

For example,

python viper_scraper.py instagram -d data_insta_test -q "#art" -c -n 100`

Will scrape the first 100 photos and captions from the art hashtag.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
config		config
utils		utils
viper_scraper		viper_scraper
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
plane_relative_freq_generated.txt		plane_relative_freq_generated.txt
viper_scraper.py		viper_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

utils

utils

viper_scraper

viper_scraper

.gitignore

.gitignore

LICENSE

LICENSE

Pipfile

Pipfile

Pipfile.lock

Pipfile.lock

README.md

README.md

plane_relative_freq_generated.txt

plane_relative_freq_generated.txt

viper_scraper.py

viper_scraper.py

Repository files navigation

Viper Scraper

Set-Up

Scraping Twitter

YOLO integration with Twitter

Scraping Instagram

Examples:

About

Releases

Packages

Languages

License

jalberse/viper_scraper

Folders and files

Latest commit

History

Repository files navigation

Viper Scraper

Set-Up

Scraping Twitter

YOLO integration with Twitter

Scraping Instagram

Examples:

About

Resources

License

Stars

Watchers

Forks

Languages