# Web scraping using Scrapy

In this notebook, we use Scrapy to collect the top headlines from misc discussions on BlackHatWorld (https://www.blackhatworld.com/).
*Adapted from Jitse-Jan van Waterschoot tutorial (https://www.jitsejan.com/using-scrapy-in-jupyter-notebook.html).*

## Pre-requisites
First, we set up the environment:

In [1]:
# Settings for notebook
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Import Scrapy & co.
import logging
import json
import pandas as pd
import scrapy
from scrapy.crawler import CrawlerProcess

## Set up a pipeline
Next, we create a simple pipeline that will write the results of the scraping to a JSON dump:

In [2]:
class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('blackhatworld_misc_discussions.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line)
        return item

## Define the spider

In the ForumSpider class we define from which URLs crawl. We set the logging level of the crawler to warning, otherwise the notebook is overloaded with DEBUG messages. Finally, we define how the retrieved data is processed: For each entry, we extract the title, author, date and upvotes. The CSS selectors are determined from the page source code. For easy handling of unique CSS/XPATHs, try Andrew Cantino's SelectorGadget: https://selectorgadget.com/.

In [3]:
class ForumSpider(scrapy.Spider):
    name = "blackhatworld_misc_discussions"
    start_urls = ['https://www.blackhatworld.com/forums/misc.18/'] + \
        ['https://www.blackhatworld.com/forums/misc.18/page-' + str(i) for i in range(2,20)]
        
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
    }
        
    def parse(self, response):
        for quote in response.css('.visible .titleText'):
            yield {
                'title': quote.css('.PreviewTooltip::text').extract_first(),
                'author': quote.css('.username::text').extract_first(),
                'date': quote.css('.DateTime::text').extract_first(),
                'votes': quote.css('strong::text').extract_first(),
            }

## Start the crawler
We run the crawler, setting a custom user agent:

In [4]:
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(ForumSpider)
process.start()

2019-02-05 11:31:59 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2019-02-05 11:31:59 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 17.5.0, Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 18.0.0 (OpenSSL 1.1.1  11 Sep 2018), cryptography 2.4.2, Platform Darwin-18.2.0-x86_64-i386-64bit
2019-02-05 11:31:59 [scrapy.crawler] INFO: Overridden settings: {'LOG_LEVEL': 30, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}


<Deferred at 0x117064048>

## Check the files
Checking whether the output file has been created and reading some raw output:

In [5]:
ll blackhatworld_misc_discussions.jl

-rw-r--r--  1 maurice  staff  42136 Feb  5 11:32 blackhatworld_misc_discussions.jl


In [6]:
!tail -n 3 blackhatworld_misc_discussions.jl

{"title": "Forum Bot - Auto-Captcha Forum Profile Bot", "author": "andreyg13", "date": "Oct 9, 2011", "votes": "7"}
{"title": "[$100 OFF] Keyword Scout - The Finest Keyword Research Tool - Amazon Scraper, Addons &More", "author": "macdonjo3", "date": "Oct 4, 2011", "votes": "11"}
{"title": "The Most Advanced WP Plugin For Getting Masses of Free Viral Traffic To Your Website", "author": "shezboy", "date": "Oct 3, 2011", "votes": "11"}


## Create dataframes
We read the results into a Pandas dataframe. To see the post witht the most upvotes in the retrieved data, we sort the frame and display its head:

In [7]:
df = pd.read_json('blackhatworld_misc_discussions.jl', lines=True)
df["votes"].fillna(0, inplace = True) 
df = df.sort_values(by=['votes'], ascending = False)
df.head(10).style.set_properties(**{'text-align': 'left'})

Unnamed: 0,author,date,title,votes
36,secondeye,2012-07-05 00:00:00,PayPal Solutions - Send Receive & Withdrawal - Remove Limit from PayPal Easily,200
27,nuaru,2011-05-02 00:00:00,ZennoPoster 5 - Automate any task in the Internet,94
121,NatashaNixon,2014-03-19 00:00:00,FLOOD Your Site With Targeted TRAFFIC. FREE Reviews!,88
108,Typlo,2011-08-15 00:00:00,High Volumes of WEBSITE TRAFFIC for 40+ COUNTRIES Over 200K A DAY $1 per 1K VISITORS!,77
38,namhq89,2015-08-29 00:00:00,[MONEY BACK GUARANTEE] Let Me Find You the RIGHT Keywords to DOMINATE your niche,46
101,MosesW,2014-04-28 00:00:00,"PandaBot.Net Free SEO Software for Websites, YouTube Videos and Social Media",46
286,andreyg13,2011-10-17 00:00:00,CAPTCHA SNIPER Your Auto Captcha Solving Software!,45
125,globolsales,2009-02-17 00:00:00,OUTSOURCE COMPANY - GET YOUR FULL TIME STAFF for 300 USD/month salary only,44
95,thetrustedzone,2016-10-04 00:00:00,"Cheap Web Traffic - $4.99 for 100,000 visitors",36
284,gimme4free,2012-03-16 00:00:00,My Personal Pinterest.com Bot Collection,29


Let's see what the BlackHat community has to tell us about web scraping :)

In [8]:
df[df['title'].str.contains("scraping", case = False, na=False)].style.set_properties(**{'text-align': 'left'})

Unnamed: 0,author,date,title,votes
24,outscrape,2018-04-08 00:00:00,MAKE MONEY WITH WEB SCRAPING (Even If You've Never Tried) - Web Scraping Secrets Exposed - 150 pages,3
217,Alex D.,2018-04-13 00:00:00,Web Scraping Service – High Quality Data / Cheap Prices,3
118,sendlerad,2018-10-17 00:00:00,"⚛️⚡Email Databases, Bulk Email Marketing Solutions, Data Cleaning, Scraping & MORE⭐✅",0
189,proxygo,2011-09-04 00:00:00,ScrapeBox Proxies - Yahoo/Google Scraping,0
188,proxygo,2013-04-13 00:00:00,Gscraper Scraping Proxies 3k Per Day,0
