Skip to content

naivomah3/yellowpage-harvesting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yellow Page Crawler

It is aiming to get a list of organizations in Madagascar along with their activity, address and contact from the Yellow Page Africa.
I made this script to help a friend of mine who has recently set up his own startup on the business call campaign.

WARNING: As web scraping is sometimes subject to nuisance to the website and could disrupt its services, I highly recommend to first read and agree on the TOS of the website and experiment this script on your own head be it.

Installation

This is an implementation of a Python Scrapy web crawler in which I use Splash to load and render Javascript embedded on pages and Docker container as a middleware serving the fully rendered(includes the necessary javascript events embedded within the different components) pages.

# For conda user 
conda install -c conda-forge scrapy
# Or using pip 
pip install Scrapy

# Install Splash using pip 
pip install scrapy-splash
# Pull the image 
sudo docker pull scrapinghub/splash
# Run the container
sudo docker run -p 8050:8050 scrapinghub/splash

In order to use Splash in pga.py spider script, the following settings have to be in the settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# For Windows & Mac, use the IP address instead of localhost 
SPLASH_URL = 'http://localhost:8050' 
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

(Optional) Depending on the scenario, basic page restriction can be bypassed using the following packages along with their settings. They can be used together or separately as they already have different and unique priority values but make sure to put in the DOWNLOADER_MIDDLEWARES section the corresponding settings.

  • The simplest way to legit requests towards the website is to change the User-Agent. This can be done statically in settings.py file by initializing this parameter.
USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
  • scrapy-user-agents can be used to automate picking a random User-Agent from a pool of pre-defined user agent. Install it via pip and add the following settings in DOWNLOADER_MIDDLEWARES section
pip install scrapy-user-agents
#....
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
#....
  • scrapy-proxy-pool can also be used to skirt firewall rules via a pool of pre-defined random Proxies. Install it via pip and add the following settings in the DOWNLOADER_MIDDLEWARES section
pip install scrapy-proxy-pool
#....
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
#....

Usage

Before running this command, make sure that the Docker container has run successfully. Go to the current project and run the following command. The output will be generated as a CSV file yellow-page-data.csv. Take a look at the scrapped data sample having 4K+ organizations available here

scrapy crawl pga -o yellow-page-data.csv

Do your own experiment and start generating your own spider by using the following command. For more details, use the documentation

scrapy genspider yourspider yourdomain.com

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.