gismeteo-news-scraping

scrapinghub project for scraping news from https://www.gismeteo.ua/news/

Usage

Project is writen on Python 3 (tested on 3.5).

It have only one spider - gismeteo. It scrapes urls (that weren't scraped yet), and then follows them and scrapes articles to Google Drive Sheet.

Google API key

Follow steps from this page to generate .json key and save it to scrapy_climate folder with client-secret.json name.

Deploying

Connection to Scrapy Cloud via GitHub mustn't work because it doesn't support Python 3. So you need to use shub command line tool.

Define your project id for each shub deploy call, or edit scrapinghub.yml.

Running

To run spider you need to create options.json in scrapy_climate folder. Use JFON format to define variables like:

{
  "SCRAPY_CLOUD_API_KEY": "<scrapy_cloud_api_key>",
  "SCRAPY_CLOUD_PROJECT_ID": "<scrapy_cloud_project_id>",
  "SPREADSHEET_TITLE": "<google_drive_spreadsheet_title>",
  "SPIDER_TO_WORKSHEET_DICTIONARY": {
    "gismeteo": 1
    "<spider>": <worksheet_id_starting_from_zero>
  }
}

This file is ignored by git, but will be deployed to ScrapingHub.

Storage

Pipeline gives items to StorageMaster that appends them to defined in spider arguments Google Drive Sheet ordered by url, header, tags and body of article to worsheet that defined for currently running spider in options.json worksheet (so it must be created before, or spider will raise an RuntimeError), and when all items where added, master ends his work with a row that contains url to job on ScrapingHub, CPU datetime, number of scraped articles and two ----- strings.

How it scrapes only fresh articles?

When spider scrapes news page, first of all it fetches indexes list of scraped articles from last week using Scrapy Cloud API. Then spider iterates over links to articles and scrapes only articles that aren't in the indexes list.

Inheriting

In scrapy_climate/spider.py Python module it is TemplateSpider class which can be use used as parent for actually running spiders. To make new spider you will need to configure it's selectors and define name, domain, and relative path to first page. More about it in the docstrings.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
scrapy_climate		scrapy_climate
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_scrapy.py		run_scrapy.py
scrapinghub.yml		scrapinghub.yml
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapy_climate

scrapy_climate

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

run_scrapy.py

run_scrapy.py

scrapinghub.yml

scrapinghub.yml

scrapy.cfg

scrapy.cfg

setup.py

setup.py

Repository files navigation

gismeteo-news-scraping

Usage

Google API key

Deploying

Running

Storage

How it scrapes only fresh articles?

Inheriting

About

Releases

Packages

Languages

iAnanich/gismeteo-news-scraping

Folders and files

Latest commit

History

Repository files navigation

gismeteo-news-scraping

Usage

Google API key

Deploying

Running

Storage

How it scrapes only fresh articles?

Inheriting

About

Resources

Stars

Watchers

Forks

Languages