This project is part of the Microverse curriculum in Ruby module!
Explore the docs »
View Demo
Report Bug
Request Feature
NewsScraper is an application that gathers content from news websites, shows the results organized in sections of interest and performs searching.
By targeting particular mark-up elements, searches for titles and descriptions of news. To search the content of each website, a configuration file is required, which is added to the source modules list. The configuration file uses detailed format described in the configuration section of this document.
At this current version, two news paper sites are configured, the Newsweek and The New York Times websites.
In order to expand scraping, one can create a new configuration module and add it to the source modules list.
- About the Project
- Application Instructions
- Configuration and Expansion
- Development
- Testing
- Built With
- Live Version
- Acknowledgements
- License
The project consists of the following files
-
The 'bin' folder
- news_scraper
The news_scraper is the executable file that controls the program logic and the interface.
Uses a loop tracking the status of the program and shows available options depending on each status.
- news_scraper
-
The 'lib' folder
-
source.rb
This class controls gathering content from a particular website. -
section.rb
This class controls gathering content from a particular section. -
article.rb
This class controls gathering content from a particular article. -
utils.rb
This file contains helper methods for parsing text and interface format. -
string.rb
This module provides extension to the String Class.
-
-
The 'resources' folder
-
newsweek.rb
This is a configuration file for the Newsweek website -
nytime.rb
This is a configuration file for the New York Times website
-
The program starts listing the available sources.
When selecting a source, type its relevant number and the configured sections list will appear
When selecting a section, type its relevant number and the list of articles of the selected section will appear showing their headers
When selecting an article type its relevant number and the header and description of the selected article will appear
Consecutive pressing the return key, returns to the initial screen with the list of sources
Entering 's' lets to provide a text to be searched in either all sources or the particular selected source
Configuration modules are Hash instances extensions.
To extend a hash with a particular module the extend method is selected by a hash instance providing the class name in the module. Then the setup method of the module is selected by the hash.
The setup method of the extension fills the hash with three key-value pairs.
- The :caption key with value an arbitrary entry that identifies the source
- The :url key with value the URL of the targeted website
- The :section_hashes key and value an array of hashes each with following entries
- :section_id or :section_class to target a section in the mark-up using either the id of the element or it's css class
- :title_tag or :title_class to target the element containing the title of the section, using either the id of the element or it's css class. If no :section_id nor :section_class are given, then a :title key is required with an arbitrary value. In this case the scraping for articles is performed into the total mark-up of the site.
- :article_tag or :article_class to target the elements of the section containing article content, using either the element's tag or it's css class
- :article_header_tag or :article_header_class to target the header of the article, using either the element's tag or it's css class
- :article_desc_tag or :article_desc_class to target the description of the article, using either the element's tag or it's css class
In order to expand the program so it can scrap news from other website, a configuration module has to be created according to the two provided templates and do the following changes in the news_scraper file
- Add an entry after the 'require_relative' entries pointing the configuration file
require_relative '<new_configuration_file>'
- Add code similar to
<new_hash> = {} <new_hash>.extend(<new_class>) <new_hash>.setup
- Create an instance of Source Class providing the new_hash and add the instance to the sources array
@sources << Source.new(<new_hash>)
- Clone the project
https://github.com/ioanniskousis/NewsScraper.git
- Run the Application
In order to run the application from a terminal, move to folder bin, type: news_scraper
and press enter!
Test units are in spec/scraper_spec.rb file.
Tests are applied so they depend on the content of the web sites which may vary.
Please note that news websites change their content on a fast pace and in a case that some tests fail due to this reason, you main apply the following changes in the spec/scraper_spec.rb file.
* Change lines 66, 67 and 68 so constant strings refer to existing content in the actual articles headings
* Change lines 94, 95 and 96 so constant strings refer to existing content in the actual articles description
This project was built using these technologies.
- Ruby
- Rubocop
- VsCode
- Git-Flow
- nokogiri gem
👤
- Github: @ioanniskousis
- Twitter: @ioanniskousis
- Linkedin: Ioannis Kousis
- E-mail: jgkousis@gmail.com
📝 This project is MIT licensed.