Skip to content

Generic structure for retrieving and processing regularly updated data from the web

License

Notifications You must be signed in to change notification settings

lanl-ansi/flexes-feed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flexes-feed

Build Status codecov

Generic structure for retrieving and processing regularly updated data from the web

Scraper Usage

To create a new scraper simply create a class that inherits from the Scraper class and override the check() method.

Here is a quick pseudo example:

import requests
from flexes_feed.scraper import NewFile, Scraper

class MyScraper(Scraper):
  def check(self):
    response = requests.get(self.channel)
    # Parse content from page 
    # If the file has changed return a NewFile object
    return [NewFile(file_url, self.s3_folder)]
    
def run_scraper():
  s3_folder = 's3://bucket/path/to/store/data'
  channel = 'http://somedata.com'
  scraper = MyScraper(s3_folder, channel)
  scraper.run()
  
if __name__ == '__main__':
  run_scraper()

See examples/noaa_wind_scraper.py for a real example. The example also requires that BeautifulSoup4 and lxml are installed. To install, simply run pip install BeautifulSoup4 lxml.

Subscriber Usage

To create a new subscriber simply create a class that inherits from the Subscriber class and override the process() method.

Here is a quick pseduo example:

from flexes_feed.subscriber import Subscriber

class MySubscriber(Subscriber):
  def process(self, s3_uri):
    # Process file(s) in s3_uri
    
def subscribe():
  channel = 'http://somedata.com'
  sub = MySubscriber(channel)
  sub.subscribe()
  
if __name__ == '__main__':
  subscribe()

See examples/noaa_wind_subscriber.py for a real example using the lanlytics API.

About

Generic structure for retrieving and processing regularly updated data from the web

Resources

License

Stars

Watchers

Forks

Packages

No packages published