Various programs to extract information from the web
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitattributes
.gitignore
README.md
Scrap_97TingsProgrammer.py
Scrap_GoogleImg.py
Scrap_Imgur.py
Scrap_Moodle.py
Scrap_Nordelec.py
Scrap_PolyEduportefolio.py
Scrap_PrenomMasc.py
Scrap_RSS_titles.py
Scrap_Reddit.py
Scrap_Tumblr.py
Scrap_WhoIS.py
scraptools.py
statedConnection.py

README.md

WebScraping Tools

This repo contains examples of web scraping. Dependencies are lxml for DOM traversal and feedparser for RSS.

There is also a utility class "scraptools" to group common scraping operations:

  • getDOM : Returns the DOM element of the page for the given url
  • getElementsFromHTML : Returns a list of lxml elements from html source corresponding to a cssSelector
  • getElementsFromUrl : Returns a list of lxml elements from the page fetched at url corresponding to a cssSelector
  • getUrlContent : Gets the content of a url as a string
  • downloadResource : Download the content of a url to the disk
  • saveResource : Saves data to file in binary write mode
  • urlIterator : Successively yields page urls while there is a next one found by the cssSelector

Examples:

  • Scrap_97ThingsProgrammer : Aggregates 97 good programming practices and generates a printer friendly html page
  • Scrap_Eduportefolio : Get names of students attending Polytechnique Montreal
  • Scrap_GoogleImg : Download top imgages for a search on Google image
  • Scrap_Imgur : Download individual images or a whole gallery
  • Scrap_Moodle : Recursively downloads all the files from the course pages on Moodle
  • Scrap_Nordelec : Get information about the companies inside this building
  • Scrap_PrenomMasc : Get first names from a website
  • Scrap_Reddit : Parses posts from a subreddit
  • Scrap_RSS_titles : Get the article titles of rss feeds. Usefull for a quick glance at the news from the console ;)
  • Scrap_Tumblr : Gets pictures based on their tags