Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Various programs to extract information from the web
Python
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
.gitattributes
.gitignore
README.md
Scrap_97TingsProgrammer.py
Scrap_GoogleImg.py
Scrap_Imgur.py
Scrap_Moodle.py
Scrap_Nordelec.py
Scrap_PolyEduportefolio.py
Scrap_PrenomMasc.py
Scrap_RSS_titles.py
Scrap_Reddit.py
Scrap_Tumblr.py
Scrap_WhoIS.py
scraptools.py
statedConnection.py

README.md

WebScraping Tools

This repo contains examples of web scraping. Dependencies are lxml for DOM traversal and feedparser for RSS.

There is also a utility class "scraptools" to group common scraping operations:

  • getDOM : Returns the DOM element of the page for the given url
  • getElementsFromHTML : Returns a list of lxml elements from html source corresponding to a cssSelector
  • getElementsFromUrl : Returns a list of lxml elements from the page fetched at url corresponding to a cssSelector
  • getUrlContent : Gets the content of a url as a string
  • downloadResource : Download the content of a url to the disk
  • saveResource : Saves data to file in binary write mode
  • urlIterator : Successively yields page urls while there is a next one found by the cssSelector

Examples:

  • Scrap_97ThingsProgrammer : Aggregates 97 good programming practices and generates a printer friendly html page
  • Scrap_Eduportefolio : Get names of students attending Polytechnique Montreal
  • Scrap_GoogleImg : Download top imgages for a search on Google image
  • Scrap_Imgur : Download individual images or a whole gallery
  • Scrap_Moodle : Recursively downloads all the files from the course pages on Moodle
  • Scrap_Nordelec : Get information about the companies inside this building
  • Scrap_PrenomMasc : Get first names from a website
  • Scrap_Reddit : Parses posts from a subreddit
  • Scrap_RSS_titles : Get the article titles of rss feeds. Usefull for a quick glance at the news from the console ;)
  • Scrap_Tumblr : Gets pictures based on their tags
Something went wrong with that request. Please try again.