Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Various programs to extract information from the web
Python
branch: master
Failed to load latest commit information.
.gitattributes Initial Commit
.gitignore Add scrap whois website
README.md added comments for scrap_97things
Scrap_97TingsProgrammer.py added comments for scrap_97things
Scrap_GoogleImg.py added if main tests and comments
Scrap_Imgur.py Added Scrap_Moodle.py
Scrap_Moodle.py Fix selector to new moodle layout
Scrap_Nordelec.py added if main tests and comments
Scrap_PolyEduportefolio.py added if main tests and comments
Scrap_PrenomMasc.py added if main tests and comments
Scrap_RSS_titles.py added if main tests and comments
Scrap_Reddit.py Add general method to scrapImgur
Scrap_Tumblr.py Added Scrap_Moodle.py
Scrap_WhoIS.py Add scrap whois website
scraptools.py Add scrap whois website
statedConnection.py

README.md

WebScraping Tools

This repo contains examples of web scraping. Dependencies are lxml for DOM traversal and feedparser for RSS.

There is also a utility class "scraptools" to group common scraping operations:

  • getDOM : Returns the DOM element of the page for the given url
  • getElementsFromHTML : Returns a list of lxml elements from html source corresponding to a cssSelector
  • getElementsFromUrl : Returns a list of lxml elements from the page fetched at url corresponding to a cssSelector
  • getUrlContent : Gets the content of a url as a string
  • downloadResource : Download the content of a url to the disk
  • saveResource : Saves data to file in binary write mode
  • urlIterator : Successively yields page urls while there is a next one found by the cssSelector

Examples:

  • Scrap_97ThingsProgrammer : Aggregates 97 good programming practices and generates a printer friendly html page
  • Scrap_Eduportefolio : Get names of students attending Polytechnique Montreal
  • Scrap_GoogleImg : Download top imgages for a search on Google image
  • Scrap_Imgur : Download individual images or a whole gallery
  • Scrap_Moodle : Recursively downloads all the files from the course pages on Moodle
  • Scrap_Nordelec : Get information about the companies inside this building
  • Scrap_PrenomMasc : Get first names from a website
  • Scrap_Reddit : Parses posts from a subreddit
  • Scrap_RSS_titles : Get the article titles of rss feeds. Usefull for a quick glance at the news from the console ;)
  • Scrap_Tumblr : Gets pictures based on their tags
Something went wrong with that request. Please try again.