Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Procedural reimplementation #1

Open
wants to merge 3 commits into from

1 participant

@globalcitizen

mostly, i wrote this because my g+ feed was filling up with updates
about this project, i wanted to see what the code looked like. then
i had an unquashable urge to rewrite it. i find OO PHP to be
irritatingly verbose, and it pains me that people use it so much.
in order to keep existing users happy, i also modified the existing
OO code to wrap this procedural codebase. this follows the SPOT
(single point of truth) rule, which also provides for
maintaintability going forward.

requires:

  • php's curl and json extensions enabled
  • write access to a cache directory (if using cache)

changes:

  • better inline and overall documentation
  • upgrade to byte and screen-efficient '#'-based comments ;)
  • cachefile is now based upon the actual google+ ID you grabbed data from, which fixes a bug in the original code and enables the caching of data from multiple google+ accounts
  • should now be cross platform in that windows-style directory separators are honoured.
  • removed expectation of user intelligence (re: remembering to include a trailing '/' on will fail) ... ie: bug
  • slight improvement in codepath for cache (time calculation is avoided until certainly required)
  • now checks that the length of a supplied g+ id is sane before wasting time attempting to scrape a page (eg: if '0' supplied)
  • fixes possible security holes whereby input from the cache file is trusted. now validates and performs tag stripping through an internal function __gc_fix_tr() that is also used before storing to cache and returning non-cached values. (also provides safety against weird HTML tidbits when the scraper inevitably breaks)
  • removed acceptance of underscores in the number of circles
  • url is no longer cached to disk (cache size reduced)
  • added debugging function and messages
  • changed default cache dir to '.gc_cache' since someone out there is going to install this in a web-accessible dir, many http servers include or encourage disallowing dotfiles, and cache is collision-probable. (might fall back to gc_cache on windows - untested ;)
  • code now automatically attempts to create the cache dir if it doesn't exist
  • added some configuration options in the source

whinges:

  • 'count' is a really bad choice of identifier since it is far too generic and thus ambiguous. however, it has been left in place to preserve backwards compatibility.

procedural usage (OO usage remains unchanged):
require_once('gc.php');
$googleplusid = '123456789012345678901';
$info = google_plus_info($googleplusid);
print $info['name'] . ' has ' . $info['count'] . ' followers.';

globalcitizen added some commits
@globalcitizen globalcitizen Procedural re-implementation.
mostly, i wrote this because my g+ feed was filling up with updates
about this project, i wanted to see what the code looked like. then
i had an unquashable urge to rewrite it. i find OO PHP to be
irritatingly verbose, and it pains me that people use it so much.
in order to keep existing users happy, i also modified the existing
OO code to wrap this procedural codebase. this follows the SPOT
(single point of truth) rule, which also provides for
maintaintability going forward.

requires:
 - php's curl and json extensions enabled
 - write access to a cache directory (if using cache)

changes:
 - better inline and overall documentation
 - upgrade to byte and screen-efficient '#'-based comments ;)
 - cachefile is now based upon the actual google+ ID you grabbed
   data from, which fixes a bug in the original code and enables
   the caching of data from multiple google+ accounts
 - should now be cross platform in that windows-style directory
   separators are honoured.
 - removed expectation of user intelligence (re: remembering to
   include a trailing '/' on will fail) ... ie: bug
 - slight improvement in codepath for cache (time calculation is
   avoided until certainly required)
 - now checks that the length of a supplied g+ id is sane before
   wasting time attempting to scrape a page (eg: if '0' supplied)
 - fixes possible security holes whereby input from the cache
   file is trusted. now validates and performs tag stripping
   through an internal function __gc_fix_tr() that is also used
   before storing to cache and returning non-cached values.
   (also provides safety against weird HTML tidbits when the
   scraper inevitably breaks)
 - removed acceptance of underscores in the number of circles
 - url is no longer cached to disk (cache size reduced)
 - added debugging function and messages
 - changed default cache dir to '.gc_cache' since someone out
   there is going to install this in a web-accessible dir,
   many http servers include or encourage disallowing dotfiles,
   and cache is collision-probable. (might fall back to gc_cache
   on windows - untested ;)
 - code now automatically attempts to create the cache dir if
   it doesn't exist
 - added some configuration options in the source

whinges:
 - 'count' is a really bad choice of identifier since it is far
   too generic and thus ambiguous. however, it has been left in
   place to preserve backwards compatibility.

procedural usage (OO usage remains unchanged):
  require_once('gc.php');
  $googleplusid = '123456789012345678901';
  $info = google_plus_info($googleplusid);
  print $info['name'] . ' has ' . $info['count'] . ' followers.';
727b65b
@globalcitizen globalcitizen Modify to wrap gc.php, a newer procedural implementation with many bu…
…gfixes and improvements.
0ab9ec0
@globalcitizen globalcitizen Update documentation. e3b54f4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 21, 2011
  1. @globalcitizen

    Procedural re-implementation.

    globalcitizen authored
    mostly, i wrote this because my g+ feed was filling up with updates
    about this project, i wanted to see what the code looked like. then
    i had an unquashable urge to rewrite it. i find OO PHP to be
    irritatingly verbose, and it pains me that people use it so much.
    in order to keep existing users happy, i also modified the existing
    OO code to wrap this procedural codebase. this follows the SPOT
    (single point of truth) rule, which also provides for
    maintaintability going forward.
    
    requires:
     - php's curl and json extensions enabled
     - write access to a cache directory (if using cache)
    
    changes:
     - better inline and overall documentation
     - upgrade to byte and screen-efficient '#'-based comments ;)
     - cachefile is now based upon the actual google+ ID you grabbed
       data from, which fixes a bug in the original code and enables
       the caching of data from multiple google+ accounts
     - should now be cross platform in that windows-style directory
       separators are honoured.
     - removed expectation of user intelligence (re: remembering to
       include a trailing '/' on will fail) ... ie: bug
     - slight improvement in codepath for cache (time calculation is
       avoided until certainly required)
     - now checks that the length of a supplied g+ id is sane before
       wasting time attempting to scrape a page (eg: if '0' supplied)
     - fixes possible security holes whereby input from the cache
       file is trusted. now validates and performs tag stripping
       through an internal function __gc_fix_tr() that is also used
       before storing to cache and returning non-cached values.
       (also provides safety against weird HTML tidbits when the
       scraper inevitably breaks)
     - removed acceptance of underscores in the number of circles
     - url is no longer cached to disk (cache size reduced)
     - added debugging function and messages
     - changed default cache dir to '.gc_cache' since someone out
       there is going to install this in a web-accessible dir,
       many http servers include or encourage disallowing dotfiles,
       and cache is collision-probable. (might fall back to gc_cache
       on windows - untested ;)
     - code now automatically attempts to create the cache dir if
       it doesn't exist
     - added some configuration options in the source
    
    whinges:
     - 'count' is a really bad choice of identifier since it is far
       too generic and thus ambiguous. however, it has been left in
       place to preserve backwards compatibility.
    
    procedural usage (OO usage remains unchanged):
      require_once('gc.php');
      $googleplusid = '123456789012345678901';
      $info = google_plus_info($googleplusid);
      print $info['name'] . ' has ' . $info['count'] . ' followers.';
  2. @globalcitizen
  3. @globalcitizen

    Update documentation.

    globalcitizen authored
This page is out of date. Refresh to see the latest.
Showing with 259 additions and 153 deletions.
  1. +68 −5 README
  2. +179 −0 gc.php
  3. +12 −148 googleCard.php
View
73 README
@@ -1,13 +1,76 @@
googleCard
====================
-A very quick and rough PHP class to scrape data from google+
-Copyright (C) 2011 Mabujo
-http://plusdevs.com
-http://plusdevs.com/googlecard-googleplus-php-scraper/
+Scrape data about a user from Google+.
-See index.php for sample implementation
+This project used to be "A very quick and rough PHP class to scrape data
+from google+" but is now a procedural codebase with a minimalist PHP class
+wrapper (for existing users, and all those poor, poor, OO PHP enthusiasts
+out there... ;) and a bunch of bugfixes / improvements.
+Authors:
+ - Original by Mabujo
+ - Rewrite by https://github.com/globalcitizen/
+
+URLs:
+ - http://plusdevs.com
+ - http://plusdevs.com/googlecard-googleplus-php-scraper/
+
+Example:
+ - See index.php for a sample implementation (OO)
+
+Requirements
+---------------------
+ - PHP with 'curl' and 'json' support compiled in (should be...)
+ - A writable cache directory, if you want to use caching (just say yes!)
+
+Installation
+---------------------
+ - The code should make its own cache directory if it has
+ rights, otherwise you may need to do this manually.
+
+Compatibility
+---------------------
+ - "Should" work on Windows as well as Unix platforms, however this is
+ untested
+
+Usage
+---------------------
+Basically it comes down to one function. You supply it with a Google+
+ID and it returns an array of information regarding that user. The
+information returned is as follows:
+
+ 'url' URL to the user's Google+ profile
+ 'img' URL to the user's Google+ profile image
+ 'count' Total number of followers the user has
+ 'name' The user's name
+
+The interface differs slightly based upon your implementation style.
+
+ Procedural ("Traditional / Unix style")
+ ---------------------------------------
+ require_once('gc.php');
+ $googleplusid = '123456789012345678901';
+ $info = google_plus_user_info($googleplusid);
+ print $info['name'] . ' has ' . $info['count'] . ' followers.';
+
+ Object Oriented ("Enterprise Platform Framework Integration style")*
+ -------------------------------------------------------------------
+ include_once('googleCard.php');
+ $googleplusid = '123456789012345678901';
+ $plus = new googleCard($plus_id);
+ $data = $plus->googleCard();
+ print $data['name'] . ' has ' . $data['count'] . ' followers.';
+
+ * Also, usually a hell of a lot more typing, less concise code,
+ more implicit assumptions about integration paths, etc. See
+ 'Coders at Work' or 'The Art of UNIX Programming' for some
+ well-considered and even-handed criticisms of OO's overuse.
+
+Configuration
+-------------
+There are some options available within the source, however you
+are encouraged to use the defaults.
License
---------------------
View
179 gc.php
@@ -0,0 +1,179 @@
+<?php
+/**
+* Google Plus User Information Scraper (and 'Google Card' backend)
+* http://plusdevs.com
+* http://plusdevs.com/googlecard-googleplus-php-scraper/
+*
+* This program is free software: you can redistribute it and/or
+* modify
+* it under the terms of the GNU General Public License as published
+* by
+* the Free Software Foundation, either version 3 of the License, or
+* (at your option) any later version.
+*
+* This program is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+* GNU General Public License for more details.
+*
+* You should have received a copy of the GNU General Public License
+* along with this program. If not, see
+* <http://www.gnu.org/licenses/>.
+*/
+
+# procedural reimplementation of the 'googleCard' googleplus scraper
+# (see README for docs.)
+
+# init
+global $__gc;
+
+# settings
+$__gc = array(
+ # basics for http access
+ 'url' => 'http://plus.google.com/',
+ 'user_agent' => 'Mozilla/5.0 (X11; Linux x86_64; rv:5.0) Gecko/20100101 Firefox/5.0',
+ # whether to cache or not
+ 'cache' => 1,
+ # number of hours to cache, if caching
+ 'cache_hours' => 2,
+ # whether to trust the cache (0 = more secure, 1 = faster)
+ 'trust_cache' => 0,
+ # whether to trust the web (0 = intelligent, 1 = fast but stupid)
+ 'trust_world' => 0,
+ # directory in which to store cachefile
+ 'cache_dir' => '.gc_cache',
+ # debug
+ 'debug' => 0
+ );
+# cachedir fix for windows platforms, which may break on '.blah'
+if(DIR_SEPARATOR == '\\') { $__gc['cache_dir'] = 'gc_cache'; }
+
+# check cache dir sanity
+# - exists?
+if(!is_dir($__gc['cache_dir'])) {
+ # attempt to create
+ if(!mkdir($__gc['cache_dir'])) {
+ print "ERROR: failed to create cache_dir '" . $__gc['cachedir'] . "'\n";
+ die();
+ }
+}
+# - is writable?
+if(!is_writable($__gc['cache_dir'])) {
+ print "ERROR: cache_dir '" . $__gc['cache_dir'] . "' does not exist, or is not writable!\n";
+ die();
+}
+
+# acquire information regarding a user on google+ by screen scraping
+# arguments:
+# $googleplusid numeric google+ id of the user whose data you wish
+# to query. (note: this should be 21-100 numerals)
+# returns an array with the following elements:
+# count the number of followers the user has
+# img url to the users' image
+# url url to users' google+ page
+# name the user's name
+# ... or false on failure.
+# note that cache behaviour is controlled via $__gc
+function google_plus_user_info($googleplusid) {
+ # sanitise
+ $googleplusid = preg_replace('/[^0-9]/','',$googleplusid);
+ $length = strlen($googleplusid);
+ if($length < 21 || $length > 100) { return; }
+ # init
+ global $__gc;
+ $tr = array(); # data 'to return'
+ $url = $__gc['url'] . $googleplusid;
+ # first, handle the case of caching enabled
+ if($__gc['cache']) {
+ # build cachefile path (cross-platform)
+ $cachefile = $__gc['cache_dir'] . DIRECTORY_SEPARATOR . $googleplusid;
+ # does the cachefile exist?
+ if(file_exists($cachefile) && is_readable($cachefile)) {
+ # is cachefile fresh?
+ if(filemtime($cachefile) > (time() - ($__gc['cache_hours']*60*60))) {
+ # great! read the data.
+ $tr = json_decode(file_get_contents($cachefile),1);
+ __gc_debug("loaded data from cachefile '$cachefile'");
+ # sanitise if desired
+ if($__gc['trust_cache'] == 0) { $tr = __gc_tr_fix($tr); }
+ # append URL and return
+ $tr['url'] = $url;
+ return $tr;
+ }
+ }
+ }
+ # cache was either disabled or stale, so we need to fetch new data.
+ $html = __gc_http($url);
+ # attempt to extract the relevant information
+ # - number of followers ("in <x> circles")
+ preg_match('/<h4 class="a-c-ka-Sf">(.*?)<\/h4>/s',$html,$matches);
+ $tr['count'] = preg_replace('/[^0-9]/', '', $matches[1]);
+ if($tr['count']=='') { $tr['count'] = 0; }
+ # - user's name
+ preg_match('/<span class="fn">(.*?)<\/span>/s',$html,$matches);
+ $tr['name'] = $matches[1];
+ # - user's image URL
+ preg_match('/<div class="a-Ba-V-z-N">(.*?)<\/div>/s',$html,$matches);
+ $img_div_html = $matches[1]; # actually div data
+ preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i',$img_div_html,$matches);
+ $tr['img'] = 'http:' . $matches[1];
+ # finally, we handle saving to cache if required
+ if($__gc['cache']) {
+ file_put_contents($cachefile,json_encode($tr));
+ __gc_debug("stored data to cachefile '$cachefile'");
+ }
+ # sanitise if required
+ if(!$__gc['trust_world']) {
+ $tr = __gc_tr_fix($tr);
+ }
+ # append URL and return
+ $tr['url'] = $url;
+ return $tr;
+}
+
+# try to load a page
+function __gc_http($url) {
+ global $__gc;
+ $ch = curl_init($url);
+ curl_setopt($ch, CURLOPT_HEADER, 0);
+ curl_setopt($ch, CURLOPT_USERAGENT, $__gc['user_agent']);
+ curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
+ curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
+ return curl_exec($ch);
+}
+
+# internal function to validate information prior to returning
+# called from two places:
+# - after untrusted cache load
+# - prior to storing cache results
+function __gc_tr_fix($tr) {
+ __gc_debug("[gc_tr_fix] input was " . print_r($tr,1));
+ if(!is_array($tr)) { $tr = array(); }
+ # count must be numeric
+ if(($tr['count']+0) == 0) { $tr['count'] = 0; }
+ # img must be an http url
+ if(substr($tr['img'],0,7) != 'http://') { $tr['img'] = ''; }
+ # if name is unset, set it blank
+ if(!isset($tr['name'])) { $tr['name'] = ''; }
+ # now strip_tags() on all data, simultaneously dropping unknown keys
+ $allowed_keys = array('name','img','count');
+ $keys = array_keys($tr);
+ foreach($keys as $key) {
+ if(!in_array($key,$allowed_keys)) {
+ unset($tr[$key]);
+ }
+ else {
+ $tr[$key] = strip_tags($tr[$key]);
+ }
+ }
+ __gc_debug("[gc_tr_fx] output was " . print_r($tr,1));
+ return $tr;
+}
+
+# debug
+function __gc_debug($str) {
+ global $__gc;
+ if($__gc['debug']) { print "DEBUG: " . $str . "\n"; }
+}
+
+?>
View
160 googleCard.php
@@ -2,6 +2,7 @@
/**
* A very quick and rough PHP class to scrape data from google+
* Copyright (C) 2011 Mabujo
+* - 2011-07-21: rewrite by https://github.com/globalcitizen/
* http://plusdevs.com
* http://plusdevs.com/googlecard-googleplus-php-scraper/
*
@@ -19,152 +20,15 @@
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
-class googleCard
-{
- // The base g+ URL
- public $gplus_url = 'http://plus.google.com/';
-
- // set a plausible user agent
- public $user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:5.0) Gecko/20100101 Firefox/5.0';
-
- /*
- * whether to cache the data or not
- * no cache = 0
- * cache = 1
- */
- public $cache_data = '0';
-
- // how many hours to cache for
- public $cache_hours = '2';
-
- // cache file directory
- public $cache_dir = 'cache/';
-
- // cache file name
- public $cache_file = 'plus_card.txt';
-
- // constructor
- function __construct($id = '')
- {
- if (!empty($id) && is_numeric($id))
- {
- // build our google+ url
- $this->url = $this->gplus_url . $id;
- }
- }
-
- // main handler function, call it from your script
- public function googleCard()
- {
- // if we're using caching
- if ($this->cache_data > 0)
- {
- $html = $this->ghettoCache();
- return $html;
- }
- // don't cache
- else
- {
- $html = $this->parseHtml();
- return $html;
- }
- }
-
- // parses through the returned html
- protected function parseHtml()
- {
- // load the page
- $this->getPage();
-
- // parse the html to look for the h4 'have X in circles' element
- preg_match('/<h4 class="a-c-ka-Sf">(.*?)<\/h4>/s', $this->html, $matches);
- $count = $matches[1];
- $circles = preg_replace('/[^0-9_]/', '', $count);
- if (empty($circles))
- {
- $circles = 0;
- }
-
- // parse the html for the user's name
- preg_match('/<span class="fn">(.*?)<\/span>/s', $this->html, $matches);
- $name = $matches[1];
-
- // parse the html for the img div
- preg_match('/<div class="a-Ba-V-z-N">(.*?)<\/div>/s', $this->html, $matches);
- $img_div = $matches[1];
-
- // parse the img div for the image src
- preg_match('/< *img[^>]*src *= *["\']?([^"\']*)/i', $img_div, $matches);
- $img = 'http:' . $matches[1];
-
- // put the data in an array
- $return = array('count' => $circles, 'name' => $name, 'img' => $img, 'url' => $this->url);
-
- return $return;
- }
-
- // use curl to load the page
- protected function getPage()
- {
- // initiate curl with our url
- $this->curl = curl_init($this->url);
-
- // set curl options
- curl_setopt($this->curl, CURLOPT_HEADER, 0);
- curl_setopt($this->curl, CURLOPT_USERAGENT, $this->user_agent);
- curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);
- curl_setopt($this->curl, CURLOPT_FOLLOWLOCATION, true);
-
- // execute the call to google+
- $this->html = curl_exec($this->curl);
-
- curl_close($this->curl);
- }
-
- // caching
- protected function ghettoCache()
- {
- // our cache file
- $file = $this->cache_dir . $this->cache_file;
- $cache_time = ($this->cache_hours * 60) * 60;
-
- // if we have a cache file and it's within our expiry time
- if (file_exists($file) && (time() - $cache_time < filemtime($file)))
- {
- //open cached file
- $handle = fopen($file, "r");
-
- //read it
- $data = fgets($handle);
-
- //close it
- fclose($handle);
-
- // json decode, put into array and return
- return get_object_vars(json_decode($data));
- }
- // we don't have a cache file
- // call google+ and cache
- else
- {
- // get and parse the data
- $html = $this->parseHtml();
-
- // json encode the data
- $json = json_encode($html);
-
- // open the file
- $handle = fopen($file, 'w');
-
- // write data to file
- fwrite($handle, $json);
-
- // close file
- fclose($handle);
-
- // return data
- return $html;
- }
- }
+# now simply a wrapper for gc.php, which is a rewrite.
+class googleCard {
+ function __construct($id = '') {
+ require_once('gc.php'); # load the procedural codebase
+ $this->id = $id;
+ }
+ public function googleCard() {
+ return google_plus_user_info($this->id);
+ }
}
-?>
+
+?>
Something went wrong with that request. Please try again.