Skip to content
Browse files

initial xgoogle import, version v1.3

  • Loading branch information...
0 parents commit e3477ce68ffb77f4fbfbd268b87334cd34aa14bc @pkrumins committed
Showing with 3,029 additions and 0 deletions.
  1. +199 −0 readme.txt
  2. +1,931 −0 xgoogle/BeautifulSoup.py
  3. +34 −0 xgoogle/__init__.py
  4. +105 −0 xgoogle/browser.py
  5. +89 −0 xgoogle/googlesets.py
  6. +237 −0 xgoogle/search.py
  7. +235 −0 xgoogle/sponsoredlinks.py
  8. +199 −0 xgoogle/translate.py
199 readme.txt
@@ -0,0 +1,199 @@
+This is a Google library called 'xgoogle'. Current version is 1.3.
+
+It's written by Peteris Krumins (peter@catonmat.net).
+His blog is at http://www.catonmat.net -- good coders code, great reuse.
+
+The code is licensed under MIT license.
+
+--------------------------------------------------------------------------
+
+At the moment it contains:
+ * Google Search module xgoogle/search.py.
+ http://www.catonmat.net/blog/python-library-for-google-search/
+
+ * Google Sponsored Links Search module xgoogle/sponsoredlinks.py
+ http://www.catonmat.net/blog/python-library-for-google-sponsored-links-search/
+
+ * Google Sets module xgoogle/googlesets.py
+ http://www.catonmat.net/blog/python-library-for-google-sets/
+
+ * Google Translate module xgoogle/translate.py
+ http://www.catonmat.net/blog/python-library-for-google-translate/
+
+--------------------------------------------------------------------------
+
+Here is an example usage of Google Search module:
+
+ >>> from xgoogle.search import GoogleSearch
+ >>> gs = GoogleSearch("catonmat")
+ >>> gs.results_per_page = 25
+ >>> results = gs.get_results()
+ >>> for res in results:
+ ... print res.title.encode('utf8')
+ ...
+
+ output:
+
+ good coders code, great reuse
+ MIT's Introduction to Algorithms, Lectures 1 and 2: Analysis of ...
+ catonmat - Google Code
+ ...
+
+The GoogleSearch object has several public methods and properties:
+
+ method get_results() - gets a page of results, returning a list of SearchResult objects.
+ property num_results - returns number of search results found.
+ property results_per_page - sets/gets the number of results to get per page.
+ property page - sets/gets the search page.
+
+A SearchResult object has three attributes -- "title", "desc", and "url".
+They are Unicode strings, so do a proper encoding before outputting them.
+
+--------------------------------------------------------------------------
+
+Here is an example usage of Google Sponsored Links Search module:
+
+ >>> from xgoogle.sponsoredlinks import SponsoredLinks, SLError
+ >>> sl = SponsoredLinks("video software")
+ >>> sl.results_per_page = 100
+ >>> results = sl.get_results()
+ >>> for result in results:
+ ... print result.title.encode('utf8')
+ ...
+
+ output:
+
+ Photoshop Video Software
+ Video Poker Software
+ DVD/Video Rental Software
+ ...
+
+The SponsoredLinks object has several public methods and properties:
+
+ method get_results() - gets a page of results, returning a list of SearchResult objects.
+ property num_results - returns number of search results found.
+ property results_per_page - sets/gets the number of results to get per page.
+
+A SponsoredLink object has four attributes -- "title", "desc", "url", and "display_url".
+They are Unicode strings, don't forget to use a proper encoding before outputting them.
+
+--------------------------------------------------------------------------
+
+Here is an example usage of Google Sets module:
+
+ >>> from xgoogle.googlesets import GoogleSets
+ >>> gs = GoogleSets(['red', 'yellow'])
+ >>> results = gs.get_results()
+ >>> print len(results)
+ >>> for r in results:
+ ... print r.encode('utf8')
+ ...
+
+ output:
+
+ red
+ yellow
+ blue
+ white
+ ...
+
+The GoogleSets object has only get_results(set_type) public method. The default value
+for set_type is SMALL_SET, which makes it return 15 related items or fewer.
+Use LARGE_SET to get more than 15 items. This get_results() method returns a list of
+related items that are represented as unicode strings.
+Don't forget to do the proper encoding when outputting these strings!
+
+Here is an example showing differences between SMALL_SET and LARGE_SET:
+
+ >>> from xgoogle.googlesets import GoogleSets, LARGE_SET, SMALL_SET
+ >>> gs = GoogleSets(['python', 'perl'])
+ >>> results_small = gs.get_results() # SMALL_SET by default
+ >>> len(results_small)
+ 11
+ >>> results_small
+ [u'python', u'perl', u'php', u'ruby', u'java', u'javascript', u'c++', u'c',
+ u'cgi', u'tcl', u'c#']
+ >>>
+ >>> results_large = gs.get_results(LARGE_SET)
+ >>> len(results_large)
+ 46
+ >>> results_large
+ [u'perl', u'python', u'java', u'c++', u'php', u'c', u'c#', u'javascript',
+ u'howto', u'wiki', u'raid', u'dd', u'linux', u'ruby', u'language', u'xml',
+ u'sgml', u'svn', u'kernel', ...]
+
+
+--------------------------------------------------------------------------
+
+Here is an example usage of Google Translate module:
+
+ >>> from xgoogle.translate import Translator
+ >>>
+ >>> translate = Translator().translate
+ >>> print translate("Mani sauc Pēteris", lang_to="ru").encode('utf-8')
+ Меня зовут Петр
+ >>> print translate("Mani sauc Pēteris", lang_to="en")
+ My name is Peter
+ >>> print translate("Меня зовут Петр")
+ My name is Peter
+
+The "translate" function takes three arguments - "message", "lang_from" and "lang_to".
+If "lang_from" is not given, Google's translation service auto-detects it.
+If "lang_to" is not given, it defaults to "en" (English).
+
+In case of an error the "translate" function throws "TranslationError" exception.
+Make sure to wrap your code in try/except block to catch it:
+
+ >>> from xgoogle.translate import Translator, TranslationError
+ >>>
+ >>> try:
+ >>> translate = Translator().translate
+ >>> print translate("")
+ >>> except TranslationError, e:
+ >>> print e
+
+ Failed translating: invalid text
+
+
+The Google Translate module also provides "LanguageDetector" class that can be used
+to detect the language of the text.
+
+Here is an example usage of LanguageDetector:
+
+ >>> from xgoogle.translate import LanguageDetector, DetectionError
+ >>>
+ >>> detect = LanguageDetector().detect
+ >>> english = detect("This is a wonderful library.")
+ >>> english.lang_code
+ 'en'
+ >>> english.lang
+ 'English'
+ >>> english.confidence
+ 0.28078437000000001
+ >>> english.is_reliable
+ True
+
+The "DetectionError" may get raised if the detection failed.
+
+
+--------------------------------------------------------------------------
+
+
+Version history:
+
+v1.0: * initial release, xgoogle library contains just the Google Search.
+v1.1: * added Google Sponsored Links Search.
+ * fixed a bug in browser.py that might have thrown an unexpected exception.
+v1.2: * added Google Sets module
+v1.3: * added Google Translate module
+ * fixed a bug in browser.py when KeyboardInterrupt did not get propagated.
+
+--------------------------------------------------------------------------
+
+That's it. Have fun! :)
+
+
+Sincerely,
+Peteris Krumins
+http://www.catonmat.net
+
1,931 xgoogle/BeautifulSoup.py
1,931 additions, 0 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
34 xgoogle/__init__.py
@@ -0,0 +1,34 @@
+#!/usr/bin/python
+#
+# Peteris Krumins (peter@catonmat.net)
+# http://www.catonmat.net -- good coders code, great reuse
+#
+# A Google Python library:
+# http://www.catonmat.net/blog/python-library-for-google-search/
+#
+# Distributed under MIT license:
+#
+# Copyright (c) 2009 Peteris Krumins
+#
+# Permission is hereby granted, free of charge, to any person
+# Obtaining a copy of this software and associated documentation
+# Files (the "Software"), to deal in the Software without
+# Restriction, including without limitation the rights to use,
+# Copy, modify, merge, publish, distribute, sublicense, and/or sell
+# Copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following
+# Conditions:
+#
+# The above copyright notice and this permission notice shall be
+# Included in all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+# OTHER DEALINGS IN THE SOFTWARE.
+#
+
105 xgoogle/browser.py
@@ -0,0 +1,105 @@
+#!/usr/bin/python
+#
+# Peteris Krumins (peter@catonmat.net)
+# http://www.catonmat.net -- good coders code, great reuse
+#
+# http://www.catonmat.net/blog/python-library-for-google-search/
+#
+# Code is licensed under MIT license.
+#
+
+import random
+import socket
+import urllib
+import urllib2
+import httplib
+
+BROWSERS = (
+ # Top most popular browsers in my access.log on 2009.02.12
+ # tail -50000 access.log |
+ # awk -F\" '{B[$6]++} END { for (b in B) { print B[b] ": " b } }' |
+ # sort -rn |
+ # head -20
+ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6',
+ 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6',
+ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)',
+ 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.10 (intrepid) Firefox/3.0.6',
+ 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6',
+ 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6 (.NET CLR 3.5.30729)',
+ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.48 Safari/525.19',
+ 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)',
+ 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.10 (intrepid) Firefox/3.0.6',
+ 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.5) Gecko/2008121621 Ubuntu/8.04 (hardy) Firefox/3.0.5',
+ 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-us) AppleWebKit/525.27.1 (KHTML, like Gecko) Version/3.2.1 Safari/525.27.1',
+ 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)',
+ 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)',
+ 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
+)
+
+TIMEOUT = 5 # socket timeout
+
+class BrowserError(Exception):
+ def __init__(self, url, error):
+ self.url = url
+ self.error = error
+
+class PoolHTTPConnection(httplib.HTTPConnection):
+ def connect(self):
+ """Connect to the host and port specified in __init__."""
+ msg = "getaddrinfo returns an empty list"
+ for res in socket.getaddrinfo(self.host, self.port, 0,
+ socket.SOCK_STREAM):
+ af, socktype, proto, canonname, sa = res
+ try:
+ self.sock = socket.socket(af, socktype, proto)
+ if self.debuglevel > 0:
+ print "connect: (%s, %s)" % (self.host, self.port)
+ self.sock.settimeout(TIMEOUT)
+ self.sock.connect(sa)
+ except socket.error, msg:
+ if self.debuglevel > 0:
+ print 'connect fail:', (self.host, self.port)
+ if self.sock:
+ self.sock.close()
+ self.sock = None
+ continue
+ break
+ if not self.sock:
+ raise socket.error, msg
+
+class PoolHTTPHandler(urllib2.HTTPHandler):
+ def http_open(self, req):
+ return self.do_open(PoolHTTPConnection, req)
+
+class Browser(object):
+ def __init__(self, user_agent=BROWSERS[0], debug=False, use_pool=False):
+ self.headers = {
+ 'User-Agent': user_agent,
+ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+ 'Accept-Language': 'en-us,en;q=0.5'
+ }
+ self.debug = debug
+
+ def get_page(self, url, data=None):
+ handlers = [PoolHTTPHandler]
+ opener = urllib2.build_opener(*handlers)
+ if data: data = urllib.urlencode(data)
+ request = urllib2.Request(url, data, self.headers)
+ try:
+ response = opener.open(request)
+ return response.read()
+ except (urllib2.HTTPError, urllib2.URLError), e:
+ raise BrowserError(url, str(e))
+ except (socket.error, socket.sslerror), msg:
+ raise BrowserError(url, msg)
+ except socket.timeout, e:
+ raise BrowserError(url, "timeout")
+ except KeyboardInterrupt:
+ raise
+ except:
+ raise BrowserError(url, "unknown error")
+
+ def set_random_user_agent(self):
+ self.headers['User-Agent'] = random.choice(BROWSERS)
+ return self.headers['User-Agent']
+
89 xgoogle/googlesets.py
@@ -0,0 +1,89 @@
+#!/usr/bin/python
+#
+# Peteris Krumins (peter@catonmat.net)
+# http://www.catonmat.net -- good coders code, great reuse
+#
+# http://www.catonmat.net/blog/python-library-for-google-sets/
+#
+# Code is licensed under MIT license.
+#
+
+import re
+import urllib
+import random
+from htmlentitydefs import name2codepoint
+from BeautifulSoup import BeautifulSoup
+
+from browser import Browser, BrowserError
+
+class GSError(Exception):
+ """ Google Sets Error """
+ pass
+
+class GSParseError(Exception):
+ """
+ Parse error in Google Sets results.
+ self.msg attribute contains explanation why parsing failed
+ self.tag attribute contains BeautifulSoup object with the most relevant tag that failed to parse
+ Thrown only in debug mode
+ """
+
+ def __init__(self, msg, tag):
+ self.msg = msg
+ self.tag = tag
+
+ def __str__(self):
+ return self.msg
+
+ def html(self):
+ return self.tag.prettify()
+
+LARGE_SET = 1
+SMALL_SET = 2
+
+class GoogleSets(object):
+ URL_LARGE = "http://labs.google.com/sets?hl=en&q1=%s&q2=%s&q3=%s&q4=%s&q5=%s&btn=Large+Set"
+ URL_SMALL = "http://labs.google.com/sets?hl=en&q1=%s&q2=%s&q3=%s&q4=%s&q5=%s&btn=Small+Set+(15+items+or+fewer)"
+
+ def __init__(self, items, random_agent=False, debug=False):
+ self.items = items
+ self.debug = debug
+ self.browser = Browser(debug=debug)
+
+ if random_agent:
+ self.browser.set_random_user_agent()
+
+ def get_results(self, set_type=SMALL_SET):
+ page = self._get_results_page(set_type)
+ results = self._extract_results(page)
+ return results
+
+ def _maybe_raise(self, cls, *arg):
+ if self.debug:
+ raise cls(*arg)
+
+ def _get_results_page(self, set_type):
+ if set_type == LARGE_SET:
+ url = GoogleSets.URL_LARGE
+ else:
+ url = GoogleSets.URL_SMALL
+
+ safe_items = [urllib.quote_plus(i) for i in self.items]
+ blank_items = 5 - len(safe_items)
+ if blank_items > 0:
+ safe_items += ['']*blank_items
+
+ safe_url = url % tuple(safe_items)
+
+ try:
+ page = self.browser.get_page(safe_url)
+ except BrowserError, e:
+ raise GSError, "Failed getting %s: %s" % (e.url, e.error)
+
+ return BeautifulSoup(page)
+
+ def _extract_results(self, soup):
+ a_links = soup.findAll('a', href=re.compile('/search'))
+ ret_res = [a.string for a in a_links]
+ return ret_res
+
237 xgoogle/search.py
@@ -0,0 +1,237 @@
+#!/usr/bin/python
+#
+# Peteris Krumins (peter@catonmat.net)
+# http://www.catonmat.net -- good coders code, great reuse
+#
+# http://www.catonmat.net/blog/python-library-for-google-search/
+#
+# Code is licensed under MIT license.
+#
+
+import re
+import urllib
+from htmlentitydefs import name2codepoint
+from BeautifulSoup import BeautifulSoup
+
+from browser import Browser, BrowserError
+
+class SearchError(Exception):
+ """
+ Base class for Google Search exceptions.
+ """
+ pass
+
+class ParseError(SearchError):
+ """
+ Parse error in Google results.
+ self.msg attribute contains explanation why parsing failed
+ self.tag attribute contains BeautifulSoup object with the most relevant tag that failed to parse
+ Thrown only in debug mode
+ """
+
+ def __init__(self, msg, tag):
+ self.msg = msg
+ self.tag = tag
+
+ def __str__(self):
+ return self.msg
+
+ def html(self):
+ return self.tag.prettify()
+
+class SearchResult:
+ def __init__(self, title, url, desc):
+ self.title = title
+ self.url = url
+ self.desc = desc
+
+ def __str__(self):
+ return 'Google Search Result: "%s"' % self.title
+
+class GoogleSearch(object):
+ SEARCH_URL_0 = "http://www.google.com/search?hl=en&q=%(query)s&btnG=Google+Search"
+ NEXT_PAGE_0 = "http://www.google.com/search?hl=en&q=%(query)s&start=%(start)d"
+ SEARCH_URL_1 = "http://www.google.com/search?hl=en&q=%(query)s&num=%(num)d&btnG=Google+Search"
+ NEXT_PAGE_1 = "http://www.google.com/search?hl=en&q=%(query)s&num=%(num)d&start=%(start)d"
+
+ def __init__(self, query, random_agent=False, debug=False):
+ self.query = query
+ self.debug = debug
+ self.browser = Browser(debug=debug)
+ self.results_info = None
+ self.eor = False # end of results
+ self._page = 0
+ self._results_per_page = 10
+ self._last_from = 0
+
+ if random_agent:
+ self.browser.set_random_user_agent()
+
+ @property
+ def num_results(self):
+ if not self.results_info:
+ page = self._get_results_page()
+ self.results_info = self._extract_info(page)
+ if self.results_info['total'] == 0:
+ self.eor = True
+ return self.results_info['total']
+
+ def _get_page(self):
+ return self._page
+
+ def _set_page(self, page):
+ self._page = page
+
+ page = property(_get_page, _set_page)
+
+ def _get_results_per_page(self):
+ return self._results_per_page
+
+ def _set_results_par_page(self, rpp):
+ self._results_per_page = rpp
+
+ results_per_page = property(_get_results_per_page, _set_results_par_page)
+
+ def get_results(self):
+ """ Gets a page of results """
+ if self.eor:
+ return []
+
+ page = self._get_results_page()
+ search_info = self._extract_info(page)
+ if not self.results_info:
+ self.results_info = search_info
+ if self.num_results == 0:
+ self.eor = True
+ return []
+ results = self._extract_results(page)
+ if not results:
+ self.eor = True
+ return []
+ if self._page > 0 and search_info['from'] == self._last_from:
+ self.eor = True
+ return []
+ if search_info['to'] == search_info['total']:
+ self.eor = True
+ self._page += 1
+ self._last_from = search_info['from']
+ return results
+
+ def _maybe_raise(self, cls, *arg):
+ if self.debug:
+ raise cls(*arg)
+
+ def _get_results_page(self):
+ if self._page == 0:
+ if self._results_per_page == 10:
+ url = GoogleSearch.SEARCH_URL_0
+ else:
+ url = GoogleSearch.SEARCH_URL_1
+ else:
+ if self._results_per_page == 10:
+ url = GoogleSearch.NEXT_PAGE_0
+ else:
+ url = GoogleSearch.NEXT_PAGE_1
+
+ safe_url = url % { 'query': urllib.quote_plus(self.query),
+ 'start': self._page * self._results_per_page,
+ 'num': self._results_per_page }
+
+ try:
+ page = self.browser.get_page(safe_url)
+ except BrowserError, e:
+ raise SearchError, "Failed getting %s: %s" % (e.url, e.error)
+
+ return BeautifulSoup(page)
+
+ def _extract_info(self, soup):
+ empty_info = {'from': 0, 'to': 0, 'total': 0}
+ div_ssb = soup.find('div', id='ssb')
+ if not div_ssb:
+ self._maybe_raise(ParseError, "Div with number of results was not found on Google search page", soup)
+ return empty_info
+ p = div_ssb.find('p')
+ if not p:
+ self._maybe_raise(ParseError, """<p> tag within <div id="ssb"> was not found on Google search page""", soup)
+ return empty_info
+ txt = ''.join(p.findAll(text=True))
+ txt = txt.replace(',', '')
+ matches = re.search(r'Results (\d+) - (\d+) of (?:about )?(\d+)', txt, re.U)
+ if not matches:
+ return empty_info
+ return {'from': int(matches.group(1)), 'to': int(matches.group(2)), 'total': int(matches.group(3))}
+
+ def _extract_results(self, soup):
+ results = soup.findAll('li', {'class': 'g'})
+ ret_res = []
+ for result in results:
+ eres = self._extract_result(result)
+ if eres:
+ ret_res.append(eres)
+ return ret_res
+
+ def _extract_result(self, result):
+ title, url = self._extract_title_url(result)
+ desc = self._extract_description(result)
+ if not title or not url or not desc:
+ return None
+ return SearchResult(title, url, desc)
+
+ def _extract_title_url(self, result):
+ #title_a = result.find('a', {'class': re.compile(r'\bl\b')})
+ title_a = result.find('a')
+ if not title_a:
+ self._maybe_raise(ParseError, "Title tag in Google search result was not found", result)
+ return None, None
+ title = ''.join(title_a.findAll(text=True))
+ title = self._html_unescape(title)
+ url = title_a['href']
+ match = re.match(r'/url\?q=(http[^&]+)&', url)
+ if match:
+ url = urllib.unquote(match.group(1))
+ return title, url
+
+ def _extract_description(self, result):
+ desc_div = result.find('div', {'class': re.compile(r'\bs\b')})
+ if not desc_div:
+ self._maybe_raise(ParseError, "Description tag in Google search result was not found", result)
+ return None
+
+ desc_strs = []
+ def looper(tag):
+ if not tag: return
+ for t in tag:
+ try:
+ if t.name == 'br': break
+ except AttributeError:
+ pass
+
+ try:
+ desc_strs.append(t.string)
+ except AttributeError:
+ desc_strs.append(t)
+
+ looper(desc_div)
+ looper(desc_div.find('wbr')) # BeautifulSoup does not self-close <wbr>
+
+ desc = ''.join(s for s in desc_strs if s)
+ return self._html_unescape(desc)
+
+ def _html_unescape(self, str):
+ def entity_replacer(m):
+ entity = m.group(1)
+ if entity in name2codepoint:
+ return unichr(name2codepoint[entity])
+ else:
+ return m.group(0)
+
+ def ascii_replacer(m):
+ cp = int(m.group(1))
+ if cp <= 255:
+ return unichr(cp)
+ else:
+ return m.group(0)
+
+ s = re.sub(r'&#(\d+);', ascii_replacer, str, re.U)
+ return re.sub(r'&([^;]+);', entity_replacer, s, re.U)
+
235 xgoogle/sponsoredlinks.py
@@ -0,0 +1,235 @@
+#!/usr/bin/python
+#
+# Peteris Krumins (peter@catonmat.net)
+# http://www.catonmat.net -- good coders code, great reuse
+#
+# http://www.catonmat.net/blog/python-library-for-google-sponsored-links-search/
+#
+# Code is licensed under MIT license.
+#
+
+import re
+import urllib
+import random
+from htmlentitydefs import name2codepoint
+from BeautifulSoup import BeautifulSoup
+
+from browser import Browser, BrowserError
+
+#
+# TODO: join GoogleSearch and SponsoredLinks classes under a single base class
+#
+
+class SLError(Exception):
+ """ Sponsored Links Error """
+ pass
+
+class SLParseError(Exception):
+ """
+ Parse error in Google results.
+ self.msg attribute contains explanation why parsing failed
+ self.tag attribute contains BeautifulSoup object with the most relevant tag that failed to parse
+ Thrown only in debug mode
+ """
+
+ def __init__(self, msg, tag):
+ self.msg = msg
+ self.tag = tag
+
+ def __str__(self):
+ return self.msg
+
+ def html(self):
+ return self.tag.prettify()
+
+GET_ALL_SLEEP_FUNCTION = object()
+
+class SponsoredLink(object):
+ """ a single sponsored link """
+ def __init__(self, title, url, display_url, desc):
+ self.title = title
+ self.url = url
+ self.display_url = display_url
+ self.desc = desc
+
+class SponsoredLinks(object):
+ SEARCH_URL_0 = "http://www.google.com/sponsoredlinks?q=%(query)s&btnG=Search+Sponsored+Links&hl=en"
+ NEXT_PAGE_0 = "http://www.google.com/sponsoredlinks?q=%(query)s&sa=N&start=%(start)d&hl=en"
+ SEARCH_URL_1 = "http://www.google.com/sponsoredlinks?q=%(query)s&num=%(num)d&btnG=Search+Sponsored+Links&hl=en"
+ NEXT_PAGE_1 = "http://www.google.com/sponsoredlinks?q=%(query)s&num=%(num)d&sa=N&start=%(start)d&hl=en"
+
+ def __init__(self, query, random_agent=False, debug=False):
+ self.query = query
+ self.debug = debug
+ self.browser = Browser(debug=debug)
+ self._page = 0
+ self.eor = False
+ self.results_info = None
+ self._results_per_page = 10
+
+ if random_agent:
+ self.browser.set_random_user_agent()
+
+ @property
+ def num_results(self):
+ if not self.results_info:
+ page = self._get_results_page()
+ self.results_info = self._extract_info(page)
+ if self.results_info['total'] == 0:
+ self.eor = True
+ return self.results_info['total']
+
+ def _get_results_per_page(self):
+ return self._results_per_page
+
+ def _set_results_par_page(self, rpp):
+ self._results_per_page = rpp
+
+ results_per_page = property(_get_results_per_page, _set_results_par_page)
+
+ def get_results(self):
+ if self.eor:
+ return []
+ page = self._get_results_page()
+ info = self._extract_info(page)
+ if self.results_info is None:
+ self.results_info = info
+ if info['to'] == info['total']:
+ self.eor = True
+ results = self._extract_results(page)
+ if not results:
+ self.eor = True
+ return []
+ self._page += 1
+ return results
+
+ def _get_all_results_sleep_fn(self):
+ return random.random()*5 + 1 # sleep from 1 - 6 seconds
+
+ def get_all_results(self, sleep_function=None):
+ if sleep_function is GET_ALL_SLEEP_FUNCTION:
+ sleep_function = self._get_all_results_sleep_fn
+ if sleep_function is None:
+ sleep_function = lambda: None
+ ret_results = []
+ while True:
+ res = self.get_results()
+ if not res:
+ return ret_results
+ ret_results.extend(res)
+ return ret_results
+
+ def _maybe_raise(self, cls, *arg):
+ if self.debug:
+ raise cls(*arg)
+
+ def _extract_info(self, soup):
+ empty_info = { 'from': 0, 'to': 0, 'total': 0 }
+ stats_span = soup.find('span', id='stats')
+ if not stats_span:
+ return empty_info
+ txt = ''.join(stats_span.findAll(text=True))
+ txt = txt.replace(',', '').replace("&nbsp;", ' ')
+ matches = re.search(r'Results (\d+) - (\d+) of (?:about )?(\d+)', txt)
+ if not matches:
+ return empty_info
+ return {'from': int(matches.group(1)), 'to': int(matches.group(2)), 'total': int(matches.group(3))}
+
+ def _get_results_page(self):
+ if self._page == 0:
+ if self._results_per_page == 10:
+ url = SponsoredLinks.SEARCH_URL_0
+ else:
+ url = SponsoredLinks.SEARCH_URL_1
+ else:
+ if self._results_per_page == 10:
+ url = SponsoredLinks.NEXT_PAGE_0
+ else:
+ url = SponsoredLinks.NEXT_PAGE_1
+
+ safe_url = url % { 'query': urllib.quote_plus(self.query),
+ 'start': self._page * self._results_per_page,
+ 'num': self._results_per_page }
+
+ try:
+ page = self.browser.get_page(safe_url)
+ except BrowserError, e:
+ raise SLError, "Failed getting %s: %s" % (e.url, e.error)
+
+ return BeautifulSoup(page)
+
+ def _extract_results(self, soup):
+ results = soup.findAll('div', {'class': 'g'})
+ ret_res = []
+ for result in results:
+ eres = self._extract_result(result)
+ if eres:
+ ret_res.append(eres)
+ return ret_res
+
+ def _extract_result(self, result):
+ title, url = self._extract_title_url(result)
+ display_url = self._extract_display_url(result) # Warning: removes 'cite' from the result
+ desc = self._extract_description(result)
+ if not title or not url or not display_url or not desc:
+ return None
+ return SponsoredLink(title, url, display_url, desc)
+
+ def _extract_title_url(self, result):
+ title_a = result.find('a')
+ if not title_a:
+ self._maybe_raise(SLParseError, "Title tag in sponsored link was not found", result)
+ return None, None
+ title = ''.join(title_a.findAll(text=True))
+ title = self._html_unescape(title)
+ url = title_a['href']
+ match = re.search(r'q=(http[^&]+)&', url)
+ if not match:
+ self._maybe_raise(SLParseError, "URL inside a sponsored link was not found", result)
+ return None, None
+ url = urllib.unquote(match.group(1))
+ return title, url
+
+ def _extract_display_url(self, result):
+ cite = result.find('cite')
+ if not cite:
+ self._maybe_raise(SLParseError, "<cite> not found inside result", result)
+ return None
+
+ return ''.join(cite.findAll(text=True))
+
+ def _extract_description(self, result):
+ cite = result.find('cite')
+ if not cite:
+ return None
+ cite.extract()
+
+ desc_div = result.find('div', {'class': 'line23'})
+ if not desc_div:
+ self._maybe_raise(ParseError, "Description tag not found in sponsored link", result)
+ return None
+
+ desc_strs = desc_div.findAll(text=True)[0:-1]
+ desc = ''.join(desc_strs)
+ desc = desc.replace("\n", " ")
+ desc = desc.replace(" ", " ")
+ return self._html_unescape(desc)
+
+ def _html_unescape(self, str):
+ def entity_replacer(m):
+ entity = m.group(1)
+ if entity in name2codepoint:
+ return unichr(name2codepoint[entity])
+ else:
+ return m.group(0)
+
+ def ascii_replacer(m):
+ cp = int(m.group(1))
+ if cp <= 255:
+ return unichr(cp)
+ else:
+ return m.group(0)
+
+ s = re.sub(r'&#(\d+);', ascii_replacer, str, re.U)
+ return re.sub(r'&([^;]+);', entity_replacer, s, re.U)
+
199 xgoogle/translate.py
@@ -0,0 +1,199 @@
+#!/usr/bin/python
+#
+# Peteris Krumins (peter@catonmat.net)
+# http://www.catonmat.net -- good coders code, great reuse
+#
+# http://www.catonmat.net/blog/python-library-for-google-translate/
+#
+# Code is licensed under MIT license.
+#
+
+from browser import Browser, BrowserError
+from urllib import quote_plus
+import simplejson as json
+
+
+class TranslationError(Exception):
+ pass
+
+class Translator(object):
+ translate_url = "http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=%(message)s&langpair=%(from)s%%7C%(to)s"
+
+ def __init__(self):
+ self.browser = Browser()
+
+ def translate(self, message, lang_to='en', lang_from=''):
+ """
+ Given a 'message' translate it from 'lang_from' to 'lang_to'.
+ If 'lang_from' is empty, auto-detects the language.
+ Returns the translated message.
+ """
+
+ if lang_to not in _languages:
+ raise TranslationError, "Language %s is not supported as lang_to." % lang_to
+ if lang_from not in _languages and lang_from != '':
+ raise TranslationError, "Language %s is not supported as lang_from." % lang_from
+
+ message = quote_plus(message)
+ real_url = Translator.translate_url % { 'message': message,
+ 'from': lang_from,
+ 'to': lang_to }
+
+ try:
+ translation = self.browser.get_page(real_url)
+ data = json.loads(translation)
+
+ if data['responseStatus'] != 200:
+ raise TranslationError, "Failed translating: %s" % data['responseDetails']
+
+ return data['responseData']['translatedText']
+ except BrowserError, e:
+ raise TranslationError, "Failed translating (getting %s failed): %s" % (e.url, e.error)
+ except ValueError, e:
+ raise TranslationError, "Failed translating (json failed): %s" % e.message
+ except KeyError, e:
+ raise TranslationError, "Failed translating, response didn't contain the translation"
+
+ return None
+
+class DetectionError(Exception):
+ pass
+
+class Language(object):
+ def __init__(self, lang, confidence, is_reliable):
+ self.lang_code = lang
+ self.lang = _languages[lang]
+ self.confidence = confidence
+ self.is_reliable = is_reliable
+
+ def __repr__(self):
+ return '<Language: %s (%s)>' % (self.lang_code, self.lang)
+
+class LanguageDetector(object):
+ detect_url = "http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&q=%(message)s"
+
+ def __init__(self):
+ self.browser = Browser()
+
+ def detect(self, message):
+ """
+ Given a 'message' detects its language.
+ Returns Language object.
+ """
+
+ message = quote_plus(message)
+ real_url = LanguageDetector.detect_url % { 'message': message }
+
+ try:
+ detection = self.browser.get_page(real_url)
+ data = json.loads(detection)
+
+ if data['responseStatus'] != 200:
+ raise DetectError, "Failed detecting language: %s" % data['responseDetails']
+
+ rd = data['responseData']
+ return Language(rd['language'], rd['confidence'], rd['isReliable'])
+
+ except BrowserError, e:
+ raise DetectError, "Failed detecting language (getting %s failed): %s" % (e.url, e.error)
+ except ValueError, e:
+ raise DetectErrro, "Failed detecting language (json failed): %s" % e.message
+ except KeyError, e:
+ raise DetectError, "Failed detecting language, response didn't contain the necessary data"
+
+ return None
+
+
+_languages = {
+ 'af': 'Afrikaans',
+ 'sq': 'Albanian',
+ 'am': 'Amharic',
+ 'ar': 'Arabic',
+ 'hy': 'Armenian',
+ 'az': 'Azerbaijani',
+ 'eu': 'Basque',
+ 'be': 'Belarusian',
+ 'bn': 'Bengali',
+ 'bh': 'Bihari',
+ 'bg': 'Bulgarian',
+ 'my': 'Burmese',
+ 'ca': 'Catalan',
+ 'chr': 'Cherokee',
+ 'zh': 'Chinese',
+ 'zh-CN': 'Chinese_simplified',
+ 'zh-TW': 'Chinese_traditional',
+ 'hr': 'Croatian',
+ 'cs': 'Czech',
+ 'da': 'Danish',
+ 'dv': 'Dhivehi',
+ 'nl': 'Dutch',
+ 'en': 'English',
+ 'eo': 'Esperanto',
+ 'et': 'Estonian',
+ 'tl': 'Filipino',
+ 'fi': 'Finnish',
+ 'fr': 'French',
+ 'gl': 'Galician',
+ 'ka': 'Georgian',
+ 'de': 'German',
+ 'el': 'Greek',
+ 'gn': 'Guarani',
+ 'gu': 'Gujarati',
+ 'iw': 'Hebrew',
+ 'hi': 'Hindi',
+ 'hu': 'Hungarian',
+ 'is': 'Icelandic',
+ 'id': 'Indonesian',
+ 'iu': 'Inuktitut',
+ 'ga': 'Irish',
+ 'it': 'Italian',
+ 'ja': 'Japanese',
+ 'kn': 'Kannada',
+ 'kk': 'Kazakh',
+ 'km': 'Khmer',
+ 'ko': 'Korean',
+ 'ku': 'Kurdish',
+ 'ky': 'Kyrgyz',
+ 'lo': 'Laothian',
+ 'lv': 'Latvian',
+ 'lt': 'Lithuanian',
+ 'mk': 'Macedonian',
+ 'ms': 'Malay',
+ 'ml': 'Malayalam',
+ 'mt': 'Maltese',
+ 'mr': 'Marathi',
+ 'mn': 'Mongolian',
+ 'ne': 'Nepali',
+ 'no': 'Norwegian',
+ 'or': 'Oriya',
+ 'ps': 'Pashto',
+ 'fa': 'Persian',
+ 'pl': 'Polish',
+ 'pt-PT': 'Portuguese',
+ 'pa': 'Punjabi',
+ 'ro': 'Romanian',
+ 'ru': 'Russian',
+ 'sa': 'Sanskrit',
+ 'sr': 'Serbian',
+ 'sd': 'Sindhi',
+ 'si': 'Sinhalese',
+ 'sk': 'Slovak',
+ 'sl': 'Slovenian',
+ 'es': 'Spanish',
+ 'sw': 'Swahili',
+ 'sv': 'Swedish',
+ 'tg': 'Tajik',
+ 'ta': 'Tamil',
+ 'tl': 'Tagalog',
+ 'te': 'Telugu',
+ 'th': 'Thai',
+ 'bo': 'Tibetan',
+ 'tr': 'Turkish',
+ 'uk': 'Ukrainian',
+ 'ur': 'Urdu',
+ 'uz': 'Uzbek',
+ 'ug': 'Uighur',
+ 'vi': 'Vietnamese',
+ 'cy': 'Welsh',
+ 'yi': 'Yiddish'
+};

0 comments on commit e3477ce

Please sign in to comment.
Something went wrong with that request. Please try again.