Skip to content
This repository has been archived by the owner on Mar 5, 2022. It is now read-only.

No result with Google's experimental layout #103

Closed
zmwangx opened this issue Jun 4, 2016 · 3 comments · Fixed by #107
Closed

No result with Google's experimental layout #103

zmwangx opened this issue Jun 4, 2016 · 3 comments · Fixed by #107
Labels

Comments

@zmwangx
Copy link
Collaborator

zmwangx commented Jun 4, 2016

As we all know, Google experiments with stuff all the time. Here's a layout that they've been experimenting lately which I've seen a few times:

screen shot 2016-06-04 at 11 00 11 am

Sample response: https://git.io/vofW5

Our parser can't parse any result from this because the result wrapper here is <div class="g card-section">, and we've been limiting ourselves to precisely <div class="g"> in order to get rid of the occasional top card.

I'll see what I can do when I have time. (The basic idea is to open up the class restriction, and somehow still filter out the top card? I thought about enforcing nonempty abstracts, but that's not a good criterion because if you google google for instance, the second result is Google's Twitter account and there's no abstract for that. If we can't discern the top card from everything else, I guess we'll have to include it after all.)

@zmwangx zmwangx added the bug label Jun 4, 2016
@jarun
Copy link
Owner

jarun commented Jun 4, 2016

I'll take a look tomorrow.

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 4, 2016

By the way, I'm thinking about adding some dev scripts to facilitate development. Current idea:

> tree devbin
devbin
├── __pycache__
│   └── googler.cpython-35.pyc
├── googler.py -> ../googler
└── parse

1 directory, 3 files

Content of parse:

#!/usr/bin/env python3

import argparse
import json

import googler

def main():
    argparser = argparse.ArgumentParser(description='Parse Google responses.')
    argparser.add_argument('-N', '--news', action='store_true',
                           help='parse as Google News responses')
    argparser.add_argument('files', nargs='+', metavar='FILE',
                           help="HTML file with Google's response body")
    args = argparser.parse_args()
    for fn in args.files:
        with open(fn, encoding='utf-8') as fp:
            htmlparser = googler.GoogleParser(news=args.news)
            htmlparser.feed(fp.read())
            results_object = [r.jsonizable_object() for r in htmlparser.results]
            print(json.dumps(results_object, indent=2, sort_keys=True, ensure_ascii=False))

if __name__ == '__main__':
    main()

Invocation:

> devbin/parse -h
usage: parse [-h] [-N] FILE [FILE ...]

Parse Google responses.

positional arguments:
  FILE        HTML file with Google's response body

optional arguments:
  -h, --help  show this help message and exit
  -N, --news  parse as Google News responses
> devbin/parse /Volumes/ramdisk/googler-response-2qmy246b  # Good
[
  {
    "abstract": "HELLO! Online brings you the latest celebrity & royal news from the UK & around the world, magazine exclusives, celeb babies, weddings, pregnancies and ...",
    "title": "HELLO! Online: celebrity & royal news, magazine, babies, weddings ...",
    "url": "http://www.hellomagazine.com/"
  },
  {
    "abstract": "Hello, it's me. I was wondering if after all these years you'd like to meet. To go over everything. They say that time's supposed to heal ya. But I ain't done much ...",
    "title": "ADELE LYRICS - Hello - A-Z Lyrics",
    "url": "http://www.azlyrics.com/lyrics/adele/hello.html"
  },
  {
    "abstract": "hello connects you with people and content around your passions. Show the world who you are, express what you love, and create meaningful connections.",
    "title": "hello network",
    "url": "http://www.hello.com/"
  },
  {
    "abstract": "\"Hello\" is a song by English singer Adele. It was released on 23 October 2015 by XL Recordings as the lead single from her third studio album, 25 (2015). Adele ...",
    "title": "Hello (Adele song) - Wikipedia, the free encyclopedia",
    "url": "https://en.wikipedia.org/wiki/Hello_(Adele_song)"
  },
  {
    "abstract": "Watch Hello by Adele online at vevo.com. Discover the latest music videos by Adele on Vevo.",
    "title": "Hello - Adele - Vevo",
    "url": "http://www.vevo.com/watch/adele/Hello/GBH481500074"
  },
  {
    "abstract": "Make your phone smarter with Hello. Built by Messenger just for Android, Hello combines info from Facebook with the contact info on your phone so it's easy to ...",
    "title": "Hello — Caller ID & Blocking - Android Apps on Google Play",
    "url": "https://play.google.com/store/apps/details?id=com.facebook.phone&hl=en"
  },
  {
    "abstract": "HELLO! 1779396 likes · 92804 talking about this. The official Facebook page for HELLO! magazine & http://www.hellomagazine.com/ FOLLOW US @...",
    "title": "HELLO! - Facebook",
    "url": "https://www.facebook.com/hello/"
  },
  {
    "abstract": "Listen to Hello in full in the Spotify app. Play on Spotify. ℗ 2015 XL Recordings Ltd., under exclusive license to Columbia Records, a Division of Sony Music ...",
    "title": "Hello by Adele on Spotify",
    "url": "https://open.spotify.com/album/1Eo1pg2D4beFgf3HFTJTMc"
  }
]
> devbin/parse /Volumes/ramdisk/googler-response-unjwqp8g  # Bad
[]

@jarun
Copy link
Owner

jarun commented Jun 5, 2016

👍

zmwangx added a commit to zmwangx/googler that referenced this issue Jun 5, 2016
Fixes jarun#103.

By comparing https://git.io/vofW5 (experimental layout in jarun#103) and
https://git.io/voJgB (traditional layout), we see two things in common:

- The mnr-c class is used for cards, regardless of whether it a smart
  result presumably produced by Google's deep neural network;

- The g-blk class (presumably standing for g-block) coupled with the g
  class (result wrapper) is used exclusively for smart cards, either at
  the top or on the right (sample query: paris).

Therefore, we now test for the presence of `g' instead of requiring
`g'-only, and exclude `g-blk'.
@jarun jarun closed this as completed in #107 Jun 6, 2016
@lock lock bot locked and limited conversation to collaborators Nov 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants