Skip to content
This repository has been archived by the owner on Mar 5, 2022. It is now read-only.

OO refresh #99

Merged
merged 15 commits into from
Jun 5, 2016
Merged

OO refresh #99

merged 15 commits into from
Jun 5, 2016

Conversation

zmwangx
Copy link
Collaborator

@zmwangx zmwangx commented May 30, 2016

This is a huge WIP, intended to group and modularize googler's various functional units, hence lifting googler from its quick-and-dirty origin. (Starting out quick-and-dirty is totally fine, but at some point we need to rethink the structure in order to not dissolve into a pile of unmaintainable mess when we add more features. That's just my opinion.)

The guiding principle here is to break down big functional units and dissociate state info so that every small blob of code should make as much sense as possible locally, instead of relying on memorizing the effect of another blob that is two hundred lines away. (Bad example: url = url.replace('start=%d&' % oldstart, 'start=%d&' % start, 1), which relies on storing a snapshot of start in oldstart many lines before actually using it; this sort of disconnectedness could be a source of bugs when part of the program is updated.) When we do need persistent state info, they should be packaged into well-defined and well-documented OO interfaces, complete with relevant state manipulation helpers.

A brief and incomplete checklist:

  • GooglerArgumentParser: absorb type guards;
  • GoogleUrl: URL constructor;
  • GoogleConnection: handle connection opening/renewal and page fetching (including redirection) (should absorb google_get, new_connection and a big chunk of fetch_results);
  • GooglerCmd: OO command line interpreter and executioner (model on cmd.Cmd?).

Classes should be global-free (except constants and logger):

  • GoogleUrl
  • GoogleConnection
  • GoogleParser
  • Resultlucky should be dealt with outside Result; colors, columns etc. that affect the output should be class variables; json should be dealt with inside Result; oh, and think of a better name for urlindex (urltable maybe?)...
  • GooglerCmd
  • GooglerArgumentParser

I'm busy recently so it will take a while.

The finished parts could be reviewed and/or regression-tested and I'd like to have feedback.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

By the way, this is laying groundwork for both "Option-related improvements" and "Refactor and write unit tests" on the todo list.

@jarun
Copy link
Owner

jarun commented May 30, 2016

I'm busy recently so it will take a while.

Please take your time. This is massive improvement over what we have today.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

Pushed another commit, implementing GoogleConnection. That wraps up this week's work.

@jarun
Copy link
Owner

jarun commented May 30, 2016

👍 get some rest.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

👍 get some rest. Pick up another piece of work.

FTFY 😛

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

Seems that Google raised their bar of suspicious activity, or we have some unknown competitor testing Google search on Travis at the same time...

@jarun
Copy link
Owner

jarun commented May 30, 2016

Seems that Google raised their bar of suspicious activity

We can experiment with the cookie stuff our user share on Gitter. I'll check it next weekend.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

We can experiment with the cookie stuff our user share on Gitter.

You mean grab a cookie from the browser?

  1. I don't think spamming hundreds requests with cookies is not suspicious from Google's point of view;
  2. Even if it does prevent Google's block, it only works for so long before it expires;
  3. I would be hesitant to make my cookies (signed in) public, because that may open up a channel for session hijacking attacks against my Google account.

It doesn't hurt to try though.

@jarun
Copy link
Owner

jarun commented May 30, 2016

No no... This one: https://docs.python.org/3.0/library/http.cookiejar.html

Now we are all-Python3. No double maintenance.
If we do it, we handle ourselves.

@jarun
Copy link
Owner

jarun commented May 30, 2016

I don't think spamming hundreds requests with cookies is not suspicious from Google's point of view

We can check this by firing 100 requests from the browser. I can't do it right now.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

Oh, you mean just keeping the cookies? Not sure that would help. I would say keeping the cookies would be more suspicious, because you're clearly spamming from the same session, rather than having multiple people searching from the same IP, which is more excusable.

What I had in mind was cookies from a signed-in session, because I've never been greeted by a reCAPTCHA when I'm signed in, but occasionally do when I search from Incognito/Private, especially when using my PIA VPN.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

What I had in mind was cookies from a signed-in session

Obviously we don't want to introduce signing in.

Anyway, it's not a big deal. Also, my plans for testing would render this spamming a thing of the past.

@jarun
Copy link
Owner

jarun commented May 30, 2016

Not sure that would help.

This is what I wanna test, from a fresh, cookie-cleared browser session.

Obviously we don't want to introduce signing in.

We can't. Google has added restrictions recently. Many third-party clients can't anymore.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

We can't.

Plus that hardly helps our users. Only helps with automated testing.

@zmwangx
Copy link
Collaborator Author

zmwangx commented May 30, 2016

This is what I wanna test, from a fresh, cookie-cleared browser session.

And I just tried.

for i in {1..500}; do echo -n $'\r'$i; chrome-cli open 'https://google.com/search?q=hello' >/dev/null; sleep 1; chrome-cli close; done

All pages are opened in a fresh Incognito session (same session). Got reCAPTCHA'ed on the ~100th request.

@jarun
Copy link
Owner

jarun commented May 30, 2016

Got ReCAPTCHA'ed on the ~100th request.

LoL!!! Doesn't help then.

@zmwangx zmwangx force-pushed the oo-refresh branch 3 times, most recently from 74b59c6 to cbcd55c Compare June 2, 2016 00:43
This is Python 3, dude!
- Make previously global type guard functions into static methods of
  GooglerArgumentParser (cleaner global namespace, easier-to-locate type
  guards);

- Add two new type guards: positive_int: for --num; nonnegative_int: for
  --start (technically a breaking change, realistic no one should ever
  type in a negative start index or a non-positive number of results per
  page).

- Drop Python 2 legacy (super).
The performance gain of compiling r'\s+' is negligible, which is
completely outweighed by the awkwardness of the variable and the dev
time wasted in looking for the definition of another variable (although
it could be easily located by automated tools).
The URL construction interface is now fully OO. It is to some extend
modeled on urllib.parse.ParseResult for familiarity, but it takes care
of all aspects of URL construction (even pagination) and defines a
easy-to-use and flexity interface with properties and methods. See the
docstring of the GoogleUrl class for details, as well as the docstrings
of individual user-facing methods.

As a side effect, we have reduced the number of globals, which are now
grouped into three categories: state carriers (which, as the name
suggests, carry global states), output/user interaction options, and
cosmetic options.
move GooglerArgumentParser to the end of classes.

- Logically GoogleUrl should precede GoogleParser because we can't parse
  anything if we haven't constructed the request URL in the first place.

  Once we have GoogleConnection (OO version of the current conn global),
  it will go between GoogleUrl and GoogleParser.

- GooglerArgumentParser is a class.
GoogleConnection absorbed new_connection, google_get, and the connecting
and fetching parts of fetch_results. The new OO interface is nicely
abstract and fully documented.

Note that we have now registered an exit handler for closing the
connection. This way we can exit the program at any point using sys.exit
instead of our kind of awkward quit_program.

There's one breaking change which should not have much of a consequence:
we no longer modify the host segment of the global URL (now managed by
google_url) after a redirection. This is partly stipulated by the OO
interface (GoogleConnection class should not have access to the global
google_url) and partly because users should get what they specified when
they ask for the URL (with the 'o' command, for instance), not what
Google forced on them with a redirection.
@zmwangx zmwangx force-pushed the oo-refresh branch 2 times, most recently from d30b4a2 to ca33fda Compare June 2, 2016 21:25
1. I'm trying to eliminate globals. `debug' is a non-idiomatic global,
   whereas `logger' is idiomatic.
2. I cringe at `debugp'.
3. Having to maintain home-grown messaging functions sucks.

Also moving import tempfile and import json to where they're actually
used for maintainability. Doing the same to webbrowser because it's also
an optional component that's only used once.
We already have a global variable `colors' which we can test for
truthiness, so another boolean global is totally unnecessary.

Note that colors will eventually be absorbed into class variables of
classes that rely on it (Result and GooglerCmd), but for now it is kept.
immediately after constants.

These global helpers are called from classes, so logically they should
appear first.
The `columns' global variable has been eliminated (when I say eliminated
I mean it can be wrapped into a main function and is referenced nowhere
else). `colors' is currently used in `read_next_command' but will be
eliminated once we pack up the REPL.

Also restructured the opt handling part a bit to make them more
organized and more sensible.
instead of Result.print_entry.

Opening the first result really has nothing to do with "print entry".

While we're at it, we
- Make --lucky imply --noprompt. (Previously, there was a pointless
  blank line printed due to not in noninteractive mode.)
- Print a warning of "No results" when no results were found unless in
  JSON mode (where an empty array should be obvious enough).
The name "urlindex" is so weird, considering it is a dict of index: url
pairs.
This way we
- Better adapt to terminal resize;
- Get rid of one kind of awkward class variable.

We also document the class variables of the Result class.

Note that os.get_terminal_size is *fast*. I did some benchmarking:

    termsize.py:

    #!/usr/bin/env python3
    import os
    import sys

    try:
        count = int(sys.argv[1])
    except:
        count = 1

    for _ in range(count):
        try:
            columns, lines = os.get_terminal_size()
        except:
            pass

With Python 3.5.1 on OS X 10.11.5 on my Mid-2015 15'' rMBP with 2.5 GHz
quad-core Core i7, 1E7 calls take this long:

    zsh:

    > time (./termsize.py 10000000)
    ( ./termsize.py 10000000; )  5.02s user 3.22s system 99% cpu 8.263 total
    > time (./termsize.py 10000000 &>/dev/null </dev/null)
    ( ./termsize.py 10000000 &> /dev/null < /dev/null; )  9.13s user 2.50s system 99% cpu 11.667 total
    > unset COLUMNS LINES; time (./termsize.py 10000000)
    ( ./termsize.py 10000000; )  5.18s user 3.25s system 99% cpu 8.458 total

    bash:

    > time (./termsize.py 10000000)

    real	0m8.459s
    user	0m5.186s
    sys	        0m3.246s
    > time (./termsize.py 10000000 &>/dev/null </dev/null)

    real	0m13.320s
    user	0m10.363s
    sys	        0m2.899s
    > unset COLUMNS LINES; time (./termsize.py 10000000)

    real	0m8.177s
    user	0m5.006s
    sys	        0m3.152s
There's one breaking change:

- The `n' command now navigates to the next page even when the current
  page seems to be empty. The reason is not being able to find any
  result (that matches our criteria) on the page doesn't necessarily
  mean it is the last page, especially when --count is small.

Other improvements other than the obvious include

- Global variables are finally eliminated, and program no longer has
  side effects;
- Argument parsing code has been factored into a function parse_args()
  to be forward-looking;
- All exceptions are now captured. In non-debug mode we only print the
  error message if there's an otherwise uncaught exception; in debug
  mode we re-raise to produce a traceback.
Previously I forgot to handle the `site' key in GoogleUrl.update().

Also ordered keys in update (we need some easy-to-follow order so that
we won't need to worry about where to insert a new key in the future).
@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 5, 2016

Lazy import done. Rebasing was kind of a nightmare.

Any other comments or replies to my comments?

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 5, 2016

Rebasing this on master won't be fun, because that cyan change would result in amending almost every single commit. If you're happy with this branch, I'll do a manual merge instead, which is easy.

@jarun
Copy link
Owner

jarun commented Jun 5, 2016

Lazy import done. Rebasing was kind of a nightmare.

I guessed it but I really crave for speed.

I'll do a manual merge instead, which is easy.

Go as you like :)

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 5, 2016

but I really crave for speed.

Needed to rebase anyway. No problem.

@zmwangx zmwangx merged commit baedc3b into jarun:master Jun 5, 2016
@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 5, 2016

Merged.

@zmwangx zmwangx deleted the oo-refresh branch June 5, 2016 08:18
@jarun
Copy link
Owner

jarun commented Jun 5, 2016

👍

@jarun
Copy link
Owner

jarun commented Jun 9, 2016

I'm planning to revert this commit. Earlier I was thinking of supporting all options at the prompt and this commit would help with that but I have dumped the plan.

This commit takes re-directions to a new level. Just to figure out where program args (opts) are being parsed I had to go though 4 functions where opts is being passed around!

Let's keeps things simpler. So much for object-oriented.

@jarun
Copy link
Owner

jarun commented Jun 9, 2016

OK, that came in with GooglerCmd.

@jarun
Copy link
Owner

jarun commented Jun 9, 2016

And we couldn't live without re???

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 9, 2016

Earlier I was thinking of supporting all options at the prompt and this commit would help with that but I have dumped the plan.

I think it really makes sense to support that. But if you don't want to, we can dump GooglerCmd. Other classes can stay.

This commit takes re-directions to a new level.

Eliminating side effects come with a price. Otherwise something like devbin/parse would not work (or it could work, but you have global everywhere, which would be insane).

Just to figure out where program args (opts) are being parsed I had to go though 4 functions where opts is being passed around!

Dynamic URL construction makes sense even if you don't want to support dynamically updating options during the execution. Previously you need to understand all URL manipulation code scattered around just to make sure your redirection code won't break anything (and you actually can't). Now the complexity has been contained, and you don't have to understand anything about URL construction unless you need to.

Plus, tracebacks in the OO setting are arguably more helpful. You have the entire detailed callstack, while you could also quickly isolate the issue, rather than worrying about states.

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 9, 2016

In short, if you confirm that the plan to support dynamic options has been dropped, then I'll go ahead and try to eliminate GooglerCmd, because that's the only class that's adding complexity (in addition to wrapping a few globals) if we don't want to actually achieve something with it.

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 9, 2016

Forgot to reply to re comment.

Using re to replace consecutive Unicode whitespace

  • Results in cleaner queries;
  • Makes it more flexible for the users who might type in whitespace characters other than the space (e.g., when you use a Chinese input method).

But it is by no means necessary.

@jarun
Copy link
Owner

jarun commented Jun 9, 2016

Other classes can stay.

For generating a URL (which could have been done easily with string manipulation and class variables) we have used dictionary, multiple levels of API calls to the extent that it becomes boring to understand the code... it's everywhere! This has now become a Python-internal DS usage demonstration rather than a utility one would want to go through and contribute to. Why couldn't we keep things simpler?

Dynamic URL construction makes sense even if you don't want to support dynamically updating options during the execution.

I'll get back to this but if this is what it takes to re-form and generate a URL, I'm not sure I'll carry it along.

googler doesn't have to be that complex a utility. We don't want to turn it into a full-fledged enterprise database management system loaded with cannons.

@zmwangx
Copy link
Collaborator Author

zmwangx commented Jun 9, 2016

we have used dictionary

Because URL query is a dictionary. It's the natural data structure to use.

multiple levels of API calls

To construct, you call the constructor with options; to update, you call update; to get the URL, you access the url property, or full() (same thing, just for clarity in certain use cases) or relative() depending on use case. Navigation is handled too. What is multi-leveled? If you don't need dynamic manipulation of options, just ignore some of the methods. You don't have to use all methods, and existence of a few possibly extraneous methods doesn't make the class more complex.

googler doesn't have to be that complex a utility.

Number of lines isn't equal to complexity. Complexity is nothing if it's contained and split up.

@mberg2007 mberg2007 mentioned this pull request Jun 9, 2017
@lock lock bot locked and limited conversation to collaborators Nov 15, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants