Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

Closed
mmmaisel opened this issue Apr 22, 2018 · 11 comments
Closed

Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

mmmaisel opened this issue Apr 22, 2018 · 11 comments
Assignees
Labels
bug Bugs that impacts on main KiCost functionality. discussion Discution about implementation and new features. enhancement Improvements on features already existent.

Comments

@mmmaisel
Copy link
Contributor

Scraping some Mouser parts fails with "Unknown error for from mouser" while other parts or distributors just work fine. This was tested with git revision e7e4243 on Ubuntu 16.04 64bit (Python 2.7) and Ubuntu 18.04 64bit (Python 3.6).

KiCost was invoked with the following command:

~/.local/bin/kicost -i test.xml -d99 -w -s --throttling_delay 1

PDB shows that the variables "html" and "tree" in distributors/mouser/mouser.py:178 contain some javascript garbage instead of the intented HTML source code.

The following minimal KiCad xml demonstrates the error:
test.xml.txt

Terminal output:
term_output.txt

@hildogjr
Copy link
Owner

I could not reproduce this error, for me the result was just fine: test.xlsx.
Waiting for more test / someone more to confirm (I'm not a specialist on web / javascript).

@mmmaisel
Copy link
Contributor Author

It looks like the root-cause of this issue is that mouser returns a captcha after scraping some parts.

I did some further experiments and figured out that if you correctly handle cookies, the captcha can be avoided. The following (partial?) solution worked for me if the -s and --throttling_delay=3 options are used:

  • Create a http.cookiejar object and a urllib.request.HTTPCookieProcessor and install it as global opener (urllib.request.install_opener).
  • Init the Request object with user-agent and keep this user-agent for the whole scraping process.
  • Access www.mouser.com without search query and store all cookies for subsequent scraping apsses.
  • Scrape all parts and update cookies after every part if required.

See attached mouser.py.txt for an example implementation of those hacks.

Now, the question is how to cleanly implement the following in KiCost's code:

  • Keep state information between calls to get_part_html_tree and between different threads.
  • Avoid global opener object, maybe use the opener object directly instead of urlopen.
  • Add support for the mouser preferences='ps=www2&pl=en-US&pc_www2=USDe' cookie (this is not supported by my hack).
  • Implement this for all distributors.

@hildogjr
Copy link
Owner

@mmmaisel thanks by debug and discussion. We (@devbisme and I) know that this multiple-sequencial scrape could be interpreted as robots and block by the distributors.
And debug by users we help KiCost to improve.

(Since the last code modifications #223 somethink could be implemented on fake_brownser(foo)).

@mmmaisel
Copy link
Contributor Author

@hildogjr yes, fake_browser (as a class) would be a good point for storing various things like the cookiejar, selected user_agent, domain name, urllib opener, scrape retries and other things.
However, this fake_browser object has to be passed to every function that accesses URLs like get_part_html_tree and various AJAX using sub-functions of some distributors. It also has to be instanced somewhere before the scraping starts.

At this point, I think the code needs some redesign because currently KiCost first iterates over all parts (eventually distributed over several threads) and then iterates over all known distributors for a single part (with random user-agent and without cookies from previous requests). This needs to be changed to iterate distributors first, then parts to keep the state information.
I think one thread per distributor is enough as the scrape operation is limited by network I/O speed and throttling_delay settings not CPU speed. Allowing multiple threads per distributor would only increase state synchronization complexity and risk of getting banned.

Maybe, it would also make sense to change all distributor modules to classes which hold all required state information and common values like the fake_browser object, currency settings and maybe others. This would reduce the amount of function arguments that has to passed around as well.

While I'm not a Python specialist, I can help implementing the changes if desired.

@hildogjr
Copy link
Owner

hildogjr commented May 30, 2018

@mmmaisel nice to ear some tips. I am not expert too but, all standardization that help the next user / developer to add and correct features will be welcome.
When I entered to the KiCost project I started talking with @devbisme and we split the code in several modules (before KiCost as "like one big code file"). This allow us to improve the spreadsheet, expand to more EDAs, add the multiple parts component...
Since web-scrape is not my strong skill on Python, I am focused on the other code parts (spreadsheet, BOM readers, merge BOMs, ...).

If @devbisme (owner of the package) agree and pass us some directives, we could use the pickle_module or one reformulation_classes in my Git to this development to release when stable.

In my view, the points of this implementation is (more your tips) allow enhacement #4 and #65 (in beta on Digikey). Also if allow to search some part by "value/footprint/tolerance/so on" and get by, for example, more chipper, could me allow to future solve #17 (for resistor and capacitor at least).

Yes, I agree, the distributors modules are now the "bottleneck" of KiCost.

@hildogjr hildogjr added bug Bugs that impacts on main KiCost functionality. enhancement Improvements on features already existent. discussion Discution about implementation and new features. labels May 31, 2018
@hildogjr hildogjr changed the title Mouser scraping fails for some parts Scrape fails be javascript (Mouser/Farnell) & web module enhancement May 31, 2018
@hildogjr
Copy link
Owner

@mmmaisel, what you think:

  1. A main class with browser definitions, maximum number of retries, cookies, ajax / java of plugins used to interpret (beautifulsoup4 also);
  2. Each distributor inherits (1) add language, currency, method to match when table results (close match as now or chipper and prioritize parts recommend to laytout);
  3. Change the way to iterate, if possible (first distributors, second by parts), as you said to decreased the ban risk;
  4. Return extra of the component page (footprint, image link, ...). As did already for Digikey;
  5. The distributors class have to keep the functionality of automatic import (today just adding the subfolder if the __init__.py on distributors, it is recognized as a new distributor module).

mmmaisel added a commit to mmmaisel/KiCost that referenced this issue Jun 1, 2018
Goals are:
- Add stateful user-agent and cookie handling in fake_browser
- Scrape parts in distrtibutor then part order
- Use a class inheritance approach to distributor modules, this allows
  adding state information and reduces the amount of variabled passed around.
- One scraping thread per distributor, simplify locking

Implemented in this commit:
- Used class approach to distributors and fake_browser
- Parts are scraped in distributor -> part order
- One (IO limited) python thread per distributor
- Simplified locking
@mmmaisel
Copy link
Contributor Author

mmmaisel commented Jun 1, 2018

I started implementing the changes in my local fork.
First tests with mouser work. However, there may be new bugs introduced by refactoring.

Currently implemented are:

  • Class approach for all distributors and fake_browser.
  • Parts are scraped in distributor -> part order.
  • One (IO limited) python thread per distributor and simplified locking.
  • Automatic module import is now handeled from kicost.py.

mmmaisel added a commit to mmmaisel/KiCost that referenced this issue Jun 1, 2018
Goals are:
- Add stateful user-agent and cookie handling in fake_browser
- Scrape parts in distrtibutor then part order
- Use a class inheritance approach to distributor modules, this allows
  adding state information and reduces the amount of variabled passed around.
- One scraping thread per distributor, simplify locking

Implemented in this commit:
- Used class approach to distributors and fake_browser
- Parts are scraped in distributor -> part order
- One (IO limited) python thread per distributor
- Simplified locking
@hildogjr hildogjr assigned hildogjr and mmmaisel and unassigned xesscorp Jun 2, 2018
@hildogjr
Copy link
Owner

hildogjr commented Jun 2, 2018

It is getting a new level of code organization, nice. I will test to keep us updated about compatibility and others issues.
@mmmaisel, it is interesting also make possible to: importing the distributor class in the python terminal, do a "scrape by PART1", this will make more easy to future debugs.

Tip: make the difference in the py2 / p3 import as

from sys import version_info as py_version
if  py_version>=(3,0):
     import py3
else:
     import py2

This make possible to raise the import error (not using it to deal with py2/3 difference) and may be more pythonizide, as used in some files already revised, line 70.

Just other guidance about the dependence used: pycountry give a country-currency dictionary and future correlation to use in the configuration; CurrencyConverter is used to convert the price to a not available currency in that distributor.

@hildogjr
Copy link
Owner

hildogjr commented Jun 2, 2018

@mmmaisel, appear that the Farnell issue was fixed on your branch.
I just had this issues / reports:

  1. KiCost is spending almost 30s to start scrape on my computer (may be related in the class importation and initial configuration?);
  2. Digikey is not working (return always empty). But I saw that there is a #TODO in the file.
  3. TME got this error:

Traceback (most recent call last):
File "/home/h/.local/bin/kicost", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.5/dist-packages/kicost/main.py", line 274, in main
local_currency=args.currency)
File "/usr/local/lib/python3.5/dist-packages/kicost/kicost.py", line 315, in kicost
res_dist = res_proc.get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.5/dist-packages/kicost/kicost.py", line 300, in mt_scrape_part
retval.append(inst.scrape_part(i, parts[i]))
File "/usr/local/lib/python3.5/dist-packages/kicost/distributors/distributor.py", line 160, in scrape_part
qty_avail = self.dist_get_qty_avail(html_tree)
File "/usr/local/lib/python3.5/dist-packages/kicost/distributors/tme/tme.py", line 149, in dist_get_qty_avail
ajax_tree, qty_str = self.__ajax_details(pn)
File "/usr/local/lib/python3.5/dist-packages/kicost/distributors/tme/tme.py", line 76, in __ajax_details
r = r.decode('utf-8') # Convert bytes to string in Python 3.
UnboundLocalError: local variable 'r' referenced before assignment

@mmmaisel
Copy link
Contributor Author

mmmaisel commented Jun 3, 2018

@hildogjr

  1. Init time issue: should be fixed now.
  2. I'm now working on the digikey issue.
  3. The TME "use of undefined local variable 'r'" error was already there before my refactoring, see line 66).

mmmaisel added a commit to mmmaisel/KiCost that referenced this issue Jun 4, 2018
Goals are:
- Add stateful user-agent and cookie handling in fake_browser
- Scrape parts in distrtibutor then part order
- Use a class inheritance approach to distributor modules, this allows
  adding state information and reduces the amount of variabled passed around.
- One scraping thread per distributor, simplify locking

Implemented in this commit:
- Used class approach to distributors and fake_browser
- Parts are scraped in distributor -> part order
- One (IO limited) python thread per distributor
- Simplified locking
@hildogjr
Copy link
Owner

hildogjr commented Jun 6, 2018

Fixed on #242, merging soon.
This discussion could be re-open any moment for more improvements.

@hildogjr hildogjr closed this as completed Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs that impacts on main KiCost functionality. discussion Discution about implementation and new features. enhancement Improvements on features already existent.
Projects
None yet
Development

No branches or pull requests

3 participants