Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

mmmaisel · 2018-04-22T15:43:41Z

Scraping some Mouser parts fails with "Unknown error for from mouser" while other parts or distributors just work fine. This was tested with git revision e7e4243 on Ubuntu 16.04 64bit (Python 2.7) and Ubuntu 18.04 64bit (Python 3.6).

KiCost was invoked with the following command:

~/.local/bin/kicost -i test.xml -d99 -w -s --throttling_delay 1

PDB shows that the variables "html" and "tree" in distributors/mouser/mouser.py:178 contain some javascript garbage instead of the intented HTML source code.

The following minimal KiCad xml demonstrates the error:
test.xml.txt

Terminal output:
term_output.txt

The text was updated successfully, but these errors were encountered:

hildogjr · 2018-04-22T22:36:54Z

I could not reproduce this error, for me the result was just fine: test.xlsx.
Waiting for more test / someone more to confirm (I'm not a specialist on web / javascript).

mmmaisel · 2018-05-29T15:48:57Z

It looks like the root-cause of this issue is that mouser returns a captcha after scraping some parts.

I did some further experiments and figured out that if you correctly handle cookies, the captcha can be avoided. The following (partial?) solution worked for me if the -s and --throttling_delay=3 options are used:

Create a http.cookiejar object and a urllib.request.HTTPCookieProcessor and install it as global opener (urllib.request.install_opener).
Init the Request object with user-agent and keep this user-agent for the whole scraping process.
Access www.mouser.com without search query and store all cookies for subsequent scraping apsses.
Scrape all parts and update cookies after every part if required.

See attached mouser.py.txt for an example implementation of those hacks.

Now, the question is how to cleanly implement the following in KiCost's code:

Keep state information between calls to get_part_html_tree and between different threads.
Avoid global opener object, maybe use the opener object directly instead of urlopen.
Add support for the mouser preferences='ps=www2&pl=en-US&pc_www2=USDe' cookie (this is not supported by my hack).
Implement this for all distributors.

hildogjr · 2018-05-29T15:54:25Z

@mmmaisel thanks by debug and discussion. We (@devbisme and I) know that this multiple-sequencial scrape could be interpreted as robots and block by the distributors.
And debug by users we help KiCost to improve.

(Since the last code modifications #223 somethink could be implemented on fake_brownser(foo)).

mmmaisel · 2018-05-30T17:01:21Z

@hildogjr yes, fake_browser (as a class) would be a good point for storing various things like the cookiejar, selected user_agent, domain name, urllib opener, scrape retries and other things.
However, this fake_browser object has to be passed to every function that accesses URLs like get_part_html_tree and various AJAX using sub-functions of some distributors. It also has to be instanced somewhere before the scraping starts.

At this point, I think the code needs some redesign because currently KiCost first iterates over all parts (eventually distributed over several threads) and then iterates over all known distributors for a single part (with random user-agent and without cookies from previous requests). This needs to be changed to iterate distributors first, then parts to keep the state information.
I think one thread per distributor is enough as the scrape operation is limited by network I/O speed and throttling_delay settings not CPU speed. Allowing multiple threads per distributor would only increase state synchronization complexity and risk of getting banned.

Maybe, it would also make sense to change all distributor modules to classes which hold all required state information and common values like the fake_browser object, currency settings and maybe others. This would reduce the amount of function arguments that has to passed around as well.

While I'm not a Python specialist, I can help implementing the changes if desired.

hildogjr · 2018-05-30T18:45:37Z

@mmmaisel nice to ear some tips. I am not expert too but, all standardization that help the next user / developer to add and correct features will be welcome.
When I entered to the KiCost project I started talking with @devbisme and we split the code in several modules (before KiCost as "like one big code file"). This allow us to improve the spreadsheet, expand to more EDAs, add the multiple parts component...
Since web-scrape is not my strong skill on Python, I am focused on the other code parts (spreadsheet, BOM readers, merge BOMs, ...).

If @devbisme (owner of the package) agree and pass us some directives, we could use the pickle_module or one reformulation_classes in my Git to this development to release when stable.

In my view, the points of this implementation is (more your tips) allow enhacement #4 and #65 (in beta on Digikey). Also if allow to search some part by "value/footprint/tolerance/so on" and get by, for example, more chipper, could me allow to future solve #17 (for resistor and capacitor at least).

Yes, I agree, the distributors modules are now the "bottleneck" of KiCost.

hildogjr · 2018-05-31T18:55:00Z

@mmmaisel, what you think:

A main class with browser definitions, maximum number of retries, cookies, ajax / java of plugins used to interpret (beautifulsoup4 also);
Each distributor inherits (1) add language, currency, method to match when table results (close match as now or chipper and prioritize parts recommend to laytout);
Change the way to iterate, if possible (first distributors, second by parts), as you said to decreased the ban risk;
Return extra of the component page (footprint, image link, ...). As did already for Digikey;
The distributors class have to keep the functionality of automatic import (today just adding the subfolder if the __init__.py on distributors, it is recognized as a new distributor module).

Goals are: - Add stateful user-agent and cookie handling in fake_browser - Scrape parts in distrtibutor then part order - Use a class inheritance approach to distributor modules, this allows adding state information and reduces the amount of variabled passed around. - One scraping thread per distributor, simplify locking Implemented in this commit: - Used class approach to distributors and fake_browser - Parts are scraped in distributor -> part order - One (IO limited) python thread per distributor - Simplified locking

mmmaisel · 2018-06-01T13:36:48Z

I started implementing the changes in my local fork.
First tests with mouser work. However, there may be new bugs introduced by refactoring.

Currently implemented are:

Class approach for all distributors and fake_browser.
Parts are scraped in distributor -> part order.
One (IO limited) python thread per distributor and simplified locking.
Automatic module import is now handeled from kicost.py.

Goals are: - Add stateful user-agent and cookie handling in fake_browser - Scrape parts in distrtibutor then part order - Use a class inheritance approach to distributor modules, this allows adding state information and reduces the amount of variabled passed around. - One scraping thread per distributor, simplify locking Implemented in this commit: - Used class approach to distributors and fake_browser - Parts are scraped in distributor -> part order - One (IO limited) python thread per distributor - Simplified locking

hildogjr · 2018-06-02T02:21:00Z

It is getting a new level of code organization, nice. I will test to keep us updated about compatibility and others issues.
@mmmaisel, it is interesting also make possible to: importing the distributor class in the python terminal, do a "scrape by PART1", this will make more easy to future debugs.

Tip: make the difference in the py2 / p3 import as

from sys import version_info as py_version
if  py_version>=(3,0):
     import py3
else:
     import py2

This make possible to raise the import error (not using it to deal with py2/3 difference) and may be more pythonizide, as used in some files already revised, line 70.

Just other guidance about the dependence used: pycountry give a country-currency dictionary and future correlation to use in the configuration; CurrencyConverter is used to convert the price to a not available currency in that distributor.

hildogjr · 2018-06-02T17:37:08Z

@mmmaisel, appear that the Farnell issue was fixed on your branch.
I just had this issues / reports:

KiCost is spending almost 30s to start scrape on my computer (may be related in the class importation and initial configuration?);
Digikey is not working (return always empty). But I saw that there is a #TODO in the file.
TME got this error:

Traceback (most recent call last):
File "/home/h/.local/bin/kicost", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.5/dist-packages/kicost/main.py", line 274, in main
local_currency=args.currency)
File "/usr/local/lib/python3.5/dist-packages/kicost/kicost.py", line 315, in kicost
res_dist = res_proc.get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.5/dist-packages/kicost/kicost.py", line 300, in mt_scrape_part
retval.append(inst.scrape_part(i, parts[i]))
File "/usr/local/lib/python3.5/dist-packages/kicost/distributors/distributor.py", line 160, in scrape_part
qty_avail = self.dist_get_qty_avail(html_tree)
File "/usr/local/lib/python3.5/dist-packages/kicost/distributors/tme/tme.py", line 149, in dist_get_qty_avail
ajax_tree, qty_str = self.__ajax_details(pn)
File "/usr/local/lib/python3.5/dist-packages/kicost/distributors/tme/tme.py", line 76, in __ajax_details
r = r.decode('utf-8') # Convert bytes to string in Python 3.
UnboundLocalError: local variable 'r' referenced before assignment

mmmaisel · 2018-06-03T08:27:38Z

@hildogjr

Init time issue: should be fixed now.
I'm now working on the digikey issue.
The TME "use of undefined local variable 'r'" error was already there before my refactoring, see line 66).

Goals are: - Add stateful user-agent and cookie handling in fake_browser - Scrape parts in distrtibutor then part order - Use a class inheritance approach to distributor modules, this allows adding state information and reduces the amount of variabled passed around. - One scraping thread per distributor, simplify locking Implemented in this commit: - Used class approach to distributors and fake_browser - Parts are scraped in distributor -> part order - One (IO limited) python thread per distributor - Simplified locking

hildogjr · 2018-06-06T15:42:45Z

Fixed on #242, merging soon.
This discussion could be re-open any moment for more improvements.

hildogjr mentioned this issue May 3, 2018

KiCost - No Farnell results #245

Closed

hildogjr assigned xesscorp May 29, 2018

hildogjr added bug Bugs that impacts on main KiCost functionality. enhancement Improvements on features already existent. discussion Discution about implementation and new features. labels May 31, 2018

hildogjr changed the title ~~Mouser scraping fails for some parts~~ Scrape fails be javascript (Mouser/Farnell) & web module enhancement May 31, 2018

hildogjr assigned hildogjr and mmmaisel and unassigned xesscorp Jun 2, 2018

This was referenced Jun 2, 2018

ImportError when using multiple threads #258

Closed

Exit all threads when close the main one #215

Closed

hildogjr mentioned this issue Jun 3, 2018

TME fix and clear of distributors import mmmaisel/KiCost#1

Closed

mmmaisel mentioned this issue Jun 4, 2018

Refactor distributor modules to classes #262

Merged

hildogjr closed this as completed Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

mmmaisel commented Apr 22, 2018

hildogjr commented Apr 22, 2018

mmmaisel commented May 29, 2018

hildogjr commented May 29, 2018

mmmaisel commented May 30, 2018

hildogjr commented May 30, 2018 •

edited

Loading

hildogjr commented May 31, 2018

mmmaisel commented Jun 1, 2018

hildogjr commented Jun 2, 2018 •

edited

Loading

hildogjr commented Jun 2, 2018

mmmaisel commented Jun 3, 2018

hildogjr commented Jun 6, 2018

Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

Scrape fails be javascript (Mouser/Farnell) & web module enhancement #242

Comments

mmmaisel commented Apr 22, 2018

hildogjr commented Apr 22, 2018

mmmaisel commented May 29, 2018

hildogjr commented May 29, 2018

mmmaisel commented May 30, 2018

hildogjr commented May 30, 2018 • edited Loading

hildogjr commented May 31, 2018

mmmaisel commented Jun 1, 2018

hildogjr commented Jun 2, 2018 • edited Loading

hildogjr commented Jun 2, 2018

mmmaisel commented Jun 3, 2018

hildogjr commented Jun 6, 2018

hildogjr commented May 30, 2018 •

edited

Loading

hildogjr commented Jun 2, 2018 •

edited

Loading