Issue parsing files with MSDN_crawler #2

wzr · 2014-09-16T06:14:29Z

I seem to have hit an issue with the parsing of the files, tried this on 6.6 on x64, and 6.5 on x86.

C:\Users\luser\Desktop\IDA stuff\flare-ida\MSDN_crawler [master]> python .\msdn_
crawler.py 'C:\\sdk_help\\' 'C:\\Program Files\\IDA 6.5\\tilib.exe' 'C:\\Program
 Files\\IDA 6.5\\til\\pc'
MSDN crawler based on zynamics msdn-crawler - Copyright 2010
Traceback (most recent call last):
  File ".\msdn_crawler.py", line 413, in <module>
    main()
  File ".\msdn_crawler.py", line 398, in main
    (file_counter, results) = parse_files(msdn_directory, tilib_exe, til_dir)
  File ".\msdn_crawler.py", line 371, in parse_files
    result = parse_file(os.path.join(root, file), const_enum)
  File ".\msdn_crawler.py", line 276, in parse_file
    return parse_new_style(file, content, const_enum)
  File ".\msdn_crawler.py", line 183, in parse_new_style
    parsed_html.find_all(width='60%')]
TypeError: 'NoneType' object is not callable

(This happens after a few minutes of processing). Upon running the same command -v, I get:

Lots of this "Could not retrieve function description...", which I figure it's okay since not all files will be relevant to the script.

[...] 
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abff2e90-4c42-4c07-816f-efde05343e03.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abff323b-e6c6-45e0-93bd-eeb68bca80e0.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abff3c41-301f-4273-9103-8e6197ba41fe.htm
Traceback (most recent call last):
  File "c:\python27\lib\logging\__init__.py", line 842, in emit
    msg = self.format(record)
  File "c:\python27\lib\logging\__init__.py", line 719, in format
    return fmt.format(record)
  File "c:\python27\lib\logging\__init__.py", line 464, in format
    record.message = record.getMessage()
  File "c:\python27\lib\logging\__init__.py", line 328, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file msdn_crawler.py, line 118
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abff3c41-301f-4273-9103-8e6197ba41fe.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abffd0fe-d047-4670-a728-eea8253f3f2d.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_activate.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_getautohidebar.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_getstate.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_gettaskbarpos.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_new.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_querypos.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_remove.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_setautohidebar.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_setpos.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_setstate.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_windowposchanged.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abnormaltermination.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_fullscreenapp.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_poschanged.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_statechange.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_windowarrange.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abort.htm
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abort.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abortall.htm
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abortall.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abortdoc.htm
Traceback (most recent call last):
  File ".\msdn_crawler.py", line 413, in <module>
    main()
  File ".\msdn_crawler.py", line 398, in main
    (file_counter, results) = parse_files(msdn_directory, tilib_exe, til_dir)
  File ".\msdn_crawler.py", line 371, in parse_files
    result = parse_file(os.path.join(root, file), const_enum)
  File ".\msdn_crawler.py", line 276, in parse_file
    return parse_new_style(file, content, const_enum)
  File ".\msdn_crawler.py", line 183, in parse_new_style
    parsed_html.find_all(width='60%')]
TypeError: 'NoneType' object is not callable

Just for some 4 eyeing:
I am running windows version of IDA (tried both 64 and 32 bit hosts).
Python is always 32 bit.
I pip-installed "beautifulsoup" (not beautifulsoup4)
I decompressed all the HxS help files to a flat directory, i.e. all htm files in the same directory.

ps. as a sidenote I get tilib.exe errors on different files, on clean installs, on pretty much every version from 6.1 to 6.6 except for 6.5, (tilib.exe version matching that of the IDA distribution) anyone else experiencing this?

The text was updated successfully, but these errors were encountered:

williballenthin · 2014-09-16T13:39:23Z

Can you try with beautifulsoup4 (via pip)? I think bs4 is the correct dependency, though I realize this isn't documented well. If that works, I'll update all the docs to clearly point it out.

wzr · 2014-09-16T16:06:13Z

I think we can discard that one. (I really hope I am not fat-fingering something silly).

In the meantime I have included a try: except: block around

        if file.endswith('htm'):
            file_counter += 1
            try:
                result = parse_file(os.path.join(root, file), const_enum)
            except:
                error_files.append(file)
            if result:
                results.append(result)
print 'ERROR processing %d files' % len(error_files)

Which resulted in:

ERROR processing 21828 files
Parsed 329993 files
Extracted information about 15263 functions

Does this correspond with your numbers?

Thank you a lot for the effort in the tool, it is great! (and also the evtx and registry modules! big fan of those!).

williballenthin · 2014-09-16T16:56:34Z

I'm putting together a testing environment to triage these issues now.

I'll also split out the file processing issues into a separate issue so we can track and discuss it more clearly.

peta909 · 2014-10-01T07:27:15Z

I solved the file parsing issues by using beautiful soup 3 instead of 4.

mr-tz · 2017-05-15T14:13:26Z

Closing this old issue. Please reopen if it's still not working for you.

williballenthin mentioned this issue Sep 16, 2014

MSDN_Crawler: many tilib.exe errors #3

Closed

mr-tz closed this as completed May 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue parsing files with MSDN_crawler #2

Issue parsing files with MSDN_crawler #2

wzr commented Sep 16, 2014

williballenthin commented Sep 16, 2014

wzr commented Sep 16, 2014

williballenthin commented Sep 16, 2014

peta909 commented Oct 1, 2014

mr-tz commented May 15, 2017

Issue parsing files with MSDN_crawler #2

Issue parsing files with MSDN_crawler #2

Comments

wzr commented Sep 16, 2014

williballenthin commented Sep 16, 2014

wzr commented Sep 16, 2014

williballenthin commented Sep 16, 2014

peta909 commented Oct 1, 2014

mr-tz commented May 15, 2017