Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue parsing files with MSDN_crawler #2

Closed
wzr opened this issue Sep 16, 2014 · 5 comments
Closed

Issue parsing files with MSDN_crawler #2

wzr opened this issue Sep 16, 2014 · 5 comments

Comments

@wzr
Copy link

wzr commented Sep 16, 2014

I seem to have hit an issue with the parsing of the files, tried this on 6.6 on x64, and 6.5 on x86.

C:\Users\luser\Desktop\IDA stuff\flare-ida\MSDN_crawler [master]> python .\msdn_
crawler.py 'C:\\sdk_help\\' 'C:\\Program Files\\IDA 6.5\\tilib.exe' 'C:\\Program
 Files\\IDA 6.5\\til\\pc'
MSDN crawler based on zynamics msdn-crawler - Copyright 2010
Traceback (most recent call last):
  File ".\msdn_crawler.py", line 413, in <module>
    main()
  File ".\msdn_crawler.py", line 398, in main
    (file_counter, results) = parse_files(msdn_directory, tilib_exe, til_dir)
  File ".\msdn_crawler.py", line 371, in parse_files
    result = parse_file(os.path.join(root, file), const_enum)
  File ".\msdn_crawler.py", line 276, in parse_file
    return parse_new_style(file, content, const_enum)
  File ".\msdn_crawler.py", line 183, in parse_new_style
    parsed_html.find_all(width='60%')]
TypeError: 'NoneType' object is not callable

(This happens after a few minutes of processing). Upon running the same command -v, I get:

Lots of this "Could not retrieve function description...", which I figure it's okay since not all files will be relevant to the script.

[...] 
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abff2e90-4c42-4c07-816f-efde05343e03.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abff323b-e6c6-45e0-93bd-eeb68bca80e0.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abff3c41-301f-4273-9103-8e6197ba41fe.htm
Traceback (most recent call last):
  File "c:\python27\lib\logging\__init__.py", line 842, in emit
    msg = self.format(record)
  File "c:\python27\lib\logging\__init__.py", line 719, in format
    return fmt.format(record)
  File "c:\python27\lib\logging\__init__.py", line 464, in format
    record.message = record.getMessage()
  File "c:\python27\lib\logging\__init__.py", line 328, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file msdn_crawler.py, line 118
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abff3c41-301f-4273-9103-8e6197ba41fe.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abffd0fe-d047-4670-a728-eea8253f3f2d.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_activate.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_getautohidebar.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_getstate.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_gettaskbarpos.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_new.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_querypos.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_remove.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_setautohidebar.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_setpos.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_setstate.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abm_windowposchanged.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abnormaltermination.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_fullscreenapp.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_poschanged.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_statechange.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abn_windowarrange.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abort.htm
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abort.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abortall.htm
DEBUG:__main__:Error: Could not retrieve function description from file C:\\sdk_
help\\abortall.htm
DEBUG:__main__:Parsing C:\\sdk_help\\abortdoc.htm
Traceback (most recent call last):
  File ".\msdn_crawler.py", line 413, in <module>
    main()
  File ".\msdn_crawler.py", line 398, in main
    (file_counter, results) = parse_files(msdn_directory, tilib_exe, til_dir)
  File ".\msdn_crawler.py", line 371, in parse_files
    result = parse_file(os.path.join(root, file), const_enum)
  File ".\msdn_crawler.py", line 276, in parse_file
    return parse_new_style(file, content, const_enum)
  File ".\msdn_crawler.py", line 183, in parse_new_style
    parsed_html.find_all(width='60%')]
TypeError: 'NoneType' object is not callable

Just for some 4 eyeing:
I am running windows version of IDA (tried both 64 and 32 bit hosts).
Python is always 32 bit.
I pip-installed "beautifulsoup" (not beautifulsoup4)
I decompressed all the HxS help files to a flat directory, i.e. all htm files in the same directory.

ps. as a sidenote I get tilib.exe errors on different files, on clean installs, on pretty much every version from 6.1 to 6.6 except for 6.5, (tilib.exe version matching that of the IDA distribution) anyone else experiencing this?

@williballenthin
Copy link
Contributor

Can you try with beautifulsoup4 (via pip)? I think bs4 is the correct dependency, though I realize this isn't documented well. If that works, I'll update all the docs to clearly point it out.

@wzr
Copy link
Author

wzr commented Sep 16, 2014

I think we can discard that one. (I really hope I am not fat-fingering something silly).

image

In the meantime I have included a try: except: block around

        if file.endswith('htm'):
            file_counter += 1
            try:
                result = parse_file(os.path.join(root, file), const_enum)
            except:
                error_files.append(file)
            if result:
                results.append(result)
print 'ERROR processing %d files' % len(error_files)

Which resulted in:

ERROR processing 21828 files
Parsed 329993 files
Extracted information about 15263 functions

Does this correspond with your numbers?

Thank you a lot for the effort in the tool, it is great! (and also the evtx and registry modules! big fan of those!).

@williballenthin
Copy link
Contributor

I'm putting together a testing environment to triage these issues now.

I'll also split out the file processing issues into a separate issue so we can track and discuss it more clearly.

@peta909
Copy link

peta909 commented Oct 1, 2014

I solved the file parsing issues by using beautiful soup 3 instead of 4.

@mr-tz
Copy link
Contributor

mr-tz commented May 15, 2017

Closing this old issue. Please reopen if it's still not working for you.

@mr-tz mr-tz closed this as completed May 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants