Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSDN_Crawler: many tilib.exe errors #3

Closed
williballenthin opened this issue Sep 16, 2014 · 17 comments
Closed

MSDN_Crawler: many tilib.exe errors #3

williballenthin opened this issue Sep 16, 2014 · 17 comments

Comments

@williballenthin
Copy link
Contributor

as reported by @wzr in #2.

In the meantime I have included a try: except: block around

   if file.endswith('htm'):
        file_counter += 1
        try:
            result = parse_file(os.path.join(root, file), const_enum)
        except:
            error_files.append(file)
        if result:
            results.append(result)
    print 'ERROR processing %d files' % len(error_files)

Which resulted in:

ERROR processing 21828 files
Parsed 329993 files
Extracted information about 15263 functions

Does this correspond with your numbers?

@peta909
Copy link

peta909 commented Oct 1, 2014

I am facing the same issues of multiple tilib.exe errors

@zer0pwned
Copy link

I use about 3.* version BeautifulSoup, then debug the msdn_crawler.py , fix some error, etc:

constant_names = re.findall(
    "<dl><dt>(.*?)</dt>", descriptions[i])
 if not constant_names:
    continue
constant_names = [strip_html(unicode(c, 'utf-8'))
      .encode('utf-8') for c in constant_names]
parsed_html = BeautifulSoup(descriptions[i])
 constant_descriptions = []
for string in parsed_html.findAll(width='60%'):
     constant_descriptions.append(strip_html(string.text.encode('ascii')).encode('utf-8'))

the result is :

Parsed 341278 files
Extracted information about 34218 functions
ERROR processing 197 files

the size of msdn_data_nn.xml is 33.7Mb, what about others?

I'll upload this file in my blog site, so people can download this, now i'm focusing on the rest error file..

@flypuma
Copy link

flypuma commented Feb 9, 2015

It's still not work,although i change the py file

@zer0pwned
Copy link

@flypuma can u paste the error infos in ur post?

@flypuma
Copy link

flypuma commented Feb 9, 2015

@niklaus520

Traceback (most recent call last):
File "C:\flare-ida-master\MSDN_crawler\msdn_crawler.py", line 414, in

main()

File "C:\flare-ida-master\MSDN_crawler\msdn_crawler.py", line 399, in main
(file_counter, results) = parse_files(msdn_directory, tilib_exe, til_dir)
File "C:\flare-ida-master\MSDN_crawler\msdn_crawler.py", line 372, in parse_fi
les
result = parse_file(os.path.join(root, file), const_enum)
File "C:\flare-ida-master\MSDN_crawler\msdn_crawler.py", line 277, in parse_fi
le
return parse_new_style(file, content, const_enum)
File "C:\flare-ida-master\MSDN_crawler\msdn_crawler.py", line 185, in parse_ne
w_style
constant_descriptions.append(strip_html(string.text.encode('ascii')).encode(
'utf-8'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 64:
ordinal not in range(128)

@zer0pwned
Copy link

@flypuma which line have u changed? How many files caught error while processing?

@flypuma
Copy link

flypuma commented Feb 9, 2015

@niklaus520 just line you pasted, it begins from line 176. If I do not chang these lines, It has the issue just like #2. I didn't show the number

@flypuma
Copy link

flypuma commented Feb 9, 2015

@niklaus520 I run the original msdn-crawler.py and txtracted about 33984 functions, just colse to yours. How could i get the file in you blog site.

@zer0pwned
Copy link

@flypuma http://blog.depressedmarvin.com/upload/2015/02/09/msdn_data_nn.xml

you can just wget it

@flypuma
Copy link

flypuma commented Feb 10, 2015

@niklaus520 Thanks a lot. Could you upload the file msdn_crawler.py?

@zer0pwned
Copy link

http://blog.depressedmarvin.com/upload/2015/02/10/msdn_crawler.py

well, now u can try my script, see if there are still errors.

Then u can compare them, maybe some lines are different~

@thansau239
Copy link

i got the issue when run python script annotate_IDB_MSDN, please help me

Traceback (most recent call last):
File "C:/Program Files/IDAPro6.6/python/flare/annotate_IDB_MSDN.py", line 117, in on_ok_button
IDB_MSDN_Annotator.main(config)
File "C:/Program Files/IDAPro6.6/python/flare\IDB_MSDN_Annotator__init__.py", line 523, in main
functions_map = parse_xml_data_files(msdn_data_dir)
File "C:/Program Files/IDAPro6.6/python/flare\IDB_MSDN_Annotator__init__.py", line 486, in parse_xml_data_files
additional_functions = xml_parser.parse(xml_file)
File "C:/Program Files/IDAPro6.6/python/flare\IDB_MSDN_Annotator\xml_parser.py", line 283, in parse
parser.parse(xmlfile)
File "C:\Program Files\IDAPro6.6\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Program Files\IDAPro6.6\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Program Files\IDAPro6.6\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Program Files\IDAPro6.6\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: C:\Program Files\IDAPro6.6\python\flare\annotate_IDB_MSDN.py:1:2: not well-formed (invalid token)

Thank you very much!

@I-VANN
Copy link

I-VANN commented Dec 13, 2015

to niklaus520
if you change this instruction you can eliminate all error related to unicode:

        for string in parsed_html.findAll(width='60%'):
            try:
                constant_descriptions.append(strip_html(string.text.encode('ascii')).encode('utf-8'))
            except Exception,e:
                constant_descriptions.append(strip_html(string.text.encode('utf-8')))

please upload to your code so all can download it.

Ivan

@zer0pwned
Copy link

@I-VANN
Copy link

I-VANN commented Dec 13, 2015

I've just modified the file with my suggestion, it was only for other people.
So if you think that this change to your modified file is acceptable you can modify for all.
Thank you for your availability.

@zer0pwned
Copy link

@I-VANN cool, thanks for your suggestion

@mr-tz
Copy link
Contributor

mr-tz commented May 15, 2017

Closing this old issue. Please check if the following file works for you after unzipping it.
https://github.com/mr-tz/flare-ida/blob/master/MSDN_data/msdn_data.zip
Please reopen this issue if you need further assistance.

@mr-tz mr-tz closed this as completed May 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants