-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XPath results contain namespace in the keys #20
Comments
Thank you! And sorry for the late reply. Hopefully I can be of some
That would be amazing! I'll go ahead and point you to a few resources that
This is due to a patch required to properly handle namespaces without lxml. You can see if installing riko with the lxml parser fixes the issue since it works a bit differently than the native Python xml parser (ElementTree).
Could you please provide an example of the desired output?
|
Hello, Thank you for your detailed response. Unfortunately, the deadline for the tutorial was a week ago and I submitted my tutorial. I believe it turned out pretty good and I hope it'll be useful to someone. The tutorial is available here. I'll still check out everything you linked to. Thank you very much. |
Well, glad you were able to get along without me. I really do need to start replying to my emails/gh issues more timely :). It's soooo cool to have proof that someone else besides me is using this library! Please let me know if there was any [other] part of riko you found confusing. I'm going through your notebook now, very impressive!! I'll submit an issue with typo corrections later. Also, free feel to submit a PR to the readme linking to your notebook. |
Thank you very much, I'm glad that you liked it! I just sent a pull request... I think :D I might've messed it up, it's been some time since I opened a pull request. Please tell me if I messed up and I'll re-send it. |
Did a bit more investigating and this issue is compounded in certain cases (search for {
'{http://www': {
'w3': {
'org/1999/xhtml}span': {
'class': 'date', 'content': '6 Kasım 2016'}, 'org/1999/xhtml}p': 'Amok'}}}} This is due to the original key So, the upshot of all this is that I need to figure out how to remove the namespace from being included in the result that |
Hello,
First of all, commendable job. Thank you for your work.
I'm working on a Jupyter notebook, which will be a tutorial on how to use Riko to access unstructured website data in a structured manner. When I finish it, I will send you a pull request with the notebook (or get it to you in an alternative way), as I think it could be a great beginner's guide for everyone who'd like to use Riko.
As I am preparing the notebook, I ran in to an interesting situation: when I am parsing
<li>
elements using thexpathfetchpage
and if those elements have other elements nested underneath it, the keys to those nested elements have a weird{http://www.w3.org/1999/xhtml}
prefix. The following code snippet can illustrate it:This prints:
for the fetched structure:
(This page is updated daily so the exact output might differ when you run it but the structure remains the same)
I was unable to figure out why there's that '
{http://www.w3.org/1999/xhtml}
' prefix on the nested key values or how to get rid of them. I understand that it differentiates between the attributes of a tag and the nested elements but maybe there is a flag (that I was unable to find) to retrieve them as a list under a key like 'child
' in top-level dictionary.Thank you for your assistance.
The text was updated successfully, but these errors were encountered: