Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I extract attribute values using OpenScraping? #6

Closed
Porkstone opened this issue Mar 15, 2016 · 19 comments
Closed

Can I extract attribute values using OpenScraping? #6

Porkstone opened this issue Mar 15, 2016 · 19 comments

Comments

@Porkstone
Copy link

<a href="test">dfsdfsdf</a>

I have tried systax like this //a/@href
It just returns the contents of the anchor tag, but I'm looking for "test" in the href attribute.

Is this possible?

@zmarty
Copy link
Contributor

zmarty commented Mar 15, 2016

It should be possible. I will give it a try.

@zmarty zmarty closed this as completed Mar 17, 2016
@zmarty
Copy link
Contributor

zmarty commented Mar 17, 2016

Sorry, I mistakenly closed this item. This is still a problem.

@zmarty zmarty reopened this Mar 17, 2016
@zmarty
Copy link
Contributor

zmarty commented Mar 17, 2016

OpenScraping depends on Html Agility Pack, which does not support attribute selection, as explained here.

What I can do is modify the code to support simple attribute selection, so your particular example would work. Let me know if you still need this and I will make the change.

@nickhaughton
Copy link

nickhaughton commented Dec 13, 2016

I could use this, I'm happy to do the work if needs be. @zmarty

@mcskelle
Copy link

I'm pretty happy using the library but really really need to extract links using /a/@href or some other method in the lib. Any chance this will be available soon?

@cbracht
Copy link

cbracht commented Jan 29, 2017

@zmarty The library is working great so far but I'm in need to extract the href attribute of a link as well. A workaround would be great.

@zmarty
Copy link
Contributor

zmarty commented Feb 2, 2017

@cbracht Acknowledged, will work on it.

@cbracht
Copy link

cbracht commented Feb 2, 2017

Thank you, your work is appreciated!

@marcel-silva
Copy link

I made it work like this:

public class LinkAttributeValue : ITransformationFromHtml
{
    public object Transform(Dictionary<string, object> settings, HtmlAgilityPack.HtmlNode node, List<HtmlAgilityPack.HtmlNode> logicalParents)
    {
        if (node != null)
        {
            string attrName = "href";
            string fallBack = "#";
            if (settings != null && settings["_attrName"] != null && ((JValue)settings["_attrName"]).Type == JTokenType.String)
            {
                attrName = ((JValue)settings["_attrName"]).ToObject<string>();
            }
            if (settings != null && settings["_fallBack"] != null && ((JValue)settings["_fallBack"]).Type == JTokenType.String)
            {
                fallBack = ((JValue)settings["_fallBack"]).ToObject<string>();
            }

            var href = node.GetAttributeValue(attrName, fallBack);
            return href;
        }

        return null;
    }
}

And to use:

      var html = "<html><body><h1>Article title</h1><div class='article'><a href='link1.com'>link1</a></div><div class='article'><a href='link2.com'>link2</a></div></body></html>";

        var configJson = @"{
            'teams': {
                '_xpath': '//div[contains(@class, \'article\')]',
                'name': './/a',
                'link': {
                  '_xpath': './/a',
                  '_transformations': [
                    {
                        '_type': 'LinkAttributeValue',
                      '_attrName': 'href',
                      '_fallBack': '#'
                    }
                  ]
                }
            }
        }";`

@zmarty
Copy link
Contributor

zmarty commented Feb 22, 2017

Thanks for the workaround! I am working on a permanent solution in another branch. Will merge to master as soon as it's ready.

@avi22228
Copy link

avi22228 commented Jul 9, 2017

@zmarty Do you have a fix for this ?

Thanks for the good work.

@zmarty
Copy link
Contributor

zmarty commented Jul 9, 2017

Not yet, sorry. I did not have time to finish it.

@avi22228
Copy link

avi22228 commented Jul 9, 2017

No worries. Thanks for the prompt reply. Hack posted by @marcel-silva is working fine for now.

When you release the next version, please also update the dependencies, currently it is interfering with other stuff.

Thank you..

@zmarty zmarty mentioned this issue Jul 18, 2017
@pldmgg
Copy link

pldmgg commented Sep 16, 2017

Just wanted to chime in that I just ran into this as well. Using @marcel-silva workaround for time being. Thanks again for making this! It's very handy.

@shawnshaddock
Copy link
Contributor

I also really need this fix. Is it still being worked on?

@shawnshaddock
Copy link
Contributor

I have fixed this issue in pull request #14

@zmarty
Copy link
Contributor

zmarty commented Aug 23, 2018

Thank you @shawnshaddock, this is now in master.

@zmarty zmarty closed this as completed Aug 23, 2018
@zmarty
Copy link
Contributor

zmarty commented Dec 5, 2018

@shawnshaddock Reopening since checkin #14 breaks transformations such as CastToIntegerTransformation. I am working on a fix, since I need this for a project.

@zmarty zmarty reopened this Dec 5, 2018
@zmarty
Copy link
Contributor

zmarty commented Dec 5, 2018

Fixed (hopefully permanently) through #21

@zmarty zmarty closed this as completed Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants