Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextField strips strings which may not be desirable #84

Closed
abmyii opened this issue Jan 6, 2020 · 9 comments · Fixed by #85
Closed

TextField strips strings which may not be desirable #84

abmyii opened this issue Jan 6, 2020 · 9 comments · Fixed by #85

Comments

@abmyii
Copy link
Contributor

abmyii commented Jan 6, 2020

strings = [node.strip() for node in element.itertext()]

My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

@howie6879
Copy link
Owner

Hi @abmyii :

Maybe we should remove .srtip()

@abmyii
Copy link
Contributor Author

abmyii commented Jan 7, 2020

Maybe we should remove .strip()

Will do.

BTW, this is my paragraph field, in case it is useful:

class ParagraphField(TextField):
    """
    This field is used to get paragraphs.
    """

    def _parse_element(self, element):
        strings = [node for node in element.itertext()]
        string = "".join(strings)
        return string if string else self.default

    def extract(self, *args, **kwargs):
        # Join lines (after stripping them)
        return "\n".join(
            _.strip() for _ in super().extract(*args, **kwargs) if _.strip()
        ).strip()

@howie6879
Copy link
Owner

howie6879 commented Jan 7, 2020

I think getting customized target content should be implemented by the developer in the clean function, for example:

class TargetItem(Item):
    paragraph = TextField(css_selector="")

    async def clean_paragraph(self, value):
        # TODO
        pass

@abmyii
Copy link
Contributor Author

abmyii commented Jan 7, 2020

I did that before, but having ~3/4 paragraph fields all with an identical clean function (except different name) isn't very pleasing:

async def clean_<name1>(self, values):
    return "\n".join(_.strip() for _ in values if _.strip()).strip()

async def clean_<name2>(self, values):
    return "\n".join(_.strip() for _ in values if _.strip()).strip()

async def clean_<name3>(self, values):
    return "\n".join(_.strip() for _ in values if _.strip()).strip()
...

@howie6879
Copy link
Owner

If repeated often, I recommend that developers do this themselves, just like what you do now:

class ParagraphField(TextField):
    """
    This field is used to get paragraphs.
    """

    def _parse_element(self, element):
        strings = [node for node in element.itertext()]
        string = "".join(strings)
        return string if string else self.default

    def extract(self, *args, **kwargs):
        # Join lines (after stripping them)
        return "\n".join(
            _.strip() for _ in super().extract(*args, **kwargs) if _.strip()
        ).strip()

It might not be a good idea if it came directly from Ruia, Ruia should only provide basic functions, what do you think about it?

@abmyii
Copy link
Contributor Author

abmyii commented Jan 7, 2020

Ruia should only provide basic functions

True. I can't tell if this should be or not, but I'm sure you know better. I'll submit the PR for removing .strip and this should be done.
Thanks for your input!

@howie6879
Copy link
Owner

Thank you for your understanding. You can write a plug-in. This can also achieve what you want and help more people.

@howie6879
Copy link
Owner

howie6879 commented Jan 7, 2020

if you are interested in this job, we can do this in python-ruia

@abmyii
Copy link
Contributor Author

abmyii commented Jan 7, 2020

Sure! As I said, I'll be working on this project and once I collect enough to warrant a plug-in, I'll submit one. Right now I only have ParagraphField and I don't think that's enough to make a plug-in for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants