New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextField
strips strings which may not be desirable
#84
Comments
Hi @abmyii : Maybe we should remove |
Will do. BTW, this is my paragraph field, in case it is useful:
|
I think getting customized target content should be implemented by the developer in the class TargetItem(Item):
paragraph = TextField(css_selector="")
async def clean_paragraph(self, value):
# TODO
pass |
I did that before, but having ~3/4 paragraph fields all with an identical clean function (except different name) isn't very pleasing:
|
If repeated often, I recommend that developers do this themselves, just like what you do now: class ParagraphField(TextField):
"""
This field is used to get paragraphs.
"""
def _parse_element(self, element):
strings = [node for node in element.itertext()]
string = "".join(strings)
return string if string else self.default
def extract(self, *args, **kwargs):
# Join lines (after stripping them)
return "\n".join(
_.strip() for _ in super().extract(*args, **kwargs) if _.strip()
).strip() It might not be a good idea if it came directly from Ruia, Ruia should only provide basic functions, what do you think about it? |
True. I can't tell if this should be or not, but I'm sure you know better. I'll submit the PR for removing |
Thank you for your understanding. You can write a plug-in. This can also achieve what you want and help more people. |
if you are interested in this job, we can do this in python-ruia |
Sure! As I said, I'll be working on this project and once I collect enough to warrant a plug-in, I'll submit one. Right now I only have |
ruia/ruia/field.py
Line 120 in 8a91c01
My use case is extracting paragraphs which have newlines between them, and these are stripped out by
TextField
. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.The text was updated successfully, but these errors were encountered: