Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split cell into new rows, not just new columns. #386

Closed
pjakobsen opened this issue Sep 8, 2016 · 3 comments · Fixed by #430

Comments

@pjakobsen
Copy link
Contributor

commented Sep 8, 2016

I propose a feature that would allow the use of regex to split a fields into new rows. Currently, it's only possible to create new columns afaik, which causes a potential problem when the split yields different numbers of new data elements for every case. Simple Example:

 [['ID','AMOUNT','RECIPIENT'],
    ['1','1000','Joe: 25%, Jane: 75%'],
    ['2','2000','Joe: 25%, Jess: 25%, Jenny: 25%, Jim: 25%']]

Desired Result:

 [['ID','AMOUNT','RECIPIENT','PERCENT],
    ['1','1000','Joe', 25%],
    ['1','1000','Jane',75%'],
    ['2','2000','Joe', '25%'],
    ['2','2000', 'Jess','25%']
    ['2','2000', 'Jenny','25%']
    ['2','2000', 'Jim','25%']
]

I'd be happy to give it a try if it doesn't already exist. Looking for suggestion on how to implement, but I imagine it as some type of parameter you could pass to the split method.

@alimanfoo

This comment has been minimized.

Copy link
Collaborator

commented Sep 21, 2016

Hi @hexatonic,

Apologies for the slow reply, just back from conferences.

I think this is a nice idea, very happy to review a PR.

I'd suggest a new function called something like "splitdown" or "splitmany"
within the petl.transform.regex module.

Function signature something like:

def splitdown(table, field, pattern, maxsplit=0, flags=0)

If you look at the source of the petl.transform.regex module you'll see the
general pattern. There is a public function like "splitdown", then a class
returned from that function like "SplitDownView", then I often break out
the row iterator implementation into a separate function like
"itersplitdown" to ensure iterators are completely independent.

In this case you want a row iterator function something like (untested):

def itersplitdown(source, field, pattern, maxsplit, flags):
# setup source iterator
it = iter(source)

# setup regex
prog = re.compile(pattern, flags)

# obtain source header row
hdr = next(it)
flds = list(map(text_type, hdr))

# handle field argument
if isinstance(field, int) and field < len(hdr):
    field_index = field
    field = hdr[field_index]
elif field in flds:
    field_index = flds.index(field)
else:
    raise ArgumentError('field invalid: must be either field name or

index')

# yield output header - same as input
yield tuple(hdr)

# yield data rows
for row in it:
    value = row[field_index]
    for v in prog.split(value, maxsplit):
        outrow = tuple(v if i == field_index else row[i] for i in

range(len(hdr)))
yield outrow

Hope that's comprehensible! Obviously a unit test would be very nice too.

Cheers,
Alistair

On Thursday, September 8, 2016, Hexatonic notifications@github.com wrote:

I propose a feature that would allow the use of regex to split a fields
into new rows. Currently, it's only possible to create new columns afaik,
which causes a potential problem when the split yields different numbers of
new data elements for every case. Simple Example:

[['ID','AMOUNT','RECIPIENT'],
['1','1000','Joe: 25%, Jane: 75%'],
['2','2000','Joe: 25%, Jess: 25%, Jenny: 25%, Jim: 25%']]

Desired Result:

[['ID','AMOUNT','RECIPIENT','PERCENT],
['1','1000','Joe', 25%],
['1','1000','Jane',75%'],
['2','2000','Joe', '25%'],
['2','2000', 'Jess','25%']
['2','2000', 'Jenny','25%']
['2','2000', 'Jim','25%']
]

I'd be happy to give it a try if it doesn't already exist. Looking for
suggestion on how to implement, but I imagine it as some type of parameter
you could pass to the split method.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#386, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QtIq1lzB3S5g4u-P4UloFfCNSFo1ks5qoDkLgaJpZM4J4Mfr
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

@John-Dennert

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2017

Hi @alimanfoo,
I'm working on a PR for this and was wondering what you thought about similar functionality for "splittingdown" a field that contained a list or a tuple (or may even a dict).

Thanks,
John

@John-Dennert

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2017

Hi @hexatonics,
For the example you provided, what regex would you expect to supply to produce the desired result?

Thanks,
John

alimanfoo added a commit that referenced this issue Aug 6, 2019
Add splitdown function (#430)
* Added splitdown function (resolves #386, #386).

Allows the use of a regex to split a field into new rows.

* Added unit test for splitdown function.

All tests pass for py26, py27 and py36.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.