Split cell into new rows, not just new columns. #386

pjakobsen · 2016-09-08T16:34:19Z

I propose a feature that would allow the use of regex to split a fields into new rows. Currently, it's only possible to create new columns afaik, which causes a potential problem when the split yields different numbers of new data elements for every case. Simple Example:

 [['ID','AMOUNT','RECIPIENT'],
    ['1','1000','Joe: 25%, Jane: 75%'],
    ['2','2000','Joe: 25%, Jess: 25%, Jenny: 25%, Jim: 25%']]

Desired Result:

 [['ID','AMOUNT','RECIPIENT','PERCENT],
    ['1','1000','Joe', 25%],
    ['1','1000','Jane',75%'],
    ['2','2000','Joe', '25%'],
    ['2','2000', 'Jess','25%']
    ['2','2000', 'Jenny','25%']
    ['2','2000', 'Jim','25%']
]

I'd be happy to give it a try if it doesn't already exist. Looking for suggestion on how to implement, but I imagine it as some type of parameter you could pass to the split method.

The text was updated successfully, but these errors were encountered:

alimanfoo · 2016-09-21T18:21:23Z

Hi @hexatonic,

Apologies for the slow reply, just back from conferences.

I think this is a nice idea, very happy to review a PR.

I'd suggest a new function called something like "splitdown" or "splitmany"
within the petl.transform.regex module.

Function signature something like:

def splitdown(table, field, pattern, maxsplit=0, flags=0)

If you look at the source of the petl.transform.regex module you'll see the
general pattern. There is a public function like "splitdown", then a class
returned from that function like "SplitDownView", then I often break out
the row iterator implementation into a separate function like
"itersplitdown" to ensure iterators are completely independent.

In this case you want a row iterator function something like (untested):

def itersplitdown(source, field, pattern, maxsplit, flags):
# setup source iterator
it = iter(source)

# setup regex
prog = re.compile(pattern, flags)

# obtain source header row
hdr = next(it)
flds = list(map(text_type, hdr))

# handle field argument
if isinstance(field, int) and field < len(hdr):
    field_index = field
    field = hdr[field_index]
elif field in flds:
    field_index = flds.index(field)
else:
    raise ArgumentError('field invalid: must be either field name or

index')

# yield output header - same as input
yield tuple(hdr)

# yield data rows
for row in it:
    value = row[field_index]
    for v in prog.split(value, maxsplit):
        outrow = tuple(v if i == field_index else row[i] for i in

range(len(hdr)))
yield outrow

Hope that's comprehensible! Obviously a unit test would be very nice too.

Cheers,
Alistair

On Thursday, September 8, 2016, Hexatonic notifications@github.com wrote:

I propose a feature that would allow the use of regex to split a fields
into new rows. Currently, it's only possible to create new columns afaik,
which causes a potential problem when the split yields different numbers of
new data elements for every case. Simple Example:

[['ID','AMOUNT','RECIPIENT'],
['1','1000','Joe: 25%, Jane: 75%'],
['2','2000','Joe: 25%, Jess: 25%, Jenny: 25%, Jim: 25%']]

Desired Result:

[['ID','AMOUNT','RECIPIENT','PERCENT],
['1','1000','Joe', 25%],
['1','1000','Jane',75%'],
['2','2000','Joe', '25%'],
['2','2000', 'Jess','25%']
['2','2000', 'Jenny','25%']
['2','2000', 'Jim','25%']
]

I'd be happy to give it a try if it doesn't already exist. Looking for
suggestion on how to implement, but I imagine it as some type of parameter
you could pass to the split method.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#386, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAq8QtIq1lzB3S5g4u-P4UloFfCNSFo1ks5qoDkLgaJpZM4J4Mfr
.

Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health http://cggh.org
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alimanfoo@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

John-Dennert · 2017-09-05T03:07:19Z

Hi @alimanfoo,
I'm working on a PR for this and was wondering what you thought about similar functionality for "splittingdown" a field that contained a list or a tuple (or may even a dict).

Thanks,
John

John-Dennert · 2017-09-05T05:14:53Z

Hi @hexatonics,
For the example you provided, what regex would you expect to supply to produce the desired result?

Thanks,
John

* Added splitdown function (resolves #386, #386). Allows the use of a regex to split a field into new rows. * Added unit test for splitdown function. All tests pass for py26, py27 and py36.

John-Dennert mentioned this issue Sep 5, 2017

Add splitdown function #430

Merged

alimanfoo closed this as completed in #430 Aug 6, 2019

alimanfoo added a commit that referenced this issue Aug 6, 2019

Add splitdown function (#430)

d819ceb

* Added splitdown function (resolves #386, #386). Allows the use of a regex to split a field into new rows. * Added unit test for splitdown function. All tests pass for py26, py27 and py36.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split cell into new rows, not just new columns. #386

Split cell into new rows, not just new columns. #386

pjakobsen commented Sep 8, 2016

alimanfoo commented Sep 21, 2016

John-Dennert commented Sep 5, 2017

John-Dennert commented Sep 5, 2017

Split cell into new rows, not just new columns. #386

Split cell into new rows, not just new columns. #386

Comments

pjakobsen commented Sep 8, 2016

alimanfoo commented Sep 21, 2016

John-Dennert commented Sep 5, 2017

John-Dennert commented Sep 5, 2017