Skip to content

Conversation

@moritzschaefer
Copy link

@moritzschaefer moritzschaefer commented Nov 3, 2017

Right now only the last value is returned in the DF if a given attribute key exists more than once in one GTF-line. It doesn't make sense to just return that last value, so I changed the code such that a comma-separeted string is returned for these attributes instead.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 92.216% when pulling afa721a on moritzschaefer:master into b923298 on hammerlab:master.

@moritzschaefer
Copy link
Author

The failing CI build is unrelated to my changes and results from the use of the Deprecated Series.irow method

@iskandr
Copy link
Contributor

iskandr commented Feb 19, 2018

Hey @moritzschaefer I just noticed this PR -- is it valid to have a key occur more than once? And when that happens, are you sure that appending the values into a string representation of comma-separated list is the only/best action?

@iskandr
Copy link
Contributor

iskandr commented Feb 19, 2018

The irow problem should be fixed by #7, can you rebase off master after I merge that PR and also increase the version of gtfparse? It should be good to go after that.

@iskandr
Copy link
Contributor

iskandr commented Feb 19, 2018

@moritzschaefer I think this issue #2 runs into the same problem as you but their suggested solution is keeping a list of options. That seems safer to me than a comma separated string -- what do you think of changing this PR to keep lists of multi-values?

@iskandr
Copy link
Contributor

iskandr commented Feb 19, 2018

OK, just implement the list-accumulating version of this PR here: #8

Curious whether this works for @moritzschaefer

@iskandr iskandr closed this Feb 19, 2018
@moritzschaefer
Copy link
Author

Gencode gtf has keys used more than once a lot. This does not mean, it is valid, though it means it is the case.

The reason I used a comma-separated list, is that when using read_gtf_as_dataframe, I wanted to have a single value field at each position in the DataFrame (for performance reasons). I think it's also feasible to convert the list objects to strings after read_gtf_as_dataframe is performed.

Did you test your PR on real data? (https://github.com/openvax/gtfparse/pull/8/files)

@iskandr
Copy link
Contributor

iskandr commented Feb 20, 2018

@moritzschaefer The str vs. list representation seems to be a memory vs. runtime tradeoff. It's cheaper to make lists but I think will use more memory than a concatenated string. I'm going with lists for now since it feels cleaner. I just added a unit test in this PR: #9

(as well as fixing deploy, so hopefully this will actually go to PyPI now)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants