crunch throws an error with failed d45 type #5

japhir · 2021-09-10T09:22:44Z

Hi Mathieu!

I've been trying out D47crunch for our data!

The first issue I had was that our UIDs and Sessions were integers and datetimes respectively, and that I had to cast them to character to get mydata.crunch() to work. Otherwise it would throw this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pygzCIH1", line 9, in <module>
  File "<string>", line 1, in <module>
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1274, in crunch
    self.compute_bulk_and_clumping_deltas(r)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1349, in compute_bulk_and_clumping_deltas
    R45 = (1 + r['d45'] / 1000) * R45_wg
TypeError: unsupported operand type(s) for /: 'str' and 'int'

I think I managed to fix this by first casting them to characters and then running it again.

The text was updated successfully, but these errors were encountered:

mdaeron · 2021-09-10T10:05:22Z

Good point about the documentation.

We could use str() when reading the Session field, with the following results:

from datetime import date, datetime
import arrow # another popular option

str(date.today())   # returns '2021-09-10'
str(datetime.now()) # returns '2021-09-10 12:03:37.940927'
str(arrow.now())    # returns '2021-09-10T12:04:02.268544+02:00'

Would that work for your use case?

japhir · 2021-09-10T10:16:49Z

Am I correct in that arrow includes the timezone and datetime relies on UTC?

I am an absolute beginner in python, so have no idea of what packages are out there and what they do, and how to modify the examples you listed in the docs to my liking ;-). I basically used R to generate the csv in the desired structure and then ran below in sequence:

  import D47crunch
  mydata = D47crunch.D47data()
  mydata.read('/home/japhir/SurfDrive/PhD/programming/dataprocessing/out/rawdata.csv')
  mydata.wg()  # as an aside, where do I tell it the d13C and d18O of the working gas?
  mydata.plot_distribution_of_analyses() # gives an unusable plot if you have many measurements, in my case 1779 measurements only (that's a tiny subset of everything since may 2021)
  mydata.crunch() # this gave me the error if I didn't make sure UID and Session were characters
  mydata.standardize() # this still gives me the error in #6 
  mydata.summary(verbose = True, save_to_file = False) # can only reach this step with the simulated data

the #6 in the comment of the code block doesn't create a link, so here it is.

mdaeron · 2021-09-10T12:56:52Z

Going back to your original first message above, the error you report has nothing to do (I think) with the Session field. Instead, it's complaining that r['d45'] is a string. After mydata.read(), what does type(mydata[0]['d45']) return?
mydata.wg() is only used if you want to compute the WG composition based on your δ13C and δ18Ο standards. If you want to specify arbitrary values for the WG, add fields d13Cwg_VPDB and d18Owg_VSMOW to your csv file. This used to be included in the documentation but it was dropped at some point; I'll put it back in. I personally advise to use D47data.wg() rather than nominal WG values, but YMMV of course.
plot_distribution_of_analyses(): maybe open another issue with an example of the output, and suggestions re: how to improve the plot. I've only used it once, for the data in Fiebig et al. (2021), but the plot could be heavily customized using function keywords if needed.
crunch(): I don't understand how UID and Session end up not being strings, unless you used the 'session' keyword in D47data.read()? In that case, yes, the Session field needs to be a string. That doesn't explain why UID is not a string. The relevant code is below:

data = [{k: v if k in ['UID', 'Session', 'Sample'] else smart_type(v) for k,v in zip(txt[0], l) if v != ''} for l in txt[1:]]

if session != '':
	for r in data:
		r['Session'] = session

japhir · 2021-09-10T14:31:59Z

I was able to get the code running only if I took a small subset of my whole dataset. I thought that there were some rows that force the whole d45 column to be a string, however, for one of the subsets that breaks (n = 1000) I can get the same error while type(mydata[0]['d45']) <class 'float'>
thanks for the explanation! I'll have a look at whether the wg computed value is very different from the one specified by our gas provider.
Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.
I think I'm mixing issues here, sorry about that. If I filter to rows where the above error doesn't take place, this error pops up with

>>>  mydata.standardize()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pyPjNI3W", line 3, in <module>
  File "/tmp/babel-yqvP9s/python-GXTFNT", line 1, in <module>
    mydata.standardize()
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1618, in standardize
    params.add(f'a_{s}', value = 0.9)
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
KeyError: "'a_2020_01_03T00:00:00Z' is not a valid Parameters name"

If I then re-implement my manual fix to convert the Session to character prior to export to csv and re-run, I get the other issue #6 where the Sample names cause issues

>>>   mydata.standardize()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pykIhOHv", line 3, in <module>
  File "/tmp/babel-yqvP9s/python-EUrohX", line 1, in <module>
    mydata.standardize()
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1625, in standardize
    params.add(f'D{self._4x}_{pf(sample)}', value = 0.5)
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
KeyError: "'D47_AU002_(2)' is not a valid Parameters name"

mdaeron · 2021-09-10T14:52:23Z

I was able to get the code running only if I took a small subset of my whole dataset. I thought that there were some rows that force the whole d45 column to be a string, however, for one of the subsets that breaks (n = 1000) I can get the same error while type(mydata[0]['d45']) <class 'float'>

mydata[0] is only the first line. To find the offending lines, you can try:

for k,r in enumerate(mydata):
    try:
        r['d45'] / 1000
    except:
        print(k, r['d45'], type(r['d45'])

This will print out the index, r['d45'] and type for the offending lines.

mdaeron · 2021-09-10T14:58:32Z

Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.

Agreed in principle, but such a plot could be pretty useful for anybody, there's nothing specific to Jens' study here. Again, if it works poorly for your use case, let me know and we can improve it if you feel it's worth it.

japhir · 2021-09-10T15:06:54Z

Aargh now It's breaking again but this time type(mydata[0]['d45']) is <class 'str'>.

I think it's because the Sample names in the subset that is causing issues are things like "ETM-2_690_5-83,85(Ntr)", which needs to be quoted, while some others are not quoted.

I've looked into how my R csv export decides to quote and/or escape things. By default it quotes "as needed", meaning that the above one with parentheses etc. gets quoted, but a simple ETH-3 doesn't.
Maybe the offender is "ETM-2_690_4-138,145(Ntr)[2]"? With quotes, parentheses AND square brackets this one is a winner in messing up computers ;-). I found this blogpost about reading/writing csv files with edge cases. For now I will simply transform these names manually?

After specifically filtering out any Sample with (, ), [, ] I think it's running smoothly now!
filter(!str_detect(identifier_1, "\\(|\\)|\\[|\\]")) %>%

mydata[0] is only the first line. To find the offending lines, you can try:

wait, so python's rows are by default lists that can contain different types for each row? o.O

mdaeron · 2021-09-10T16:36:22Z

I realized your problem is definitely caused by commas in sample names. The csv file uses commas as separators by default so that inserts spurious columns in your data. Using tabs or semicolons instead should help a lot.

wait, so python's rows are by default lists that can contain different types for each row? o.O

Not always. Python has lists of items, where each item can be any object, even another list. There is no concept of rows for such lists. In this case D4xdata objects are built as lists of dictionaries. A dictionary is a set of keys (e.g., of type int, or str), with each key associated to an arbitrary object (which could be a list another dictionary, among others). So mydata[0] is the first item (a dictionary, in this case, which describes a single analysis) in the list mydata, and mydata[2]['d45'] is the value associated with key d45 in the third item (=analysis) of mydata.

There is no obligation that all items in mydata have the same fields. You could add arbitrary fields (e.g., mineralogy) to some analyses., and test for the presence of these fields using the following code:

for r in mydata:
    if 'mineralogy' in r:
        print(f'Analysis {r["UID"]}: {r["mineralogy"]}')

japhir · 2021-09-10T17:13:13Z

I realized your problem is definitely caused by commas in sample names. The csv file uses commas as separators by default so that inserts spurious columns in your data. Using tabs or semicolons instead should help a lot.

ah yes, that must be it. Is there any escape syntax that you can use during export to keep the original identifier_1 column? The R export has quoted the fields, so the commas are technically correct csv syntax right? The link I shared before has some python csv package code that specifies how to handle quote syntax etc.

japhir · 2021-09-10T17:13:55Z

Not always. Python has lists of items, where each item can be any object, even another list. There is no concept of rows for such lists. In this case D4xdata objects are built as lists of dictionaries. A dictionary is a set of keys (e.g., of type int, or str), with each key associated to an arbitrary object (which could be a list another dictionary, among others). So mydata[0] is the first item (a dictionary, in this case, which describes a single analysis) in the list mydata, and mydata[2]['d45'] is the value associated with key d45 in the third item (=analysis) of mydata.

There is no obligation that all items in mydata have the same fields. You could add arbitrary fields (e.g., mineralogy) to some analyses., and test for the presence of these fields using the following code:
for r in mydata:
    if 'mineralogy' in r:
        print(f'Analysis {r["UID"]}: {r["mineralogy"]}')

Thanks for the elaboration :)

mdaeron · 2021-09-10T17:19:15Z

ah yes, that must be it. Is there any escape syntax that you can use during export to keep the original identifier_1 column? The R export has quoted the fields, so the commas are technically correct csv syntax right? The link I shared before has some python csv package code that specifies how to handle quote syntax etc.

I confess that I never took the time to use a real csv parser as you suggest. I'll add this as a todo item for the dev version.

japhir mentioned this issue Sep 10, 2021

standardize throws an error because of package lmfit #6

Closed

japhir changed the title ~~Make required data structure/types more explicit in docs~~ crunch throws an error with failed d45 type Sep 10, 2021

japhir mentioned this issue Sep 10, 2021

Improve plot_distribution_of_analyses or throw warnings in case there are many sessions/samples #7

Closed

mdaeron closed this as completed Sep 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crunch throws an error with failed d45 type #5

crunch throws an error with failed d45 type #5

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021 •

edited

Loading

japhir commented Sep 10, 2021 •

edited

Loading

mdaeron commented Sep 10, 2021

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021

mdaeron commented Sep 10, 2021

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021

japhir commented Sep 10, 2021

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021

crunch throws an error with failed d45 type #5

crunch throws an error with failed d45 type #5

Comments

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021 • edited Loading

japhir commented Sep 10, 2021 • edited Loading

mdaeron commented Sep 10, 2021

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021

mdaeron commented Sep 10, 2021

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021

japhir commented Sep 10, 2021

japhir commented Sep 10, 2021

mdaeron commented Sep 10, 2021

mdaeron commented Sep 10, 2021 •

edited

Loading

japhir commented Sep 10, 2021 •

edited

Loading