Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table with text columns containing and starting with missing values produce "NA" values where it should be NA #111

Open
stoeter opened this issue Mar 16, 2023 · 5 comments

Comments

@stoeter
Copy link
Collaborator

stoeter commented Mar 16, 2023

When a table contains text columns starting in the first rows with missing values then in R theres are correcty parsed to NA.
However, when a row contains a text value, then the previous row is parsed as "NA". In consequese the table retured to KNIME contains "NA" entries where before R snippet there was a missing value. This is a bug.

Then checked on R side this happens already in the parsed kIn table (from the data provided below):
kIn$Name (Sense)[1:20]
[1] NA NA NA NA NA NA NA NA NA
[10] NA NA NA "NA" "seq_001" "seq_002" "seq_003" "seq_004" "seq_005"
[19] "seq_006" "seq_007"

Sorting the table in KNIME in such a way that missing values are at the end of the table solves the issue for that column.

Attached KNIME workflow with data showing the problem...

Win7
KNIME 4.5.2
R version 3.6.1 (2019-07-05)
Rserve 1.8-6

R snippted problem with NAs.zip

@stoeter
Copy link
Collaborator Author

stoeter commented Mar 16, 2023

Correction: the issue cannot be solved by pre-sorting. It looks like the last row is also parsed as "NA" instead of NA

tail(kIn)
plateRow plateColumn Name (Sense) Sense sequence Name (Antisense) Reverse complement
Row90#1_? 8 7
Row91#1_? 8 8
Row92#1_? 8 9
Row93#1_? 8 10
Row94#1_? 8 11
Row95#1_? 8 12 NA NA NA NA
sample number sample number.text sample type plate number plate number.text
Row90#1_? NA H2O 2 002
Row91#1_? NA H2O 2 002
Row92#1_? NA H2O 2 002
Row93#1_? NA H2O 2 002
Row94#1_? NA H2O 2 002
Row95#1_? NA H2O 2 002
sample number in plate plateRow384 plateColumn384 plateNumber384
Row90#1_? NA 15 14 1
Row91#1_? NA 15 16 1
Row92#1_? NA 15 18 1
Row93#1_? NA 15 20 1
Row94#1_? NA 15 22 1
Row95#1_? NA 15 24 1
kIn$Name (Sense)[180:192]
[1] NA NA NA NA NA NA NA NA NA NA NA NA "NA"

@stoeter
Copy link
Collaborator Author

stoeter commented Mar 16, 2023

pasted code from command line was not properliy displayed in previous comment:

tail(kIn)
          plateRow plateColumn Name (Sense) Sense sequence Name (Antisense) Reverse complement
Row90#1_?        8           7         <NA>           <NA>             <NA>               <NA>
Row91#1_?        8           8         <NA>           <NA>             <NA>               <NA>
Row92#1_?        8           9         <NA>           <NA>             <NA>               <NA>
Row93#1_?        8          10         <NA>           <NA>             <NA>               <NA>
Row94#1_?        8          11         <NA>           <NA>             <NA>               <NA>
Row95#1_?        8          12           NA             NA               NA                 NA
          sample number sample number.text sample type plate number plate number.text
Row90#1_?            NA               <NA>         H2O            2               002
Row91#1_?            NA               <NA>         H2O            2               002
Row92#1_?            NA               <NA>         H2O            2               002
Row93#1_?            NA               <NA>         H2O            2               002
Row94#1_?            NA               <NA>         H2O            2               002
Row95#1_?            NA               <NA>         H2O            2               002
          sample number in plate plateRow384 plateColumn384 plateNumber384
Row90#1_?                     NA          15             14              1
Row91#1_?                     NA          15             16              1
Row92#1_?                     NA          15             18              1
Row93#1_?                     NA          15             20              1
Row94#1_?                     NA          15             22              1
Row95#1_?                     NA          15             24              1
> kIn$`Name (Sense)`[180:192]
 [1] NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   "NA"

@stoeter
Copy link
Collaborator Author

stoeter commented Mar 16, 2023

Sorry, the bug is even more severe. It looks like the last row with a value that is followed by a missing value is parsed as a missing value. In this example there are 10 sequences with their names. seq_010 was replaced by a missing value, NA, in kIn.

In addition what is very strange is that the column sampleNumber.text, which is also a text column looks fine!???

At this point is is also not clear to me what NA in R means:
: missing value / not available (ok)
"NA" : text "NA" (ok)
NA : appears in kIn$Name (Sense)[1:30] (, but not in kIn$sample number.text[1:30]), whereas that data displayed as kIn[1:30,] shows in table (why is it NA and not when printed as vector? does this have anything to do with the problem?)

Here is the example:

> kIn[1:30,]
              plateRow plateColumn Name (Sense)  Sense sequence Name (Antisense) Reverse complement
Row0#0_?             1           1         <NA>            <NA>             <NA>               <NA>
Row1#0_?             1           2         <NA>            <NA>             <NA>               <NA>
Row2#0_?             1           3         <NA>            <NA>             <NA>               <NA>
Row3#0_?             1           4         <NA>            <NA>             <NA>               <NA>
Row4#0_?             1           5         <NA>            <NA>             <NA>               <NA>
Row5#0_?             1           6         <NA>            <NA>             <NA>               <NA>
Row6#0_?             1           7         <NA>            <NA>             <NA>               <NA>
Row7#0_?             1           8         <NA>            <NA>             <NA>               <NA>
Row8#0_?             1           9         <NA>            <NA>             <NA>               <NA>
Row9#0_?             1          10         <NA>            <NA>             <NA>               <NA>
Row10#0_?            1          11         <NA>            <NA>             <NA>               <NA>
Row11#0_?            1          12         <NA>            <NA>             <NA>               <NA>
Row12#0_?            2           1           NA              NA               NA                 NA
Row13#0_Row0         2           2      seq_001 GAGAAAAGGAAGCTG     anti-seq_001    CAGCTTCCTTTTCTC
Row14#0_Row1         2           3      seq_002 GAGGAGTTGAAGCTG     anti-seq_002    CAGCTTCAACTCCTC
Row15#0_Row2         2           4      seq_003 GAGGAAAGGAAGCTG     anti-seq_003    CAGCTTCCTTTCCTC
Row16#0_Row3         2           5      seq_004 GAGATCAGAAAGCTG     anti-seq_004    CAGCTTTCTGATCTC
Row17#0_Row4         2           6      seq_005 GAGGGGTGGGGGCTG     anti-seq_005    CAGCCCCCACCCCTC
Row18#0_Row5         2           7      seq_006 GAGAAGAGGAAGCTG     anti-seq_006    CAGCTTCCTCTTCTC
Row19#0_Row6         2           8      seq_007 GAGAAATGGAGGCTG     anti-seq_007    CAGCCTCCATTTCTC
Row20#0_Row7         2           9      seq_008 GAGAAGAGGGAGCTG     anti-seq_008    CAGCTCCCTCTTCTC
Row21#0_Row8         2          10      seq_009 GAGAGGAGGGAGCTG     anti-seq_009    CAGCTCCCTCCTCTC
Row22#0_Row9         2          11         <NA>            <NA>             <NA>               <NA>
Row23#0_?            2          12         <NA>            <NA>             <NA>               <NA>
Row24#0_?            3           1           NA              NA               NA                 NA
Row25#0_Row10        3           2      seq_011 GAGGGAAGGAGGCTG     anti-seq_011    CAGCCTCCTTCCCTC
Row26#0_Row11        3           3      seq_012 GAGGGATGGAGGCTG     anti-seq_012    CAGCCTCCATCCCTC
Row27#0_Row12        3           4      seq_013 GAGGGGAGGGAGCTG     anti-seq_013    CAGCTCCCTCCCCTC
Row28#0_Row13        3           5      seq_014 GAGGGGTGGAGGCTG     anti-seq_014    CAGCCTCCACCCCTC
Row29#0_Row14        3           6      seq_015 GAGGGAGCTTGGCTG     anti-seq_015    CAGCCAAGCTCCCTC
              sample number sample number.text sample type plate number plate number.text
Row0#0_?                 NA               <NA>         H2O            1               001
Row1#0_?                 NA               <NA>         H2O            1               001
Row2#0_?                 NA               <NA>         H2O            1               001
Row3#0_?                 NA               <NA>         H2O            1               001
Row4#0_?                 NA               <NA>         H2O            1               001
Row5#0_?                 NA               <NA>         H2O            1               001
Row6#0_?                 NA               <NA>         H2O            1               001
Row7#0_?                 NA               <NA>         H2O            1               001
Row8#0_?                 NA               <NA>         H2O            1               001
Row9#0_?                 NA               <NA>         H2O            1               001
Row10#0_?                NA               <NA>         H2O            1               001
Row11#0_?                NA               <NA>         H2O            1               001
Row12#0_?                NA               <NA>         H2O            1               001
Row13#0_Row0              1                001       dsDNA            1               001
Row14#0_Row1              2                002       dsDNA            1               001
Row15#0_Row2              3                003       dsDNA            1               001
Row16#0_Row3              4                004       dsDNA            1               001
Row17#0_Row4              5                005       dsDNA            1               001
Row18#0_Row5              6                006       dsDNA            1               001
Row19#0_Row6              7                007       dsDNA            1               001
Row20#0_Row7              8                008       dsDNA            1               001
Row21#0_Row8              9                009       dsDNA            1               001
Row22#0_Row9             10                010       dsDNA            1               001
Row23#0_?                NA               <NA>         H2O            1               001
Row24#0_?                NA               <NA>         H2O            1               001
Row25#0_Row10            11                011       dsDNA            1               001
Row26#0_Row11            12                012       dsDNA            1               001
Row27#0_Row12            13                013       dsDNA            1               001
Row28#0_Row13            14                014       dsDNA            1               001
Row29#0_Row14            15                015       dsDNA            1               001

> kIn$`Name (Sense)`[1:30]
 [1] NA        NA        NA        NA        NA        NA        NA        NA        NA       
[10] NA        NA        NA        "NA"      "seq_001" "seq_002" "seq_003" "seq_004" "seq_005"
[19] "seq_006" "seq_007" "seq_008" "seq_009" NA        NA        "NA"      "seq_011" "seq_012"
[28] "seq_013" "seq_014" "seq_015"

> kIn$`sample number.text`[1:30]
 [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 001  002  003  004  005  006 
[20] 007  008  009  010  <NA> <NA> 011  012  013  014  015 
96 Levels: 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 ... 096

@stoeter
Copy link
Collaborator Author

stoeter commented Mar 21, 2024

Also saw this today again in newer software versions. To me this is a severe bug, because it changes the data where it should not, and the user might not even notice! (=> changed Priority)

Win11 Pro
KNIME 4.7.8
R version 4.3.2 (2023-10-31 ucrt)
Rserve 1.8-13

Current example of string column parsed to R from KNIME and back to KNIME (all "NA" were previously ?):
TDS-DB.ID
NA
101_MSD_3_5
101_MSD_3_6
101_MSD_3_7
?
NA
101_MSD_3_9
101_MSD_3_10
101_MSD_3_11
?
NA
101_MSD_3_13
101_MSD_3_14
101_MSD_3_15
101_MSD_3_16
?
NA
101_MSD_3_18

@stoeter
Copy link
Collaborator Author

stoeter commented Mar 21, 2024

Ok, just realized the full problem of the bug again:
a) certain missing values are parsed to R as "NA" (minor bug, annoying, but one could correct this)
b) certain values (strings, rows, that are followed by a missing value) are parsed to R as NA (missing value in R) and are therefore lost! (severe bug!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants