Response to Yiyi Huang's post #4

HaolingZHANG · 2024-05-06T07:54:50Z

Hi all,

Recently we have realized that this post on ResearchGate accused that we used improper data for our work. After going through Huang’s report, we found that the wrongly conclusion was mainly due to the confusion of different concepts and improper use of our software package. Hereby we give the following detailed reply.

We have also sent the reply to the editor of Nature Computational Science. The editor notified us that our response has been forwarded to Huang by 18th Apr 2024. No further response is obtained so far and we decided to respond here as a comment.

Here we summarized the raised issues by Huang into three main points:

The conflict between theoretical information density (1.95 bits/nt) and real information density (1.75-1.78 bits/nt) by simulation (referring to contents from section 2 to section 3)

In general, we believe Huang has confused the concept difference of information density, net information density and coding potential (i.e. theoretical upper bound). As Huang pointed out in the report, one of them is “coding potential” (Erlich et. al., 2017), which indicates the theoretical upper bound of coding capacity, while in the previous paper, the “coding potential” was 1.98 bits/nt, and “net information density” was 1.57 bits/nt. The two values are also not identical because of the different definition of these two terms. Meanwhile, even with identical coding scheme, the “net information density” will also vary according to file size, experiment design, etc., so we did not include this term as our major evaluation indicator.

In the comparison between information densities of different coding schemes (Table 1), we used “information density” instead of “coding potential” because for broader readership, “information density” is much easier to understand. In this table, we use the exact data from Erlich’s paper, e.g. for DNA Fountain, 1.98 bits/nt, as information density, which is not the so-called “net information density” (Huang also showed the original table in Huang’s post, fig3). The actual information density in our work (1.75-1.78 bits/nt) is theoretically calculated by considering only the indices and data-payload, but without the flanking region and error-correction codes which are more flexible during real practice. The formula to calculate it can be easily found in supplementary information, equation 11(shown as fig 6 in Huang’s post). In the point-by-point response of second round revision, we also explained this difference to Reviewer 4, Response 12, while it is accepted.

Meanwhile, in the first paragraph of section “Calculation of Theoretical information density of YYC”, we have already state that “By applying the above equations, it is easy to evaluate the difference between actual information density interval and theoretical upper bound for a transcoding algorithm”, showing that we are fully aware the difference between coding potential and actual information density (i.e., net information density). As the screenshot (fig2) in Huang’s post shows, we actually differentiated the terms $ d_{\text{ACTUAL}} $, $ d_{\text{THEORY}}^i $, $ d_{\text{BASELINE}}^i $, etc. We believe Huang neglected this point.

In addition, Huang confused the theoretical calculation and simulation data. In supplementary table 3 and 4 (fig4 and fig5 in Huang’s post), the information density is calculated from transcoding simulation. It may be confusing to use “information density” in supplementary table 3. But according to the context and the data in the tables, it should be easy to find out the data here (e.g. 1.8037 bits/nt for Mona Lisa) does not equates to the coding potential, but the net information density.

In Supplementary Table 4, the purpose of comparison is to demonstrate that we can always find a proper coding rule to have a relatively high information density for varies types of files. For example, for the file “Exiting the Factory.flv” mentioned by Huang, other coding rules generate the result of ~1.4 bits/nt because of its biased byte frequency, YYC No. 888 still gives 1.719 bits/nt. In Huang’s post, it seems that he/she/they assume that we are demonstrating a coding scheme with higher information density compared to the previous ones (e.g. DNA Fountain). In fact, the point we are trying to deliver is that although we try to maintain a relatively high information density, robustness is what we care most, as shown in Figure 2, 3 and 4 in our paper.

The calculation of physical information density is stated in the Methods section, subtitled “Data analysis”, the calculation methods was also conducted previously by Organick et. al., 2020. One of the most important parameters used in this calculation is the average copy number. In Erlich et. al., 2017, the copy number is ~1300, that’s why they can achieve 215 Pbytes/g, over two orders of magnitude higher than previous reports. In our work, the average copy number is ~100, which is approximately one order of magnitude lower than DNA Fountain. Therefore, the physical density of our work is about 10 folds of DNA Fountain. Apparently, the facts that DNA Fountain holds a much higher physical density than that previous coding scheme because of lower copy number was neglected, and the assumption that it is impossible to achieve 2.25 Ebytes/g in our work is arbitrary.

The problems in implementing encoding process using YYC package released on GitHub (referring to contents of section 4)

Since YYC used a pseudo-random incorporation strategy and screening-based approach, unless the random seed is fixed (which is not employed in our code), the information density will fluctuate in a certain interval. As Huang showed in section 4.1- 4.6 of the post, the corresponding information density data in Supplementary Table 3 and 4 in our work (Figure 4 and 5 in the post) is “quite close to” that of Huang’s test. This just goes to show that YYC was performs robustly on information density.

In addition, regarding the raised concern of success rate, we have gone through the details in Huang’s test code provided from this GitHub repository, and found out that in Huang’s implementation, the setting of index length was not appropriate. For the convenience of understanding, the default index length setting is to automatically fit the file size and use the minimum index length available. However, if the index length is too short, the implementation would be affected, and the success rate would be lower than expected.

Regarding the time complexity issue, we never claimed a linear time complexity in our paper. We even stated that “It implies that with the binary segment contains extreme 0/1 ratio, the encoding process will be significantly time consuming.” in the supplementary information (section “Detailed workflow and features of YYC system”, paragraph one). Furthermore, screening-based coding scheme, such as YYC and DNA fountain, is different in time complexity compared to constrained code such as Church et. al., Goldman et. al. As show in supplementary information of Erlich et. al., 2017, section 1.3.5, its time complexity is affected by many factors, such as file size, restrictions, and so on. Under improper configuration, the time complexity of its encoding process can reach an exponential level.

We believe that one of the problems raising from implementation of our package might come from the misuse of our package. For instance, we serialize our algorithm into a byte stream using the standard Python library “pickle”, allowing us to efficiently store its various hyperparameters. This serialized form can then be deserialized to reload the algorithm object, eliminating the need for users to manually input various parameters from Yin-Yang Code during the decoding process. This is also one of the common model storage techniques in the field of machine learning.

Problems of robustness test (referring to contents of section 5)

First of all, the criteria to evaluate the robustness is stated in Methods section, subtitled “Data analysis”, and we quote here “The data recovery rate was calculated using $ \frac{\text{successful recovered binary segments}}{\text{total number of binary segments}} $”. The issue was mainly because Huang used an inappropriate criterion, i.e., the equality of the recovered binary matrix and the source binary matrix, to evaluate the robustness in the post.

Second, YYC serves as a bit-to-base coding scheme, not an error-correcting algorithm. And bit-to-base transcoding and error correction are two different research topics, and there are precedents for independent or combined usage. In some recent practice, people tend to use the combination such as HEDGES (Press et. al., 2020, PNAS) and SPIDER-WEB (our recent work). In our paper, we clearly stated that Reed-Solomon (RS) code was used together with YYC for robust data recovery. Also, in the validation sections of our work, to avoid biased analysis, we made the comparison between DNA Fountain and YYC when RS code was also introduced for both implementations. As we checked the test codes in Huang’s Github repository, we found that no error-correction code is used for testing. Therefore, difference from Huang’s result and ours is expected.

We thank Huang’s interest in our work. However, this post is biased, and the language used in the post is also improper and disrespectful. We think all these can be avoid if Huang can simply contact us to discuss the problems during using YYC in a proper manner.

HaolingZHANG added the help wanted Extra attention is needed label May 6, 2024

HaolingZHANG assigned HaolingZHANG and ntpz870817 May 6, 2024

HaolingZHANG pinned this issue May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Response to Yiyi Huang's post #4

Response to Yiyi Huang's post #4

HaolingZHANG commented May 6, 2024 •

edited

Loading

Response to Yiyi Huang's post #4

Response to Yiyi Huang's post #4

Comments

HaolingZHANG commented May 6, 2024 • edited Loading

HaolingZHANG commented May 6, 2024 •

edited

Loading