Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for(i in array) not sequential #171

Closed
hallenstal opened this issue Feb 10, 2023 · 5 comments
Closed

for(i in array) not sequential #171

hallenstal opened this issue Feb 10, 2023 · 5 comments

Comments

@hallenstal
Copy link

on MacOS,awk version 20200816:
echo "one;three;54;3;86;seven" | awk '{split($0,a,";");for(i in a){print "a[" i "]=" a[i] }}'
a[2]=three
a[3]=54
a[4]=3
a[5]=86
a[6]=seven
a[1]=one

@aksr
Copy link

aksr commented Feb 10, 2023

Nice catch, can confirm (5e49ea4).

@arnoldrobbins
Copy link
Collaborator

awk purposely does not define the order in which a for (i in array) loop goes through the array. You cannot depend on it to be "sequential", and different implementations will go through the loop in different orders. If you require sequential traversal, do it like so:

n = length(array)
for (i = 1; i <= n; i++)
   do something with array[i]

This should only be used when you know for sure that the indices are sequential (such as with split()) since indices can be strings, or even be missing.

Closing this issue.

@hallenstal
Copy link
Author

hallenstal commented Feb 10, 2023 via email

@ryenus
Copy link

ryenus commented Jan 29, 2024

Possible to revisit the decision here?

I'd argue for several points:

  • Given that awk arrays are actually associative, like maps, the keys could be either numbers or strings, or even a series of numbers with skipped values (holes), therefore it's more preferable to use for(var in array) to loop an array

  • Making things worse, the original awk doesn't even provide a builtin array length function. To be able to iterate through a properly indexed array incrementally, one has to first loop through the array using for(var in array) to count the array length, then loop the array again with for(i=0;i<length;i++), to get the order right. This also applies to some other awk distributions.

    • Even worse, if the array contains string keys, then array[pos] would NOT work because the key at position pos could be a string instead of the natural number, causing pos to be an invalid index.

      $ echo "one;three;54;3;86;seven" | /usr/bin/awk '{split($0,a,";");a["k"]="v";len=length(a); print "length:"len; for(i=1;i<len;i++) {print "a:"i, a[i]} }'
      length:7   # due to the extra entry `a["k"]="v"
      a:1 one
      a:2 three
      a:3 54
      a:4 3
      a:5 86
      a:6 seven
      # a:k v is missed
  • With for (var in array), the array is iterated almost in sequential order, except the first element is always iterated the last, doesn't it seem like a suspicious off-by-one bug somewhere?

    echo "one;three;54;3;86;seven" | awk '{split($0,a,";");for(i in a){print "a[" i "]=" a[i] }}'
    a[2]=three
    a[3]=54
    a[4]=3
    a[5]=86
    a[6]=seven
    a[1]=one

@arnoldrobbins
Copy link
Collaborator

Hello.

Possible to revisit the decision here?

Not really, no. The array management isn't going to change.

I'd argue for several points:

  • Given that awk arrays are actually associative, like maps, the keys could be either numbers or strings, or even a series of numbers with skipped values (holes), therefore it's more preferable to use for(var in array) to loop an array

So this is arguing against ordered traversal of the array.

  • Making things worse, the original awk doesn't even provide a builtin array length function.

If by "original" you mean this version, you are incorrect. It has supported length(array) since January of 2002, over 20 years.

To be able to iterate through a properly indexed array incrementally, one has to first loop through the array using for(var in array) to count the array length, then loop the array again with for(i=0;i<length;i++), to get the order right. This also applies to some other awk distributions.

This isn't necessary. If you know that an array is indexed from 1 to N, you can do this:

for (i = 1; i in array; i++) ...
  • Even worse, if the array contains string keys, then array[pos] would NOT work because the key at position pos could be a string instead of the natural number, causing pos to be an invalid index.

So this also argues against trying to provided ordered traversal of arrays.

  • With for (var in array), the array is iterated almost in sequential order, except the first element is always iterated the last, doesn't it seem like a suspicious off-by-one bug somewhere?
    echo "one;three;54;3;86;seven" | awk '{split($0,a,";");for(i in a){print "a[" i "]=" a[i] }}'
    a[2]=three
    a[3]=54
    a[4]=3
    a[5]=86
    a[6]=seven
    a[1]=one

Arrays are implemented using hash tables. What you're seeing is how things hash. Since the number of items in the array is small, it looks like it's sequential, but if you put in a lot of elements (say 100), you'll see that the order isn't sequential at all. In short, there's no bug here.

As described, ordered traversal isn't so simple. Gawk provides ways to do it. It isn't the default in awk both because it's difficult to define what the ordering should be when numbers and strings are mixed, and also because it adds an extra expensive step to the process: sorting. The cost for setting up an ordered traversal through a hash table, particularly when there are lots of elements, can be measured and it can be expensive. Making ordered traversal the default means that users are paying for a feature they rarely need, and that's not a nice way to write software.

I hope all this helps. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants