A need I see often is the need to take a collection of files with lists of attributes and put them into a "booleanized" or count table. Each file is a column and each item in the collected files is a row. The cell is either a 1 or 0, or a value if the files have second columns. 

For example, a file called "apple" contains "red" and "fruit" and a file called "orange" contains "orange" and "fruit". The resulting table would look something like...

| | apple | orange |
|-| | |
| red | 1 | 0 |
| orange | 0 | 1 |
| fruit | 1 | 1 |


In [1]:
cat > apple <<-EOF
red
fruit
EOF

In [2]:
cat apple

red
fruit


In [3]:
cat > orange <<-EOF
orange
fruit
EOF

In [4]:
cat orange

orange
fruit


In gawk it would look something like this...

In [5]:
gawk '( FNR == 1 ){
  split(FILENAME,f,".")
  filename=f[1]
  filenames[filename]++
} {
  attribute=$1
  attributes[attribute]++
  b[filename][attribute]++
} END {
  asorti(attributes)
  asorti(filenames)
  printf "-"
  for(filename in filenames) printf ",%s",filenames[filename]
  printf "\n"
  for(attribute in attributes){
    printf attributes[attribute]
    for(filename in filenames) printf ",%s",b[filenames[filename]][attributes[attribute]]||0
    printf "\n"
  }
}' apple orange

-,apple,orange
fruit,1,1
orange,0,1
red,1,0


For the most part this is just fine, however it requires that all files be read and loaded into memory. If you have a couple hundred files, and each attribute is a string of say 10 chars and each files contains about 20,000 attributes.

In [6]:
echo $(( 200 * 10 * 8 * 20000 ))

320000000


320MB isn't so bad. But if you have 450 files with 1 - 2 million attributes ...

In [7]:
echo $(( 450 * 10 * 8 * 1500000 ))

54000000000


54GB just to load the data files. When I loaded this data set it took 120GB because the data is actually in 2 arrays.

In [8]:
\rm apple
\rm orange

I need to devise a way to do this using much less memory.

My plan is to read all of the files once, or provide a separate list of all the attributes, then process each file 1 at a time and create the table transposed, 90 degrees. Then once complete, transpose the table 90 degrees using either `datamash transpose -t,` or a "cut and paste" script like


```
#!/bin/bash

numc=$(($(head -n 1 "$1" | grep -o "$2" | wc -l)+1))
for ((i=1; i<="$numc"; i++))
do cut -d "$2" -f"$i" "$1" | paste -s -d "$2"
done
```