#### Souces 
- [awk manual](https://www.gnu.org/software/gawk/manual/html_node/Changing-Fields.html#Changing-Fields)

#### output field separator (OFS) using comma
|  |   |
---|---|
| **FS** | field separator (default: white space) |
| **OFS** |  output field separator, i.e. what character separates fields when printing|
| **RS** | record separator, i.e. what character records are split on (default: new line) |
| **ORS** | output record separator |
| **NR** | number of records in input (# lines by default) |

In [31]:
cat data/anno.gtf | head | awk '{print $2,$1}' - 

2 ##gff-version
rtracklayer ##source-version
2022-04-07 ##date
ensembl_havana 17
ensembl_havana 17
ensembl_havana 17
ensembl_havana 17
ensembl_havana 17
ensembl_havana 17
ensembl_havana 17


In [28]:
# record: row
# field: column
cat data/anno.gtf | head | awk '{print $2 $1}' - # even there is a literal space, NO OFS is returned. 

2##gff-version
rtracklayer##source-version
2022-04-07##date
ensembl_havana17
ensembl_havana17
ensembl_havana17
ensembl_havana17
ensembl_havana17
ensembl_havana17
ensembl_havana17


In [32]:
cat data/anno.gtf | head | awk '{OFS="_____"; print $2,$1}' - 

2_____##gff-version
rtracklayer_____##source-version
2022-04-07_____##date
ensembl_havana_____17
ensembl_havana_____17
ensembl_havana_____17
ensembl_havana_____17
ensembl_havana_____17
ensembl_havana_____17
ensembl_havana_____17


#### Change csv to tsv

In [40]:
cat data/enhancers.csv | head | awk 'BEGIN{FS=","; OFS="\t"}{$1=$1;print $0}' - 

chr1	1015066	1015266	HSE897	9.5706324138
chr1	1590473	1590673	HSE853	17.3206898329
chr1	2120861	2121064	HSE86	66.0424471614
chr1	6336418	6336624	HSE394	14.8892086906
chr1	7404594	7404794	HSE315	24.228051023
chr1	11941325	11941525	HSE322	17.5192630328
chr1	15055555	15055755	HSE962	11.7065788965
chr1	15478024	15478224	HSE354	17.5771586196
chr1	16065307	16065508	HSE264	23.9092446094
chr1	23244249	23244449	HSE434	17.1331422162


##### [about `$1 = $1`](https://unix.stackexchange.com/questions/568666/how-does-awk-1-1-remove-extra-spaces)
> It is a common error to try to change the field separators in a record simply by setting FS and OFS, and then expecting a plain ‘print’ or ‘print $0’ to print the modified record.

In [42]:
echo "0    0" | awk '{$1 = $1}'1

0 0


In [43]:
echo "0    0" | awk '{$1 = $1}' # no input 

#### Arrange the column

In [23]:
# Print everything except 1st column  
cat data/anno.gtf | head -4 | awk '{print $0}' -

##gff-version 2
##source-version rtracklayer 1.54.0
##date 2022-04-07
17	ensembl_havana	gene	1	3410	.	-	.	gene_id "ENSG00000108518"; gene_version "8"; gene_name "PFN1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";


In [25]:
cat data/anno.gtf | head -4 | awk '{$1=""; print $0}' -  # print everything except 1st column. 

 2
 rtracklayer 1.54.0
 2022-04-07
 ensembl_havana gene 1 3410 . - . gene_id "ENSG00000108518"; gene_version "8"; gene_name "PFN1"; gene_source "ensembl_havana"; gene_biotype "protein_coding";


#### Mutate and transmutate

In [56]:
cat data/inventory-shipped | head | awk '{nboxes = $3; $3 = $3 - 10; print nboxes, $3}' 

25 15
32 22
24 14
52 42
34 24
42 32
34 24
34 24
55 45
54 44


In [61]:
cat data/inventory-shipped | head | awk '{$6 = ($2 + $3 + $4 + $5); print $6}' - 

168
297
301
566
287
640
561
412
382
676


#### Filter

In [90]:
cat data/anno.gtf | awk 'BEGIN {FS = "\t"; OFS = "\t"} {if($1 == "17" && $4 > 300 && $5 < 1190) print $0}' - 

17	ensembl_havana	exon	977	1169	.	-	.	gene_id "ENSG00000108518"; gene_version "8"; gene_name "PFN1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENST00000225655"; transcript_version "6"; transcript_name "PFN1-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "basic"; transcript_support_level "1 (assigned to previous version 5)"; exon_number "2"; exon_id "ENSE00000676461"; exon_version "1"; ccds_id "CCDS11061"
17	ensembl_havana	CDS	977	1169	.	-	0	gene_id "ENSG00000108518"; gene_version "8"; gene_name "PFN1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENST00000225655"; transcript_version "6"; transcript_name "PFN1-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "basic"; transcript_support_level "1 (assigned to previous version 5)"; exon_number "2"; protein_id "ENSP00000225655"; protein_version "5"; ccds_id "CCDS11061"
17	havana	exon	1042	1169	.	-	.	gene_id 

In [89]:
cat data/anno.gtf |head -20 | awk 'BEGIN {FS = "\t"; OFS = "\t"} {if($3 ~ /CDS/) print $0}' - 

17	ensembl_havana	CDS	2612	2743	.	-	0	gene_id "ENSG00000108518"; gene_version "8"; gene_name "PFN1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENST00000225655"; transcript_version "6"; transcript_name "PFN1-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "basic"; transcript_support_level "1 (assigned to previous version 5)"; exon_number "1"; protein_id "ENSP00000225655"; protein_version "5"; ccds_id "CCDS11061"
17	ensembl_havana	CDS	977	1169	.	-	0	gene_id "ENSG00000108518"; gene_version "8"; gene_name "PFN1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_id "ENST00000225655"; transcript_version "6"; transcript_name "PFN1-201"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "basic"; transcript_support_level "1 (assigned to previous version 5)"; exon_number "2"; protein_id "ENSP00000225655"; protein_version "5"; ccds_id "CCDS11061"
17	ensembl_havana	CDS	252	346	.	-

In [99]:
# grep -v '^#' data/anno.gtf | awk 'BEGIN {FS = "\t"; OFS = "\t"} {print $1, $4-1, $5}' - 
cat data/anno.gtf | awk 'BEGIN {FS = "\t"; OFS = "\t"} !/^#/{print $1, $4-1, $5}' - 

17	0	3410
17	0	2879
17	2611	2879
17	2611	2743
17	2740	2743
17	976	1169
17	976	1169
17	0	346
17	251	346
17	248	251
17	2743	2879
17	0	248
17	15	1818
17	976	1818
17	976	1193
17	1190	1193
17	15	346
17	251	346
17	248	251
17	1193	1818
17	15	248
17	1041	3410
17	3292	3410
17	3292	3368
17	3365	3368
17	2611	2904
17	2611	2904
17	1041	1169
17	1041	1169
17	3368	3410


In [109]:
cat data/anno.gtf | awk 'BEGIN {FS = "\t"; OFS = "\t"} {if($3 ~ /CDS/) print $1, $4-1, $5}' > data/CDS.bed - # GTF to bed for CDS only

In [112]:
cat data/anno.gtf | awk 'BEGIN {FS = "\t"; OFS = "\t"} {if($3 ~ /CDS/) print $1, $4-1, $5}' > data/CDS.bed - 
cat data/CDS.bed | awk 'BEGIN {FS = "\t"; sum = 0} {len = $3-$2; sum = sum + len} END{print sum/NR}' - 

153.625


In [86]:
cat data/SRR13345674.fastq | head | awk 'END{print NR/4}' -

2.5


In [None]:
# gft2 to bed using bed2op walkaround https://www.biostars.org/p/206342/ 
awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' input.gtf | gtf2bed - > output.bed