# Sorting and counting utilities 

## Preparing environment 

Let's create a file with repeated lines

In [99]:
! rm -f uniq_example.txt

! seq 1 2 10 > uniq_example.txt
! cat uniq_example.txt

! echo ""

! seq 1 10 >> uniq_example.txt
! cat uniq_example.txt

1
3
5
7
9

1
3
5
7
9
1
2
3
4
5
6
7
8
9
10


## Examples 

* Sort with -n sort numerically (by default, -d is in alphabetical order)

In [100]:
! sort -n uniq_example.txt

1
1
2
3
3
4
5
5
6
7
7
8
9
9
10


* -r to reverse result

In [101]:
! sort -nr uniq_example.txt

10
9
9
8
7
7
6
5
5
4
3
3
2
1
1


* We can append several files to be sorted

In [102]:
! sort -n uniq_example.txt uniq_example.txt

1
1
1
1
2
2
3
3
3
3
4
4
5
5
5
5
6
6
7
7
7
7
8
8
9
9
9
9
10
10


* -u (unique) to remove duplicates

In [103]:
! sort -nu uniq_example.txt
!rm -f uniq_example.txt ; echo removing uniq_example.txt

1
2
3
4
5
6
7
8
9
10
removing uniq_example.txt


* -t “d” : file has a delimiter which is “d”. White space is the delimiter by default in sort.
* -k M[,N]=--key=M[,N] : sort field consists of part of line between M and N inclusive (or the end of the line, if N is omitted)
* sort -t"," -k1,2 -k3n,3 file = sort a file based on the 1st and 2nd field, and numerically on 3rd field.
* sort -t"," -k1,1 -u file = Remove duplicates from the file based on 1st fiel

* Order by 6th, 7thh and 8th columns reverse using ^ separator

In [104]:
! sort -t "^" -k 6r ~/Data/opentraveldata/optd_aircraft.csv | head

APF^BAe^ATP Freighter^APF^2T^ZZZZ^2^T
BTA^^Business Turbo-Prop Aircraft^BTA^2T^ZZZZ^2^T
351^Airbus^A350-1000^350^2J^ZZZZ^2^J
358^Airbus^A350-800^350^2J^ZZZZ^2^J
359^Airbus^A350-900^350^2J^ZZZZ^2^J
A58^Antonov^An-158^A58^2J^ZZZZ^2^J
CRF^Canadair^CRJ Freighter^CRF^2J^ZZZZ^2^J
CS1^Bombardier^CSeries CS100^CSB^2J^ZZZZ^2^J
CS3^Bombardier^CS300^CSB^2J^ZZZZ^2^J
H20^Hawker-Beechcraft^Hawker 200^HBA^2J^ZZZZ^2^J


* Order just by 6th column

In [105]:
! sort -t "^" -k 6r,6 ~/Data/opentraveldata/optd_aircraft.csv | head

351^Airbus^A350-1000^350^2J^ZZZZ^2^J
358^Airbus^A350-800^350^2J^ZZZZ^2^J
359^Airbus^A350-900^350^2J^ZZZZ^2^J
A58^Antonov^An-158^A58^2J^ZZZZ^2^J
APF^BAe^ATP Freighter^APF^2T^ZZZZ^2^T
BTA^^Business Turbo-Prop Aircraft^BTA^2T^ZZZZ^2^T
CRF^Canadair^CRJ Freighter^CRF^2J^ZZZZ^2^J
CS1^Bombardier^CSeries CS100^CSB^2J^ZZZZ^2^J
CS3^Bombardier^CS300^CSB^2J^ZZZZ^2^J
H20^Hawker-Beechcraft^Hawker 200^HBA^2J^ZZZZ^2^J


* Sort by 2nd column and remove duplicated base on 2nd column. Eg. Note that there is just one _Airbus_ row

In [106]:
! sort -t "^" -k 2,2 -u ~/Data/opentraveldata/optd_aircraft.csv | head

BTA^^Business Turbo-Prop Aircraft^BTA^2T^ZZZZ^2^T
NDC^Aerospatiale / SNIAS^SN.601 Corvette^NDC^2J^S601^2^J
AGH^Agusta / AgustaWestland^A-109^AGH^H^A109^^H
AWH^AgustaWestland^AW139^AWH^2T^A139^2^T
19D^Airbus^A319^^^^^
L4F^Aircraft Industries (LET)^L-410 Freighter^L4F^2T^L410^2^T
CS5^Airtech^CN-235^CS5^2T^CN35^2^T
A22^Antonov^An-22^A22^4T^AN22^4^T
A26^Antonow / Antonov^An-26^AN6^2T^AN26^2^T
AT4^ATR^ATR 42-300 / 320^ATR^2T^AT43^2^T


## Exercises

* Find top 10 files by size in your home direcbtory including the subdirectories. Sort them by size and print the result including the size and the name of the file (hint: use find with –size and -exec ls –s parameters)  

In [107]:
% cd 
! find -type f -size +10000k -exec ls -s {}  \;  | sort -k 1rn,1 | head -n 10

/home/dsc
541968 ./Data/challenge/bookings.csv.bz2
471872 ./Data/challenge/searches.csv.bz2
189092 ./anaconda3/pkgs/mkl-2018.0.1-h19d6760_4.tar.bz2
72644 ./anaconda3/lib/libmkl_avx512_mic.so
72644 ./anaconda3/pkgs/mkl-2018.0.1-h19d6760_4/lib/libmkl_avx512_mic.so
63860 ./anaconda3/lib/libmkl_avx512.so
63860 ./anaconda3/pkgs/mkl-2018.0.1-h19d6760_4/lib/libmkl_avx512.so
58148 ./anaconda3/bin/pandoc
58148 ./anaconda3/pkgs/pandoc-1.19.2.1-hea2e7c5_1/bin/pandoc
53944 ./Data/airline_tickets/sales_segments.csv.gz


*Create a dummy file with this command : seq15> 20lines.txt; seq9 1 20 >>20lines.txt; echo"20\n20" >> 20lines.txt; (check the content of the file first)*

In [108]:
! rm -f 20lines.txt
! seq 15 >> 20lines.txt
! seq 9 1 20 >> 20lines.txt
! echo "20\n20" >> 20lines.txt
! cat 20lines.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
9
10
11
12
13
14
15
16
17
18
19
20
20
20


* Sort the lines of file based on alphanumeric characters

In [109]:
! sort -d 20lines.txt

1
10
10
11
11
12
12
13
13
14
14
15
15
16
17
18
19
2
20
20
20
3
4
5
6
7
8
9
9


* Sort the lines of file based on numeric values and eliminate the duplicates

In [110]:
! sort -nu 20lines.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


* Print just duplicated lines of the file. Hint: unq is the only one which can print only duplicates. But it need to have them in order...

In [111]:
! sort -n 20lines.txt | uniq -d 

9
10
11
12
13
14
15
20


* Print the line which has most repetitions

In [112]:
! sort -nr 20lines.txt | uniq -d | head -n 1

20


* Print unique lines with the number of repetitions sorted by the number of repetitions from lowest to highest

In [113]:
! sort -n 20lines.txt | uniq -c | sort -k 1n,1

      1 1
      1 16
      1 17
      1 18
      1 19
      1 2
      1 3
      1 4
      1 5
      1 6
      1 7
      1 8
      2 10
      2 11
      2 12
      2 13
      2 14
      2 15
      2 9
      3 20


*Create another file with this command : seq0 2 40 > 20lines2.txt*

In [114]:
! rm -f 20lines2.txt
! seq 0 2 40 > 20lines2.txt
! cat 20lines2.txt

0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40


* Create 3rdfile combining the first two files (20lines.txt and 20lines2.txt) but without duplicates. Hint: remember that sort can accept several files as input. 

In [115]:
! rm -f other_file.txt
! sort -n -u 20lines.txt 20lines2.txt  > other_file.txt
! cat other_file.txt

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
22
24
26
28
30
32
34
36
38
40


* Merge the content of 20lines.txt and 20lines2.txt into 40lines.txt. Print unique lines together with the number of occurrences  of 40lines.txt file and sort the line based on line content. 

In [116]:
! rm -f 40lines.txt
! cat 20lines.txt 20lines2.txt > 40lines.txt

! sort 40lines.txt | uniq -c | sort -k 2,2

      1 0
      1 1
      3 10
      2 11
      3 12
      2 13
      3 14
      2 15
      2 16
      1 17
      2 18
      1 19
      2 2
      4 20
      1 22
      1 24
      1 26
      1 28
      1 3
      1 30
      1 32
      1 34
      1 36
      1 38
      2 4
      1 40
      1 5
      2 6
      1 7
      2 8
      2 9


* How would you get the same result without passing through the intermediary file 40lines.txt?

In [117]:
! sort 20lines.txt 20lines2.txt | uniq -c | sort -k 2,2

      1 0
      1 1
      3 10
      2 11
      3 12
      2 13
      3 14
      2 15
      2 16
      1 17
      2 18
      1 19
      2 2
      4 20
      1 22
      1 24
      1 26
      1 28
      1 3
      1 30
      1 32
      1 34
      1 36
      1 38
      2 4
      1 40
      1 5
      2 6
      1 7
      2 8
      2 9


* Go to ~/Data/opentraveldata. Get the line with the highest number of engines from optd_aircraft.csv by using sort.

In [118]:
! head -n 1 ~/Data/opentraveldata/optd_aircraft.csv ; 

! echo ""
! sort -t "^" -k 7nr,7 ~/Data/opentraveldata/optd_aircraft.csv | head -1

iata_code^manufacturer^model^iata_group^iata_category^icao_code^nb_engines^aircraft_type

A5F^Antonov^An-225^A5F^6J^A225^6^J


<div style="text-align:right">
Juan Luis García López (@huanlui)
<a href="https://github.com/huanlui" class="fa fa-github"> Github </a>
<a href="https://twitter.com/huanlui" class="fa fa-twitter"> Twitter </a>
<a href="https://www.linkedin.com/in/juan-luis-garcía-lópez-99057138" class="fa fa-linkedin"> Linkedin </a>
<div>