# 第2章: UNIXコマンドの基礎

[popular-names.txt](popular-names.txt)は，アメリカで生まれた赤ちゃんの「名前」「性別」「人数」「年」をタブ区切り形式で格納したファイルである．以下の処理を行うプログラムを作成し，[popular-names.txt](popular-names.txt)を入力ファイルとして実行せよ．さらに，同様の処理をUNIXコマンドでも実行し，プログラムの実行結果を確認せよ．

In [1]:
data_path = 'data/popular-names.txt'

In [2]:
!cp data/popular-names.txt data/tmp_popular-names.txt

In [3]:
!ls data/

popular-names.txt  tmp_popular-names.txt


In [51]:
!head data/popular-names.txt

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880
Ida	F	1472	1880
Alice	F	1414	1880
Bertha	F	1320	1880
Sarah	F	1288	1880


## 10. 行数のカウント
行数をカウントせよ．確認にはwcコマンドを用いよ．

In [5]:
with open(data_path, 'r') as f:
    print(len(f.readlines()))

2740


- with構文 https://www.sejuku.net/blog/24672
- file操作 http://programming-study.com/technology/python-file/
- テキストファイル読み込み http://www.yukun.info/blog/2008/06/python-file.html

In [6]:
!wc -l < data/popular-names.txt

2740


## 11. タブをスペースに置換
タブ1文字につきスペース1文字に置換せよ．確認にはsedコマンド，trコマンド，もしくはexpandコマンドを用いよ．

In [7]:
with open(data_path, 'r') as f:
    with open('work/q11.txt', 'w') as fw:
        for line in f:
            fw.write(line.replace('\t', ' '))

In [8]:
!head -5 work/q11.txt

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880


In [9]:
!head -5 data/popular-names.txt | sed 's/\t/ /g'

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880


In [10]:
!head -5 data/popular-names.txt | tr '\t' ' '

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880


In [11]:
!head -5 data/popular-names.txt | expand -t 1

Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880


## 12. 1列目をcol1.txtに，2列目をcol2.txtに保存
各行の1列目だけを抜き出したものをcol1.txtに，2列目だけを抜き出したものをcol2.txtとしてファイルに保存せよ．確認にはcutコマンドを用いよ．

In [12]:
with open(data_path, 'r') as f:
    with open('work/col1.txt', 'w') as f1:
        with open('work/col2.txt', 'w') as f2:
            for line in f:
                name, sex, *_ = line.split()
                f1.write(name + '\n')
                f2.write(sex + '\n')

In [13]:
!head -5 work/col1.txt

Mary
Anna
Emma
Elizabeth
Minnie


In [14]:
!head -5 work/col2.txt

F
F
F
F
F


In [15]:
!head -5 data/popular-names.txt | cut -f1

Mary
Anna
Emma
Elizabeth
Minnie


In [16]:
!head -5 data/popular-names.txt | cut -f2

F
F
F
F
F


## 13. col1.txtとcol2.txtをマージ
12で作ったcol1.txtとcol2.txtを結合し，元のファイルの1列目と2列目をタブ区切りで並べたテキストファイルを作成せよ．確認にはpasteコマンドを用いよ．

In [17]:
with open('work/col1.txt', 'r') as f1:
    with open('work/col2.txt', 'r') as f2:
        with open('work/q13.txt', 'w') as fw:
            for name, sex in zip(f1, f2):
                fw.write('{}\t{}\n'.format(name.rstrip(), sex.rstrip()))

In [18]:
!head -5 work/q13.txt

Mary	F
Anna	F
Emma	F
Elizabeth	F
Minnie	F


## 14. 先頭からN行を出力
自然数Nをコマンドライン引数などの手段で受け取り，入力のうち先頭のN行だけを表示せよ．確認にはheadコマンドを用いよ．

### input

In [19]:
N = int(input())

with open(data_path, 'r') as f:
    for i, line in enumerate(f):
        if i == N:
            break
        print(line, end='')

5
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880


### sys.argv

In [20]:
%%file src/q014_sys.py

import sys
data_path = 'data/popular-names.txt'

N = int(sys.argv[1])

with open(data_path, 'r') as f:
    for i, line in enumerate(f):
        if i == N:
            break
        print(line, end='')

Overwriting src/q014_sys.py


In [21]:
!python src/q014_sys.py 3

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880


### argparse
https://www.sejuku.net/blog/23647

### if __name__ == '__main__'
https://docs.python.org/ja/3/library/__main__.html

In [22]:
%%file src/q014_argparse.py

import argparse

def main():
    with open(args.data_path, 'r') as f:
        for i, line in enumerate(f):
            if i == args.lines:
                break
            print(line, end='')
            
if __name__ == '__main__':
    parser = argparse.ArgumentParser(
                        prog='q014_argparse', 
                        usage='Print the first N lines of a file.', 
                        description='description', 
                        epilog='end', 
                        add_help=True,
                        )
    
    parser.add_argument('data_path')
    parser.add_argument('-N', '--lines',
                                           type=int,
                                           default=5)
    
    args = parser.parse_args()
    
    main()

Overwriting src/q014_argparse.py


In [23]:
!python src/q014_argparse.py -h

usage: Print the first N lines of a file.

description

positional arguments:
  data_path

optional arguments:
  -h, --help            show this help message and exit
  -N LINES, --lines LINES

end


In [24]:
!python src/q014_argparse.py data/popular-names.txt

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880


In [25]:
!python src/q014_argparse.py data/popular-names.txt --lines=10

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880
Ida	F	1472	1880
Alice	F	1414	1880
Bertha	F	1320	1880
Sarah	F	1288	1880


In [26]:
!head -5 data/popular-names.txt

Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880


## 15. 末尾のN行を出力
自然数Nをコマンドライン引数などの手段で受け取り，入力のうち末尾のN行だけを表示せよ．確認にはtailコマンドを用いよ．

In [27]:
# ゴリ押し．ファイル2回も開いている
N = int(input())

with open(data_path, 'r') as f:
    file_len = len(f.readlines())
    
with open(data_path, 'r') as f:
    for i, line in enumerate(f):
        if i >= file_len-N:
            print(line, end='')

5
Benjamin	M	14569	2016
Jacob	M	14416	2016
Michael	M	13998	2016
Elijah	M	13764	2016
Ethan	M	13758	2016


### collections.deque

In [28]:
import collections

N = int(input())
with open(data_path, 'r') as f:
    print(''.join(collections.deque(f, maxlen=N)))

5
Benjamin	M	14569	2016
Jacob	M	14416	2016
Michael	M	13998	2016
Elijah	M	13764	2016
Ethan	M	13758	2016



In [29]:
!tail -5 data/popular-names.txt

Benjamin	M	14569	2016
Jacob	M	14416	2016
Michael	M	13998	2016
Elijah	M	13764	2016
Ethan	M	13758	2016


## 16. ファイルをN分割する
自然数Nをコマンドライン引数などの手段で受け取り，入力のファイルを行単位でN分割せよ．同様の処理をsplitコマンドで実現せよ．

In [38]:
# ボツ
"""
N = int(input())

with open(data_path, 'r') as f:
    file_len = len(f.readlines())
    
quot = int(file_len/N)
rem = file_len%N
    
with open(data_path, 'r') as f:
    i = 0
    for line in f:
        with open('work/q16_{}.txt'.format(i), 'w') as fw:
            if rem > 0:
                for j in range(quot+1):
                    fw.write(line.rstrip())
                rem -= 1
            else:
                for j in range(quot):
                    fw.write(line.rstrip())
            i += 1
"""

"\nN = int(input())\n\nwith open(data_path, 'r') as f:\n    file_len = len(f.readlines())\n    \nquot = int(file_len/N)\nrem = file_len%N\n    \nwith open(data_path, 'r') as f:\n    i = 0\n    for line in f:\n        with open('work/q16_{}.txt'.format(i), 'w') as fw:\n            if rem > 0:\n                for j in range(quot+1):\n                    fw.write(line.rstrip())\n                rem -= 1\n            else:\n                for j in range(quot):\n                    fw.write(line.rstrip())\n            i += 1\n"

In [39]:
def divide(L, N):
    return ((L+i)//N for i in reversed(range(N)))

In [34]:
!ls ./work

col1.txt  col2.txt  q11.txt  q13.txt


## 17. １列目の文字列の異なり
1列目の文字列の種類（異なる文字列の集合）を求めよ．確認にはcut, sort, uniqコマンドを用いよ．

In [42]:
uniq_names = sorted({line.rstrip() for line in open('./work/col1.txt')})

In [44]:
uniq_names[:10]

['Abigail',
 'Aiden',
 'Alexander',
 'Alexis',
 'Alice',
 'Amanda',
 'Amy',
 'Andrew',
 'Angela',
 'Anna']

In [45]:
len(uniq_names)

132

In [46]:
!cat work/col1.txt | sort | uniq | head

Abigail
Aiden
Alexander
Alexis
Alice
Amanda
Amy
Andrew
Angela
Anna


In [49]:
!cat work/col1.txt | sort | uniq | wc -l

132


## 18. 各行を3コラム目の数値の降順にソート
各行を3コラム目の数値の逆順で整列せよ（注意: 各行の内容は変更せずに並び替えよ）．確認にはsortコマンドを用いよ（この問題はコマンドで実行した時の結果と合わなくてもよい）．

In [52]:
sorted_by_quantity = sorted((line.rstrip() for line in open('./data/popular-names.txt')),
                                                    key=lambda line: int(line.split()[2]),
                                                    reverse=True)

In [53]:
sorted_by_quantity[:10]

['Linda\tF\t99685\t1947',
 'Linda\tF\t96210\t1948',
 'James\tM\t94762\t1947',
 'Michael\tM\t92716\t1957',
 'Robert\tM\t91641\t1947',
 'Linda\tF\t91013\t1949',
 'Michael\tM\t90620\t1956',
 'Michael\tM\t90512\t1958',
 'James\tM\t88584\t1948',
 'Michael\tM\t88525\t1954']

### 安定ソート

In [54]:
!sort -k3 -n -r -s data/popular-names.txt | head

Linda	F	99685	1947
Linda	F	96210	1948
James	M	94762	1947
Michael	M	92716	1957
Robert	M	91641	1947
Linda	F	91013	1949
Michael	M	90620	1956
Michael	M	90512	1958
James	M	88584	1948
Michael	M	88525	1954
sort: write failed: 'standard output': Broken pipe
sort: write error


## 19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる
各行の1列目の文字列の出現頻度を求め，その高い順に並べて表示せよ．確認にはcut, uniq, sortコマンドを用いよ．

In [59]:
from collections import Counter

c = Counter(line.rstrip() for line in open('./work/col1.txt'))
most_common_names = c.most_common()

In [61]:
most_common_names[:10]

[('James', 116),
 ('William', 109),
 ('John', 108),
 ('Robert', 108),
 ('Mary', 92),
 ('Charles', 75),
 ('Michael', 74),
 ('Elizabeth', 73),
 ('Joseph', 71),
 ('Margaret', 60)]

In [62]:
len(most_common_names)

132

In [64]:
!cat work/col1.txt | sort | uniq -c | sort -k1 -n -r -s | head

    116 James
    109 William
    108 John
    108 Robert
     92 Mary
     75 Charles
     74 Michael
     73 Elizabeth
     71 Joseph
     60 Margaret
