## From BNC to Ngram 

### BNC Data:  
https://drive.google.com/file/d/1mKX1DLHDIqKph4e4k1MnYOV3iWtvT7-E/view?usp=sharing

### 1. Extract lines containing id, title, classcode, keywords, sentences from each BNC parts

grep (global search regular RE)
grep是很常見也很常用的命令，它的主要功能是進行字符串數據的比較，然後符合用戶需求的字符串打印出來，但是注意，grep在數據中查找一個字符串時，是以“整行”爲單位進行數據篩選的。

egrep (extended RE)

Reference
https://www.twblogs.net/a/5d26d705bd9eee1e5c84509d

In [1]:
! time ! for name in {A,B,C,D,E,F,G,H,J,K}; do egrep -o -h \
'(<idno type="bnc">.*?</idno>|<title>.*?</title>|<classCode.*?</classCode>|<keywords>.*?</keywords>|<s n=".*?">|<w c5=".*?" hw=".*?" pos=".*?">.*?</w>|<c c5=".*?">.*?</c>|</s>|<p>|</p>)' \
BNC/Texts/*/*/$name*.xml > BNC.$name.txt; done


####        Repeat Step 1 for all sections A, B, C, D, E, F, G, H, J, and K 

 ### 2. Convert sentences to bigram (for all sections A to K, no I)
 ### 2.1 Convert line to word tokens

In [2]:
import re
from pprint import pprint

def line_to_token(line):
    if line.startswith('<s'):
        return ('<s> ', '<s>', '<s>') 
    elif line.startswith('</s'):
        return ('</s>', '</s>', '</s>') 
    elif line.startswith('<w'):
        # <w c5="VVN" hw="discount" pos="VERB">discounted </w>
        match = re.findall('<w c5="(.*?)" hw="(.*?)" pos=".*?">(.*?)</w>', line)
        return (match[0][2].strip(), match[0][0].upper(), match[0][1]) # lemma, tag, word
    elif line.startswith('<c'):
        match = re.findall('<c c5="PUN">(.*?)</c>', line)
        if not match:
            return '??? line'
        return (match[0], match[0], match[0])

def tokens_to_bigram(tokens):
    result = []
    for i in range(len(tokens)-1):
        if i == 1:
            word2tag2lemma2 = [tokens[i][j].lower()+' '+tokens[i+1][j] for j in range(3)]
        else:
            word2tag2lemma2 = [tokens[i][j]+' '+tokens[i+1][j] for j in range(3)]
        if word2tag2lemma2[0][0].isalpha() or word2tag2lemma2[0][0] == '<': 
            result = result + [ '\t'.join(word2tag2lemma2) ]
    return result

### 2.2 Convert token stream to bigram stream

In [3]:
def word_to_bigram(wordfile, bigramfile):
    
    def Batch_to_ngram(batch, fileout):        
        with open(wordfile.format(batch)) as filein:
            lines = filein.readlines()
            for i, line in enumerate(lines):
                if line.startswith('<s'):
                    sent_start = i
                elif line.startswith('</s'):
                    sentence = [line.strip() for line in lines[sent_start:i+1]]
                    tokens = [line_to_token(line) for line in sentence ]
                    #pprint (tokens)
                    bigram = tokens_to_bigram(tokens)
                    print('\n'.join(bigram), file=fileout)
    
    with open(bigramfile, 'w') as fileout:
        for batch in 'ABCDEFGHJK':
            Batch_to_ngram(batch, fileout)
                
word_to_bigram('BNC.{0}.txt', 'BNC.2w.txt')

### 3 Sort and count bigram (word1 word2 \<tab\> count) 

In [4]:
#1 BNC.2w.txt ==> BNC.2w.c.txt
! time sort BNC.2w.txt | uniq -c | \
awk '{ gsub(/^[ ]*/, ""); print }' | awk '{print substr($0, index($0, " ")+1) "\t" $1}' > BNC.2w.c.txt

sort BNC.2w.txt  738.99s user 154.28s system 90% cpu 16:30.41 total
uniq -c  40.17s user 1.52s system 4% cpu 16:30.41 total
awk '{ gsub(/^[ ]*/, ""); print }'  69.71s user 0.72s system 7% cpu 16:30.41 total
awk '{print substr($0, index($0, " ")+1) "\t" $1}' > BNC.2w.c.txt  55.50s user 2.35s system 5% cpu 16:30.41 total


In [5]:
! egrep '^(big|serious|fatal) accident\t' BNC.2w.c.txt

big accident	AJ0 NN1	big accident	3
fatal accident	AJ0 NN1	fatal accident	82
fatal accident	aj0 NN1	fatal accident	3
serious accident	AJ0 NN1	serious accident	61


Target output:  
https://drive.google.com/file/d/1xM46aaDIeu4Z0FkikGOcmDoq7u2O47tY/view?usp=sharing