# 3   Processing Raw Text

**학습목표**
1. 웹 또는 로컬 파일에 있는 텍스트에 접근하려면 어떻게 하는가?
2. 문서를 개별 단어/문장부호 단위로 어떻게 나누는가? 
3. 처리된 결과를 어떻게 저장하는가?

## 3.1   Accessing Text from the Web and from Disk

### Electronic Books

Project Gutenberg 
- 25,000 free online books 
- http://www.gutenberg.org/catalog/
    

In [3]:
from urllib import request
url = "http://www.gutenberg.org/files/2554/2554.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
type(raw)

str

In [4]:
raw[:75]

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

**tokenization:** 문자열(string)을 단어와 문장부호(punctuation) 단위로 나누기

In [8]:
import nltk
tokens = nltk.word_tokenize(raw)
tokens[:10]

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [17]:
# nltk 기능을 활용한 문서처리를 위해 'nltk text'라는 형태로 만듦
text = nltk.Text(tokens) 
type(text)

nltk.text.Text

In [10]:
text[1024:1062]

['CHAPTER',
 'I',
 'On',
 'an',
 'exceptionally',
 'hot',
 'evening',
 'early',
 'in',
 'July',
 'a',
 'young',
 'man',
 'came',
 'out',
 'of',
 'the',
 'garret',
 'in',
 'which',
 'he',
 'lodged',
 'in',
 'S.',
 'Place',
 'and',
 'walked',
 'slowly',
 ',',
 'as',
 'though',
 'in',
 'hesitation',
 ',',
 'towards',
 'K.',
 'bridge',
 '.']

In [12]:
text.collocations() # 문서 내의 연어(collocations)들을 찾아줌

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market


In [16]:
raw.find("PART I") # 해당 어구의 시작 index를 문서의 앞에서부터 찾음 5338
raw.rfind("End of Project Gutenberg's Crime") # 해당 어구를 index를 문서의 뒤에서부터 찾음 1157746
raw = raw[5338:1157746] # 문서의 본문만 골라내었음. 

### Dealing with HTML

In [18]:
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [22]:
# HTML로부터 문자열을 추출: BeautifulSoup
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, "lxml").get_text()
tokens = nltk.word_tokenize(raw)
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '12:51'

In [23]:
text = nltk.Text(tokens)
text.concordance('gene') # 문서에서 해당 단어를 찾아서 앞뒤 맥락을 함께 보여주는 기능.

Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 


### Reading Local Files

In [24]:
f = open('document.txt') # 현재는 해당 파일이 없으므로 열리지 않음.

FileNotFoundError: [Errno 2] No such file or directory: 'document.txt'

## The NLP Pipeline

1. 웹이나 파일에 있는 문서에 접근한다.
2. tokenize -> nltk text 형태로 만든다.
3. vocabulary를 만든다.

<img src = "images/pipeline.png">

## 3.2   Strings: Text Processing at the Lowest Level

### Basic Operations with Strings

In [3]:
# apostrophe(') 를 쓰려면 쌍따옴표(")를 쓰거나 앞에 역슬래시(\)를 붙여야함 
circus = "Monty Python's Flying Circus"
circus = 'Monty Python\'s Flying Circus'

In [8]:
# 긴 string을 두줄에 걸쳐서 쓰려면 역슬래시나 괄호를 사용해야 함. 결과는 한줄로 나옴.
couplet = "Rough winds do shake the darling buds of May,"\
"And Summer's lease hath all too short a date:"
couplet = ("Rough winds do shake the darling buds of May,"
"And Summer's lease hath all too short a date:")
print(couplet)

Rough winds do shake the darling buds of May,And Summer's lease hath all too short a date:


In [9]:
# 줄바꿈해서 작성한 그대로 입력하려면 triple-quote (""", ''')를 사용.
couplet = """Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:"""
couplet = '''Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:'''
print(couplet)

Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate:


In [10]:
# 반복을 위해 덧셉(+)과 곱셈(*) 기호를 사용할 수 있음(뺄셈(-)과 나눗셈(/)은 불가능)
'very' + 'very' + 'very'
'very' * 3

'veryveryvery'

### Accessing Substrings

string의 index. 자세한 내용은 생략. (그림 참조)

<img src = "images/index.png">

### More operations on strings

<img src = "images/operations.png">

## 3.3   Text Processing with Unicode

### What is Unicode?

**유니코드**: 전 세계의 모든 문자를 다루도록 설계된 표준 문자 전산 처리 방식
- translation into Unicode is called **decoding** 
- translation out of Unicode is called **encoding**

<img src = "images/unicode.png">

### Extracting encoded text from files

In [18]:
# 'latin2'로 인코딩된 폴란드어 문서
import nltk
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')
f = open(path, encoding='latin2')
for line in f:
   line = line.strip()
   print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [24]:
# 'unicode_escape'로 인코딩한 결과 보기 (특수문자들을 유니코드로 처리한 것)
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In [25]:
# 유니코드는 모든 문자를 16진법의 4자리 숫자로 표기 가능
nacute = '\u0144'
nacute

'ń'

In [26]:
# We can also see how this character is represented as a sequence of bytes inside a text file:
nacute.encode('utf8')

b'\xc5\x84'

### Using your local encoding in Python

파이썬에서 특정 인코딩 방식을 기본으로 사용하려면, 첫번째 줄에 **'# -*- coding: <coding> -*-'** 를 붙임.

## 3.4   Regular Expressions for Detecting Word Patterns



### Using Basic Meta-Characters


Basic Regular Expression Meta-Characters, Including Wildcards, Ranges and Closures

<img src = "images/rebasic.png">

In [27]:
import re
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

# 'ed'로 끝나는 단어들을 찾아라
[w for w in wordlist if re.search('ed$', w)] 

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

In [28]:
# 3번째 글자가 j, 6번째 글자가 t인 8글자짜리 단어들을 찾아라 
[w for w in wordlist if re.search('^..j..t..$', w)]

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

## 3.5   Useful Applications of Regular Expressions

### Extracting Word Pieces

In [29]:
# 단어에서 모음(a,e,i,o,u)골라내기
word = 'supercalifragilisticexpialidocious'
re.findall(r'[aeiou]', word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

### Doing More with Word Pieces

In [31]:
# 모음으로 시작하는 것 or 모음이 끝나는 것 or 아니면 모음을 제거
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'
def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

english_udhr = nltk.corpus.udhr.words('English-Latin1')
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


### Finding Word Stems

In [35]:
# stemming: 단어에서 접미사(suffix)를 제거하여 stem만 남기기
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = nltk.word_tokenize(raw)
[stem(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut',
 'sword',
 '...',
 'i',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'Supreme',
 'execut',
 'power',
 'deriv',
 'from',
 '...',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

## 3.6   Normalizing Text

### Stemmers

Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. 

In [37]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

In [38]:
# poster stemmer
porter = nltk.PorterStemmer()
[porter.stem(t) for t in tokens]

['denni',
 ':',
 'listen',
 ',',
 'strang',
 'women',
 'lie',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'basi',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'power',
 'deriv',
 'from',
 'a',
 'mandat',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcic',
 'aquat',
 'ceremoni',
 '.']

In [39]:
# lancaster stemmer
lancaster = nltk.LancasterStemmer()
[lancaster.stem(t) for t in tokens]

['den',
 ':',
 'list',
 ',',
 'strange',
 'wom',
 'lying',
 'in',
 'pond',
 'distribut',
 'sword',
 'is',
 'no',
 'bas',
 'for',
 'a',
 'system',
 'of',
 'govern',
 '.',
 'suprem',
 'execut',
 'pow',
 'der',
 'from',
 'a',
 'mand',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'som',
 'farc',
 'aqu',
 'ceremony',
 '.']

### Lemmatization

The WordNet lemmatizer only removes affixes if the resulting word is in its dictionary. 

This additional checking process makes the lemmatizer slower than the above stemmers. 

In [40]:
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'woman',
 'lying',
 'in',
 'pond',
 'distributing',
 'sword',
 'is',
 'no',
 'basis',
 'for',
 'a',
 'system',
 'of',
 'government',
 '.',
 'Supreme',
 'executive',
 'power',
 'derives',
 'from',
 'a',
 'mandate',
 'from',
 'the',
 'mass',
 ',',
 'not',
 'from',
 'some',
 'farcical',
 'aquatic',
 'ceremony',
 '.']

## 3.7   Regular Expressions for Tokenizing Text

Regular Expression Symbols

<img src = "images/resimbols.png">

In [49]:
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x)    # set flag to allow verbose regexps
    ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
    '''
nltk.regexp_tokenize(text, pattern) # 결과가 이상함...

[('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

## 3.9   Formatting: From Lists to Strings

In [50]:
# From Lists to Strings: join() method
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
' '.join(silly)

'We called him Tortoise because he taught us .'

### Strings and Formats

In [54]:
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in sorted(fdist):
   print(word, '->', fdist[word], end='; ')

cat -> 3; dog -> 4; snake -> 1; 

In [57]:
#  string formatting {}
for word in sorted(fdist):
   print('{}->{};'.format(word, fdist[word]), end=' ')

cat->3; dog->4; snake->1; 

In [56]:
'I want a {} right now'.format('coffee')

'I want a coffee right now'

### Lining Things Up

In [58]:
'{:6}'.format('dog')

'dog   '

In [59]:
'{:>6}'.format('dog')

'   dog'

In [60]:
import math
'{:.4f}'.format(math.pi)

'3.1416'

In [61]:
count, total = 3205, 9375
"accuracy for {} words: {:.4%}".format(total, count / total)

'accuracy for 9375 words: 34.1867%'

### Writing Results to a File

In [75]:
output_file = open('output.txt', 'w')
words = set(nltk.corpus.genesis.words('english-kjv.txt'))
for word in sorted(words):
    print(word, file=output_file)

## 3.10 Summary

- In this book we view a text as a list of words. A "raw text" is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.
- A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".
- The characters of a string are accessed using indexes, counting from zero: 'Monty Python'[0] gives the value M. The length of a string is found using len().
- Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.
- Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python']. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives  'Monty/Python'.
- We can read text from a file input.txt using text = open('input.txt').read(). We can read text from url using  text = request.urlopen(url).read().decode('utf8'). We can iterate over the lines of a text file using for line in open(f).
- We can write text to a file by opening the file for writing output_file = open('output.txt', 'w'), then adding content to the file  print("Monty Python", file=output_file).
- Texts found on the web may contain unwanted material (such as headers, footers, markup), that need to be removed before we do any linguistic processing.
- Tokenization is the segmentation of a text into basic units — or tokens — such as words and punctuation. Tokenization based on whitespace is inadequate for many applications because it bundles punctuation together with words. NLTK provides an off-the-shelf tokenizer nltk.word_tokenize().
- Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical or citation form of the word, also known as the lexeme or lemma (e.g. appear).
- Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.
- If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.
- When backslash is used before certain characters, e.g. \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g. \., \|, \$, these characters - lose their special meaning and are matched literally.
- A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.
