## 第6章: 英語テキストの処理

英語のテキスト（[nlp.txt](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/nlp.txt)）に対して，以下の処理を実行せよ．

## 50. 文区切り
(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．


In [3]:
import re

def process_lines():
    fname = 'nlp.txt'
    with open(fname) as lines:

        # 文切り出しの正規表現コンパイル
        pattern = re.compile('(^.*?[\.|\;|\:|\?|\!])\s([A-Z].*)', re.MULTILINE + re.VERBOSE + re.DOTALL)

        for line in lines:
            line = line.strip()
            while len(line) > 0:
                match = pattern.match(line)
                if match:
                    yield match.group(1)
                    line = match.group(2)
                else:
                    yield line
                    line = ''

for line in process_lines():
    print(line)

Natural language processing
From Wikipedia, the free encyclopedia
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
History
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a 

## 51. 単語の切り出し
空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

In [5]:
def process_words():
    for lines in process_lines():
        for word in lines.split():
            yield word.rstrip('.,:;?!')
        yield ''
            

for i in process_words():
    print(i)

Natural
language
processing

From
Wikipedia
the
free
encyclopedia

Natural
language
processing
(NLP)
is
a
field
of
computer
science
artificial
intelligence
and
linguistics
concerned
with
the
interactions
between
computers
and
human
(natural)
languages

As
such
NLP
is
related
to
the
area
of
humani-computer
interaction

Many
challenges
in
NLP
involve
natural
language
understanding
that
is
enabling
computers
to
derive
meaning
from
human
or
natural
language
input
and
others
involve
natural
language
generation

History

The
history
of
NLP
generally
starts
in
the
1950s
although
work
can
be
found
from
earlier
periods

In
1950
Alan
Turing
published
an
article
titled
"Computing
Machinery
and
Intelligence"
which
proposed
what
is
now
called
the
Turing
test
as
a
criterion
of
intelligence

The
Georgetown
experiment
in
1954
involved
fully
automatic
translation
of
more
than
sixty
Russian
sentences
into
English

The
authors
claimed
that
within
three
or
five
years
machine
translation
would
be
a
solved


## 52. ステミング
51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装として[stemming](https://pypi.python.org/pypi/stemming)モジュールを利用するとよい．

In [6]:
import snowballstemmer as stm

stemmer = stm.stemmer('english')
for word in process_words():
    print('{}\t{}'.format(word, stemmer.stemWord(word)))

Natural	Natur
language	languag
processing	process
	
From	From
Wikipedia	Wikipedia
the	the
free	free
encyclopedia	encyclopedia
	
Natural	Natur
language	languag
processing	process
(NLP)	(NLP)
is	is
a	a
field	field
of	of
computer	comput
science	scienc
artificial	artifici
intelligence	intellig
and	and
linguistics	linguist
concerned	concern
with	with
the	the
interactions	interact
between	between
computers	comput
and	and
human	human
(natural)	(natural)
languages	languag
	
As	As
such	such
NLP	NLP
is	is
related	relat
to	to
the	the
area	area
of	of
humani-computer	humani-comput
interaction	interact
	
Many	Mani
challenges	challeng
in	in
NLP	NLP
involve	involv
natural	natur
language	languag
understanding	understand
that	that
is	is
enabling	enabl
computers	comput
to	to
derive	deriv
meaning	mean
from	from
human	human
or	or
natural	natur
language	languag
input	input
and	and
others	other
involve	involv
natural	natur
language	languag
generation	generat
	
History	Histori
	
The	The
history	histori
of	of


## 53. Tokenization
[Stanford Core NLP](https://nlp.stanford.edu/software/corenlp.shtml)を用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

In [None]:
#preparation
!curl -O http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
! unzip stanford-corenlp-full-2017-06-09.zip
!java -cp "stanford-corenlp-full-2017-06-09/*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file nlp.txt

In [13]:
import xml.etree.ElementTree as ET

xml_file = "nlp.txt.xml"

tree = ET.parse(xml_file)
root = tree.getroot()

for word in root.iter('word'):
    print(word.text)


Natural
language
processing
From
Wikipedia
,
the
free
encyclopedia
Natural
language
processing
-LRB-
NLP
-RRB-
is
a
field
of
computer
science
,
artificial
intelligence
,
and
linguistics
concerned
with
the
interactions
between
computers
and
human
-LRB-
natural
-RRB-
languages
.
As
such
,
NLP
is
related
to
the
area
of
humani-computer
interaction
.
Many
challenges
in
NLP
involve
natural
language
understanding
,
that
is
,
enabling
computers
to
derive
meaning
from
human
or
natural
language
input
,
and
others
involve
natural
language
generation
.
History
The
history
of
NLP
generally
starts
in
the
1950s
,
although
work
can
be
found
from
earlier
periods
.
In
1950
,
Alan
Turing
published
an
article
titled
``
Computing
Machinery
and
Intelligence
''
which
proposed
what
is
now
called
the
Turing
test
as
a
criterion
of
intelligence
.
The
Georgetown
experiment
in
1954
involved
fully
automatic
translation
of
more
than
sixty
Russian
sentences
into
English
.
The
authors
claimed
that
within
three
or
five

## 54. 品詞タグ付け
Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

In [14]:
import xml.etree.ElementTree as ET

xml_file = "nlp.txt.xml"

tree = ET.parse(xml_file)
root = tree.getroot()

for token in root.iter('token'):
    word = token.findtext('word')
    lemma = token.findtext('lemma')
    pos = token.findtext('POS')
    print('{}\t{}\t{}'.format(word, lemma, pos))

Natural	natural	JJ
language	language	NN
processing	processing	NN
From	from	IN
Wikipedia	Wikipedia	NNP
,	,	,
the	the	DT
free	free	JJ
encyclopedia	encyclopedia	NN
Natural	natural	JJ
language	language	NN
processing	processing	NN
-LRB-	-lrb-	-LRB-
NLP	nlp	NN
-RRB-	-rrb-	-RRB-
is	be	VBZ
a	a	DT
field	field	NN
of	of	IN
computer	computer	NN
science	science	NN
,	,	,
artificial	artificial	JJ
intelligence	intelligence	NN
,	,	,
and	and	CC
linguistics	linguistics	NNS
concerned	concern	VBN
with	with	IN
the	the	DT
interactions	interaction	NNS
between	between	IN
computers	computer	NNS
and	and	CC
human	human	JJ
-LRB-	-lrb-	-LRB-
natural	natural	JJ
-RRB-	-rrb-	-RRB-
languages	language	NNS
.	.	.
As	as	IN
such	such	JJ
,	,	,
NLP	nlp	NN
is	be	VBZ
related	relate	VBN
to	to	TO
the	the	DT
area	area	NN
of	of	IN
humani-computer	humani-computer	JJ
interaction	interaction	NN
.	.	.
Many	many	JJ
challenges	challenge	NNS
in	in	IN
NLP	nlp	NN
involve	involve	VBP
natural	natural	JJ
language	language	NN
understanding	unde

## 55. 固有表現抽出
入力文中の人名をすべて抜き出せ．

In [28]:
import xml.etree.ElementTree as ET

xml_file = 'nlp.txt.xml'

tree = ET.parse(xml_file)
root = tree.getroot()

for token in root.iter('token'):
    pos = token.findtext('NER')
    if pos == 'PERSON':
        print(token.findtext('word'))

Alan
Turing
Joseph
Weizenbaum
MARGIE
Schank
Wilensky
Meehan
Lehnert
Carbonell
Lehnert
Racter
Jabberwacky
Moore


## 56. 共参照解析
Stanford Core NLPの共参照解析の結果に基づき，文中の参照表現（mention）を代表参照表現（representative mention）に置換せよ．ただし，置換するときは，「代表参照表現（参照表現）」のように，元の参照表現が分かるように配慮せよ．

In [101]:
import xml.etree.ElementTree as ET

xml_file = 'nlp.txt.xml'

tree = ET.parse(xml_file)
root = tree.getroot()

coreference_dict = {}

for coreference in root.iterfind('./document/coreference/coreference'):
   
    representative = coreference.findtext('./mention[@representative="true"]/text')
    
    for mention in coreference.iterfind('./mention'):
        if mention.get('representative', 'false') == 'false':
            
            sentence_id = int(mention.findtext('sentence'))
            start_token_id = int(mention.findtext('start'))
            end_token_id = int(mention.findtext('end'))
            
            if not (sentence_id, start_token_id) in coreference_dict:
                coreference_dict[(sentence_id, start_token_id)] = (end_token_id, representative)
        
            
            

for sentence in root.iterfind('./document/sentences/sentence'):
    sentence_id = int(sentence.get('id'))
    counter = 0

    for token in sentence.iterfind('./tokens/token'):
        token_id = int(token.get('id'))
        
        if counter == 0 and (sentence_id, token_id) in coreference_dict:
            (end, representative) = coreference_dict[(sentence_id, token_id)]
            
            print(representative, end='')
            print('(', end='')
            
            counter = end - token_id

        print(token.findtext('word'), end='')
        
        if counter:
            counter -= 1
            if not counter:
                print(')', end='')
        

        print(' ', end='')
        
    print()

Natural language processing From Wikipedia , the free encyclopedia Natural language processing -LRB- NLP -RRB- is the free encyclopedia Natural language processing -LRB- NLP -RRB-(a field of computer science) , artificial intelligence , and linguistics concerned with the interactions between computers and human -LRB- natural -RRB- languages . 
As such , NLP is related to the area of humani-computer interaction . 
Many challenges in NLP involve natural language understanding , that is , enabling computers(computers) to derive meaning from human or natural language input , and others involve natural language generation . 
History The history of NLP generally starts in the 1950s , although work can be found from earlier periods . 
In 1950 , Alan Turing published an article titled `` Computing Machinery and Intelligence '' which proposed what is now called the Alan Turing(Turing) test as a criterion of intelligence . 
The Georgetown experiment in 1954 involved fully automatic translation o

## 57. 係り受け解析
Stanford Core NLPの係り受け解析の結果（collapsed-dependencies）を有向グラフとして可視化せよ．可視化には，係り受け木を[DOT言語](https://ja.wikipedia.org/wiki/DOT%E8%A8%80%E8%AA%9E)に変換し，[Graphviz](http://www.graphviz.org/)を用いるとよい．また，Pythonから有向グラフを直接的に可視化するには，[pydot](https://github.com/erocarrera/pydot)を使うとよい．

In [102]:
import xml.etree.ElementTree as ET
import pydot_ng as pydot

tree = ET.parse(xml_file)
root = tree.getroot()

for sentence in root.iterfind('./document/sentences/sentence'):
    

## 58. タプルの抽出
Stanford Core NLPの係り受け解析の結果（collapsed-dependencies）に基づき，「主語 述語 目的語」の組をタブ区切り形式で出力せよ．ただし，主語，述語，目的語の定義は以下を参考にせよ．

+ 述語: nsubj関係とdobj関係の子（dependant）を持つ単語
+ 主語: 述語からnsubj関係にある子（dependent）
+ 目的語: 述語からdobj関係にある子（dependent）

## 59. S式の解析
Stanford Core NLPの句構造解析の結果（S式）を読み込み，文中のすべての名詞句（NP）を表示せよ．入れ子になっている名詞句もすべて表示すること．