# 第6章: 英語テキストの処理
英語のテキスト（[nlp.txt](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/nlp.txt)）に対して，以下の処理を実行せよ．

## 50. 文区切り
(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

In [30]:
import re

def output_line(text="nlp.txt"):
    output = []
    pattern = r"(.*?[.;:?!]) (?=[A-Z])"
    prog = re.compile(pattern)
    with open(text,'r') as f:
        line = f.readline()
        while line:
            line = prog.findall(line)
            if 0 < len(line):
                output += line
            line = f.readline()
    return output

In [31]:
for line in output_line("nlp.txt"):
    print(line)

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages.
As such, NLP is related to the area of humani-computer interaction.
The history of NLP generally starts in the 1950s, although work can be found from earlier periods.
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English.
The authors claimed that within three or five years, machine translation would be a solved problem.
However, real progress was much slower, and after the ALPAC report in 1966, which found that ten year long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psych

## 51. 単語の切り出し
空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

In [45]:
def output_word():
    output= []
    for line in output_line():
        pattern = r"[^a-zA-Z0-9\s]"
        prog = re.compile(pattern)
        line = prog.sub('',line)
        word_list = line.split(' ')
        output += word_list + ['\n']
    return output
            
        
for word in output_word():
    print(word)

Natural
language
processing
NLP
is
a
field
of
computer
science
artificial
intelligence
and
linguistics
concerned
with
the
interactions
between
computers
and
human
natural
languages


As
such
NLP
is
related
to
the
area
of
humanicomputer
interaction


The
history
of
NLP
generally
starts
in
the
1950s
although
work
can
be
found
from
earlier
periods


The
Georgetown
experiment
in
1954
involved
fully
automatic
translation
of
more
than
sixty
Russian
sentences
into
English


The
authors
claimed
that
within
three
or
five
years
machine
translation
would
be
a
solved
problem


However
real
progress
was
much
slower
and
after
the
ALPAC
report
in
1966
which
found
that
ten
year
long
research
had
failed
to
fulfill
the
expectations
funding
for
machine
translation
was
dramatically
reduced


Some
notably
successful
NLP
systems
developed
in
the
1960s
were
SHRDLU
a
natural
language
system
working
in
restricted
blocks
worlds
with
restricted
vocabularies
and
ELIZA
a
simulation
of
a
Rogerian
psychotherapist
wr

## 52. ステミング
51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装として[stemming](https://pypi.python.org/pypi/stemming)モジュールを利用するとよい．

In [51]:
from stemming.porter2 import stem

for word in output_word():
    print(word,stem(word),sep='\t')

Natural	Natur
language	languag
processing	process
NLP	NLP
is	is
a	a
field	field
of	of
computer	comput
science	scienc
artificial	artifici
intelligence	intellig
and	and
linguistics	linguist
concerned	concern
with	with
the	the
interactions	interact
between	between
computers	comput
and	and
human	human
natural	natur
languages	languag

	

As	As
such	such
NLP	NLP
is	is
related	relat
to	to
the	the
area	area
of	of
humanicomputer	humanicomput
interaction	interact

	

The	The
history	histori
of	of
NLP	NLP
generally	general
starts	start
in	in
the	the
1950s	1950s
although	although
work	work
can	can
be	be
found	found
from	from
earlier	earlier
periods	period

	

The	The
Georgetown	Georgetown
experiment	experi
in	in
1954	1954
involved	involv
fully	fulli
automatic	automat
translation	translat
of	of
more	more
than	than
sixty	sixti
Russian	Russian
sentences	sentenc
into	into
English	English

	

The	The
authors	author
claimed	claim
that	that
within	within
three	three
or	or
five	five
years	year
machine	mac

## 53. Tokenization
[Stanford Core NLP](https://stanfordnlp.github.io/CoreNLP/)を用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

In [1]:
!cat nlp.txt | sh ./stanford-corenlp-full-2018-10-05/corenlp.sh > 53.xml

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Searching for resource: StanfordCoreNLP.properties ... found.
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.3 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7

In [4]:
import re
pattern = r"<word>(?P<word>.*?)</word>"
prog = re.compile(pattern)
with open("53.xml",'r') as f:
    for line in f.readlines():
        result = prog.search(line)
        if result:
            print(result.group("word"))

Natural
language
processing
From
Wikipedia
,
the
free
encyclopedia
Natural
language
processing
-LRB-
NLP
-RRB-
is
a
field
of
computer
science
,
artificial
intelligence
,
and
linguistics
concerned
with
the
interactions
between
computers
and
human
-LRB-
natural
-RRB-
languages
.
As
such
,
NLP
is
related
to
the
area
of
humani-computer
interaction
.
Many
challenges
in
NLP
involve
natural
language
understanding
,
that
is
,
enabling
computers
to
derive
meaning
from
human
or
natural
language
input
,
and
others
involve
natural
language
generation
.
History
The
history
of
NLP
generally
starts
in
the
1950s
,
although
work
can
be
found
from
earlier
periods
.
In
1950
,
Alan
Turing
published
an
article
titled
``
Computing
Machinery
and
Intelligence
''
which
proposed
what
is
now
called
the
Turing
test
as
a
criterion
of
intelligence
.
The
Georgetown
experiment
in
1954
involved
fully
automatic
translation
of
more
than
sixty
Russian
sentences
into
English
.
The
authors
claimed
that
within
three
or
five

## 54. 品詞タグ付け
[Stanford Core NLP](https://stanfordnlp.github.io/CoreNLP/)の解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

In [56]:
# import re
# pattern = r"(<word>(?P<word>.*?)</word>|<lemma>(?P<lemma>.*?)</lemma>|<POS>(?P<POS>.*?)</POS>)"
# prog = re.compile(pattern)
# with open("53.xml",'r') as f:
#     for line in f.readlines():
#         if line.find("<token ") != -1:
#             while not line.find("<\token ") != 1:
#                 result = prog.search(line)
#                 if result.group("word"):
#                     word = result.group("word")
#                 elif result.group("lemma"):
#                     lemma = result.group("lemma")
#                 elif result.group("POS"):
#                     POS = result.group("POS")
#             print("word:{}\tlemma{}\tPOS{}".format(word,lemma,POS))

from bs4 import BeautifulSoup
file = open('53.xml','r')
soup = BeautifulSoup(file,'lxml')
for token in soup.find_all("token"):
    word,lemma,pos = token.word.text,token.lemma.text,token.pos.text
    print("word:{:<20}\tlemma:{:<20}\tpos:{:<20}".format(word,lemma,pos))

word:Natural             	lemma:natural             	pos:JJ                  
word:language            	lemma:language            	pos:NN                  
word:processing          	lemma:processing          	pos:NN                  
word:From                	lemma:from                	pos:IN                  
word:Wikipedia           	lemma:Wikipedia           	pos:NNP                 
word:,                   	lemma:,                   	pos:,                   
word:the                 	lemma:the                 	pos:DT                  
word:free                	lemma:free                	pos:JJ                  
word:encyclopedia        	lemma:encyclopedia        	pos:NN                  
word:Natural             	lemma:natural             	pos:JJ                  
word:language            	lemma:language            	pos:NN                  
word:processing          	lemma:processing          	pos:NN                  
word:-LRB-               	lemma:-lrb-               	pos:-LRB-  

word:its                 	lemma:its                 	pos:PRP$                
word:conference          	lemma:conference          	pos:NN                  
word:CoNLL               	lemma:conll               	pos:NN                  
word:and                 	lemma:and                 	pos:CC                  
word:peak                	lemma:peak                	pos:NN                  
word:body                	lemma:body                	pos:NN                  
word:SIGNLL              	lemma:SIGNLL              	pos:NNP                 
word:are                 	lemma:be                  	pos:VBP                 
word:sponsored           	lemma:sponsor             	pos:VBN                 
word:by                  	lemma:by                  	pos:IN                  
word:ACL                 	lemma:acl                 	pos:NN                  
word:,                   	lemma:,                   	pos:,                   
word:recognizing         	lemma:recognize           	pos:VBG    

## 55. 固有表現抽出
入力文中の人名をすべて抜き出せ．

In [57]:
for token in soup.find_all("token"):
    if token.ner.text == "PERSON":
        print(token.word.text)

Alan
Turing
Joseph
Weizenbaum
MARGIE
Schank
Wilensky
Meehan
Lehnert
Carbonell
Lehnert
Racter
Jabberwacky
Moore


## 56. 共参照解析
[Stanford Core NLP](https://stanfordnlp.github.io/CoreNLP/)の共参照解析の結果に基づき，文中の参照表現（mention）を代表参照表現（representative mention）に置換せよ．ただし，置換するときは，「代表参照表現（参照表現）」のように，元の参照表現が分かるように配慮せよ．

In [121]:
# for coreference in soup.find_all("coreference"):
#     for i,mention in enumerate(coreference.find_all("mention")):
#         if i==0:
#             representative = mention.text
#         else:
            
for root in soup.find_all("root"):
    sentence_list = root.find_all("sentence")
    for i in range(len(sentence_list)):
        for j,token in enumerate(sentence_list[i].find_all("token")):
            print(i,j,token.word.text)
        for mention in root.find_all("mention"):
            print(mention)
#     for sentence in root.find_all("sentence"):
#         for i,token in enumerate(sentence.find_all("token")):
#             print(i,token.word)
#     for mention in root.find_all("mention"):
#         for 
        

0 0 Natural
0 1 language
0 2 processing
0 0 From
0 1 Wikipedia
0 2 ,
0 3 the
0 4 free
0 5 encyclopedia
0 0 Natural
0 1 language
0 2 processing
0 3 -LRB-
0 4 NLP
0 5 -RRB-
0 6 is
0 7 a
0 8 field
0 9 of
0 10 computer
0 11 science
0 12 ,
0 13 artificial
0 14 intelligence
0 15 ,
0 16 and
0 17 linguistics
0 18 concerned
0 19 with
0 20 the
0 21 interactions
0 22 between
0 23 computers
0 24 and
0 25 human
0 26 -LRB-
0 27 natural
0 28 -RRB-
0 29 languages
0 30 .
<mention representative="true">
<sentence>1</sentence>
<start>5</start>
<end>6</end>
          5
          <text>NLP</text>
</mention>
<mention>
<sentence>2</sentence>
<start>4</start>
<end>5</end>
          4
          <text>NLP</text>
</mention>
<mention>
<sentence>3</sentence>
<start>4</start>
<end>5</end>
          4
          <text>NLP</text>
</mention>
1 0 As
1 1 such
1 2 ,
1 3 NLP
1 4 is
1 5 related
1 6 to
1 7 the
1 8 area
1 9 of
1 10 humani-computer
1 11 interaction
1 12 .
<mention representative="true">
<sentence>1</sentence>


0 11 '
0 12 ,
0 13 which
0 14 structured
0 15 real-world
0 16 information
0 17 into
0 18 computer-understandable
0 19 data
0 20 .
<mention representative="true">
<sentence>2</sentence>
<start>14</start>
<end>15</end>
          14
          <text>1978</text>
</mention>
<mention>
<sentence>2</sentence>
<start>21</start>
<end>22</end>
          21
          <text>1978</text>
</mention>
1 0 Examples
1 1 are
1 2 MARGIE
1 3 -LRB-
1 4 Schank
1 5 ,
1 6 1975
1 7 -RRB-
1 8 ,
1 9 SAM
1 10 -LRB-
1 11 Cullingford
1 12 ,
1 13 1978
1 14 -RRB-
1 15 ,
1 16 PAM
1 17 -LRB-
1 18 Wilensky
1 19 ,
1 20 1978
1 21 -RRB-
1 22 ,
1 23 TaleSpin
1 24 -LRB-
1 25 Meehan
1 26 ,
1 27 1976
1 28 -RRB-
1 29 ,
1 30 QUALM
1 31 -LRB-
1 32 Lehnert
1 33 ,
1 34 1977
1 35 -RRB-
1 36 ,
1 37 Politics
1 38 -LRB-
1 39 Carbonell
1 40 ,
1 41 1979
1 42 -RRB-
1 43 ,
1 44 and
1 45 Plot
1 46 Units
1 47 -LRB-
1 48 Lehnert
1 49 1981
1 50 -RRB-
1 51 .
<mention representative="true">
<sentence>2</sentence>
<start>14</start>
<end>15</end>
    

1 23 annotated
1 24 and
1 25 non-annotated
1 26 data
1 27 .
2 0 Generally
2 1 ,
2 2 this
2 3 task
2 4 is
2 5 much
2 6 more
2 7 difficult
2 8 than
2 9 supervised
2 10 learning
2 11 ,
2 12 and
2 13 typically
2 14 produces
2 15 less
2 16 accurate
2 17 results
2 18 for
2 19 a
2 20 given
2 21 amount
2 22 of
2 23 input
2 24 data
2 25 .
3 0 However
3 1 ,
3 2 there
3 3 is
3 4 an
3 5 enormous
3 6 amount
3 7 of
3 8 non-annotated
3 9 data
3 10 available
3 11 -LRB-
3 12 including
3 13 ,
3 14 among
3 15 other
3 16 things
3 17 ,
3 18 the
3 19 entire
3 20 content
3 21 of
3 22 the
3 23 World
3 24 Wide
3 25 Web
3 26 -RRB-
3 27 ,
3 28 which
3 29 can
3 30 often
3 31 make
3 32 up
3 33 for
3 34 the
3 35 inferior
3 36 results
3 37 .
0 0 NLP
0 1 using
0 2 machine
0 3 learning
0 0 Modern
0 1 NLP
0 2 algorithms
0 3 are
0 4 based
0 5 on
0 6 machine
0 7 learning
0 8 ,
0 9 especially
0 10 statistical
0 11 machine
0 12 learning
0 13 .
<mention representative="true">
<sentence>4</sentence>
<start>1</start>
<end>4</

<mention>
<sentence>5</sentence>
<start>7</start>
<end>8</end>
          7
          <text>they</text>
</mention>
<mention representative="true">
<sentence>1</sentence>
<start>5</start>
<end>8</end>
          7
          <text>machine learning algorithms</text>
</mention>
<mention>
<sentence>2</sentence>
<start>1</start>
<end>3</end>
          2
          <text>These algorithms</text>
</mention>
<mention>
<sentence>3</sentence>
<start>3</start>
<end>6</end>
          5
          <text>the earliest-used algorithms</text>
</mention>
<mention representative="true">
<sentence>5</sentence>
<start>1</start>
<end>3</end>
          2
          <text>Such models</text>
</mention>
<mention>
<sentence>5</sentence>
<start>7</start>
<end>8</end>
          7
          <text>they</text>
</mention>
<mention representative="true">
<sentence>1</sentence>
<start>5</start>
<end>8</end>
          7
          <text>machine learning algorithms</text>
</mention>
<mention>
<sentence>2</sentence>
<start>1</star

</mention>


## 57. 係り受け解析
[Stanford Core NLP](https://stanfordnlp.github.io/CoreNLP/)の係り受け解析の結果（collapsed-dependencies）を有向グラフとして可視化せよ．可視化には，係り受け木を[DOT言語](http://ja.wikipedia.org/wiki/DOT%E8%A8%80%E8%AA%9E)に変換し，[Graphviz](http://www.graphviz.org/)を用いるとよい．また，Pythonから有向グラフを直接的に可視化するには，[pydot](https://code.google.com/p/pydot/)を使うとよい．

## 58. タプルの抽出
[Stanford Core NLP](http://nlp.stanford.edu/software/corenlp.shtml)の係り受け解析の結果（collapsed-dependencies）に基づき，「主語 述語 目的語」の組をタブ区切り形式で出力せよ．ただし，主語，述語，目的語の定義は以下を参考にせよ．

- 述語: nsubj関係とdobj関係の子（dependant）を持つ単語
- 主語: 述語からnsubj関係にある子（dependent）
- 目的語: 述語からdobj関係にある子（dependent）

## 59. S式の解析
Stanford Core NLPの句構造解析の結果（S式）を読み込み，文中のすべての名詞句（NP）を表示せよ．入れ子になっている名詞句もすべて表示すること．