夏目漱石の小説『吾輩は猫である』の文章（neko.txt）をCaboChaを使って係り受け解析し，その結果をneko.txt.cabochaというファイルに保存せよ．このファイルを用いて，以下の問に対応するプログラムを実装せよ．

In [1]:
import MeCab
import requests
import shutil
import sys

r = requests.get("http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt", stream=True)

with open("neko.txt", 'wb') as f:
    r.raw.decode_content = True
    shutil.copyfileobj(r.raw, f)

! cabocha -f1 neko.txt > neko.txt.cabocha

---

### 40. 係り受け解析結果の読み込み（形態素）
形態素を表すクラスMorphを実装せよ．このクラスは表層形（surface），基本形（base），品詞（pos），品詞細分類1（pos1）をメンバ変数に持つこととする．さらに，CaboChaの解析結果（neko.txt.cabocha）を読み込み，各文をMorphオブジェクトのリストとして表現し，3文目の形態素列を表示せよ．

In [2]:
import re

class Morph:    
    def __init__(self, surface, pos, pos1, base):
        self.surface = surface
        self.base = base
        self.pos = pos
        self.pos1 = pos1

    def __str__(self):
        return ("Morph: {\"surface\": \"%s\", \"base\": \"%s\" ,\"pos\": \"%s\", \"pos1\": \"%s\"}" %
                (self.surface, self.base, self.pos, self.pos1))

    def __repr__(self):
        return self.__str__()

def make_morph(word):
    splitted = re.split('[\t,]', word)
    return Morph(splitted[0], splitted[7], splitted[1], splitted[2])
    
def make_morph_list(file_path):
    with open(file_path, "rt") as neko:
        return [[make_morph(word) for word in line.split('\n')
                 if len(word) >= 4 and not re.match(r'^\*.*', word)]
            for line in re.split(r'EOS\n', neko.read())
                if len(line) is not 0]

print(make_morph_list("neko.txt.cabocha")[2])

[Morph: {"surface": "名前", "base": "一般" ,"pos": "名前", "pos1": "名詞"}, Morph: {"surface": "は", "base": "係助詞" ,"pos": "は", "pos1": "助詞"}, Morph: {"surface": "まだ", "base": "助詞類接続" ,"pos": "まだ", "pos1": "副詞"}, Morph: {"surface": "無い", "base": "自立" ,"pos": "無い", "pos1": "形容詞"}, Morph: {"surface": "。", "base": "句点" ,"pos": "。", "pos1": "記号"}]


### 41. 係り受け解析結果の読み込み（文節・係り受け）
40に加えて，文節を表すクラスChunkを実装せよ．このクラスは形態素（Morphオブジェクト）のリスト（morphs），係り先文節インデックス番号（dst），係り元文節インデックス番号のリスト（srcs）をメンバ変数に持つこととする．さらに，入力テキストのCaboChaの解析結果を読み込み，１文をChunkオブジェクトのリストとして表現し，8文目の文節の文字列と係り先を表示せよ．第5章の残りの問題では，ここで作ったプログラムを活用せよ．

In [93]:
from collections import defaultdict
import pprint

class Chunk:
    def __init__(self, morphs, dst, srcs):
        self.morphs = morphs
        self.dst = dst
        self.srcs = srcs
    
    def __str__(self):
        return ("Chunk: {\"morphs\": %s, \"dst\": %s, \"srcs\" %s}" %
                (self.morphs, self.dst, self.srcs))

    def __repr__(self):
        return self.__str__()

def list_sentences(string):
    return re.split(r'EOS\n', string)

def list_chunks(sentence):
    pattern = re.compile(r'^(\*.*)$((\n([^*].*))+)', re.MULTILINE)
    chunk_list = []
    dst_src_dict = defaultdict(list)
    
    def filter_morphs(morphs):
        return [m for m in morphs.split('\n') if not len(m) == 0 and m != 'EOS']
    
    def parse_phrase(phrase):
        return re.match(r'^\*\s\d*\s(-?\d*)D', phrase).group(1)
    
    def make_morph(word):
        splitted = re.split('[\t,]', word)
        return Morph(splitted[0], splitted[7], splitted[1], splitted[2])
    
    def make_chunk(phrase, morph_list):
        idnt, dst = re.match(r'^\*\s(\d*)\s(-?\d*)D', phrase).groups()
        srcs = dst_src_dict[idnt]
        morphs = [make_morph(x) for x in morph_list]
        return (Chunk(morphs, dst, srcs), idnt, dst)

    for phrase, morphs, _, _ in [phrase.groups() for phrase in re.finditer(pattern, sentence)]:        
        chunk, idnt, dst = make_chunk(phrase, filter_morphs(morphs))
        dst_src_dict[dst].append(int(idnt))
        chunk_list.append(chunk)
    
    return chunk_list

with open("neko.txt.cabocha", "rt") as neko:
    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint([list_chunks(sentence) for sentence in list_sentences(neko.read())][8])

[   Chunk: {"morphs": [Morph: {"surface": "しかも", "base": "*" ,"pos": "しかも", "pos1": "接続詞"}], "dst": 8, "srcs" []},
    Chunk: {"morphs": [Morph: {"surface": "あと", "base": "一般" ,"pos": "あと", "pos1": "名詞"}, Morph: {"surface": "で", "base": "格助詞" ,"pos": "で", "pos1": "助詞"}], "dst": 2, "srcs" []},
    Chunk: {"morphs": [Morph: {"surface": "聞く", "base": "自立" ,"pos": "聞く", "pos1": "動詞"}, Morph: {"surface": "と", "base": "接続助詞" ,"pos": "と", "pos1": "助詞"}], "dst": 8, "srcs" [1]},
    Chunk: {"morphs": [Morph: {"surface": "それ", "base": "代名詞" ,"pos": "それ", "pos1": "名詞"}, Morph: {"surface": "は", "base": "係助詞" ,"pos": "は", "pos1": "助詞"}], "dst": 8, "srcs" []},
    Chunk: {"morphs": [Morph: {"surface": "書生", "base": "一般" ,"pos": "書生", "pos1": "名詞"}, Morph: {"surface": "という", "base": "格助詞" ,"pos": "という", "pos1": "助詞"}], "dst": 5, "srcs" []},
    Chunk: {"morphs": [Morph: {"surface": "人間", "base": "一般" ,"pos": "人間", "pos1": "名詞"}, Morph: {"surface": "中", "base": "接尾" ,"pos": "中", "pos1": "名詞"}, Morph: 

### 42. 係り元と係り先の文節の表示
係り元の文節と係り先の文節のテキストをタブ区切り形式ですべて抽出せよ．ただし，句読点などの記号は出力しないようにせよ．

In [105]:
with open("neko.txt.cabocha", "rt") as neko:
    sentences = [list_chunks(sentence) for sentence in list_sentences(neko.read())][8:9]
    print([(srcmorph.surface, "".join(map(lambda x: x.surface, chunk.morphs)))
           for sentence in sentences
           for chunk in sentence
           for srcidx in chunk.srcs
           for srcmorph in sentence[srcidx].morphs
    ])    

[('あと', '聞くと'), ('で', '聞くと'), ('書生', '人間中で'), ('という', '人間中で'), ('一番', '獰悪な'), ('しかも', '種族であったそうだ。'), ('聞く', '種族であったそうだ。'), ('と', '種族であったそうだ。'), ('それ', '種族であったそうだ。'), ('は', '種族であったそうだ。'), ('人間', '種族であったそうだ。'), ('中', '種族であったそうだ。'), ('で', '種族であったそうだ。'), ('獰悪', '種族であったそうだ。'), ('な', '種族であったそうだ。')]
