### 解釈多様性と，主題のSNDと抽象度を含む各種変数との関係性

論文中で報告した、解釈多様性と主題のSNDの負の相関が最大になる以下の条件の検討方法を報告する
1. 近傍単語数: 3000
2. 言語モデルの次元数: 200
3. 言語モデルの種類: word2vec

ライブラリの読み込み

In [1]:
import numpy as np
import pandas as pd
import tqdm

import gensim
from gensim.models import KeyedVectors



比喩と解釈多様性のデータの読み込み

In [2]:
#比喩のリストを直接読み込み
_metaphor_lst = ["皮肉は短刀だ",
"夕日は銅貨だ",
"研究は登山だ",
"蝶は踊り子だ",
"時間は洪水だ",
"暴動は嵐だ",
"批判はメスだ",
"煙草は時限爆弾だ",
"真珠は水滴だ",
"つららは水晶細工だ",
"雨雲は薄墨だ",
"建物は角砂糖だ",
"衝撃は電気だ",
"学校は工場だ",
"子猫は王女だ",
"楽譜は暗号だ",
"煙突は石碑だ",
"空は鏡だ",
"子供は天使だ",
"柿はちょうちんだ",
"デモは雪崩だ",
"疑惑は腫瘍だ",
"教育は階段だ",
"微風は吐息だ",
"山は王様だ",
"仕事は牢獄だ",
"愛は季節だ",
"利息は年貢だ",
"椿はひとだまだ",
"審判は天秤だ",
"雑巾は亀だ",
"不安は濃霧だ",
"理論は建物だ",
"希望は灯火だ",
"運命は通り魔だ",
"沈黙は海底だ",
"霧はベールだ",
"激怒は噴火だ",
"眼は湖だ",
"唇は蛭だ",
"笑顔は花だ",
"心臓は時計だ",
"麦畑は海だ",
"香水は花束だ",
"礼儀は檻だ"]

In [3]:
#解釈多様性のデータ
_id_df = pd.read_csv("../data/MetaphorSimileDiversity.csv")
_id_df.QID = _id_df.QID.astype(str).str.zfill(2)#0埋めの文字列に変換
_id_df["Questions"] = _id_df["Presentation"] + _id_df["QID"]

#QIDに基づいてMetaphorを作る
qid2id = dict()
for id, qid in enumerate(_id_df.QID.unique(), 0):
    qid2id[qid] = id
f_qid2metaphor = lambda x: _metaphor_lst[qid2id[x]]

#metaphor, topic, vehicleのマージ
_id_df['Metaphor'] = _id_df.QID.map(f_qid2metaphor)
_id_df["Topic"] = _id_df["Metaphor"].str.split('(.*)は(.*)だ', expand=True)[1]
_id_df["Vehicle"] = _id_df["Metaphor"].str.split('(.*)は(.*)だ', expand=True)[2]
_id_df.head()

Unnamed: 0,QID,Presentation,Diversity,Questions,Metaphor,Topic,Vehicle
0,1,TV,3.762489,TV01,皮肉は短刀だ,皮肉,短刀
1,1,S,3.441501,S01,皮肉は短刀だ,皮肉,短刀
2,2,TV,3.481136,TV02,夕日は銅貨だ,夕日,銅貨
3,2,S,3.72741,S02,夕日は銅貨だ,夕日,銅貨
4,3,TV,4.031975,TV03,研究は登山だ,研究,登山


AWDと中本・楠見（2004）の比喩関連指標の読み込み

In [4]:
_dat_awd_nakamoto = pd.read_csv("../data/SubsetAwdNakamoto.csv")

_id_df["Comprehensive"] = 0
_id_df["Similarity"] = 0
_id_df["Creative"] = 0
_id_df["Funniness"] = 0
_id_df["TopicAWD"] = 0
_id_df["VehicleAWD"] = 0

for col, row in _id_df.iterrows():
    #主題と喩辞の準備
    tmp_snd_topic = row.Topic
    tmp_snd_vehicle = row.Vehicle
    tmp_presentation = row.Presentation
    
    #主題と喩辞で中本のデータを検索
    search_result = _dat_awd_nakamoto.query('Topic == @tmp_snd_topic and Vehicle == @tmp_snd_vehicle and Presentation == @tmp_presentation')
    
    #検索結果を反映
    _id_df.Comprehensive.iloc[col] = search_result.Comprehensive
    _id_df.Similarity.iloc[col] = search_result.Similarity
    _id_df.Creative.iloc[col] = search_result.Creative
    _id_df.Funniness.iloc[col] = search_result.Funniness
    _id_df.TopicAWD.iloc[col] = search_result.TopicAWD
    _id_df.VehicleAWD.iloc[col] = search_result.VehicleAWD

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [5]:
_dat_awd_nakamoto.head()

Unnamed: 0,QID,Presentation,Diversity,Questions,Metaphor,Topic,Vehicle,Comprehensive,Similarity,Creative,Funniness,TopicAWD,VehicleAWD
0,1,TV,3.762489,TV01,皮肉は短刀だ,皮肉,短刀,5.22,3.12,4.89,4.51,2.74,1.71
1,1,S,3.441501,S01,皮肉は短刀だ,皮肉,短刀,5.22,3.12,4.89,4.51,2.74,1.71
2,2,TV,3.481136,TV02,夕日は銅貨だ,夕日,銅貨,4.33,2.58,4.67,3.89,1.66,1.88
3,2,S,3.72741,S02,夕日は銅貨だ,夕日,銅貨,4.33,2.58,4.67,3.89,1.66,1.88
4,3,TV,4.031975,TV03,研究は登山だ,研究,登山,6.51,4.35,4.56,4.85,2.41,1.78


SNDを算出するための関数の準備

In [6]:
def word2snd(word, model, nn=3000):
    """
    単語からSNDを求める。
    
    Parameters
    ----------
    word: str
        中心単語
        
    model : gensim.models.keyedvectors.KeyedVectors
        単語ベクトルを計算するための言語モデル（word2vecライクなやつ）
        
    nn : int
        SNDの算出に用いる、wordとの類似度が高い上位nn件の単語数

    Returns
    -------
    snd : numpy.float64
        SND（wordの周辺単語nn件の単語の、wordとの平均類似度; 類似度の算出はcos similarity）
    """
    nn_most_similars = model.most_similar(word, topn=nn)
    snd = np.sum(np.array([n_most_similar[1] for n_most_similar in nn_most_similars]))/nn
    
    return snd

SNDの算出結果を_id_dfに結合する

In [10]:
import tqdm

snd_lst = [3000]#近傍単語数
dim_lst = [200]#言語モデルの次元数

#200次元
model_dir_lst = ['../model/entity_vectors_200d.txt']
model_dir2model_name = {'../model/entity_vectors_200d.txt': 'w2v200'}

#デバッグ用
# model_dir_lst = ['/Users/Ryunosuke/Desktop/Programming_local/entity_vector/entity_vector.model_200.bin']               
# model_dir2model_name = {'/Users/Ryunosuke/Desktop/Programming_local/entity_vector/entity_vector.model_200.bin': 'w2v200'}
# binary = True

#モデルを回す
for _model_dir in model_dir_lst:
    #モデルの読み込み
    _model = KeyedVectors.load_word2vec_format(_model_dir, binary=False)    
    _model_name = model_dir2model_name[_model_dir]
    
    #次元を回す
    for _dim in dim_lst:
        #sndを回す
        for _snd in snd_lst:
            print(f'model: {_model_name}, dim: {_dim}, snd: {_snd}')
            
            _id_df[_model_name+"_dim"+str(_dim)+"_snd"+str(_snd)+"_topic"] = 0.0
            _id_df[_model_name+"_dim"+str(_dim)+"_snd"+str(_snd)+"_vehicle"] = 0.0
            
            for col, row in tqdm.tqdm(_id_df.iterrows()):
                try:
                    topic_snd = word2snd(word=row.Topic, model=_model, nn=_snd)
                    vehicle_snd = word2snd(word=row.Vehicle, model=_model, nn=_snd)
                    _id_df.loc[col, _model_name+"_dim"+str(_dim)+"_snd"+str(_snd)+"_topic"] = topic_snd
                    _id_df.loc[col, _model_name+"_dim"+str(_dim)+"_snd"+str(_snd)+"_vehicle"] = vehicle_snd
                except:
                    print("OOV!: ", row.Topic, row.Vehicle)

model: w2v200, dim: 200, snd: 3000


20it [00:02, 11.95it/s]

OOV!:  つらら 水晶細工
OOV!:  つらら 水晶細工


58it [00:06, 12.54it/s]

OOV!:  椿 ひとだま
OOV!:  椿 ひとだま


90it [00:09,  9.27it/s]


`_id_df`の列の説明
- QID: 比喩表現の番号（番号と表現の対応関係は＝で）
- Presentation: 比喩表現の種類
- Questions: TV/Sから始まる比喩番号（TV: 隠喩表現, S: 直喩表現）
- Metaphor: 隠喩表現
- Topic: 比喩表現の主題
- Vehicle: 比喩表現の喩辞
- Comprehensive: 中本・楠見（2004）の理解可能性
- Similarity: 中本・楠見（2004）の構成語類似性
- Creative: 中本・楠見（2004）の独創性
- Funniness: 中本・楠見（2004）の面白さ
- TopicAWD: AWD-JEXにおける主題の抽象度
- VehicleAWD: AWD-JEXにおける喩辞の抽象度
- w2v200_dim200_snd3000_topic: 主題のSND
- w2v200_dim200_snd3000_vehicle: 喩辞のSND

In [11]:
#結合後のデータの確認
_id_df.head()

Unnamed: 0,QID,Presentation,Diversity,Questions,Metaphor,Topic,Vehicle,Comprehensive,Similarity,Creative,Funniness,TopicAWD,VehicleAWD,w2v200_dim200_snd3000_topic,w2v200_dim200_snd3000_vehicle
0,1,TV,3.762489,TV01,皮肉は短刀だ,皮肉,短刀,5.22,3.12,4.89,4.51,2.74,1.71,0.399821,0.507033
1,1,S,3.441501,S01,皮肉は短刀だ,皮肉,短刀,5.22,3.12,4.89,4.51,2.74,1.71,0.399821,0.507033
2,2,TV,3.481136,TV02,夕日は銅貨だ,夕日,銅貨,4.33,2.58,4.67,3.89,1.66,1.88,0.457839,0.395922
3,2,S,3.72741,S02,夕日は銅貨だ,夕日,銅貨,4.33,2.58,4.67,3.89,1.66,1.88,0.457839,0.395922
4,3,TV,4.031975,TV03,研究は登山だ,研究,登山,6.51,4.35,4.56,4.85,2.41,1.78,0.429301,0.377516


In [12]:
#データの出力
_id_df.to_csv("../result/DatForCorrelation.csv", header=True, index=False, encoding="utf-8")