# [Kozlowski, Taddy, & Evans (2019)](https://journals.sagepub.com/doi/full/10.1177/0003122419877135)


## 利用词向量的计算构建理解特定复杂概念的维度

对“文化维度（Cultural Dimensions）”的测量，利用了词向量在解决类比问题上的能力。

通过反映文化概念的词对之间的向量计算，可以对“文化维度”进行测量。

例如，$\vec{male}-\vec{female}$ 这个向量表达了“性别（Gender）”这一概念。同样地，像 $\vec{king}-\vec{queen}$ 这样的词对也可以被认为反映了“性别”这一概念。

类似地，$\vec{rich}-\vec{poor}$、$\vec{affluence}-\vec{poverty}$ 这样的词对可以被认为反映了“富裕（Affluence）”这一概念。

In [5]:
#!pip install --upgrade gensim

In [6]:
import gensim.downloader
model = gensim.downloader.load('word2vec-google-news-300')

In [7]:
rich_list=["rich","richer","affluence","luxury"]
poor_list=["poor","poorer","poverty","cheap"]

In [8]:
import numpy as np
affluence_vec=[]
for i,j in zip(rich_list,poor_list):
    affluence_vec.append(model[i]-model[j])
affluence_vec=np.array(affluence_vec)
affluence_vec=np.mean(affluence_vec,axis=0)

某个词语在“文化维度（Cultural Dimensions）”中如何被解读，可以通过计算该词的向量与文化维度向量之间的夹角来判断。

这个夹角越小，说明该词与该文化维度的关联越强。通过这种方法，我们可以对词语所具有的文化含义或语义细微差别进行量化分析。


$$cos(\theta))=\frac{D \cdot V}{|D||V|} $$
$$\theta = \arccos(cos(\theta))$$

In [10]:
def get_consine(vector, dimension):
    """
    Calculate the angle between the vector and the given dimension
    """
    v_dot_d = np.dot(vector, dimension)
    v_d = np.linalg.norm(vector) * np.linalg.norm(dimension)
    return v_dot_d / v_d

In [11]:
get_consine(model["tennis"],affluence_vec)

0.10311404

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(model["tennis"].reshape(1,-1),affluence_vec.reshape(1,-1))

array([[0.10311404]], dtype=float32)

In [13]:
def get_angle(vector, dimension,degree=False):
    """
    Calculate the angle between the vector and the given dimension
    """
    c = get_consine(vector, dimension)
    if degree:
        return np.degrees(np.arccos(np.clip(c, -1, 1)))
    else:
        return np.arccos(np.clip(c, -1, 1)) #return radian

In [14]:
sports=["tennis","soccer","basketball","boxing","golf","swimming","volleyball","camping","weightlifting","hiking","hockey"]

In [15]:
for sport in sports:
    print(sport,get_angle(model[sport],affluence_vec,degree=True))

tennis 84.08148088988168
soccer 86.44827879084316
basketball 87.49476268241891
boxing 96.19771940639099
golf 81.23037992187086
swimming 87.66950350249788
volleyball 84.87990835557244
camping 92.6404688343294
weightlifting 92.84652219656259
hiking 89.06679353599412
hockey 88.23169120649816


## 利用词向量的计算理解不同维度之间的关系

“Class”这一概念是由多维的因素构成的，并且这些构成随着时间而不断变化。

通过构建“Cultural Dimensions”，可以定量地测量“Class”各个构成要素的“意义”。

例如，“富裕（Affluence）”与其他要素之间的关系，有助于揭示阶层这一概念的意义结构。

In [16]:
def create_vector(word_pair):
    vec=[]
    for i in word_pair:
        vec.append(model[i[0]]-model[i[1]])
    vec=np.array(vec)
    vec=np.mean(vec,axis=0)
    return vec

In [17]:
education_pair=[("educated","uneducated"),("learned","unlearned"),("taught","untaught"),
                ("schooled","unschooled"),("trained","untrained"),("lettered","unlettered"),
                ("tutored","untutored"),("literate","illiterate")]

In [18]:
education_vec=create_vector(education_pair)

In [19]:
gender_pair=[("man","woman"),("men","women"),("he","she"),("him","her"),
             ("his","her"),("boy","girl"),("male","female"),("masculine","feminine")]

In [20]:
gender_vec=create_vector(gender_pair)

In [21]:
cosine_similarity(gender_vec.reshape(1,-1),affluence_vec.reshape(1,-1))

array([[-0.04156307]], dtype=float32)

In [22]:
cosine_similarity(education_vec.reshape(1,-1),affluence_vec.reshape(1,-1))

array([[0.20604998]], dtype=float32)