# 課題1119

- 日本語のコーパスデータ(```dokujo-tsushin.txt```)を用いて、Word2Vecモデルを学習しなさい。
- 「男」という単語の分散表現に最も類似した単語を確認しなさい

In [3]:
from gensim.models import Word2Vec
import MeCab
def tokenize(text):
    """ テキストを形態素解析して、トークンのリストを返す """
    mecab = MeCab.Tagger("-Owakati")
    return mecab.parse(text).strip().split()

In [4]:
with open('./Data/dokujo-tsushin.txt', 'r', encoding='utf-8') as file:
    corpus = [line.strip() for line in file if line.strip()]  # 空白行をスキップ

In [5]:
tokenized_documents = [tokenize(doc) for doc in corpus]

In [6]:
model_jp = Word2Vec(sentences=tokenized_documents, vector_size=100, window=5, min_count=1, workers=4)

In [7]:
model_jp.wv['日本']

array([ 0.05526981,  0.29628485,  0.22448467, -0.18924157,  0.07167253,
       -0.79247266,  0.18809025,  0.6216907 , -0.5830733 , -0.37403762,
       -0.62924904, -0.30915678,  0.8130054 ,  0.09221639, -0.5097909 ,
       -0.5612721 ,  0.17909306, -1.1302615 , -0.7696172 , -1.0021309 ,
       -0.08317469,  0.58873475,  0.16525558, -0.60036147,  0.7626564 ,
        0.21417902, -0.14295635,  0.3433072 , -1.4546875 , -0.05863817,
       -0.22373101, -0.1569698 ,  0.5722233 , -0.34782213, -0.8870419 ,
        0.94012415, -0.0236854 , -0.80377483, -0.03074291, -0.5488695 ,
       -0.14457558, -0.4310017 ,  0.2791916 , -0.39405406,  0.49133328,
        0.3447529 , -0.38607582,  0.10320242,  0.29659846,  0.4268469 ,
       -0.28107315, -0.0024615 , -0.47937745,  0.12797731, -0.90314204,
       -0.01519119,  0.656861  ,  0.4687069 ,  0.00193253, -0.12906523,
        0.373461  ,  0.11021631, -0.19255552,  0.3118072 , -0.31508872,
        0.89075845, -0.23715876, -0.11035468,  0.16879605,  0.61

In [8]:
model_jp.wv.most_similar("男")

[('女', 0.9132813215255737),
 ('友達', 0.7762773633003235),
 ('モテ', 0.7702534794807434),
 ('女の子', 0.7469756603240967),
 ('男性', 0.7417479753494263),
 ('理想', 0.7405964136123657),
 ('彼女', 0.7361493110656738),
 ('恋人', 0.7316548824310303),
 ('やっぱり', 0.7310003042221069),
 ('言う', 0.7202484011650085)]

学習済みの単語分散表現を用いて、以下の指示に従って分析を行いなさい。

- [Kozlowski et al., 2019](https://journals.sagepub.com/doi/10.1177/0003122419877135)で提案された手法に基づき、性別に関するDimensionを作成しなさい。
- 作成した性別Dimensionにおける、スポーツに関する単語の分散表現とのベクトル角度を確認し、その結果を解釈しなさい。



In [9]:
import gensim.downloader
model = gensim.downloader.load('word2vec-google-news-300')

In [10]:
sports=["tennis","soccer","basketball","boxing","golf","swimming","volleyball","camping","weightlifting","hiking","hockey"]



In [11]:
male_list=["man","men","his","his","he","male","masculine"]
female_list=["woman","women","her","hers","she","female","feminine"]

In [12]:
import numpy as np
male_vec=[]
for i,j in zip(male_list,female_list):
    male_vec.append(model[i]-model[j])
male_vec=np.array(male_vec)
male_vec=np.mean(male_vec,axis=0)

In [16]:
def get_consine(vector, dimension):
    """
    Calculate the angle between the vector and the given dimension
    """
    v_dot_d = np.dot(vector, dimension)
    v_d = np.linalg.norm(vector) * np.linalg.norm(dimension)
    return v_dot_d / v_d

In [17]:
def get_angle(vector, dimension,degree=False):
    """
    Calculate the angle between the vector and the given dimension
    """
    c = get_consine(vector, dimension)
    if degree:
        return np.degrees(np.arccos(np.clip(c, -1, 1)))
    else:
        return np.arccos(np.clip(c, -1, 1)) #return radian

In [18]:
for sport in sports:
    print(sport,get_angle(model[sport],male_vec,degree=True))

tennis 97.40878845685319
soccer 91.80367294960222
basketball 92.62550153560498
boxing 88.10336068157848
golf 89.92070043788232
swimming 95.5663930896144
volleyball 102.37309292835671
camping 90.67382084420579
weightlifting 90.39126386418617
hiking 93.58319273998538
hockey 89.98937626434014


# おまけ：　Gender Biasの計算

In [19]:
import numpy as np
from gensim.downloader import load
from numpy.linalg import norm

# モデルの読み込み
model = load("word2vec-google-news-300")

# 性別を代表する単語リスト
female_words = ["she", "female", "woman", "girl"]
male_words = ["he", "male", "man", "boy"]

# 職業単語リスト
occupation_words = ["engineer", "nurse", "housekeeper"]

# 性別バイアスを計算する関数
def calculate_gender_bias(occupation, female_words, male_words, model):
    # 距離の計算 (女性)
    female_distances = [norm(model[occupation] - model[female]) for female in female_words]
    female_mean = np.mean(female_distances)

    # 距離の計算 (男性)
    male_distances = [norm(model[occupation] - model[male]) for male in male_words]
    male_mean = np.mean(male_distances)

    # 性別バイアス
    gender_bias = female_mean - male_mean
    return gender_bias

# 各職業の性別バイアスを計算
bias_results = {}
for occupation in occupation_words:
    bias = calculate_gender_bias(occupation, female_words, male_words, model)
    bias_results[occupation] = bias

In [20]:
# 結果を表示
for occupation, bias in bias_results.items():
    relation = "closer to women" if bias < 0 else "closer to men"
    print(f"Gender Bias for {occupation}: {bias:.4f} ({relation})")

Gender Bias for engineer: 0.2160 (closer to men)
Gender Bias for nurse: -0.2625 (closer to women)
Gender Bias for housekeeper: -0.1960 (closer to women)
