# Lexicon generation

There are four pairs of traits for women and men proposed by Heilman (2012). Three example words for each group are listed in the following table.


| Group-Masculine           |    Example Words                | Group-Feminine       |  Example Words       |
| :--                       | :--:                            | :--                  | :--:                 |
| achievement orientation   |competent, ambitious, achievement|concern for others    |kind, caring, considerate |
| inclination to take charge|forceful, dominant, assertive    |affiliative tendencies| warm, collaborative, friendly|
| autonomy                  |independent, decisive, autonomous| deference            |obedient, respectful, receptive|
| rationality               |analytical, logical, objective   | emotional sensitivity| perceptive, understanding, intuitive|



I first experimented with `Empath` and `ChatGPT` for synonym extraction, however, the former can generate plenty of words describing an irrelevant topic, the latter often provides unstable responses. More importantly, both of them are impossible to have threshold for a more accurate measure of relevance.

Therefore, extracting parts of speech and identifying synonyms through pre-trained embeddings are a better choice. The `word2vec-google-news-300` was considered in this notebook, which is a set of 300-dimensional vectors pre-trained on 100 billion words from Google news.

In [1]:
import numpy as np
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from gensim import downloader
import json

In [2]:
ggl_model = downloader.load('word2vec-google-news-300')

In [3]:
PATH = ''
TRAINED_MODEL_NAME = '4th_400_hst'
FILE_NAMES = ['h1-100', 's1-100', 't1-100', 'h101-200','s101-200', 't101-200',\
              'h201-300', 's201-300', 't201-300', 'h301-400', 's301-400', 't301-400']

### Extract nouns and adjectives from the training set
The first step is to keep all nouns and adjectives with a meaningful length (no smaller than 4).
I also create a list of irrelevant words for an improved filtering.

In [4]:
data = []
for f_name in FILE_NAMES:
    with open(PATH + f_name +'.txt') as f:
        file = f.read().replace('\n','').split('.')
        data += file

In [5]:
adj = set()
nnn = set()
for sentence in data:
    tagged = pos_tag(word_tokenize(sentence))
    n = set(word for word,tag in tagged if tag == "NN")
    a = set(word for word,tag in tagged if tag[0] == "J")
    nnn = nnn | n
    adj = adj | a

In [6]:
irrelevant = set(['very','thing','something','anything','everything','nothing','final','whole','little',\
                 'easy','ruthless','cold','disrespectful','type','mindset','sort','insight','swift','rude',\
                 'demeanor','dramatic','cool','functional','accurate','simple','understood','learn','seamless',\
                 'particular','great','real','explanation','directness','imagine','wonderful','clever',\
                 'uncaring','truly','gentleman','quiet','calm','honorable','honest','candid','intrigued',\
                 'dismissive','satisfied','surprised','optimistic','apprehensive','attractive','hesitant', \
                 'impressed', 'uncomfortable', 'interested', 'skeptical', 'hopeful', 'unorthodox', 'attitude', \
                 'demeanor', 'tempting', 'compelling','achieve','experience','efficient','innovative','understand',\
                 'achieved','achieving','dominated','strive','cared','thought','knowledge', 'smart','brilliant',\
                 'volunteering'])
adj = adj - irrelevant
nnn = nnn - irrelevant

In [7]:
la = list(adj)
ln = list(nnn)

In [8]:
la = [i for i in la if len(i)>=4]
ln = [i for i in ln if len(i)>=4]

### Define a threshold of synonym identification
By computing paired similarity in google word2vec model for example words, it is clear that the majority of the example words are **semantically synonyms but not spatially similar**. The greatest value, 0.46, indicates a tentative threshold for synonyms of the cosine similarity measure. 

In [9]:
# list of base attributes
h_achi = ['competent', 'ambitious', 'achievement']
h_char = ['forceful', 'dominant', 'assertive']
h_auto = ['independent', 'decisive', 'autonomous']
h_ratn = ['analytical', 'logical', 'objective']
s_cocn = ['kind', 'caring', 'considerate']
s_affi = ['warm', 'collaborative', 'friendly']
s_defe = ['obedient', 'respectful', 'receptive']
s_emot = ['perceptive', 'understanding', 'intuitive']

c_h_achi = ggl_model.similarity(h_achi[0], h_achi[1])
c_h_char = ggl_model.similarity(h_char[0], h_char[1])
c_h_auto = ggl_model.similarity(h_auto[0], h_auto[1])
c_h_ratn = ggl_model.similarity(h_ratn[0], h_ratn[1])
c_s_cocn = ggl_model.similarity(s_cocn[0], s_cocn[1])
c_s_affi = ggl_model.similarity(s_affi[0], s_affi[1])
c_s_defe = ggl_model.similarity(s_defe[0], s_defe[1])
c_s_emot = ggl_model.similarity(s_emot[0], s_emot[1])

In [10]:
c_h_achi, c_h_char, c_h_auto, c_h_ratn, c_s_cocn, c_s_affi, c_s_defe, c_s_emot

(0.22667369,
 0.29560408,
 0.12045363,
 0.18539125,
 0.17777112,
 0.102216855,
 0.45882586,
 0.26906115)

### Establish a rule of synonym identification
Here are functions for synonym identification, measuring the similarity between a particular word and three example words.

To balance the goal of identifying more traits and keeping them relevant to the example words at the same time: 
- Compute the paired similarities between a single word and three example words. 
- Use the greatest value among those three to describe the relevance of the word to this group.
- Obtain similarity scores associated with eight groups.
- Utilized 0.4 as the threshold of closeness - if a word is more relevant to one group and the similarity score is greater than 0.4, it will be selected for the lexicon of this group.

In [11]:
# measure the similarity between a single attribute and a list of base attributes

def sim(attr, base_attr, model=ggl_model):
    try:
        s = -1 * np.ones(len(base_attr))
        for i in range(len(base_attr)):
            s[i] = model.similarity(attr, base_attr[i])
        return max(s)     # can be changed
    except:
        return -1
    
    
# assign a group for a single attribute from eight groups above

def find_group(attr, threshold=0.4):
    s = np.zeros(8)
    l = np.array([h_achi, h_char, h_auto, h_ratn, s_cocn, s_affi, s_defe, s_emot], dtype=object)
    l_name = np.array(['h_achi', 'h_char', 'h_auto', 'h_ratn', 's_cocn', 's_affi', 's_defe', 's_emot'])
    for i, base_attr in enumerate(l):
        s[i] = sim(attr, base_attr)
    close = s > threshold
    if close.sum() == 0:
        return 0
    elif close.sum() == 1:
        return attr, l_name[close][0], s[close][0]
    else:
        return attr, l_name[np.argmax(s)], s.max()
        

In [12]:
l_name = ['h_achi', 'h_char', 'h_auto', 'h_ratn', 's_cocn', 's_affi', 's_defe', 's_emot']

tagged_attr = []
tagged_grup = []
tagged_siml = []
attr_dict = {}

for i in la:
    gr = find_group(i, 0.4)
    if gr != 0:
        tagged_attr.append(gr[0])
        tagged_grup.append(gr[1])
        tagged_siml.append(gr[2])
        
        attr_dict[gr[1]] = attr_dict.get(gr[1],[])
        attr_dict[gr[1]].append(gr[0])

for i in ln:
    gr = find_group(i, 0.4)
    if gr != 0:
        tagged_attr.append(gr[0])
        tagged_grup.append(gr[1])
        tagged_siml.append(gr[2])
        
        attr_dict[gr[1]] = attr_dict.get(gr[1],[])
        attr_dict[gr[1]].append(gr[0])

In [13]:
attr_sim = {tagged_attr[i]:tagged_siml[i] for i in range(len(tagged_siml))}

In [14]:
Counter(tagged_grup)

Counter({'s_defe': 22,
         'h_char': 28,
         'h_achi': 22,
         's_cocn': 33,
         'h_auto': 10,
         's_affi': 16,
         's_emot': 20,
         'h_ratn': 12})

In [15]:
json.dump(attr_dict, open("attr_dict.json", 'w'))
json.dump(attr_sim, open("attr_sim.json", 'w'))