# Word Embedding Association Test (WEAT)

- Evaluate Implicit Association between social categories and attributes.
  - Conventional Implicit Association Test (IAT) is a well-known pyschological assessment intended to detect subconscious associations between mental representations of objects (concepts) in memory.
  - Easier pairings that result in faster responses are interpreted as more strongly associated in memory.
- WEAT: Comparing the association between two sets of target words and two sets of attributes.
  - Target words: Social categories, such as female/male and races.
  - Attribute words: Merits (honest), characteristics (pleasant), and occupations (programmers and nurses).
  - Association: Cosine similarity
  - Statistical test: Independent/Pairwise T Test

_Reference: Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like moral choices. Science, 356(April), 183–186. https://doi.org/10.1145/3306618.3314267_

## Gender Bias
1. Prepare two lists of female-typed and male-typed names
2. Target two gender-typed attributes, i.e. family vs career
3. Compile a sufficient number of keywords that are representative of the attribute, i.e. family-related words vs career-related words
4. Calculate the average cosine similarity (association) between a specific name and family/career words
   - Word Embedding Models
   - Extract word vectors associated with names and attribute words
   - Integrate them into a big matrix where rows represent names and columns represent latent dimensions
6. Substract career and family association for a given name -> Relative strength of **family association**, in contrast to the career association (to what extent the name's family association outsizes its career association)
7. Compare the relative family association of female/male groups -> T test
   - `sklearn.metrics.pairwise.cosine_similarity(X, Y)` Compute cosine similarity between all possible combinations of samples in X and Y.

Download this file: https://juniorworld.github.io/python-workshop/doc/weat_gender.json

In [None]:
import json
weat_dict = json.loads(open('./doc/weat_gender.json','r').read())

In [None]:
weat_dict.keys()

In [None]:
# Each dictionary element contains a list of words associated with a certain social category (key)
weat_dict['Female']

In [None]:
# Overall 19 female-typed names/pronouns are selected.
len(weat_dict['Female'])

In [None]:
weat_dict['Family']

In [None]:
import collections.abc
#Hyper needs the four following aliases to be done manually.
collections.Iterable = collections.abc.Iterable
collections.Mapping = collections.abc.Mapping
collections.MutableSet = collections.abc.MutableSet
collections.MutableMapping = collections.abc.MutableMapping

In [None]:
# Show all available models in gensim-data
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

Format: **Model_Name**-**Training_source**-**Dimensions**

Model name:
1. `fasttext` invented at Facebook
2. `word2vec` invented at Google
3. `GloVe` invented at Stanford

Training sources:
1. Twitter
2. Google News
3. Wikipedia articles

Dimensions: Vector Size

In [None]:
# Load GloVe model trained on the Wiki articles with 200 dimensions
glove = gensim.downloader.load('glove-wiki-gigaword-200')

In [None]:
# Check if a word exists in the pretrained word embedding
'she' in glove

In [None]:
# Get the word vector for "adam"
glove['she']

In [None]:
# Yuner does not exist in the pretrained word embedding 
'yuner' in glove

In [None]:
# Install sklearn if you haven't done so
! pip3 install scikit-learn

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def words_to_vectors(words, word_embedding):
    vectors = np.empty(shape=(0,200)) #initialize an empty array (a data type supported by numpy, equivalent to list)
    for i in words:
      if i in word_embedding:
        vectors = np.vstack([vectors,word_embedding[i]]) #stack arrays vertically
    return(vectors)

In [None]:
# Convert female, male, family, and career words into vectors
Female_vectors = 
Male_vectors = 
Family_vectors = 
Career_vectors = 

In [None]:
Female_vectors.shape

In [None]:
Family_vectors.shape

In [None]:
Female_Family_association = cosine_similarity(Female_vectors, Family_vectors)

In [None]:
Female_Family_association.shape

In [None]:
Female_Family_mean = np.mean(Female_Family_association, axis=1)

In [None]:
Female_Family_mean.shape

In [None]:
Female_Family_mean

In [None]:
Female_Career_association = cosine_similarity(Female_vectors, Career_vectors)
Female_Career_mean = np.mean(Female_Career_association, axis=1)

In [None]:
Female_Career_mean

In [None]:
# What does this block of scripts mean?
for i in range(len(Female_Family_mean)):
    print(weat_dict['Female'][i],Female_Family_mean[i]-Female_Career_mean[i])

In [None]:
# Which name is the most family-oriented name in the current word embedding model?


In [None]:
Male_Family_association = cosine_similarity(Male_vectors, Family_vectors)
Male_Family_mean = np.mean(Male_Family_association, axis=1)
Male_Career_association = cosine_similarity(Male_vectors, Career_vectors)
Male_Career_mean = np.mean(Male_Career_association, axis=1)

In [None]:
# What does this block of scripts mean?
for i in range(len(Male_Family_mean)):
    print(weat_dict['Male'][i],Male_Family_mean[i]-Male_Career_mean[i])

## Student's T-test
- Student's t-test is a statistical test used to test whether the difference between two groups is statistically significant or not.
- https://www.youtube.com/watch?v=pTmLQvMM-1M
- Different variants:
  - Independent t-test: Two independent groups
  - Paired sample t-test: Two matched groups associated with identical or similar subjects

In [None]:
Female_association = Female_Family_mean-Female_Career_mean
Male_association = Male_Family_mean-Male_Career_mean

In [None]:
# Compare group means
np.mean(Female_association)>np.mean(Male_association)

In [None]:
from scipy import stats
stats.ttest_ind(Female_association, Male_association)

## Excersie
Write a program to evaluate the association between genders and art/science. <br>
In this word embedding, is there a stronger association between females and art compared to males, as opposed to science?




In [None]:
# Write your code here



# Chinese NLP

### What's Special about Chinese Language?
1. First-run Cleaning: No need to convert <font style="color: blue">letter case</font>
    - data_cleaning() solely for chinese language will be slighter than that for english, without a line of text.lower()
    - However, we typically will use english-version data_cleaning(), in case there are some english character in the text
2. Tokenization: No <font style="color: blue">natural deliminator</font>, like the space in Eng. Need to rely on language model to split text into words.
3. Second-run Cleanig: No need to <font style="color: blue">stem/lemmatize</font> words
4. Vectorization: Identical to English

#### 1. First-run Data Cleaning
- Main task: Remove punctuations and special characters like hashtags, hyperlinks
- Use Regular Expression for Pattern Matching
- No need to convert cases

In [None]:
#Also works for Chinese text
import re
re.sub('[\W]+',' ','普京表示，歡迎中方在化解危機中的建設性角色！')

In [None]:
def data_cleaning(text):
    text=text.lower()
    text=re.sub('[0-9]+','',text)
    text=re.sub('@[^ ]+','',text)
    text=re.sub('#[^ ]+','',text)
    text=re.sub('https:[^ ]+','',text)
    text=re.sub('[\W]+',' ',text)
    text=text.strip()
    return(text)

In [None]:
#test the data_cleaning() function with a Weibo post
a="各國應轟炸俄羅斯境內“暈輸線”……當年“炮擊金門”很久，最後因美國切斷了“廈門車站”運輸線，炮擊金門才止。（而不應去烏克蘭建軍工廠：俄會集中火力轟炸。）@美国驻华大使馆 @英國駐華使館 @歐盟在中國 @烏克蘭信使"
data_cleaning(a)

#### 2. Tokenization

We will use a package package "jieba" to tokenize Chinese text.<br>
<br>
**Why jieba?**
- It adopts a hybrid method combining both statistical/probabilistic inference and pattern matching based on dictionary. 
    - capable to recognize words existing in the pre-defined dictionary
    - capable to find new words.
- Two dictionaries:
    - System dictionary
        - Simplied Chinese
        - Simplied+Traditional Chinese
    - User dictionary
- Syntax:
>```python
jieba.cut(sentence) #result is a list of words
```

In [None]:
! pip3 install jieba

In [None]:
import jieba

In [None]:
list(jieba.cut('你好，这是一个简单的句子。'))

In [None]:
#it can segment tradional Chinese text by using statistical inference method.
list(jieba.cut('你好，這是一個簡單的句子。'))

In [None]:
#however, statistical inference is not always perfect.
list(jieba.cut('談判擱置，工會號召靜坐。'))

In [None]:
list(jieba.cut('谈判搁置，工会号召静坐。'))

How could we improve statistical inference for the tokenization?<br>
**Human in the loop**: Provide human-defined dictionary to constrain and fine-tune the statistical inference.
##### Solution: Configurate Dictionaries
- Two types of dictionaries:
    1. System dictionary: General purpose
    2. User dictionary: Special context, e.g. dictionaries for emotion, incivility, war
- How does the dictionary look like?
    - Don't confuse with the data type “dictionary”
    - Dictionary is a plain text file
    - One line one keyword, similar to the stopword list/file
    - [Optional] Words might also be weighted, carrying with a number/decimal suggestive of the importance of the words

>```
>#Way 1: no weight: all words are created equal
China
People's Republic of China
China Central Television

>```
#Way 2: with weights: words are treated unequally. Higher weight, Higher priority
China,3
People's Republic of China, 4
China Central Television,4


- In jieba, you can load dictionaries using the following syntaxes:
>```python
jieba.set_dictionary("path_of_system_dict") 
jieba.load_userdict("path_of_user_dict")


To better segment traditional Chinese text, we need to upgrade system dictionary to include traditional Chinese words.<br>
Download the system dictionary from this link:https://github.com/fxsjy/jieba/raw/master/extra_dict/dict.txt.big

In [None]:
#load traditional Chinese system dictionary
jieba.set_dictionary('./doc/dict.txt.big')

In [None]:
#try tokenizing this sentence again
list(jieba.cut('談判擱置，工會號召靜坐。'))

In [None]:
#Some names and special terminologies cannot be properly identified.
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('蔡英文日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #special terminologies

In [None]:
#Use a for loop to build your user dictionary (time-consuming)
file=open('user_dict.txt','w',encoding='utf-8')
keywords=['林鄭月娥','蔡英文','韓國瑜','汶萊達魯薩蘭國']
#Write your loop here
for keyword in keywords:
    file.write(keyword+'\n')
file.close()

In [None]:
#Use your user dictionary
jieba.load_userdict('user_dict.txt')

In [None]:
#After loading user dictionary:
print(list(jieba.cut('中央上周二向特首林鄭月娥發公函'))) #very long name
print(list(jieba.cut('蔡英文日前表示希望與日本舉行安保對話'))) #names including frequently used words
print(list(jieba.cut('高雄市長韓國瑜本月稍後訪問港澳深圳廈門四市'))) #names including frequently used words
print(list(jieba.cut('汶萊的全稱為汶萊達魯薩蘭國。'))) #terminologies

#### 3. Remove stop words

Chinese stop words file: https://juniorworld.github.io/python-workshop/doc/stop_words_chi.txt

In [None]:
#load stop word list
file_chi=open('./doc/stop_words_chi.txt','r',encoding='utf-8')
stopwords=[i.strip() for i in file_chi.readlines()]

In [None]:
len(stop_words) #much longer and detailed than english stopwords

In [None]:
#have a look at the dictionary
stopwords[34:39]

In [None]:
def remove_stopwords(words):
    global stopwords
    words_rm=[]
    for word in words:
        if word not in stopwords:
            words_rm.append(word)
    return(words_rm)

In [None]:
paragraph='Facebook CEO 馬克·朱克伯格（Mark Zuckerberg）週三發布了一篇長文，闡述了要將 Facebook 打造成「以隱私為中心的平台」的願景，並表示將打通 Messenger、Instagram 和 WhatsApp 用戶之間的交流阻礙。朱克伯格表示，他相信未來人們的溝通行為會更多轉向私人加密服務，也未必希望他們分享的所有內容都被永遠保存在互聯網上——後者對於每個人來說，既可能是財富，也可能是負擔。因此，儘管 Facebook 長期以來專注於打造開放、分享的社區平台，但他認為，以隱私為中心的通信平台會比當今的開放平台更加重要。'
text_clean=data_cleaning(paragraph)
words=jieba.cut(text_clean)
words_rm=remove_stopwords(words)

In [None]:
# Containing too many meaningless whitespaces
words_rm

In [None]:
# Add whitespace to the stopword lists


In [None]:
# Rerun the codes
text_clean=data_cleaning(paragraph)
words=jieba.cut(text_clean)
words_rm=remove_stopwords(words)

In [None]:
words_rm

In [None]:
import pandas as pd
#count word frequency
pd.Series(words_rm).value_counts()

In [None]:
pd.Series(words_rm).value_counts(normalize=True)

<h3 style='color:blue'>Exercise</h3>

Find the 10 fade-in and fade-out words in speeches.<br>
The magnitude of difference is measured by the change in their relative frequencies:<br>
<p style='text-align:center;font-size:15px;'>Relative Freq (RF) = word frequency / sum of word frequencies</p>
<p style='text-align:center;font-size:15px;'>Difference = RF<font size='2px'>2019</font> - RF<font size='2px'>2009</font></p>

Options:<br>
- Chinese: Annual government work reports, <a href="https://juniorworld.github.io/python-workshop/doc/2019_Government_Work_Report.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop/doc/2009_Government_Work_Report.txt">2009</a>
- English: State of the Union address, <a href="https://juniorworld.github.io/python-workshop/doc/2019_SoU.txt">2019</a> vs <a href="https://juniorworld.github.io/python-workshop/doc/2009_SoU.txt">2009</a><br>

*Hint:*<br>
*1. Use `pd.concat([df1,df2],axis=1)` to combine two dataframes by columns and `pd.concat([df1,df2],axis=0)` to combine two dataframes by rows*<br>
*2. Use `df[column_name].value_counts()` to count the items in a column.*<br>
*3. Use `df.sort_values(column_name,ascending=True)` to sort a certain column. To get a reversed list, you can set ascending=False* <br>
*4. Use `df.fillna(0)` to replace NAN value with 0.*

In [None]:
#Read Chinese files
Chi_file_2019=open('./doc/2019_Government_Work_Report.txt','r',encoding='utf-8')
Chi_file_2009=open('./doc/2009_Government_Work_Report.txt','r',encoding='utf-8')

In [None]:
#CHI 2019
#Step1: Clean text
#Step2: Tokenize text
#Step3: Remove stopwords
#Step4: Add the current word list to Chi_words_2019
#---------------------------------------------------
Chi_words_2019=[]




In [None]:
#CHI 2009
Chi_words_2009=[]



In [None]:
#Count relative word frequencies



In [None]:
#Combine relative_freq_2019 and relative_freq_2009 into relative_freq, using pd.concat() function
relative_freq = 

In [None]:
#Fill out missing values with 0
relative_freq=relative_freq.fillna(0)

In [None]:
relative_freq.head()

In [None]:
#Calculate the frequency difference
relative_freq['diff']=

In [None]:
#Change column names
relative_freq.columns=

In [None]:
#Sort table by column 'diff'
#Fade in words: words that are more common in 2019 report



In [None]:
#Fade out words: words that are more common in 2009 report



# Hackathon & Team Project

1. Team Size = 4 ppl
2. Random Grouping or Self Grouping?
3. Format:
   - 3-hour Hackathon: Submit a draft (no word limits) explaining your team plan, division of labour, and the preliminary results you have derived within the 3 hours. You will also need to turn in your jupyter notebook.
   - 1-week Extended work: Extend the draft into a full paper with no less than 3200 words.
4. Datasets:
   - Social media postings
   - News articles
   - Reviews/Comments
   - You have leeway to add new dataset to your study
5. Grade breakdowns:
   - Hackathon: 10 points
     - Group grading: 5 points
     - Individual grading: 5 points
   - Extended paper: 10 points
     - Group grading: 5 points
     - Individual grading: 5 points
   - Peer Evaluation: 3 points
     - Evaluations lacking variation, such as exclusively consisting of either 5-star or 1-star ratings, will result in disqualification.
6. Evaluation criteria:
   - Effort-based grading
   - Sophistication/Richness of results
   - Computional thinking & Code quality
   - Storytelling skills: describe the data clearly, explain the rationale for the study, incorporate visualizations and statistics smartly, and derive insightful findings from the analyses
   - Team work