## Testing Additional Biases in Word Embeddings

We are testing biases based on Socioeconomic Status, Nationality, and Age.
This Notebook is based on examples notebook provided by the original researchers of **Man is to Programmer as Woman is to Homemaker? Debiasing Word Embeddings** (https://arxiv.org/abs/1607.06520). 

The scripts used and the example iPython Notebook can be found in their github repository at https://github.com/tolga-b/debiaswe


In [1]:
from __future__ import print_function, division
%matplotlib inline
from matplotlib import pyplot as plt
import json
import random
import numpy as np

import debiaswe as dwe
import debiaswe.we as we
from debiaswe.we import WordEmbedding
from debiaswe.data import load_professions

# load google news word2vec
E = WordEmbedding('./embeddings/w2v_gnews_small.txt')
E_nat = E
E_age = E

# load professions
# professions = load_professions()
# profession_words = [p[0] for p in professions]

*** Reading data from ./embeddings/w2v_gnews_small.txt
(26423, 300)
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


### Defining Socioeconomic Direction

We defined socioeconomic status by using pairs "wealthy-impoverished", "rich-poor", and "wealth-poverty".

In [2]:
terms = ["wealthy", "impoverished", "rich", "poor", "wealth", "poverty"]
terms_group1 = [terms[2 * i] for i in range(len(terms) // 2)]
terms_group2 = [terms[2 * i + 1] for i in range(len(terms) // 2)]

vs = [sum(E.v(w) for w in terms) for terms in (terms_group2, terms_group1)]
vs = [v / np.linalg.norm(v) for v in vs]

v_eco = vs[1] - vs[0]
v_eco = v_eco / np.linalg.norm(v_eco)

### Generating analogies of "Rich: x :: Poor : y"

In [3]:
# socioeconomical analogies
a_eco = E.best_analogies_dist_thresh(v_eco)

for (a,b,c) in a_eco:
    print(a+"-"+b)

Computing neighbors
Mean: 10.219732808538016
Median: 7.0
wealthiest-poorest
wealthier-poorer
luxuries-basic_necessities
inequality-poverty
educated-illiterate
maximize-improve
real_estate-housing
yachts-fishing_boats
bitterness-hopelessness
spacious-cramped
rosy-dismal
distinguished-exemplary
outrageous-deplorable
ample-insufficient
opulent-austere
untapped-underdeveloped
stellar-subpar
capitalists-peasants
wealthy-wealthier
civil_liberties-human_rights
renovated-dilapidated
arrogance-apathy
sons-children
enhanced-improved
advantages-disadvantages
ludicrous-appalling
corporations-organizations
hubris-ineptitude
greedy-corrupt
able-unable
gullible-uneducated
enhancing-improving
clueless-hopeless
buoyant-sluggish
dehydrated-malnourished
ridiculous-atrocious
valuable-crucial
insiders-observers
strolled-trudged
hypocritical-pathetic
sunny-dreary
fireplaces-stoves
villas-dwellings
socialite-actress
great-terrible
perpetually-chronically
profits-profit
paintings-poems
impressive-unimpressive

### Debiasing Word Embedding of Socioeconomic Status Bias
Using debias function from debiaswe/debias.py

In [4]:
from debiaswe.debias import debias

definitional_pairs = [['rich','poor'], ['wealthy', 'impoverished'], ['educated','uneducated'], ['millionaires','migrant_workers'], ['mansions','affordable_housing'], ['limo','bus'],['luxuries','basic_necessities'],['wealthiest','poorest'], ['wealthier','poorer']]
equalized_pairs = [['rich','poor'], ['wealthy', 'impoverished']]
eco_specific_seed = ['mansion', 'yachts', 'villas', 'homelessness', 'millionaires', 'slums', 'migrant-workers', 'rich', 'affordable_housing', 'homeless_shelter', 'motel', 'limo']

debias(E, eco_specific_seed, definitional_pairs, equalized_pairs)

26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
{('wealthy', 'impoverished'), ('rich', 'poor'), ('WEALTHY', 'IMPOVERISHED'), ('RICH', 'POOR'), ('Wealthy', 'Impoverished'), ('Rich', 'Poor')}
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


### Generating Socioeconomic Status Based Analogies After Debiasing

In [5]:
# analogies socioeconomic
a_eco_debiased = E.best_analogies_dist_thresh(v_eco)

for (a,b,c) in a_eco_debiased:
    print(a+"-"+b)

Computing neighbors
Mean: 10.235249593157477
Median: 7.0
wealthy-impoverished
rich-poorest
poorer-poor
villas-mansions
millionaires-billionaires
yachts-boats
shelter-homeless_shelter
limo-limousine
mansion-penthouse
metropolis-slums
mental_illness-homelessness
inequality-poverty
multifamily-affordable_housing
thirst-hunger
surplus-deficit
sensitive-touchy
opportunity-chance
real_estate-condo
prominence-obscurity
surpluses-deficits
grief-despair
employment-unemployment
wide_variety-ranging
vast-colossal
renewables-greenhouse_gases
renewable-carbon_neutral
reasonably_priced-pricey
job_seekers-unemployed
transaction-tender_offer
borders-enclaves
broadcast-televised
rainfall-drought
valuable-prized
reigns-reigned
wallet-pocket
influenza-malaria
wisdom-humility
realtors-condos
travels-trips
favorable-unfavorable
wastewater-sewage
urgency-desperation
richness-splendor
investors-analysts
wealth-riches
bulk-mostly
strengthen-improve
stellar-subpar
mature-immature
buck-bucks
sadness-hopelessnes

### Defining Nationality Direction

In [6]:
terms = ["national", "international", "citizen", "immigrant","citizenship", "visa", "native", "alien", "domestic", "foreign"]
terms_group1 = [terms[2 * i] for i in range(len(terms) // 2)]
terms_group2 = [terms[2 * i + 1] for i in range(len(terms) // 2)]

vs = [sum(E_nat.v(w) for w in terms) for terms in (terms_group2, terms_group1)]
vs = [v / np.linalg.norm(v) for v in vs]

v_nat = vs[1] - vs[0]
v_nat = v_nat / np.linalg.norm(v_nat)

### Generating Nationality Based Analogies of "Native : x :: Foreign : y"

In [7]:
# nationality analogies
a_nat = E_nat.best_analogies_dist_thresh(v_nat)

for (a,b,c) in a_nat:
    print(a+"-"+b)

citizenship-visa
national-international
guardsmen-marines
creature-alien
citizenry-elites
democracy-dictators
goats-camels
socialism-imperialist
doctorate-postgraduate
deplorable-inhuman
nation-continent
while-whilst
residents-locals
homeowners-landlords
watershed-estuary
permits-visas
rifle-machine_guns
proclamation-edict
mass_transit-subway
vitality-dynamism
woman-prostitute
steelmaker-iron_ore
halfback-flyhalf
woodlands-woodland
lieutenants-henchmen
native-natives
cutting-squeezing
patriots-revolutionaries
pipped-fancied
reform-liberalization
identity_theft-fraudsters
nationally-internationally
dubious-dodgy
motorist-tow_truck
roadways-parking_lots
nationals-foreigners
conservatism-fundamentalism
lowest-weakest
monument-tomb
communities-enclaves
championship-postseason
drivers-taxi_drivers
democratic-repressive
states-countries
heartwarming-captivating
integrator-integrators
fishes-sharks
broadband-telecom_operators
businessman-businessmen
waitress-waiters
academic_excellence-academ

### Debiasing Word Embedding of Nationality Bias

In [8]:
from debiaswe.debias import debias

nat_definitional_pairs = [['national','international'],['domestic','foreign'], ['native','alien'], ['citizenship','visa'], ['citizen','immigrant'],['familiar','exotic'], ['home','overseas'], ['national', 'abroad'], ['nationals','foreigners'], ['citizens','immigrants']]
nat_specific_seed = []
equalized_pairs = []

debias(E_nat, nat_specific_seed, nat_definitional_pairs, equalized_pairs)

26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
set()
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


### Generating Nationality Based Analogies After Debiasing

In [9]:
# analogies nationality
a_nat_debiased = E_nat.best_analogies_dist_thresh(v_nat)

for (a,b,c) in a_nat_debiased:
    print(a+"-"+b)

Computing neighbors
Mean: 10.24100215721152
Median: 7.0
citizenship-visa
abroad-national
skiing-ski
roadways-parking_lots
illegal_aliens-aliens
cutting-squeezing
astronauts-spaceship
reside-inhabit
internationally-international
citizen-foreigner
democracy-civil_society
regionally-regional
recuperating-sidelined
lieutenants-henchmen
graduate-student
native-natives
doctorate-professor
water_heaters-heaters
pears-pumpkin
woodlands-wooded
globally-global
guardsmen-marines
animals-creatures
wastewater-septic
mental_illness-psychotic
fiancee-pal
resides-sits
horses-horsemen
printed-pasted
corporal-commander
movies-sci_fi
minor_leagues-minor_league
demand-demands
expatriates-expatriate
missionary-evangelical
produces-creates
lifelong-longtime
largely-completely
destinations-attractions
sea_turtles-sharks
breeder-horse
fulfill-meet
oysters-crab
eggs-egg
ancestors-prehistoric
townhouse-storey
ammunition-machine_guns
plants-greenhouses
loved_ones-grieving
polymers-molecules
knee_surgery-groin
in

### Defining Age Direction

In [10]:
terms = ['old', 'young', 'parent', 'child', 'elderly', 'youth', 'past', 'future', 'old', 'new', 'death', 'birth']
terms_group1 = [terms[2 * i] for i in range(len(terms) // 2)]
terms_group2 = [terms[2 * i + 1] for i in range(len(terms) // 2)]

vs = [sum(E_age.v(w) for w in terms) for terms in (terms_group2, terms_group1)]
vs = [v / np.linalg.norm(v) for v in vs]

v_age = vs[1] - vs[0]
v_age = v_age / np.linalg.norm(v_age)

### Generating Age Based Analogies of "Old : x :: Young : y"

In [11]:
# age analogies
a_age = E_age.best_analogies_dist_thresh(v_age)
i = 0
for (a,b,c) in a_age:
    print(a+"-"+b)

middle_aged-young
old-olds
olds-youngsters
grandma-kids
daughter-child
last-this
bothers-excites
aunt-parents
grandmother-grandparents
woman-women
mother-mothers
eventful-exciting
girl-girls
caring-nurturing
unleashed-unleash
accidentally-unintentionally
mild_mannered-charismatic
executed-execute
upstart-fledgling
stepdaughter-daughters
nurse-midwives
entrepreneur-entrepreneurial
opposite_direction-direction
apparently-presumably
later-shortly_thereafter
killed-martyred
speculated-hinted
entertainer-entertainers
carer-childcare
reservist-commanding_officer
teenagers-youth
veteran-veterans
eyesore-amenity
retiree-retirees
after-shortly
sexually_assaulting-sexual_misconduct
recent-latest
tacklers-playmakers
smoker-smoking
disgruntled-disaffected
learned-learn
survive-thrive
unsuccessful-successful
apples-fruits
triplets-birth
cranky-restless
happened-happening
prowess-talents
grandmothers-fathers
gelding-stallion
businessman-businessmen
discovered-discover
identified-identify
pensioner-h

### Debiasing Word Embedding of Age Bias

In [12]:
from debiaswe.debias import debias

age_definitional_pairs = [['old','young'],['adult','child'], ['father','son'], ['mother','son'], ['father','daughter'],['mother','daughter'], ['grandpa','grandson'], ['man', 'boy'], ['grandmother','father'], ['older','younger']]
age_specific_seed = []
equalized_pairs = []

debias(E_age, age_specific_seed, age_definitional_pairs, equalized_pairs)

26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine
set()
26423 words of dimension 300 : in, for, that, is, ..., Jay, Leroy, Brad, Jermaine


### Generating Age Based Analogies after Debiasing

In [13]:
# analogies age after debiased
a_age_debiased = E_age.best_analogies_dist_thresh(v_age)

for (a,b,c) in a_age_debiased:
    print(a+"-"+b)

Computing neighbors
Mean: 10.21186087877985
Median: 7.0
old-young
caregiver-child
mother-baby
grandma-kids
upstart-fledgling
caring-nurturing
last-next
recent-latest
uncle-son
bothers-excites
aunt-grandchild
irate-livid
pensioner-mum
menopause-pregnancy
olds-youngsters
widower-daughter
speculated-hinted
eventful-exciting
vegetables-fruits
relatives-parents
mild_mannered-charismatic
grandmothers-mothers
dispose-disposal
toddler-newborn
appreciative-excited
births-birth
unemployed-employment
consisted-includes
good_natured-playful
retired-retires
carer-childcare
killed-martyred
mentally_ill-mental_health
occasionally-whenever
eyesore-redeveloped
competed-compete
nurse-midwife
polite-forthright
drug_trafficking-human_trafficking
musicals-musical
mainstay-cornerstone
taught-teach
upscale-luxurious
angry-anxious
unleashed-unleash
geologist-geological
after-shortly
renews-unveils
apparently-presumably
analog-digital
upperclassmen-incoming_freshmen
stench-scent
hurt-hamper
tacklers-playmakers