# INFORMATION RETRIEVAL PROJECT
# 2. ANALYSIS OF GENDER STEREOTYPES BY YEARS - PRELIMINARY ANALYSIS

---
## Gender stereotypes in parliamentary speeches

In word embedding models, each word is assigned to a high-dimensional vector such that the geometry of the vectors captures semantic relations between the words – e.g. vectors being closer together has been shown to correspond to more similar words. Recent works in machine learning demonstrate that word embeddings also capture common stereotypes, as these stereotypes are likely to be present, even if subtly, in the large corpora of training texts. These stereotypes are automatically learned by the embedding algorithm and could be problematic in many context if the embedding is then used for sensitive applications such as search rankings, product recommendations, or translations. An important direction of research is on developing algorithms to debias the word embeddings.

This project aims to use the word embeddings to study historical trends – specifically trends in the gender and ethnic stereotypes in the Italian parliamentary speeches from 1948 to 2020.

In [1]:
import numpy as np
import pandas as pd
import gensim
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
import pickle
import os
from collections import defaultdict, OrderedDict
from tqdm.auto import tqdm
import itertools
from itertools import product
import json
from sklearn.feature_extraction.text import TfidfVectorizer

from INFORET_project import load_embed_model
# import matplotlib.pylab as plt
pd.set_option("display.max_rows", 100, "display.max_columns", 100)

Load a different model for each time period

In [13]:
from INFORET_project import YEARS

In [14]:
YEARS

['1948_1968', '1968_1985', '1985_2000', '2000_2020']

In [None]:
model = load_embed_model(YEARS[0])

In [7]:
model = load_embed_model(YEARS[1])

In [8]:
model = load_embed_model(YEARS[2])

In [16]:
model = load_embed_model(YEARS[3])

---

## 1) PRELIMINARY ANALYSIS
Create a group of gendered words and retrieve the mean vector, then retrieve the most similar words to the mean vector. This provides to us a first hint of the most related words to each gender.

In [58]:
# to return the nearest words to an averaged vector of words. cosine similarity is used
from INFORET_project import similar_to_avg_vector 
# download the dictionary of useful words
from INFORET_project.data import gendered_neutral_words

In [31]:
# Print for every time period the most similar words to the averaged vectors of genders

for year in YEARS:
    print(f"\nYears: {year}")
    model = load_embed_model(year)
    
    for gender in ['male','female']:
        print(f"\nMost similar words to {gender} vector:")
        _ = similar_to_avg_vector(model.wv, gendered_neutral_words[gender]) 


Years: 1948_1968

Most similar words to male vector:


[('Zappelli', 0.5698506832122803),
 ('Settembrini', 0.5646949410438538),
 ('lei', 0.562781572341919),
 ('Lho', 0.5611302256584167),
 ('maschiare', 0.5558352470397949),
 ('divulgatore', 0.5550841093063354),
 ('Uscito', 0.5509503483772278),
 ('brav', 0.5507361888885498),
 ('piglio', 0.5503260493278503),
 ('Spellanzon', 0.5499021410942078),
 ('Montemartini', 0.5494170188903809),
 ('Maffi', 0.545027494430542),
 ('Roggero', 0.5446869134902954),
 ('possedette', 0.543790876865387),
 ('casata', 0.5433430075645447),
 ('Olgiati', 0.5430217981338501),
 ('causidica', 0.5428649187088013),
 ('novantenne', 0.5408570170402527),
 ('Ricordati', 0.539644181728363),
 ('Filpo', 0.5394607186317444)]


Most similar words to female vector:


[('maschio', 0.6027352809906006),
 ('coniugato', 0.5403891801834106),
 ('bambino', 0.5359815359115601),
 ('madre', 0.5300886034965515),
 ('A4110ra', 0.5290261507034302),
 ('Duro', 0.5266222953796387),
 ('pizza', 0.5260553359985352),
 ('traviata', 0.5229917764663696),
 ('caramellare', 0.5219253301620483),
 ('Speravano', 0.519498884677887),
 ('educabili', 0.5186226963996887),
 ('piazzarsi', 0.5178252458572388),
 ('tenerissima', 0.5173928141593933),
 ('ragazza', 0.516727864742279),
 ('PerchP', 0.5149996280670166),
 ('Parlavamo', 0.5093346238136292),
 ('epoi', 0.5088962316513062),
 ('magliaie', 0.5078704953193665),
 ('Nbn', 0.507202684879303),
 ('ragazzina', 0.5067441463470459)]


Years: 1968_1985

Most similar words to male vector:


[('imparentare', 0.5919234752655029),
 ('dabbene', 0.5873482823371887),
 ('Cascavilla', 0.5757697224617004),
 ('moraleggiare', 0.57254558801651),
 ('Pascal', 0.5718967914581299),
 ('degnissima', 0.5694552063941956),
 ('toccarla', 0.5629111528396606),
 ('incestuoso', 0.5626682639122009),
 ('Mahler', 0.5580506324768066),
 ('Pagheranno', 0.5543779730796814),
 ('Ponzo', 0.5536849498748779),
 ('Studenti', 0.5516732931137085),
 ('andarvi', 0.5507052540779114),
 ('integerrimo', 0.5496920943260193),
 ('focoso', 0.5495660305023193),
 ('sottoccupato', 0.5480859279632568),
 ('peccatore', 0.5471469759941101),
 ('svillaneggiare', 0.5464788675308228),
 ('suggestivamente', 0.5458648800849915),
 ('violabili', 0.545601487159729)]


Most similar words to female vector:


[('emancipato', 0.6487357020378113),
 ('sessualmente', 0.6152198314666748),
 ('maschiare', 0.6015832424163818),
 ('giovanissime', 0.5884872674942017),
 ('nubile', 0.5852416157722473),
 ('focomelico', 0.5786728858947754),
 ('55°', 0.5733279585838318),
 ('evirare', 0.5684111714363098),
 ('lei', 0.5655562281608582),
 ('lesbica', 0.565072774887085),
 ('abortire', 0.5640296339988708),
 ('ragazza', 0.5634415149688721),
 ('sgraziato', 0.5633205771446228),
 ('incinto', 0.5623881220817566),
 ('Resti', 0.5620538592338562),
 ('puerpera', 0.561506450176239),
 ('Vapona', 0.5599506497383118),
 ('malformato', 0.5562317371368408),
 ('nutrice', 0.5554565191268921),
 ('incestuoso', 0.5538656711578369)]


Years: 1985_2000

Most similar words to male vector:


[('capoccia', 0.5751026272773743),
 ('supplicare', 0.5648277401924133),
 ('evangelista', 0.5597399473190308),
 ('coglione', 0.5544094443321228),
 ('prostituire', 0.5524687767028809),
 ('tenerezza', 0.5514913201332092),
 ('aguzzino', 0.5504509210586548),
 ('Nietzsche', 0.5490137338638306),
 ('immodestamente', 0.545851469039917),
 ('sentirla', 0.5446656942367554),
 ('battagliero', 0.5441685914993286),
 ('Votano', 0.5432336330413818),
 ('squartare', 0.542973518371582),
 ('Parlava', 0.5422471165657043),
 ('abate', 0.5420100092887878),
 ('cannonata', 0.5413274765014648),
 ('ubriacare', 0.5363138318061829),
 ('motoretta', 0.5358209609985352),
 ('Arbore', 0.5346950888633728),
 ('Milton', 0.5343030691146851)]


Most similar words to female vector:


[('divorziato', 0.5739867091178894),
 ('menopausa', 0.5716233849525452),
 ('giovanissime', 0.5603377819061279),
 ('incinto', 0.558162271976471),
 ('partoriente', 0.5513758659362793),
 ('sieropositivo', 0.551144540309906),
 ('monoparentali', 0.5507063269615173),
 ('empowerment', 0.5504602789878845),
 ('motoretta', 0.5501375198364258),
 ('bambino', 0.548194408416748),
 ('single', 0.546561062335968),
 ('bambina', 0.5425636768341064),
 ('persona', 0.5407829284667969),
 ('ragazza', 0.5405844449996948),
 ('accudire', 0.5405091643333435),
 ('adulto', 0.5365861654281616),
 ('giovane', 0.5364285707473755),
 ('procreare', 0.5351355075836182),
 ('studentessa', 0.5350608229637146),
 ('maciullare', 0.5350210070610046)]


Years: 2000_2020

Most similar words to male vector:


[('moderatore', 0.5495796799659729),
 ('magistrate', 0.537135124206543),
 ('Uomini', 0.525381326675415),
 ('probo', 0.5215122103691101),
 ('anonimamente', 0.518482506275177),
 ('gesuita', 0.517815887928009),
 ('Mensorio', 0.5054284930229187),
 ('Franchi', 0.5021278262138367),
 ('PadoaSchioppa', 0.5015751719474792),
 ('Masih', 0.5006906390190125),
 ('Rapisarda', 0.49734416604042053),
 ('Restituiamo', 0.49563267827033997),
 ('Pinchera', 0.4954855442047119),
 ('eroicamente', 0.4952343702316284),
 ('icari', 0.49391478300094604),
 ('Castiglioni', 0.49375438690185547),
 ('semianalfabeta', 0.4926125407218933),
 ('onorabile', 0.49101564288139343),
 ('Sponziello', 0.4908043444156647),
 ('Dracula', 0.4868161976337433)]


Most similar words to female vector:


[('incinto', 0.61652010679245),
 ('maschiare', 0.5876336693763733),
 ('bambina', 0.5753710865974426),
 ('spose', 0.5685369372367859),
 ('uomini', 0.5532326102256775),
 ('Uomini', 0.5529881715774536),
 ('picchiata', 0.5529308915138245),
 ('normodotate', 0.5496178865432739),
 ('genealogia', 0.5464109182357788),
 ('ragazza', 0.5432000756263733),
 ('monoparentali', 0.5316020846366882),
 ('fattrice', 0.5290380716323853),
 ('puerpera', 0.5267908573150635),
 ('femminilita', 0.5266191363334656),
 ('giovanissime', 0.5249960422515869),
 ('persona', 0.5235183835029602),
 ('bambino', 0.5234599113464355),
 ('bisessuale', 0.5211064219474792),
 ('giovanissima', 0.5206022262573242),
 ('menopausa', 0.5193460583686829)]

## 2) TF-IDF BASELINE

Use TF-IDF to create a baseline for assessing the performance of the other methods. <br>
First, retrieve the TF-IDF for the words in the documents divided by the gender of speakers and time periods. Then compute, for each time period, the difference between the TF-IDF of males and females speakers. The higher the TF-IDF for the word, the higher its bias. <br>
Calculate the TF-IDF only for the words contained in the group of words used for this analysis. Then compute the average TF-IDF for each group of word and use it to rank the groups.

The TF-IDF vectorisation has been done through ISLab virtual machine, so the results are copy-pasted from the shell to avoid downloading large data on the local machine.

In [None]:
# create the corpus where each document correspond to all the documents of a specific 
#gender and time period

basepath = '/home/student/Desktop/COGNOMEnomeMATRICOLA/FORMENTInicole941481'

YEARS = [ "1948_1968", "1968_1985", "1985_2000", "2000_2020" ]
GENDER = ["male","female"]
corpus = []

for years,gender in tqdm(product(YEARS,GENDER),
                        total=len(YEARS)*len(GENDER)):
    print(f'YEARS: {years}, GENDER: {gender}')
    with open(os.path.join(basepath,f'docs_by_years_gender_{years}_{gender}.pickle'), "rb") as output:
        docs = pickle.load(output)
        # append to the corpus the flattened list of documents of each gender and time period
        corpus.append(list(itertools.chain.from_iterable(docs)))

Order of documents is:

```
  0%|                                                     | 0/8 [00:00<?, ?it/s]
YEARS: 1948_1968, GENDER: male
 12%|█████▋                                       | 1/8 [00:22<02:39, 22.78s/it]
YEARS: 1948_1968, GENDER: female
 25%|███████████▎                                 | 2/8 [00:25<01:06, 11.03s/it]
YEARS: 1968_1985, GENDER: male
 38%|████████████████▉                            | 3/8 [00:39<01:01, 12.25s/it]
YEARS: 1968_1985, GENDER: female
 50%|██████████████████████▌                      | 4/8 [00:44<00:38,  9.61s/it]
YEARS: 1985_2000, GENDER: male
 62%|████████████████████████████▏                | 5/8 [01:00<00:35, 11.85s/it]
YEARS: 1985_2000, GENDER: female
 75%|█████████████████████████████████▊           | 6/8 [01:05<00:18,  9.49s/it]
YEARS: 2000_2020, GENDER: male
 88%|███████████████████████████████████████▍     | 7/8 [01:31<00:15, 15.01s/it]
YEARS: 2000_2020, GENDER: female
100%|█████████████████████████████████████████████| 8/8 [01:35<00:00, 11.92s/it]
```

In [None]:
# Compute TF-IDF 
vectorizer = TfidfVectorizer()
corpus = [' '.join(doc) for doc in corpus] 
X = vectorizer.fit_transform(corpus)
# retrieve the words in the corpus
features = vectorizer.get_feature_names()

In [None]:
print(X.shape)

Shape of X is: ```(8, 500080)```

Where there are 8 documents (corpus divided by gender and years) and 500080 words.

In [4]:
# The function returns the TF-IDF score of all the documents for each word

def word_to_tfidf(X, features, word):
    index = features.index(word)
    return X[:,index].toarray()

In [None]:
word_to_tfidf(X, features, 'donna')

Output is:
```
array([[0.00071704],
       [0.01754887],
       [0.00133872],
       [0.01613872],
       [0.00077914],
       [0.01025392],
       [0.00139078],
       [0.01102728]])
```

In [None]:
with open(os.path.join(basepath,'gendered_neutral_words.json')) as fin:
    gendered_neutral_words = json.load(fin)

WORDS_GROUP = list(gendered_neutral_words.keys())[4:]

In [None]:
# create and populate a dataframe to store the TF-IDF of all the relevant words 
#for the different genders and time periods

columns = set([w for group in WORDS_GROUP for w in gendered_neutral_words[group] ])
tfidf_words = pd.DataFrame(columns=columns, index=list(product(YEARS,GENDER)))

for word in tfidf_words.columns:
    try:
        tfidf = word_to_tfidf(X, features, word)
        tfidf_words[word] = np.round(tfidf,5)
    except:
        pass
    
tfidf_words.to_csv(os.path.join(basepath,'tfidf_words_dataframe'))

In [None]:
#!/bin/bash
BASEPATH_src=/home/student/Desktop/COGNOMEnomeMATRICOLA/FORMENTInicole941481
file=$BASEPATH_src'/tfidf_words_dataframe'
scp -P 22 student@***.**.**.**:$file ~/Gender-stereotypes-in-parliamentary-speeches-with-Word-Embedding/misc

---
Load the data in the notebook

In [60]:
from INFORET_project import WORDS_GROUP
from INFORET_project.data import gendered_neutral_words

In [61]:
tfidf_words = pd.read_csv('misc/tfidf_words_dataframe',
                         index_col='Unnamed: 0').T

In [62]:
# create for each time period a new colum containing the difference of TF-IDF between
#male and female speakers

for year in YEARS:
    tfidf_words[f'bias_{year}'] = abs(tfidf_words[f"('{year}', 'male')"] - tfidf_words[f"('{year}', 'female')"])

In [64]:
tfidf_words.head()

Unnamed: 0,"('1948_1968', 'male')","('1948_1968', 'female')","('1968_1985', 'male')","('1968_1985', 'female')","('1985_2000', 'male')","('1985_2000', 'female')","('2000_2020', 'male')","('2000_2020', 'female')",bias_1948_1968,bias_1968_1985,bias_1985_2000,bias_2000_2020
timido,9e-05,6e-05,8e-05,0.0001,7e-05,9e-05,9e-05,0.0001,3e-05,2e-05,2e-05,1e-05
ambizioso,5e-05,0.0,6e-05,3e-05,7e-05,0.00014,0.00018,0.00021,5e-05,3e-05,7e-05,3e-05
sensuale,0.0,1e-05,0.0,2e-05,0.0,0.0,0.0,0.0,1e-05,2e-05,0.0,0.0
maschile,5e-05,0.00091,6e-05,0.00041,4e-05,0.00041,4e-05,0.00033,0.00086,0.00035,0.00037,0.00029
bello,0.0006,0.00089,0.00041,0.00041,0.00044,0.00047,0.00063,0.0007,0.00029,0.0,3e-05,7e-05


Store for each time period and each group of words, the mean of the bias of the words within the group. Then rank group of words according to the average bias

In [49]:
tfidf_dict = defaultdict(lambda: defaultdict(int))

for group in WORDS_GROUP:
    rows = gendered_neutral_words[group]
    data = tfidf_words.loc[rows]
    for year in YEARS:
        tfidf_dict[f'{year}'][f'{group}'] = data[f'bias_{year}'].mean()

In [57]:
for year in YEARS:
    print(f'YEAR: {year}')
    diction = tfidf_dict[f'{year}']
    print('\nTOP BIASED TOPICS:')
    # sort the group of words by their bias
    display(sorted(diction.items(), key= lambda x: x[1], reverse=True)[:6])
    print('LEAST BIASED TOPICS:')
    display(sorted(diction.items(), key= lambda x: x[1], reverse=True)[6:])

YEAR: 1948_1968

TOP BIASED TOPICS:


[('family', 0.003511666666666667),
 ('career', 0.0021229999999999995),
 ('gendered_words', 0.001305),
 ('intelligence', 8.500000000000006e-05),
 ('kindness', 8.428571428571435e-05),
 ('female_stereotypes', 7.76923076923077e-05)]

LEAST BIASED TOPICS:


[('rage', 7.333333333333333e-05),
 ('passive', 5.399999999999999e-05),
 ('active', 5.000000000000002e-05),
 ('adj_appearence', 3.833333333333334e-05),
 ('male_stereotypes', 3.8000000000000016e-05),
 ('dumbness', 2.4000000000000007e-05)]

YEAR: 1968_1985

TOP BIASED TOPICS:


[('career', 0.002613999999999999),
 ('family', 0.001535),
 ('gendered_words', 0.000555),
 ('kindness', 0.00012714285714285716),
 ('intelligence', 0.00011666666666666665),
 ('female_stereotypes', 9.461538461538461e-05)]

LEAST BIASED TOPICS:


[('passive', 8.599999999999999e-05),
 ('active', 8.000000000000002e-05),
 ('male_stereotypes', 7.300000000000001e-05),
 ('dumbness', 3.5999999999999994e-05),
 ('rage', 2.9999999999999997e-05),
 ('adj_appearence', 1.4999999999999999e-05)]

YEAR: 1985_2000

TOP BIASED TOPICS:


[('career', 0.0022179999999999986),
 ('family', 0.0013283333333333333),
 ('gendered_words', 0.00041100000000000007),
 ('active', 0.000126),
 ('intelligence', 0.00011),
 ('passive', 9.4e-05)]

LEAST BIASED TOPICS:


[('male_stereotypes', 8.1e-05),
 ('female_stereotypes', 5.384615384615385e-05),
 ('kindness', 3.571428571428574e-05),
 ('dumbness', 2.8000000000000003e-05),
 ('rage', 2.3333333333333332e-05),
 ('adj_appearence', 8.333333333333329e-06)]

YEAR: 2000_2020

TOP BIASED TOPICS:


[('career', 0.0016339999999999996),
 ('family', 0.001555),
 ('gendered_words', 0.00042400000000000006),
 ('rage', 4.333333333333333e-05),
 ('passive', 4.2e-05),
 ('intelligence', 4.166666666666677e-05)]

LEAST BIASED TOPICS:


[('active', 3.999999999999993e-05),
 ('female_stereotypes', 3.923076923076923e-05),
 ('kindness', 3.71428571428571e-05),
 ('male_stereotypes', 2.7999999999999966e-05),
 ('adj_appearence', 1.2499999999999997e-05),
 ('dumbness', 1.2000000000000004e-05)]