# INFORMATION RETRIEVAL PROJECT

---
## Gender stereotypes in parliamentary speeches

In word embedding models, each word is assigned to a high-dimensional vector such that the geometry of the vectors captures semantic relations between the words – e.g. vectors being closer together has been shown to correspond to more similar words. Recent works in machine learning demonstrate that word embeddings also capture common stereotypes, as these stereotypes are likely to be present, even if subtly, in the large corpora of training texts. These stereotypes are automatically learned by the embedding algorithm and could be problematic in many context if the embedding is then used for sensitive applications such as search rankings, product recommendations, or translations. An important direction of research is on developing algorithms to debias the word embeddings.

This project aims to use the word embeddings to study historical trends – specifically trends in the gender and ethnic stereotypes in the Italian parliamentary speeches from 1948 to 2020.

In [1]:
import pymongo
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
from tqdm.auto import tqdm
import pickle
import os
from itertools import product



# 1. DOWNLOAD DATA AND TRAIN EMBEDDING MODELS

Data exploration and training of the embedding models have been done through ISLab virtual machine, so the results are copy-pasted from the bash to avoid downloading data on the local machine.

In [None]:
db = pymongo.MongoClient()['gender_politics']
collection = db.list_collection_names(include_system_collections=False)

In [None]:
for name in collection:
    print(name)

Collections are:
- ```'tf-gender-legislature'```: tf values of tokens divided by gender (male/female) and legislature (1-18)
- ```'tokenization'```
- ```'tfidf-gender-legislature'```: tf-idf values of tokens divivded by gender (male/female) and legislature (1-18)
- ```'tfidf-year'```: tf-idf values of tokens divivded by year (from 1948 to 2020)
- ```'tfidf-deputy-year'```: tf-idf values of tokens divivded by deputy (website in format http://dati.camera.it/ocd/persona.rdf/...) and by year (from 1948 to 2020)
- ```'tfidf-gender'```: tf-idf values of tokens divivded by gender (male/female)
- ```'mi-year'```
- ```'mi-gender-legislature'```

In [None]:
# to show the fields of each collection
cursor=db['tokenization'].find_one()
cursor.keys()

The fields in the collection ```tokenization``` are: 

```
['_id', 'segment', 'tag', 'start', 'president', 'page', 'dep', 'surname', 'name', 'len', 'score', 'cognome', 'nome', 'info', 'dataNascita', 'luogoNascita', 'inizioMandato', 'fineMandato', 'collegio', 'numeroMandati', 'aggiornamento', 'year', 'month', 'day', 'id', 'convocation', 'date', 'title', 'text', 'speech', 'legislature', 'gender', 'groupname', 'rdfid', 'presidency', 'group_cluster', 'match_validation', 'paragraphs']
```

In [None]:
# to count the total number of documents
db.tokenization.estimated_document_count()

Total number of documents is ```1197023```

In [None]:
# to return the text
# the field paragraphs contain a list of different 'text' saved as vocabulary 

cursor = db.tokenization.find(
    {},
    { "_id": 0, "paragraphs":1}
)

for i in cursor:
    print(i)
    break

Part of the returned output: 

```
{'paragraphs': [{'text': 'Signor Presidente signor ministro , il provvedimento che stiamo affrontando e quanto di piu improvvisato , confuso , rabberciato e contraddittorio ci potessimo attendere .', 'ents': [{'start': 0, 'end': 33, 'label': 'MISC', 'text': 'Signor Presidente signor ministro'}], 'id': 0, 'tokens': [{'id': 0, 'start': 0, 'end': 6, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'Signor', 'dep': 'vocative', 'head': 23, 'is_stop': False, 'is_oov': False, 'stem': 'signor'}, {'id': 1, 'start': 7, 'end': 17, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'Presidente', 'dep': 'compound', 'head': 0, 'is_stop': False, 'is_oov': False, 'stem': 'president'}, {'id': 2, 'start': 18, 'end': 24, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'signore', 'dep': 'nmod', 'head': 0, 'is_stop': False, 'is_oov': False, 'stem': 'signor'}, {'id': 3, 'start': 25, 'end': 33, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Number=Sing', 'lemma': 'ministrare', 'dep': 'compound', 'head': 2, 'is_stop': False, 'is_oov': False, 'stem': 'ministr'}, {'id': 4, 'start': 34, 'end': 35, 'tag': 'FF', 'pos': 'PUNCT', 'morph': '', 'lemma': ',', 'dep': 'punct', 'head': 0, 'is_stop': False, 'is_oov': False, 'stem': ','}, {'id': 5, 'start': 36, 'end': 38, 'tag': 'RD', 'pos': 'DET', 'morph': 'Definite=Def|Gender=Masc|Number=Sing|PronType=Art', 'lemma': 'il', 'dep': 'det', 'head': 6, 'is_stop': True, 'is_oov': False, 'stem': 'il'}, {'id': 6, 'start': 39, 'end': 52, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'provvedimento', 'dep': 'nsubj', 'head': 23, 'is_stop': False, 'is_oov': False, 'stem': 'provved'}, {'id': 7, 'start': 53, 'end': 56, 'tag': 'PR', 'pos': 'PRON', 'morph': 'PronType=Rel', 'lemma': 'che', 'dep': 'nsubj', 'head': 9, 'is_stop': True, 'is_oov': False, 'stem': 'che'}, {'id': 8, 'start': 57, 'end': 63, 'tag': 'VA', 'pos': 'AUX', 'morph': 'Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin', 'lemma': 'stare', 'dep': 'aux', 'head': 9, 'is_stop': True, 'is_oov': False, 'stem': 'stiam'}, {'id': 9, 'start': 64, 'end': 75, 'tag': 'V', 'pos': 'VERB', 'morph': 'VerbForm=Ger', 'lemma': 'affrontare', 'dep': 'acl:relcl', 'head': 6, 'is_stop': False, 'is_oov': False, 'stem': 'affront'}, {'id': 10, 'start': 76, 'end': 77, 'tag': 'CC', 'pos': 'CCONJ', 'morph': '', 'lemma': 'e', 'dep': 'cc', 'head': 11, 'is_stop': False, 'is_oov': False, 'stem': 'e'}, {'id': 11, 'start': 78, 'end': 84, 'tag': 'B', 'pos': 'ADV', 'morph': '', 'lemma': 'quanto', 'dep': 'conj', 'head': 6, 'is_stop': True, 'is_oov': False, 'stem': 'quant'}, {'id': 12, 'start': 85, 'end': 87, 'tag': 'E', 'pos': 'ADP', 'morph': '', 'lemma': 'di', 'dep': 'case', 'head': 14, 'is_stop': True, 'is_oov': False, 'stem': 'di'}, {'id': 13, 'start': 88, 'end': 91, 'tag': 'B', 'pos': 'ADV', 'morph': '', 'lemma': 'piu', 'dep': 'advmod', 'head': 14, 'is_stop': True, 'is_oov': False, 'stem': 'piu'}, {'id': 14, 'start': 92, 'end': 104, 'tag': 'A', 'pos': 'ADJ', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'improvvisare', 'dep': 'nmod', 'head': 11, 'is_stop': False, 'is_oov': False, 'stem': 'improvvis'}, {'id': 15, 'start': 105, 'end': 106, 'tag': 'FF', 'pos': 'PUNCT', 'morph': '', 'lemma': ',', 'dep': 'punct', 'head': 16, 'is_stop': False, 'is_oov': False, 'stem': ','}, {'id': 16, 'start': 107, 'end': 114, 'tag': 'A', 'pos': 'ADJ', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'confondere', 'dep': 'amod', 'head': 6, 'is_stop': False, 'is_oov': False, 'stem': 'confus'}, {'id': 17, 'start': 115, 'end': 116, 'tag': 'FF', 'pos': 'PUNCT', 'morph': '', 'lemma': ',', 'dep': 'punct', 'head': 18, 'is_stop': False, 'is_oov': False, 'stem': ','}, {'id': 18, 'start': 117, 'end': 128, 'tag': 'V', 'pos': 'VERB', 'morph': 'Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part', 'lemma': 'rabberciare', 'dep': 'acl', 'head': 6, 'is_stop': False, 'is_oov': False, 'stem': 'rabberc'}, {'id': 19, 'start': 129, 'end': 130, 'tag': 'CC', 'pos': 'CCONJ', 'morph': '', 'lemma': 'e', 'dep': 'cc', 'head': 20, 'is_stop': False, 'is_oov': False, 'stem': 'e'}, {'id': 20, 'start': 131, 'end': 146, 'tag': 'A', 'pos': 'ADJ', 'morph': 'Gender=Masc|Number=Sing', 'lemma': 'contraddittorio', 'dep': 'conj', 'head': 6, 'is_stop': False, 'is_oov': False, 'stem': 'contraddittor'}, {'id': 21, 'start': 147, 'end': 149, 'tag': 'PC', 'pos': 'PRON', 'morph': 'Clitic=Yes|Number=Plur|Person=1|PronType=Prs', 'lemma': 'ci', 'dep': 'iobj', 'head': 23, 'is_stop': True, 'is_oov': False, 'stem': 'ci'}, {'id': 22, 'start': 150, 'end': 159, 'tag': 'VM', 'pos': 'AUX', 'morph': 'Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin', 'lemma': 'potere', 'dep': 'aux', 'head': 23, 'is_stop': False, 'is_oov': False, 'stem': 'potessim'}, {'id': 23, 'start': 160, 'end': 169, 'tag': 'V', 'pos': 'VERB', 'morph': 'VerbForm=Inf', 'lemma': 'attendere', 'dep': 'ROOT', 'head': 23, 'is_stop': False, 'is_oov': False, 'stem': 'attend'}, {'id': 24, 'start': 170, 'end': 171, 'tag': 'FS', 'pos': 'PUNCT', 'morph': '', 'lemma': '.', 'dep': 'punct', 'head': 23, 'is_stop': False, 'is_oov': False, 'stem': '.'}]}, ... 
```

In [None]:
# paragraphs is a list of other object, so we need to unwind it in order to retrieve the single
#elements

cursor = db.tokenization.aggregate([ 
    { "$unwind" : "$paragraphs" }, 
    {"$project": {"_id":0, "paragraphs":1}}
])

for diction in cursor:
    print(diction['paragraphs'].keys())
    break

Objects contained in ```paragraphs``` are: <br>
- ```'text'```: text of the intervention
- ```'ents'```: ?
- ```'id'```: id of the token
- ```'tokens'```: list containing different information about tokens

In [None]:
# unwind tokens to retrieve the keys in tokens
cursor = db.tokenization.aggregate([ 
    {'$unwind' : "$paragraphs" }, 
    {'$unwind' : "$paragraphs.tokens"},
    {'$project': {"_id":0, "paragraphs.tokens":1}}
])


for diction in cursor:
    print(diction['paragraphs']['tokens'].keys())
    break

Object contained in ```tokens``` are:
- ```'id'```: unique identifier of token
- ```'start'```: position of first letter of the token in the text
- ```'end'```: position of last letter of the token in the text
- ```'tag'```: fine-grained POS tags
- ```'pos'```: coarse-grained POS tag
- ```'morph'```: List of morphological features
- ```'lemma'```: root of the word
- ```'dep'```: dependency label
- ```'head'```: integer value indicating the dependency head of each token, referring to the absolute index of each token in the text.
- ```'is_stop'```: if the token is a stop word
- ```'is_oov'```: if the token is out-of-vocabulary (i.e. does it not have a word vector)?
- ```'stem'```: stemming of the word (word without last letters)

In [None]:
# example of objects contained in tokens
for diction in cursor:
    print(diction['paragraphs']['tokens'])
    break

Returned output: 

```
{'id': 3, 'start': 25, 'end': 33, 'tag': 'S', 'pos': 'NOUN', 'morph': 'Number=Sing', 'lemma': 'ministrare', 'dep': 'compound', 'head': 2, 'is_stop': False, 'is_oov': False, 'stem': 'ministr'}
```

In [None]:
# unwind tokens to retrieve the keys in tokens.
# we want to retrieve all the tokens for each text for a time period of 10 years

def get_lemmas_by_years(start_year, end_year):
    
    cursor = db.tokenization.aggregate([ 
        {"$match": {"$and": [
            {"year": {"$gte":start_year}},
            {"year": {"$lt": end_year}}
             ]}
        }, 
        {"$unwind" : "$paragraphs" },
        {"$unwind" : "$paragraphs.tokens" },
        {"$group" : {"_id": {"text":"$paragraphs.text"},
                     "lemma" : {"$push": "$paragraphs.tokens.lemma"}}}
    ], 
        allowDiskUse=True)
    
    return cursor

In [None]:
cursor = get_lemmas_by_years(1948, 1960)

In [None]:
for diction in cursor:
    print(diction)
    break

Returned output:

```
{'_id': {'text': ' 1 La Camera rilevato : 10 chele attuali leggi in materia di risarcimento dei danni di guerra ai privati non sono fra di loro coordinate e in ogni modo non permettono di addivenire .'}, 'lemma': [' ', '1', 'La', 'Camera', 'rilevare', ':', '10', 'chela', 'attuale', 'leggere', 'in', 'materia', 'di', 'risarcimento', 'dio', 'danno', 'di', 'guerra', 'al', 'privato', 'non', 'essere', 'fra', 'di', 'loro', 'coordinato', 'e', 'in', 'ogni', 'modo', 'non', 'permettere', 'di', 'addivenire', '.']}
```

In [None]:
years_period = [ [1948,1968], [1968,1985], [1985,2000], [2000,2020] ]

for years in tqdm(years_period):
    # return the lemmas for each document in the selcted time span
    cursor = get_lemmas_by_years(years[0], years[1])
    # store the list of lemmatised documents for pickling
    docs = [text['lemma'] for text in cursor]
    # pickle file
    basepath = '/home/student/Desktop/COGNOMEnomeMATRICOLA/FORMENTInicole941481'
    with open(os.path.join(basepath,f'docs_by_years_{years[0]}_{years[1]}.pickle'), "wb") as output:
        pickle.dump(docs, output)

In [None]:
def count_gender_document_by_years(start_year, end_year):
    
    cursor = db.tokenization.aggregate([ 
        {"$match": {"$and": [
            {"year": {"$gte": start_year}},
            {"year": {"$lt": end_year}}
             ]}},
        {"$unwind" : "$paragraphs" },
        {"$unwind" : "$paragraphs.text" },
        {"$project": {"_id":0, "paragraphs.text":1, "gender":1}},
        {"$group" : {"_id": {"gender":"$gender"},
                     "n documents" : {"$sum" : 1}}}
    ] , 
        allowDiskUse=True)
    
    return cursor

In [None]:
years_period = [ [1948,1968], [1968,1985], [1985,2000], [2000,2020] ]

for years in tqdm(years_period):
    # return the lemmas for each document in the selcted time span
    cursor = count_gender_document_by_years(years[0], years[1])
    for i in cursor:
        print(f"number of documents by gender of speakers for years {years}\n{i}")

Returned output:

```  0%|                                                     | 0/4 [00:00<?, ?it/s]
number of documents by gender of speakers for years [1948, 1968]
{'_id': {'gender': 'female'}, 'n documents': 47209}
number of documents by gender of speakers for years [1948, 1968]
{'_id': {'gender': 'male'}, 'n documents': 1648571}
 25%|███████████                                 | 1/4 [03:09<09:27, 189.12s/it]
 number of documents by gender of speakers for years [1968, 1985]
{'_id': {'gender': 'male'}, 'n documents': 967690}
number of documents by gender of speakers for years [1968, 1985]
{'_id': {'gender': 'female'}, 'n documents': 46964}
 50%|██████████████████████                      | 2/4 [05:06<04:53, 146.96s/it]
 number of documents by gender of speakers for years [1985, 2000]
{'_id': {'gender': 'female'}, 'n documents': 97727}
number of documents by gender of speakers for years [1985, 2000]
{'_id': {'gender': 'male'}, 'n documents': 900474}
 75%|█████████████████████████████████           | 3/4 [06:54<02:08, 129.00s/it]
 number of documents by gender of speakers for years [2000, 2020]
{'_id': {'gender': 'female'}, 'n documents': 263610}
number of documents by gender of speakers for years [2000, 2020]
{'_id': {'gender': 'male'}, 'n documents': 1083830}
100%|████████████████████████████████████████████| 4/4 [09:21<00:00, 140.39s/it] 
```

### TRAIN WORD2VEC MODEL

In [None]:
# SKIP-GRAM model
# sg ({0, 1}, optional) – Training algorithm: 1 for skip-gram; otherwise CBOW.

basepath = '/home/student/Desktop/COGNOMEnomeMATRICOLA/FORMENTInicole941481'
years_period = [ [1948,1968], [1968,1985], [1985,2000], [2000,2020] ]

for years in tqdm(years_period):
    with open(os.path.join(basepath,f'docs_by_years_{years[0]}_{years[1]}.pickle'), "rb") as output:
          docs = pickle.load(output)
    model = Word2Vec(sentences=docs, vector_size=300, window=5, min_count=5, sg=1, epochs=10) 
    model.save(os.path.join(basepath, f'W2V_by_years_{years[0]}_{years[1]}'))

In [None]:
#!/bin/bash

BASEPATH_src=/home/student/Desktop/COGNOMEnomeMATRICOLA/FORMENTInicole941481
for file in "$BASEPATH_src/W2V_by_years_*"
do
    scp -P 22 student@172.20.27.83:$file ~/Gender-stereotypes-in-parliamentary-speeches-with-Word-Embedding/we_models 
done

---

### TRAIN GLOVE MODEL

The four main tools in this package are:

1) vocab_count

This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.

2) cooccur

Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by vocab_count, and may specify a variety of parameters, as described by running ./build/cooccur.

3) shuffle

Shuffles the binary file of cooccurrence statistics produced by cooccur. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running ./build/shuffle.

4) glove

Train the GloVe model on the specified cooccurrence data, which typically will be the output of the shuffle tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running ./build/glove.

In [None]:
#prepare text file
docs = docs_by_years['docs_2000_2010'] + docs_by_years['docs_2010_2020']

with open(os.path.join(basepath,'docs_by_years_2000_2020.txt'), 'w') as f:
    for item in docs:
        f.write("%s\n" % item)
    f.close()

In [None]:
f = open(os.path.join(basepath,'docs_by_years_2000_2020.txt'), "r")
print(f.read())

In [None]:
#!/bin/bash

# clone repository
git clone https://github.com/stanfordnlp/glove
cd glove && make

The size of the vector and the window size is the same as for word2vec models.

In [None]:
#!/bin/bash
set -e

# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python

BASEPATH=/home/student/Desktop/COGNOMEnomeMATRICOLA/FORMENTInicole941481

for i in '1948_1968' '1968_1985' '1985_2000' '2000_2020'
do
    CORPUS="$BASEPATH/docs_by_years_$i.txt"
    VOCAB_FILE=vocab.txt
    COOCCURRENCE_FILE=cooccurrence.bin
    COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
    BUILDDIR=build
    SAVE_FILE="$BASEPATH/GLOVE_by_years_$i"
    VERBOSE=2
    MEMORY=4.0
    VOCAB_MIN_COUNT=1
    VECTOR_SIZE=300
    MAX_ITER=15
    WINDOW_SIZE=5
    BINARY=2
    NUM_THREADS=8
    X_MAX=10

    if hash python 2>/dev/null; then
        PYTHON=python
    else
        PYTHON=python3
    fi

    echo
    echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
    $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE

    echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
    $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE

    echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
    $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE

    echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
    $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
done