<a href="https://colab.research.google.com/github/nicolashernandez/teaching_nlp/blob/main/06_biasandethics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

--
# Bias in data-driven models 

For the following questions, explain your approach and give codes that support your observations.

You may need to turn the execution mode to GPU.

## Static word models and word similarity

[**Word2Vec** (Google)](https://github.com/tmikolov/word2vec), [**GloVe** (Stanford)](https://nlp.stanford.edu/projects/glove/), [**FastText** (Facebook)](https://github.com/facebookresearch/fastText)... are methods to build semantic word representations from corpora. Some of them uses global word co-occurrence information, others are more sensitive to morphological variations. All these methods are appealing because the word vectors are dense and there are little dimension comparing to the vocabulary size. But the major drawback of these approaches is that representations are non contextual. They remain the same for a word whatever the context is.

[**gensim**](https://radimrehurek.com/gensim/) is a library which allows to play with pre-trained models for word or document similarity tasks or to build your own models from your data. 

### QUESTION

Have a look at the [gensim-data repository](https://github.com/RaRe-Technologies/gensim-data) and check if it exists models built from twitter. If so give a name. The associated number at the end of a model name correspond to the number of dimensions used for describing a word.

In [None]:
import gensim.downloader as api

api.info()  # show info about available models/datasets

Load some models

In [None]:
wiki_model_50 = api.load("glove-wiki-gigaword-50")
wiki_model_200 = api.load("glove-wiki-gigaword-200")
#twitter_model = api.load("TODO")

### Get the similar words 

For each question below, play the game and take the time to make suggestions for answers before running the code that will allow you to look up the model's knowledge and find out what it would answer.

If I tell you 'king', what do you think of? Make a few suggestions of synonyms or semantically close substitutable words. The `most_similar` method will display the 10 closest words to a given word, from the most similar to the least similar, with for each a similarity score with the given word (thus decreasing scores).

Compare the knowledge of distinct models in terms of size and data genre.

In [None]:
wiki_model_50.most_similar("king")

compare with a larger model

In [None]:
wiki_model_200.most_similar("king")

compare with a model from a distinct genre

In [None]:
twitter_model.most_similar("king")

If I ask you to give me words related to 'palace' and 'paris', what do you think? For information, the method accepts a list of words as parameters.

In [None]:
# obtenir les mots similaires relatifs à une liste
wiki_model_200.most_similar(['palace', 'paris'])

If I add the king and woman vectors and remove the man vector what do I get? Answer before running the code below.

In [None]:
# Si j'ajoute les vecteurs de roi et de femme et que je retire le vecteur homme qu'est ce que j'obtiens ?
wiki_model_200.most_similar(positive = ['king', 'woman'], negative = ['man'])


### QUESTION

* Same question but if I add the vectors of 'paris' and 'japan' and remove the vector of 'france'. Make a proposal and write the code to check.

In [None]:
# TODO

### QUESTION 
* Play with the operation `most_similar` on static embedding vectors. Take the word "human", remove "male" and add "job"...
* Do you see any situations in operations that expose sexist, racist, religious or other biases? You may compare the various genre models. Give an example of each bias you find.

In [None]:
# TODO

### 3D Visualization of word embeddings with _tensorflow projector_

1. Open http://projector.tensorflow.org/
2. Select available tensors
3. and PROFIT!



### QUESTION

* Do you observe areas that are denser than others? What does this mean? 
* Test the 3D labels, click on a point/word (set the neighborhood to the minimum value) to observe the illumination of an area, search for a word, view 'isolate 6 points'.

## Building word2vec and fasttext models with gensim


Both Word2Vec and FastText take a normalized corpus segmented into sentences and tokenized into words. 

We could very well use spaCy or nltk to do this, but this kind of pre-processing takes "a little time". We will directly use a corpus from the nltk database available with the segmentation into sentences and the tokenization into words.

The code below uses the selection of the gutenberg corpus segmented into sentences and tokens by nltk.

In [1]:
import re
import nltk
nltk.download('gutenberg')
nltk.download('punkt')

nltk_gutenberg_corpus = list()
word_counter = 0

for fileid in nltk.corpus.gutenberg.fileids():
  segmented_and_tokenized_doc = nltk.corpus.gutenberg.sents(fileid)
  for sent in segmented_and_tokenized_doc:
    words = [re.sub(r'[^a-zA-Z\s]', '', word, re.I|re.A).lower() for word in sent]
    # 98552 2621785
    words = [word for word in words if len(word) > 3]
    # 95804 1154977
    if len(words)>0: 
      nltk_gutenberg_corpus.append(words)
      word_counter += len(words)
      
print ('sentences_len:', len(nltk_gutenberg_corpus), 'words_len:', word_counter)

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


sentences_len: 95801 words_len: 1154977


The most common hyper-parameters of
[Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) :

* `corpus`: List of tokenized sentences 
* `size` : Dimensionality of the word vectors (default: 100)
* `window` : Maximum distance between the current and predicted word within a sentence
* `sg` : Training algorithm: 1 for skip-gram; otherwise CBOW
* `iter` :  Number of iterations (epochs) over the corpus
* `workers` Use these many worker threads to train the model (=faster training with multicore machines)





In [2]:
# Set values for various parameters
lr = 0.05   # Learning rate
dim = 100   # Word vector dimensionality  
ws = 5      # Context window size    
epoch = 5
minCount = 5 # Minimum word count 
neg = 5
loss = 'ns'
t = 1e-4
#sample = 1e-3   # Downsample setting for frequent words
sg=1 

params = {
    'alpha': lr,
    'size': dim,
    'window': ws,
    'iter': epoch,
    'min_count': minCount,
    'sample': t,
    'sg': 1,
    'hs': 0,
    'negative': neg
}

Construction of the models. 
Observe the speed given the number of words.

In [3]:
from gensim.models import Word2Vec, KeyedVectors

# 
%time w2v_model = Word2Vec(nltk_gutenberg_corpus, **params) 

# save the model
!mkdir -p models
w2v_model_path = 'models/w2v_nltk-gutenberg_100_5_5_sg.gensim-bin'
w2v_model.save(w2v_model_path)

CPU times: user 28.2 s, sys: 134 ms, total: 28.3 s
Wall time: 17.9 s


In [4]:
from gensim.models.fasttext import FastText

#
%time ft_model = FastText(nltk_gutenberg_corpus, **params)

# save the model
!mkdir -p models
ft_model_path = 'models/ft_nltk-gutenberg_100_5_5_sg.gensim-bin'
ft_model.save(ft_model_path)

CPU times: user 55.6 s, sys: 352 ms, total: 56 s
Wall time: 32.4 s


Have a look at the models

In [9]:
w2v_model.most_similar("love")

  """Entry point for launching an IPython kernel.


[('dearly', 0.7445521354675293),
 ('unfeigned', 0.7432754039764404),
 ('forbearing', 0.7375692129135132),
 ('uprightness', 0.7363731861114502),
 ('integrity', 0.7290647029876709),
 ('brotherly', 0.7261554002761841),
 ('forego', 0.7226293087005615),
 ('solace', 0.7222168445587158),
 ('bountifully', 0.7199023365974426),
 ('meekness', 0.7172915935516357)]

In [10]:
ft_model.most_similar("love")

  """Entry point for launching an IPython kernel.


[('lover', 0.7898315191268921),
 ('loves', 0.7840328216552734),
 ('glove', 0.7417938709259033),
 ('loved', 0.7322890162467957),
 ('lovers', 0.7271362543106079),
 ('lovest', 0.6935369968414307),
 ('beloved', 0.6896727085113525),
 ('loveit', 0.6596423387527466),
 ('swerve', 0.6556262969970703),
 ('loveliest', 0.6489120721817017)]

If you want to visualize the model built and saved in gensim-w2v format via projector tensorflow, you can execute the following command and go to the previous section "visualize a w2v model in 3D via projector tensorflow".

In [None]:
import gensim
from gensim.scripts.word2vec2tensor import word2vec2tensor

def convert_gensim_w2v_to_w2v (gensim_w2v_in_path, w2v_out_path):
  """
  convert a model from gensim_w2v format to w2v (orginal) format
  """
  w2v_model = KeyedVectors.load(gensim_w2v_in_path)
  vectors = w2v_model.wv
  # save memory
  # del model

  # The trained word vectors can also be stored/loaded from a format compatible
  # with the original word2vec implementation via Word2Vec.wv.save_word2vec_format 
  # and gensim.models.keyedvectors.KeyedVectors.load_word2vec_format().
  vectors.save_word2vec_format(w2v_out_path, binary = True)

def convert_w2v_to_tsv (w2v_in_path, tsv_out_path):
  """
  convert a model from w2v original format to tsv format
  """
  # When running word2vec2tensor with a file resulting from 
  # save_word2vec_format, we obtain the following error:
  # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 0: invalid start byte
  # To solve the issue, I have to load with load_word2vec_format the saved file 
  # and save it again with save_word2vec_format
  w2v_model = gensim.models.KeyedVectors.load_word2vec_format(w2v_in_path, binary=True, unicode_errors='ignore')   
  w2v_model.wv.save_word2vec_format(w2v_in_path+".tmp", binary = True)
  word2vec2tensor(w2v_in_path+".tmp", tsv_out_path,  binary = True)

def convert_gensim_w2v_to_tsv (gensim_w2v_in_path, tsv_out_path):
  """
  convert a model from gensim w2v format to tsv format
  """
  convert_gensim_w2v_to_w2v (gensim_w2v_in_path, gensim_w2v_in_path+".tmp")
  convert_w2v_to_tsv (gensim_w2v_in_path+".tmp", tsv_out_path)

w2v_model_path = 'models/w2v_nltk-gutenberg_100_5_5_sg.gensim-bin'
tensor_filename = 'models/tensor_nltk-gutenberg_100_5_5_sg.tsv'

convert_gensim_w2v_to_tsv(w2v_model_path, tensor_filename)



## Models comparison and evaluation

[`gensim` implements model comparison](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Word2Vec_FastText_Comparison.ipynb) according to the **analogical reasoning** task as described in [section 4.1 of the 2013 paper by Mikolov et al.](https://arxiv.org/pdf/1301.3781v3.pdf).

```
:capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
...
:capital-world
Algiers Algeria Baghdad Iraq
Ankara Turkey Dublin Ireland
...
: city-in-state
Chicago Illinois Houston Texas
Chicago Illinois Philadelphia Pennsylvania
...
: gram1-adjective-to-adverb
amazing amazingly apparent apparently
amazing amazingly calm calmly
...
```

Other intrinsic evaluations are possible such as the [computation of a correlation coefficient between a similarity rate computed on the basis of human judgment and a cosine similarity score between Word2Vec representations](https://nlp-ensae.github.io/materials/course2/).




Below we implement the analogical reasoning task of Mikolov et al.

In [None]:
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/tmikolov/word2vec/master/questions-words.txt

In [None]:
# un oeil sur les n premières lignes du fichier
!head questions-words.txt

Definition of the method for computing the resolution performance of the analogy task

In [None]:
def print_accuracy(model, questions_file):
    print('Evaluating...\n')
    acc = model.accuracy(questions_file)
    #acc = model.wv.evaluate_word_analogies(questions_file)

    sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
    sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
    sem_acc = 100*float(sem_correct)/sem_total
    print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, sem_acc))
    
    syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
    syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
    syn_acc = 100*float(syn_correct)/syn_total
    print('Morphologic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, syn_acc))
    return (sem_acc, syn_acc)

Run the evaluation

In [None]:
#
word_analogies_file = 'questions-words.txt'

print('\nLoading Word2Vec embeddings')
w2v_model = KeyedVectors.load(w2v_model_path)
print('Accuracy for Word2Vec:')
print_accuracy(w2v_model, word_analogies_file)

print('\nLoading FastText embeddings')
ft_model = KeyedVectors.load(ft_model_path)
print('Accuracy for FastText (with n-grams):')
print_accuracy(ft_model, word_analogies_file)

#### QUESTION
* Which of the two models gives the best results on morphological analysis? On semantic analysis? Is this consistent with what you know about the models? 
* Rerun the construction of the models and then compare them. Do you get the same performance scores? Why?
* The training data are classic novels from the Gutenberg collection. If the data had been taken from Wikipedia, what results would have changed? If you want to test, below I give you a snippet of code that retrieves a normalized version of the wikipedia and runs the w2v and ft model building. It will take a few minutes... 
* In your opinion, from a model comparison perspective, is it important to build them on the same data?

## Text generation

[huggingface](https://huggingface.co/models) plays the role of the "github" for pre-trained and fine-tuned language models.

The code below allows you to use the gpt2 model and to test the generation of text in English. 

For your information, [BLOOM](https://huggingface.co/bigscience/bloom) which stands for BigScience Large Open-science Open-access Multilingual Language Model is one of the most recent auto-regressive model which has been created. More than 50 Gb to load... We won't use it here.

In [None]:
!pip install transformers

In [None]:
import transformers

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')

set_seed(42)

print(generator("The White man worked as a", max_length=10, num_return_sequences=5))
print(generator("The Black man worked as a", max_length=10, num_return_sequences=5))

### QUESTION

* Among the hugging face community resources, look for the text generation resource that uses the `gpt-fr-cased-base` template for French. Give the link to the page and implement the code provided on the page. 
* Give the beginnings of sentences to start the generation. Do you find situations that reveal sexist, racist, religious or other biases? Give examples of each.


In [None]:
#TODO

### Question/answering



* Same question as previously but on question/answering models. Search a huggingface model and imagine some questions to detect bias

In [None]:
# TODO

## Translation (Google)


> *She is a doctor. He is a nurse.*



### QUESTIONS

* Open [Google Translate in your browser](https://translate.google.fr/?hl=fr&sl=en&tl=fr&text=She%20is%20a%20doctor.%20He%20is%20a%20nurse.&op=translate)
* Translate from English (source language) to French (target language). Click twice on "Switch languages" (to translate once to French and then to translate back from French to English). Do you notice anything?
* Do the same thing using Hungarian as the target language. Do you observe anything?

**TODO**



## Automatic detection of bias

huggingface holds the following model *d4data/bias-detection-model* for detecting bias in news. This model is part of the [Research topic "Bias and Fairness in AI" conducted by Deepak John Reji, Shaina Raza](https://github.com/dreji18/Fairness-in-AI).



#### QUESTION
* Give the URL of the model. Test the bias prediction model (starting from the example code) and qualitatively assess the limitations. Does it detect any bias? Give false positive/negative examples.

**TODO**

## Translation with sequence to sequence T5

The following code allows to [use the *t5-small* prompt-based seq-to-seq model available on hugging face](https://huggingface.co/t5-small).


In [None]:
!pip install transformers


from transformers import T5TokenizerFast, T5ForConditionalGeneration

tokenizer = T5TokenizerFast.from_pretrained('t5-small')

model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)

input = "My name is Azeem and I live in India"

# You can also use "translate English to French" and "translate English to Romanian"
input_ids = tokenizer("translate English to Romanian: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

#### QUESTION
* Test the translation model to translate from/to your favorite languages and qualitatively assess the limitations. 
* Test translating from English to Russian and from Russian to English... e.g. "The mind is strong but the flesh is weak".
* Develop an application that predicts bias in your favorite through translation from English. You may need to find an adequate translation model. How good the bias detection model is through the translation ? 


**TODO**

#ELIZA: a very basic Rogerian psychotherapist chatbot

> [ELIZA](https://en.wikipedia.org/wiki/ELIZA)  was made to respond like a Rogerian psychotherapist. In this instance, the therapist "reflects" on questions by turning the questions back at the patient. Created to demonstrate the superficiality of communication between humans and machines, Eliza simulated conversation by using a "pattern matching" and substitution methodology that gave users an illusion of understanding on the part of the program, but had no built in framework for contextualizing events. An [example of ELIZA conversation here](https://upload.wikimedia.org/wikipedia/commons/7/79/ELIZA_conversation.png) and a [ELIZA demo there](http://psych.fullerton.edu/mbirnbaum/psych101/eliza.htm). 

Write your own psychotherapist chatbot. Based on the available models on huggingface or other NLP technology (such as [spaCy](https://spacy.io/)), extend the simple following chatbot by adding new abilities such as:
- evaluate your sentiment and make feedback about it 
- generate question taking noun phrases of your utterances as input
- recognize named entities and generate questions about them
- classify your message in a topic category and generate questions about it
- whatever you want... even make two agents discussing toguether...

In [None]:
print ('Good morning, my name is Eliza. Is something troubling you ?')
message = input()
while message != 'stop':
  print('Why do you say', message)
  message = input()

# Calculate the CO2 impact of your GPU usage in this course

To do this use the [Machine Learning has a carbon footprint] application (https://mlco2.github.io/impact).

Start by identifying your GPU, then approximate the time spent on GPU and calculate... observe the equivalences.




**TODO**