<a href="https://colab.research.google.com/github/nicolashernandez/teaching_nlp/blob/main/06_biasandethics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

--
# Bias in data-driven models 

For the following questions, explain your approach and give codes that support your observations.

You may need to turn the execution mode to GPU.

## Static word models and word similarity

[**Word2Vec** (Google)](https://github.com/tmikolov/word2vec), [**GloVe** (Stanford)](https://nlp.stanford.edu/projects/glove/), [**FastText** (Facebook)](https://github.com/facebookresearch/fastText)... are methods to build semantic word representations from corpora. Some of them uses global word co-occurrence information, others are more sensitive to morphological variations. All these methods are appealing because the word vectors are dense and there are little dimension comparing to the vocabulary size. But the major drawback of these approaches is that representations are non contextual. They remain the same for a word whatever the context is.

[**gensim**](https://radimrehurek.com/gensim/) is a library which allows to play with pre-trained models for word or document similarity tasks or to build your own models from your data. 

### QUESTION

Have a look at the [gensim-data repository](https://github.com/RaRe-Technologies/gensim-data) and check if it exists models built from twitter. If so give a name. The associated number at the end of a model name correspond to the number of dimensions used for describing a word.

In [11]:
import gensim.downloader as api

api.info()  # show info about available models/datasets

{'corpora': {'semeval-2016-2017-task3-subtaskBC': {'num_records': -1,
   'record_format': 'dict',
   'file_size': 6344358,
   'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py',
   'license': 'All files released for the task are free for general research use',
   'fields': {'2016-train': ['...'],
    '2016-dev': ['...'],
    '2017-test': ['...'],
    '2016-test': ['...']},
   'description': 'SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.',
   'checksum': '701ea67acd82e75f95e1d8e62fb0ad29',
   'file_name': 'se

Load some models

In [17]:
wiki_model_50 = api.load("glove-wiki-gigaword-50")
wiki_model_200 = api.load("glove-wiki-gigaword-200")
twitter_model = api.load("glove-twitter-50")



### Get the similar words 

For each question below, play the game and take the time to make suggestions for answers before running the code that will allow you to look up the model's knowledge and find out what it would answer.

If I tell you 'king', what do you think of? Make a few suggestions of synonyms or semantically close substitutable words. The `most_similar` method will display the 10 closest words to a given word, from the most similar to the least similar, with for each a similarity score with the given word (thus decreasing scores).

Compare the knowledge of distinct models in terms of size and data genre.

In [18]:
wiki_model_50.most_similar("king")

[('prince', 0.8236179351806641),
 ('queen', 0.7839042544364929),
 ('ii', 0.7746230363845825),
 ('emperor', 0.7736247181892395),
 ('son', 0.766719400882721),
 ('uncle', 0.7627150416374207),
 ('kingdom', 0.7542160749435425),
 ('throne', 0.7539913654327393),
 ('brother', 0.7492411136627197),
 ('ruler', 0.7434253096580505)]

compare with a larger model

In [19]:
wiki_model_200.most_similar("king")

[('prince', 0.6854566931724548),
 ('queen', 0.6665197610855103),
 ('kingdom', 0.6303209662437439),
 ('monarch', 0.6224350929260254),
 ('ii', 0.6146443486213684),
 ('throne', 0.6074705123901367),
 ('reign', 0.5911680459976196),
 ('iii', 0.583712637424469),
 ('crown', 0.579647958278656),
 ('emperor', 0.5552704334259033)]

compare with a model from a distinct genre

In [7]:
twitter_model.most_similar("king")

[('prince', 0.8582575917243958),
 ('jack', 0.8346865177154541),
 ('aka', 0.832629382610321),
 ('mr.', 0.8078049421310425),
 ('the', 0.8043432235717773),
 ('john', 0.8034776449203491),
 ("'s", 0.7829487919807434),
 ('jackson', 0.779657781124115),
 ('from', 0.7796491384506226),
 ('legend', 0.7787959575653076)]

If I ask you to give me words related to 'palace' and 'paris', what do you think? For information, the method accepts a list of words as parameters.

In [None]:
# obtenir les mots similaires relatifs à une liste
wiki_model_200.most_similar(['palace', 'paris'])

If I add the king and woman vectors and remove the man vector what do I get? Answer before running the code below.

In [None]:
# Si j'ajoute les vecteurs de roi et de femme et que je retire le vecteur homme qu'est ce que j'obtiens ?
wiki_model_200.most_similar(positive = ['king', 'woman'], negative = ['man'])


### QUESTION

* Same question but if I add the vectors of 'paris' and 'japan' and remove the vector of 'france'. Make a proposal and write the code to check.

In [None]:
# TODO

### QUESTION 
* Play with the operations on static embedding vectors. Take the word "human", remove "male" and add "job"...
* Do you see any situations in operations that expose sexist, racist, religious or other biases? Give an example of each.

In [None]:
# TODO

## Text generation

[huggingface](https://huggingface.co/models) plays the role of the "github" for pre-trained and fine-tuned language models.

The code below allows you to use the gpt2 model and to test the generation of text in English. 

For your information, [BLOOM](https://huggingface.co/bigscience/bloom) which stands for BigScience Large Open-science Open-access Multilingual Language Model is one of the most recent auto-regressive model which has been created. More than 50 Gb to load... We won't use it here.

In [20]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 35.7 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 51.0 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.24.0


In [22]:
import transformers

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')

set_seed(42)

print(generator("The White man worked as a", max_length=10, num_return_sequences=5))
print(generator("The Black man worked as a", max_length=10, num_return_sequences=5))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The White man worked as a clerk at the old'}, {'generated_text': 'The White man worked as a salesman in Mexico and'}, {'generated_text': 'The White man worked as a lawyer in the White'}, {'generated_text': 'The White man worked as a clerk for the store'}, {'generated_text': 'The White man worked as a barkeep and was'}]
[{'generated_text': 'The Black man worked as a prostitute. In the'}, {'generated_text': 'The Black man worked as a security guard and was'}, {'generated_text': 'The Black man worked as a guard inside of a'}, {'generated_text': 'The Black man worked as a barkeeper in Miami'}, {'generated_text': 'The Black man worked as a clerk for several American'}]


### QUESTION

* Among the hugging face community resources, look for the text generation resource that uses the `gpt-fr-cased-base` template for French. Give the link to the page and implement the code provided. 
* Give the beginnings of sentences to start the generation. Do you find situations that reveal sexist, racist, religious or other biases? Give examples of each.


In [None]:
#TODO

## Translation (Google)


> *She is a doctor. He is a nurse.*



### QUESTIONS

* Open [Google Translate in your browser](https://translate.google.fr/?hl=fr&sl=en&tl=fr&text=She%20is%20a%20doctor.%20He%20is%20a%20nurse.&op=translate)
* Translate from English (source language) to French (target language). Click twice on "Switch languages" (to translate once to French and then to translate back from French to English). Do you notice anything?
* Do the same thing using Hungarian as the target language. Do you observe anything?

**TODO**



## Automatic detection of bias

huggingface holds the following model *d4data/bias-detection-model* for detecting bias in news. This model is part of the [Research topic "Bias and Fairness in AI" conducted by Deepak John Reji, Shaina Raza](https://github.com/dreji18/Fairness-in-AI).



#### QUESTION
* Give the URL of the model. Test the bias prediction model (starting from the example code) and qualitatively assess the limitations. Does it detect any bias? Give false positive/negative examples.

**TODO**

## Translation with sequence to sequence T5

The following code allows to [use the *t5-small* prompt-based seq-to-seq model available on hugging face](https://huggingface.co/t5-small).


In [None]:
!pip install transformers


from transformers import T5TokenizerFast, T5ForConditionalGeneration

tokenizer = T5TokenizerFast.from_pretrained('t5-small')

model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)

input = "My name is Azeem and I live in India"

# You can also use "translate English to French" and "translate English to Romanian"
input_ids = tokenizer("translate English to Romanian: "+input, return_tensors="pt").input_ids  # Batch size 1

outputs = model.generate(input_ids)

decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(decoded)

#### QUESTION
* Test the translation model to translate from/to your favorite languages and qualitatively assess the limitations. 
* Test translating from English to Russian and from Russian to English... e.g. "The mind is strong but the flesh is weak".
* Develop an application that predicts bias in your favorite through translation from English. You may need to find an adequate translation model. How good the bias detection model is through the translation ? 


**TODO**

#ELIZA: a very basic Rogerian psychotherapist chatbot

> [ELIZA](https://en.wikipedia.org/wiki/ELIZA)  was made to respond like a Rogerian psychotherapist. In this instance, the therapist "reflects" on questions by turning the questions back at the patient. Created to demonstrate the superficiality of communication between humans and machines, Eliza simulated conversation by using a "pattern matching" and substitution methodology that gave users an illusion of understanding on the part of the program, but had no built in framework for contextualizing events. An [example of ELIZA conversation here](https://upload.wikimedia.org/wikipedia/commons/7/79/ELIZA_conversation.png) and a [ELIZA demo there](http://psych.fullerton.edu/mbirnbaum/psych101/eliza.htm). 

Write your own psychotherapist chatbot. Based on the available models on huggingface or other NLP technology (such as [spaCy](https://spacy.io/)), extend the simple following chatbot by adding new abilities such as:
- evaluate your sentiment and make feedback about it 
- generate question taking noun phrases of your utterances as input
- recognize named entities and generate questions about them
- classify your message in a topic category and generate questions about it
- whatever you want... even make two agents discussing toguether...

In [2]:
print ('Good morning, my name is Eliza. Is something troubling you ?')
message = input()
while message != 'stop':
  print('Why do you say', message)
  message = input()

Good morning, my name is Eliza. Is something troubling you ?
I am sad
Why do you say that I am sad
Because
Why do you say that Because


KeyboardInterrupt: ignored

# Calculate the CO2 impact of your GPU usage in this course

To do this use the [Machine Learning has a carbon footprint] application (https://mlco2.github.io/impact).

Start by identifying your GPU, then approximate the time spent on GPU and calculate... observe the equivalences.




**TODO**