# 🌐 Multilingual Gender Classification with Zero-Shot Learning 🚀

Welcome to this Google Colab notebook, where we demonstrate a multilingual gender classification task using a zero-shot learning approach. Our goal is to classify sentences into four categories: sentences with a male subject 👨, sentences with a female subject 👩, neutral sentences with an inanimate or non-gendered subject 🔄, and hybrid sentences containing both male and female subjects 👫.

We use the Hugging Face Transformers library 🤗 to build a classification pipeline, leveraging a pre-trained zero-shot classifier model (BART Large MNLI). The classifier is applied to a small dataset of sentences in both English and Spanish 🌎 to evaluate its performance in determining the gender or neutrality of the subjects.

The notebook also tackles the challenge of correctly identifying hybrid sentences containing multiple subjects of different genders. To address this, we perform a two-step classification process, first excluding neutral sentences and then classifying the remaining sentences as either single-gender or hybrid 🧩.

We compute confusion matrices 📊 to assess the accuracy of the classifier in this task, enabling us to understand its effectiveness in differentiating between male, female, neutral, and hybrid subjects across multiple languages.

Explore this notebook to learn how zero-shot learning can be applied to real-world classification tasks involving multiple languages and achieve remarkable results with minimal effort and data 🌟.


# install transformers lib

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# create simple dataset of gendered and neutral sentences

In [None]:
# Define the sentences in each language
male_sentences_en = ["John is a great athlete.", "Bob loves to play video games.", "He is a doctor at the hospital.", "David is a great cook.", "My brother is an engineer."]
female_sentences_en = ["Samantha is a great dancer.", "Emily loves to read books.", "She is a teacher at the school.", "Laura is a great singer.", "My sister is a nurse."]
neutral_sentences_en = ["The sun is shining.", "The book is on the table.", "The car is parked outside.", "The tree is tall.", "The coffee is hot."]
hybrid_sentences_en = ["Alex and Taylor went to the store.", "Jordan and Kim are both in the same class.", "Sam and Jamie are best friends.", "Taylor and Jordan went on a hike together.", "Jordan and Alex work at the same company."]


male_sentences_es = ["Juan es un gran atleta.", "Roberto ama jugar videojuegos.", "Él es un doctor en el hospital.", "David es un gran cocinero.", "Mi hermano es un ingeniero."]
female_sentences_es = ["Samantha es una gran bailarina.", "Emily ama leer libros.", "Ella es una maestra en la escuela.", "Laura es una gran cantante.", "Mi hermana es una enfermera."]
neutral_sentences_es = ["El sol está brillando.", "El libro está en la mesa.", "El auto está estacionado afuera.", "El árbol es alto.", "El café está caliente."]
hybrid_sentences_es = ["Juan y Maria fueron al cine.", "Roberto y Sofia son amigos de la infancia.", "Ella y él trabajan en la misma empresa.", "David y Laura cocinaron juntos.", "Mi hermana y mi hermano son muy cercanos."]


# Combine the sentences into a single list for each category
male_sentences = male_sentences_en + male_sentences_es
female_sentences = female_sentences_en + female_sentences_es
neutral_sentences = neutral_sentences_en + neutral_sentences_es
hybrid_sentences = hybrid_sentences_en + hybrid_sentences_es

# load classifier and classify on gender

### deactivate warnings

In [None]:
import warnings
# Set a global warning filter to ignore the UserWarning generated by the pipeline
warnings.filterwarnings("ignore", message="Length of IterableDataset")

### process dataset

In [None]:
from transformers import pipeline

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model='facebook/bart-large-mnli', tokenizer='facebook/bart-large-mnli')

# Define the candidate labels for the classification task
label_1 = "human male subject"
label_2 = "human female subject"
label_3 = "neutral or inanimate subject"

candidate_labels = [label_1, label_2, label_3]

# Classify the male sentences
male_results = classifier(male_sentences, candidate_labels)

# Classify the female sentences
female_results = classifier(female_sentences, candidate_labels)

# Classify the neutral sentences
neutral_results = classifier(neutral_sentences, candidate_labels)


# Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Combine the results into a single list of predictions and ground truth labels
predictions = []
labels = []
for result, category in [(male_results, label_1), (female_results, label_2), (neutral_results, label_3)]:
    for r in result:
        predictions.append(r["labels"][0])
        labels.append(category)

# Compute the confusion matrix
cm = confusion_matrix(labels, predictions, labels=[label_1, label_2, label_3])
print(cm)


[[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]


# Working examples

The following examples follow the dataset logic.

In [None]:
classifier("this is a simple test", candidate_labels)['labels'][0]

'neutral or inanimate subject'

In [None]:
classifier("Sarah is a very nice person", candidate_labels)['labels'][0]

'human female subject'

In [None]:
classifier("Nicolas is a very nice person", candidate_labels)['labels'][0]

'human male subject'

In [None]:
classifier("il fait beau", candidate_labels)['labels'][0]

'neutral or inanimate subject'

# Multiple subjects problem

When we have both male and female subjects, the sentence is misclassified: no "hybrid" label allows the pipeline to give the correct answer.  

Adding a 4th label to the pipeline decreases result performance across the other labels.  

We will have to run double inference on animate subjects:
- if the sentece is neutral -> done
- if the sentence is male or female -> check that only a single gender is present

Example of wrong label, the correct answer would be something like "both male and female human subjects"

In [None]:
classifier("Mesdames et messieurs les députés", candidate_labels)['labels'][0]

'human female subject'

# Phase 2: Filter again

We exclude the neutral form and filter on multiple entities or single entities.

In [None]:
# Define the candidate labels for phase 2
label_1 = "a single male subject"
label_2 = "a single female subject"
label_3 = "multiple human subjects"

candidate_labels = [label_1, label_2, label_3]

In [None]:
# Classify the male sentences
male_results = classifier(male_sentences, candidate_labels)

# Classify the female sentences
female_results = classifier(female_sentences, candidate_labels)

# Classify the neutral sentences
hybrid_results = classifier(hybrid_sentences, candidate_labels)


In [None]:
from sklearn.metrics import confusion_matrix

# Combine the results into a single list of predictions and ground truth labels
predictions = []
labels = []
for result, category in [(male_results, label_1), (female_results, label_2), (hybrid_results, label_3)]:
    for r in result:
        predictions.append(r["labels"][0])
        labels.append(category)

# Compute the confusion matrix
cm = confusion_matrix(labels, predictions, labels=[label_1, label_2, label_3])
print(cm)


[[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]
