# üöÄ Gender Classification with Advanced Zero-Shot Learning üéØ

Welcome to this Google Colab notebook, where we demonstrate an advanced gender classification task using a zero-shot learning approach! The objective is to classify sentences into four categories: sentences with a single male subject üë®, sentences with a single female subject üë©, neutral sentences with an inanimate or non-gendered subject üîÑ, and hybrid sentences containing multiple human subjects of different genders üë´.

We utilize the powerful Hugging Face Transformers library ü§ó to build a classification pipeline and leverage a pre-trained zero-shot classifier model. The classifier is then applied to a small dataset of sentences to evaluate its performance in determining the gender or neutrality of the subjects.

To refine the classification process, we implement a two-phase filtering mechanism that enhances the model's ability to differentiate between single and multiple human subjects in sentences.

Finally, we compute a confusion matrix üìä to assess the accuracy of the classifier in this particular task, enabling us to understand its effectiveness in differentiating between single male, single female, neutral, and hybrid subjects.

Dive into this notebook to explore how advanced zero-shot learning techniques can be applied to real-world classification tasks, achieving remarkable results with minimal effort and data üåü.


# install transformers lib

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.8/6.8 MB[0m [31m95.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m199.8/199.8 KB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.6/7.6 MB[0m [31m114.1 MB/s

# create simple dataset of gendered and neutral sentences

In [None]:
# Define the sentences in each language
male_sentences_en = ["John is a great athlete.", "Bob loves to play video games.", "He is a doctor at the hospital.", "David is a great cook.", "My brother is an engineer."]
female_sentences_en = ["Samantha is a great dancer.", "Emily loves to read books.", "She is a teacher at the school.", "Laura is a great singer.", "My sister is a nurse."]
neutral_sentences_en = ["The sun is shining.", "The book is on the table.", "The car is parked outside.", "The tree is tall.", "The coffee is hot."]
hybrid_sentences_en = ["Alex and Taylor went to the store.", "Jordan and Kim are both in the same class.", "Sam and Jamie are best friends.", "Taylor and Jordan went on a hike together.", "Jordan and Alex work at the same company."]


male_sentences_es = ["Juan es un gran atleta.", "Roberto ama jugar videojuegos.", "√âl es un doctor en el hospital.", "David es un gran cocinero.", "Mi hermano es un ingeniero."]
female_sentences_es = ["Samantha es una gran bailarina.", "Emily ama leer libros.", "Ella es una maestra en la escuela.", "Laura es una gran cantante.", "Mi hermana es una enfermera."]
neutral_sentences_es = ["El sol est√° brillando.", "El libro est√° en la mesa.", "El auto est√° estacionado afuera.", "El √°rbol es alto.", "El caf√© est√° caliente."]
hybrid_sentences_es = ["Juan y Maria fueron al cine.", "Roberto y Sofia son amigos de la infancia.", "Ella y √©l trabajan en la misma empresa.", "David y Laura cocinaron juntos.", "Mi hermana y mi hermano son muy cercanos."]


# Combine the sentences into a single list for each category
male_sentences = male_sentences_en + male_sentences_es
female_sentences = female_sentences_en + female_sentences_es
neutral_sentences = neutral_sentences_en + neutral_sentences_es
hybrid_sentences = hybrid_sentences_en + hybrid_sentences_es

# load classifier and classify on gender

### deactivate warnings

In [None]:
import warnings
# Set a global warning filter to ignore the UserWarning generated by the pipeline
warnings.filterwarnings("ignore", message="Length of IterableDataset")

### process dataset

In [None]:
from transformers import pipeline

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model='facebook/bart-large-mnli', tokenizer='facebook/bart-large-mnli', device=0)

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (‚Ä¶)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (‚Ä¶)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
dataset_list = [male_sentences, female_sentences, neutral_sentences, hybrid_sentences]

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using device: {torch.cuda.get_device_name(device)}")
else:
    print("No GPU available, using CPU instead.")


Using device: Tesla T4


In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm

tqdm.pandas()

In [None]:
# filter #1
label_11 = "human male subject"
label_12 = "human female subject"
label_13 = "neutral or inanimate subject"
label_list_1 = [label_11, label_12, label_13]

# filter #2
label_21 = "a single male subject"
label_22 = "a single female subject"
label_23 = "multiple human subjects"
label_list_2 = [label_21, label_22, label_23]

def label_gender(sentence_list, label_list):
  sentence_list_results = classifier(sentence_list, label_list, device=0)

  result_list = []
  for result in sentence_list_results:
    result_list.append([result["sequence"], result["labels"][0]])

  return pd.DataFrame(result_list, columns=['sentence', 'label'])

def get_final_label(label_x, label_y):
  if (label_x == label_11) and (label_y == label_21):
    return label_x
  elif (label_x == label_12) and (label_y == label_22):
    return label_x
  elif(label_x == label_13):
    return label_x
  elif(label_y == label_23):
    return label_y
  else:
    'error'

def get_result_df(sentence_list, label_list_1, label_list_2):
  
  # phase 1
  result_df_1 = label_gender(sentence_list, label_list_1)
  # phase 2
  result_df_2 = label_gender(sentence_list, label_list_2)
  result_df = pd.merge(result_df_1, result_df_2, on='sentence')
  result_df['label'] = result_df.progress_apply(lambda row: get_final_label(row['label_x'], row['label_y']), axis=1)
  del result_df['label_x']
  del result_df['label_y']
    
  return result_df

result_list = []  
for sentence_list in dataset_list:
  result_list.append(get_result_df(sentence_list, label_list_1, label_list_2))
double_label_sentence_df = pd.concat(result_list)
double_label_sentence_df

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 6068.15it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 5737.76it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 7912.29it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 5805.27it/s]


Unnamed: 0,sentence,label
0,John is a great athlete.,human male subject
1,Bob loves to play video games.,human male subject
2,He is a doctor at the hospital.,human male subject
3,David is a great cook.,human male subject
4,My brother is an engineer.,human male subject
5,Juan es un gran atleta.,human male subject
6,Roberto ama jugar videojuegos.,human male subject
7,√âl es un doctor en el hospital.,human male subject
8,David es un gran cocinero.,human male subject
9,Mi hermano es un ingeniero.,human male subject


In [None]:
get_result_df(dataset_list[0], label_list_1, label_list_2)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10/10 [00:00<00:00, 1833.66it/s]


Unnamed: 0,sentence,label
0,John is a great athlete.,human male subject
1,Bob loves to play video games.,human male subject
2,He is a doctor at the hospital.,human male subject
3,David is a great cook.,human male subject
4,My brother is an engineer.,human male subject
5,Juan es un gran atleta.,human male subject
6,Roberto ama jugar videojuegos.,human male subject
7,√âl es un doctor en el hospital.,human male subject
8,David es un gran cocinero.,human male subject
9,Mi hermano es un ingeniero.,human male subject
