<a href="https://colab.research.google.com/github/meg-huggingface/bias-testing/blob/main/fineweb_bias_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Load packages

In [74]:
!pip install datasets
!pip install datatrove
import datasets
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from datatrove.pipeline.readers import ParquetReader



## Methodology

In order to measure bias in the dataset, we consider the following simple [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) based approach. The idea is that the specificity of a term -- in our case, how `biased` it is -- can be quantified as an inverse function of the number of documents in which it occurs.

Given a dataset and terms for a subpopulation (gender) of interest:
1. Evaluate Inverse Document Frequencies on the full dataset
2. Compute the average TF-IDF vectors for the dataset for a given subpopulation (gender)
3. Sort the terms by variance to see words that are much more likely to appear specifically for a given subpopulation




### Load Fineweb


In [75]:
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb/data/CC-MAIN-2024-10", progress=True)#, limit=10000)
corpus = map(lambda doc: doc.text, data_reader())

### Compute frequencies

In [None]:
# Create a CountVectorizer object
#count_vect = CountVectorizer(stop_words='english')
# Fit and transform the data
#counts = count_vect.fit_transform(corpus)
# Create a TfidfTransformer object
#tfidf_transformer = TfidfTransformer()
# Fit and transform the data
#tfidf = tfidf_transformer.fit_transform(counts)


# Step 1: get Inverse document frequencies for the dataset
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
full_tfidf = vectorizer.fit_transform(corpus)
tfidf_feature_names = np.array(vectorizer.get_feature_names_out())


0it [00:00, ?it/s][32m2024-05-31 00:42:48.092[0m | [1mINFO    [0m | [36mdatatrove.pipeline.readers.base[0m:[36mread_files_shard[0m:[36m193[0m - [1mReading input file 000_00000.parquet[0m
1000983it [20:49, 1889.51it/s][32m2024-05-31 01:03:37.225[0m | [1mINFO    [0m | [36mdatatrove.pipeline.readers.base[0m:[36mread_files_shard[0m:[36m193[0m - [1mReading input file 000_00001.parquet[0m
1998983it [43:10, 1417.98it/s][32m2024-05-31 01:25:58.477[0m | [1mINFO    [0m | [36mdatatrove.pipeline.readers.base[0m:[36mread_files_shard[0m:[36m193[0m - [1mReading input file 000_00002.parquet[0m
2428949it [52:56, 1163.31it/s]

### Bias analysis: Gender tf-idf

In [None]:
# Step 2: get average TF-IDF vectors **for each gender**
woman_docs = map(lambda doc: doc.text, filter(lambda doc: "woman" in doc.text.split(), data_reader()))
man_docs = map(lambda doc: doc.text, filter(lambda doc: "man" in doc.text.split(), data_reader()))
tfidf_by_gender = {}
tfidf_by_gender["man"] = np.asarray(vectorizer.transform(man_docs).mean(axis=0))[0]
tfidf_by_gender["woman"] = np.asarray(vectorizer.transform(woman_docs).mean(axis=0))[0]

In [None]:
print(tfidf_by_gender)

In [None]:
# Step 3: for each term, compute the variance across genders
all_tfidf = np.array(list(tfidf_by_gender.values()))
tf_idf_var = all_tfidf - all_tfidf.sum(axis=0, keepdims=True)
tf_idf_var = np.power((tf_idf_var * tf_idf_var).sum(axis=0), 0.5)
sort_by_variance = tf_idf_var.argsort()[::-1]

In [None]:
print(all_tfidf)

In [None]:
# Create the data structure for the visualization,
# showing the highest variance words for each gender,
# and how they deviate from the mean
pre_pandas_lines = [
    {
        "word": tfidf_feature_names[w],
        "man": all_tfidf[0, w],
        "woman": all_tfidf[1, w],
        "man+": all_tfidf[0, w] - all_tfidf[:, w].mean(),
        "woman+": all_tfidf[1, w] - all_tfidf[:, w].mean(),
        "variance": tf_idf_var[w],
        "total": all_tfidf[:, w].sum(),
    }
    for w in sort_by_variance[:50]
]

### Results

In [None]:
# Plot
df = pd.DataFrame.from_dict(pre_pandas_lines)
df.style.background_gradient(
    axis=None,
    vmin=0,
    vmax=0.2,
    cmap="YlGnBu"
).format(precision=2)

#### Sorting by bias

In order to better surface biases, we can sort the table by how much one gender over-represents a term.

In this case, we see that instances mentioning `man` are more likely to include `god` than those mentioning `woman`, which in turn are more likely to include `cancer`.

In [None]:
df.sort_values('man+', ascending=False).style.background_gradient(
    axis=None,
    vmin=0,
    vmax=0.2,
    cmap="YlGnBu"
).format(precision=2)

In [None]:
df.sort_values('woman+', ascending=False).style.background_gradient(
    axis=None,
    vmin=0,
    vmax=0.2,
    cmap="YlGnBu"
).format(precision=2)