# Gender biases in Wikipedia 
## todo: find fancy title

Wikipedia has become a very large source of information. By November 2019, the number of entries in the English Wikipedia is above 5M [[1]](https://en.wikipedia.org/wiki/Wikipedia:Statistics#Page_views) and keeps increasing everyday at a rate of 500 entries in averge. 

In previous studies Wagner et al [[2]](https://arxiv.org/abs/1501.06307) shown how gender biases manifest in Wikipedia in the way women and men are portrayed. In other studies, Graells-Garrido et al [[3]](https://labtomarket.files.wordpress.com/2018/01/wiki_gender_bias.pdf) shown that women biographies are more likely to contain sex-related content. Along with these studies, several others have studied topic-related biases in the way women are portrayed but we can also take a look from the linguistic perspective. 

Linguistic biases is defined as a systematic asymmetry in word choice that reflects the social-category cognitions that are applied to the described group or individual(s) [[4]](https://oxfordre.com/communication/communication/view/10.1093/acrefore/9780190228613.001.0001/acrefore-9780190228613-e-439). We want to analyze how men and women are protrayed and more specifically, the adjectives used to describe them with the aim to spot possible gender biases from a linguistic perspective. To do so, we will use the overview of the biographies in the English Wikipedia together with other characteristics of the people we are analysing.

Initially we will start by exploring the dataset, i.e. ratio of male and female entries, presence of other genders, etc. Later, we will explore the language used on the overviews by focusing on the adjectives. We restrict the analysis to adjectives given the level of abstraction they provide [[5]](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/download/10539/10513). This analysis will be conducted by first extracting the most frequent adjectives from all the biographies used. We will build a vocabulary based on the most frequent adjectives used in male biographies and the most frequent ones used in female biographies. Using this vocabulary, we will create a representation of each character based on the adjectives in our vocabulary that appear in its biography. 

Once we have a vectorial representation of each person, we will create a model using logistic regression that will try to predict if a biography belongs to a male or female. If this task becomes feasible, it means there is a pattern in the usage of language that allows us to make a distinction between genders, highlighting the presence of a bias. Will our model succeed in its tasks? Continue with us to discover our results! 

[[1] Wikipedia Statistics](https://en.wikipedia.org/wiki/Wikipedia:Statistics#Page_views)

[[2] It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia](https://arxiv.org/abs/1501.06307)

[[3] First Women, Second Sex: Gender Bias in Wikipedia](https://labtomarket.files.wordpress.com/2018/01/wiki_gender_bias.pdf)

[[4] Oxford Research Encyclopedia](https://oxfordre.com/communication/communication/view/10.1093/acrefore/9780190228613.001.0001/acrefore-9780190228613-e-439)

[[5] Linguistic Bias in Collaboratively Produced Biographies: Crowdsourcing Social Stereotypes?](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/download/10539/10513)

In [None]:
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
import os
import json
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

## Data preprocessing and extraction

As stated before, we are only interested in analyze the biographies of Wikipedia, so we need to filter them. More precisely, we will use the overview of the biographies in the English Wikipedia. In order to do that, we followed these steps:

1. We use the [Wikidata Human Gender Indicators (WHGI)](#http://whgi.wmflabs.org) dataset, which contains all the biography articles in all Wikipedias and it is updated weekly. From this dataset (November 2019 version), we get all the biographies that are in the English Wikipedia and that have a gender. For each entry, we get the Q-id, gender and occupation. We use this dataset because it is more updated than the Wikidata one in cluster (dated from 2017). You can see the code in [here](createDataset/1_extract_qid_wikidata.py).


2. Then, we need to link the previous information with the Wikipedia article. For that, we need to use Wikidata dataset found in the cluster. First, we filter the entries that we obtain in the previous step and obtain the name of the entry in the English Wikipedia. You can see the code in [here](createDataset/2_extract_people_wikidata.py).


3. Next, we have to obtain the biographies from Wikipedia dataset. To do that, we simply join the English Wikipedia dataset (also found in the cluster) with the one obtained in the previous step by the Wikipedia title, which is unique. You can see the code in [here](createDataset/3_filter_people_enwiki.py).


4. Following, we need to extract and clear the overview of the wikipedia text. First, we find the end of the overview (which usually starts either with == or [[Category: ). Then we clear the references, comments from the editors, quotes abd withs inside parenthesis or brackets. Code in [here](createDataset/4_extract_overview_enwiki.py).


5. Finally, as it will be shown later in the analysis, we filter the database according to gender of the people. We keep only the male and the female as the other gender represent less than 1% of the whole dataset. Code in [here](createDataset/5_filter_female_male.py). 

The final dataset ... 
#### ADD THE SCHEMA AND MAYBE SOME INFO ON THE LENGTH



## Gender analysis

In [None]:
LOCAL_PATH = "../data/"
WIKI_DATA = os.path.join(LOCAL_PATH, "overview_wikipedia.json")

In [None]:
# create the session
spark = SparkSession.builder.getOrCreate()

# create the context
sc = spark.sparkContext

# load data
df = spark.read.json(WIKI_DATA)

# explode the gender column (create multiple entries for people with a list of genders)
df = df.withColumn("gender", split(regexp_replace(regexp_replace(regexp_replace(regexp_replace(df['gender'], \
                                                            '\\[', ''), '\\]', ''), ' ', ''),"'", ""), ","))
df = df.withColumn("gender", df['gender'][0])
df.show()

In [None]:
gender_counts = df.groupBy("gender").agg(count("*").alias("count")).sort(desc("count"))
gender_counts.show()

In [None]:
print("In total there are {} different genders".format(gender_counts.count()))

In [None]:
# Open dictionary to match id to gender name
with open('../data/dict_genders.json') as json_file:
    line = json_file.readline()
    dict_genders = json.loads(line)

In [None]:
def get_gender(gender):
    return dict_genders.get(gender, "other")

# get the gender (male, female or other) from the id
udf_get_gender = udf(get_gender)
gender_counts = gender_counts.withColumn("gender", udf_get_gender("gender"))

In [None]:
# group the other genders
gender_counts_grouped = gender_counts.groupBy("gender").agg(sum("count").alias("count")).sort(desc("count"))
gender_counts_grouped.show()

In [None]:
# dataframe to pandas
gender_counts_pd = gender_counts_grouped.toPandas()

pl = gender_counts_pd.plot(kind="bar", x="gender", y="count", figsize=(15, 7), log=True, alpha=0.5, color="green")
pl.set_xlabel("Gender")
pl.set_ylabel("Number of biographies (Log scale)")
pl.set_title("Number of biographies by gender");

**Attention**: The y-axis in log-scale!

In [None]:
n_total = gender_counts_pd['count'].sum()
n_male = gender_counts_pd[gender_counts_pd['gender'] == 'male']['count'].values[0]
n_female = gender_counts_pd[gender_counts_pd['gender'] == 'female']['count'].values[0]
n_other = n_total - n_male - n_female

print("{:.2f}% of the entries are male".format(n_male/n_total*100))
print("{:.2f} % of the entries are female".format(n_female/n_total*100))
print("{:.2f} % of the entries are other gender".format(n_other/n_total*100))

Based on these numbers, we decide to **drop the other genders** and continue our analysis with only female and male.

## Data extraction
Explain process of extraction of the data; combination of wikimedia and wikipedia; preprocessing steps.

In [None]:
LOCAL_PATH = "../data/"
WIKI_DATA = os.path.join(LOCAL_PATH, "wikipedia_male_female.json")

In [None]:
# Load data frame
df = spark.read.json(WIKI_DATA)
df.show()

The previous table shows an example of the data we are working with. Each row represents the article of a given person with the following information associated:
- Name
- ID (wikimedia)
- Wiki-title (wikipedia)
- Gender
- Ocuppation
- Overview

In the following steps we are going to explore how women are represented in wikipedia. First, we will start with some basic statistics like the fraction of entries that correspond to each gender and how this varies along different occupations. After, we will enter in the core analysis of the project by analysing the language used to present the different characters. The idea is to focus on the adjectives used in the overviews and look for a bias between male and female representations. 
**ADD SOME MORE DETAILS AND INTRODUCTION**

## Data translation
Translate wikimedia codes to the actual meaning in terms of gender and occupation.
**Note:** Since we are interested in both gender and occupation, when the translation from wikimedia code to words is perform those people without associated occupation will be dropped from the dataset.

In [None]:
# Open Gender dictionary
with open('../data/dict_genders.json') as json_file:
    line = json_file.readline()
    dict_genders = json.loads(line)
    
# Open occupations dictionary
dict_occupations = {}
with open('../data/dict_occupations.json') as json_file:
    content = json_file.readlines()
    for line in content:
        occ = json.loads(line)
        dict_occupations.update(occ)
        
# Observation: We need dict_categories_occupations.json in the data folder
# Open occupations categories dictionary
with open('../data/dict_categories_occupations.json') as json_file:
    line = json_file.readline()
    dict_cat_occ = json.loads(line)

# Create function to translate a code into a category
def translate(mapping):
    def translate_(col):
        return mapping.get(col)
    return udf(translate_, StringType())

In [None]:
# Translate gender and occupations codes into corresponding labels
df = df.withColumn('gender', translate(dict_genders)('gender'))\
       .withColumn('occupation', explode(split(regexp_replace(regexp_replace(regexp_replace\
                                (regexp_replace(df['occupation'], '\\[', ''), '\\]', ''), ' ', ''),"'", ""), ",")))\
       .filter(col('occupation') != '')\
       .withColumn('occupation', translate(dict_occupations)('occupation'))\
       .withColumn('field', translate(dict_cat_occ)('occupation'))

df.show()

## TODO: solve display of table

### Gender distribution

In [None]:
# Query to know how many males and females are in the data frame
# Observation: When occupation translation is done, the observations without a label are dropped, that's why, there are less male and female
df.registerTempTable("df")

query = """
SELECT gender, count(DISTINCT id) as count
FROM df
GROUP BY gender
ORDER BY count DESC
"""

gender_counts = spark.sql(query)
gender_counts = gender_counts.toPandas()
gender_counts

In [None]:
pl = gender_counts.plot(kind="bar", x="gender", y="count", figsize=(15, 7), log=False, \
                        alpha=0.5, color="green", rot=0)
pl.set_xlabel("Gender")
pl.set_ylabel("Number of biographies")
pl.set_title("Number of biographies by gender");

### Occupation distribution

In [None]:
df.registerTempTable("df")

query = """
SELECT field, count(DISTINCT id) as count
FROM df
WHERE field IS NOT NULL
GROUP BY field
ORDER BY count DESC
"""

occu_cat_counts = spark.sql(query)
occu_cat_counts = occu_cat_counts.toPandas()
occu_cat_counts.head()

In [None]:
pl = occu_cat_counts.plot(kind="bar", x="field", y="count", figsize=(15, 7), log=False, \
                          alpha=0.5, color="green", rot=0)
pl.set_xlabel("Field of occupation")
pl.set_ylabel("Number of biographies")
pl.set_title("Number of biographies by field of occupation");

The most common occupation among our characters is **Sports** followed by **Artist** and **Politics**. The group **None** represents all those occupations that did not match any of the previous groups. 

In [None]:
n_total = occu_cat_counts['count'].sum()
n_sports = occu_cat_counts[occu_cat_counts['field'] == 'Sports']['count'].values[0]
n_artist = occu_cat_counts[occu_cat_counts['field'] == 'Artist']['count'].values[0]
n_politics = occu_cat_counts[occu_cat_counts['field'] == 'Politics']['count'].values[0]

print("{:.2f}% of the entries work in the sports field".format(n_sports/n_total*100))
print("{:.2f}% of the entries work in the artistic field".format(n_artist/n_total*100))
print("{:.2f}% of the entries work in the politics field".format(n_politics/n_total*100))

### Gender by occupation

How are the distinct genders represented within the different occupational groups? Is there any group where women have a greater representation than men?

In [None]:
df.registerTempTable("df")

query = """
SELECT field, gender, count(DISTINCT id) as count
FROM df
WHERE field IS NOT NULL
GROUP BY field, gender
ORDER BY field, gender
"""

occu_gender_counts = spark.sql(query)
occu_gender_counts = occu_gender_counts.toPandas()
occu_gender_counts.head()

From the plot we can point out different details: 
- Female biographies are less in all fields except **Model** which is associated to the mode industry. In this case, for each 5 biographies related to female characters we have one male biography.
- **Religion** and **Military** are the groups where the ratio female:male becomes larger. In religion related biographies for each female we will find 69 males. In military related ones, for each female we will find 62 males.
- The most balanced occupational field is **Artist** where the ratio female:male is of 1:3"
   

In [None]:
male_count = occu_gender_counts[occu_gender_counts['gender'] == 'male']['count'].tolist()
female_count = occu_gender_counts[occu_gender_counts['gender'] == 'female']['count'].tolist()
index = occu_gender_counts['field'].unique().tolist()
occ_by_gender = pd.DataFrame({'male': male_count, 'female': female_count}, index=index)

fig, ax = plt.subplots(2, 1, figsize=(15, 14))
pl = occ_by_gender.plot(kind="bar", log=False, alpha=0.5, color=["green", "red"], rot=0, ax=ax[0])
pl.set_xlabel("Field of occupation")
pl.set_ylabel("Number of biographies")
pl.set_title("Number of biographies by gender and field of occupation");
             
occ_by_gender['ratio'] = occ_by_gender.apply(lambda x: x.male / x.female, axis=1)
pl = occ_by_gender.plot(kind="bar", y='ratio', alpha=0.5, color='green', rot=0, ax=ax[1])
for p in ax[1].patches:
    disp= '{:.1f}'.format(p.get_height())
    ax[1].annotate(disp, (p.get_x() * 1.005, p.get_height() +0.5))
pl.set_xlabel("Field of occupation")
pl.set_ylabel("Number of biographies")
pl.set_title("Ratio of female:male biographies by field of occupation");

## Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [None]:
ADJ_MALE = os.path.join(LOCAL_PATH, "count_male_adjectives.json")
ADJ_FEM = os.path.join(LOCAL_PATH, "count_female_adjectives.json")

In [None]:
most_common_adj_male = spark.read.json(ADJ_MALE)
most_common_adj_fem = spark.read.json(ADJ_FEM)

In [None]:
most_common_adj_male = most_common_adj_male.orderBy(desc("count"))
most_common_adj_male_pd = most_common_adj_male.toPandas()
first_100_adj_male = most_common_adj_male_pd[:100].copy()
first_100_adj_male.head()

In [None]:
most_common_adj_fem = most_common_adj_fem.orderBy(desc("count"))
most_common_adj_fem_pd = most_common_adj_fem.toPandas()
first_100_adj_fem = most_common_adj_fem_pd[:100].copy()
first_100_adj_fem.head()

In [None]:
most_common_adj = set()
most_common_adj.update(first_100_adj_male['adjectives'].tolist())
most_common_adj.update(first_100_adj_fem['adjectives'].tolist())
most_common_adj = list(most_common_adj)
len(most_common_adj)

In [None]:
WIKI_MALE = os.path.join(LOCAL_PATH, "wikipedia_male_adjectives.json")
WIKI_FEM = os.path.join(LOCAL_PATH, "wikipedia_female_adjectives.json")

In [None]:
df_male = spark.read.json(WIKI_MALE)
df_fem = spark.read.json(WIKI_FEM)

In [None]:
def get_n_adjs(list_adj):
    return len(list_adj)

udf_get_n_adjs = udf(get_n_adjs)

In [None]:
df_male_model = df_male.select("id", "gender", "adjectives")
df_male_model = df_male_model.withColumn("n-adjs", udf_get_n_adjs("adjectives"))
df_male_model = df_male_model.withColumn("gender", udf_get_gender("gender"))
df_male_model.show()

In [None]:
df_fem_model = df_fem.select("id", "gender", "adjectives")
df_fem_model = df_fem_model.withColumn("n-adjs", udf_get_n_adjs("adjectives"))
df_fem_model = df_fem_model.withColumn("gender", udf_get_gender("gender"))
df_fem_model.show()

In [None]:
df_male_pd = df_male_model.toPandas()
df_fem_pd = df_fem_model.toPandas()

In [None]:
def encode_input(list_words_present, list_adj_to_encode):
    encoding = np.zeros(len(list_adj_to_encode))
    for i, adj in enumerate(list_adj_to_encode):
        if adj in list_words_present:
            encoding[i] = 1
    return encoding

In [None]:
def encode_output(gender):
    return int(gender == 'female')

In [None]:
df_male_pd['input'] = df_male_pd.adjectives.map(lambda x: encode_input(x, most_common_adj))
df_male_pd['output'] = df_male_pd.gender.map(lambda x: encode_output(x))
df_male_pd.head()

In [None]:
df_fem_pd['input'] = df_fem_pd.adjectives.map(lambda x: encode_input(x, most_common_adj))
df_fem_pd['output'] = df_fem_pd.gender.map(lambda x: encode_output(x))
df_fem_pd.head()

In [None]:
n_male = len(df_male_pd)
n_fem = len(df_fem_pd)
assert n_male > n_fem

n_train = np.round(0.7 * n_fem).astype(np.uint)
print("Number of entries for each gender on train: {}".format(n_train))

n_test = (n_fem - n_train).astype(np.uint)
print("Number of entries for each gender on test: {}".format(n_test))

In [None]:
np.random.seed(8)

train_indices_fem = np.random.choice(range(n_fem), n_train, replace=False)
test_indices_fem = np.setdiff1d(range(n_fem), train_indices_fem)

train_indices_male = np.random.choice(range(n_male), n_train, replace=False)
left_indices_male = np.setdiff1d(range(n_male), train_indices_male)
test_indices_male = np.random.choice(left_indices_male, n_test, replace=False)

In [None]:
assert len(train_indices_fem) == len(train_indices_male)
assert len(test_indices_fem) == len(test_indices_male)

In [None]:
df_fem_train = df_fem_pd.iloc[train_indices_fem]
df_male_train = df_male_pd.iloc[train_indices_male]

In [None]:
X_train_fem = np.stack(df_fem_train.input)
y_train_fem = np.stack(df_fem_train.output)

X_train_male = np.stack(df_male_train.input)
y_train_male = np.stack(df_male_train.output)

X_train = np.concatenate((X_train_fem, X_train_male), axis=0)
y_train = np.concatenate((y_train_fem, y_train_male), axis=0)

In [None]:
print("Shape of train input: {}".format(X_train.shape))
print("Shape of train output: {}".format(y_train.shape))

In [None]:
lr = LogisticRegression()
# train the model
lr.fit(X_train, y_train)

In [None]:
df_fem_test = df_fem_pd.iloc[test_indices_fem]
df_male_test = df_male_pd.iloc[test_indices_male]

In [None]:
X_test_fem = np.stack(df_fem_test.input)
y_test_fem = np.stack(df_fem_test.output)

X_test_male = np.stack(df_male_train.input)
y_test_male = np.stack(df_male_train.output)

X_test = np.concatenate((X_test_fem, X_test_male), axis=0)
y_test = np.concatenate((y_test_fem, y_test_male), axis=0)

In [None]:
# predict
y_pred = lr.predict(X_test)

In [None]:
# confusion matrix (true - rows, pred - cols)
cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
# accuracy
lr.score(X_test, y_test)

In [None]:
# get probabilities
probs = lr.predict_proba(X_test)
probs_pd = pd.DataFrame(probs, columns=['prob_male', 'prob_female'])
probs_pd.head()

In [None]:
# get coefficients
lr.coef_[0]

In [None]:
# Create pandas DataFrame 
data_pd = {'adjective': most_common_adj, 'coefficient': lr.coef_[0].tolist()}   
df_coef = pd.DataFrame(data_pd) 
df_coef = df_coef.sort_values(by='coefficient', ascending=False).reset_index(drop=True)

In [None]:
df_coef.head(10)

In [None]:
df_coef.tail(10)

In [None]:
subjectivity_dictionary = {}
    
with open('../data/subjectivity_dictionary.json', 'r') as json_file:
    for item in eval(json_file.readline()):
        subjectivity_dictionary.update({item['word']: (item['strength'], item['subj'])})

In [None]:
def get_subjectivity(adj):
    return subjectivity_dictionary.get(adj)[1]

def get_strength(adj):
    return subjectivity_dictionary.get(adj)[0]

In [None]:
df_coef['subjectivity'] = df_coef['adjective'].map(lambda x: get_subjectivity(x))
df_coef['strength'] = df_coef['adjective'].map(lambda x: get_strength(x))

In [None]:
df_coef.head(10)

In [None]:
df_coef.tail(10)