### Conditional probabilities

Since we found a small gender segregation effect earlier, let's look at this again in a different way. The questions are:

- if the first author is female, does it affect the probability that another author is female?
- if the last author is female, does it affect the probability that another author is female?

Same for male.

In [None]:
# install the follwoing packages in the enviroment:
# python3 -m pip install pandas
# python3 -m pip install seaborn

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import numpy as np
import json

import os

from read_jsondata import read_jsons

import time

## How can we check this?

We need to compute the conditional probability of: Given that the (last) author is female (male), what is the probability that the first author is female?

We assume for the moment that authorship genders are independent of each other. 

We write: 
- first author female: A
- first author male: Y
- last author female: B
- last author male: Z

$$P(A | B) = P(B | A) * P(A) / [P(B | A) * P(A) + P(B | Y) * P(Y)]  $$


We can try to estimate all the quantities in this equation from the data. I.e. take all papers with a female last author, determine their number, $n_b$ then determine how many of them have female first author. In this way we obtain an estimate of $ P(A | B)$. Similarly for all the other quantities. 

We have already estimated P(A), P(B) and P(Z) earlier:

$$ P(A) = 0.2767 $$
$$ P(B) = 0.1908 $$
$$ P(Y) = 0.7233 $$


In [None]:
p_a = 0.2767; p_b = 0.1908; p_y = 0.7233; p_z = 0.8092

In [None]:
# Define local paths

root = ! pwd
root = root[0]

RAW_DIR=root+"/author_allgenders/"  

if not os.path.exists(RAW_DIR):
    print("The directory {} does not exist.\nThere is no raw data for statistical analysis.".format(RAW_DIR))

In [None]:
df = read_jsons(RAW_DIR,  columns=['journal','all_names', 'all_genders','all_percent','year'])  # included here cleanup and IF and removing 2021
df

### Create new columns in the dataframe extracting useful information from list of coauthors

In [None]:
# Number of authors:

df['Number_authors'] = df['all_genders'].apply(lambda x: len(x)) #take the length of the list all_genders
df['Number_init'] = df['all_genders'].apply(lambda x: len([s for s in x if "init"==s]))


# First author's gender and percentage:

df['First_Author_gend'] = df['all_genders'].apply(lambda x: x[0]) #take the first element of the list all_genders
df['First_Author_perc'] = df['all_percent'].apply(lambda x: x[0])

# Last author's gender and percentage:

df['Last_Author_gend'] = df['all_genders'].apply(lambda x: x[-1]) #take the last element of the list all_genders
df['Last_Author_perc'] = df['all_percent'].apply(lambda x: x[-1])

### dropping init (unidentified initialed names)

In [None]:
df = df[df.Number_init==0].copy()

 #### It is easier if the all probabilities are with respect to the same gender (female)

In [None]:
# prob(female) = 1 - prob(male)

# Prob last author female:

df['Last_Author_probF'] = df['Last_Author_perc']
df.loc[df['Last_Author_gend'] == 'male','Last_Author_probF'] = \
    1 - df.loc[df['Last_Author_gend'] == 'male','Last_Author_probF']

# Prob first author female:

df['First_Author_probF'] = df['First_Author_perc']
df.loc[df['First_Author_gend'] == 'male','First_Author_probF'] = \
    1 - df.loc[df['First_Author_gend'] == 'male','First_Author_probF']


In [None]:
p_ff = df['First_Author_probF'].sum()/df.shape[0]
p_mf = (1 - df['First_Author_probF']).sum()/df.shape[0]
p_fl = df['Last_Author_probF'].sum()/df.shape[0]
p_ml = (1 - df['Last_Author_probF']).sum()/df.shape[0]


print('Probability of having a female first author:', p_ff)
print('Probability of having a male first author:', p_mf)

print('Probability of having a female last author:', p_fl)
print('Probability of having a male last author:', p_ml)

In [None]:
# remove papers that have only one author for this analysis
df = df[df.Number_authors > 1].copy()


In [None]:
#Define functions to multiply probabilities in each row

def Prob_intersect(x,y, gender_first, gender_last):
    if x[0] == gender_first:
        prod = float(y[0])
    else:
        prod = 1 - float(y[0])
        
    if x[-1] == gender_last:
        prod *= float(y[-1])
    else:
        prod *= 1 - float(y[-1])
    return prod

# Create corresponding columns:

df['Prob_FF'] = df.apply(lambda x: Prob_intersect(x.all_genders, x.all_percent, 
                                                               "female", "female"), axis=1)
df['Prob_FM'] = df.apply(lambda x: Prob_intersect(x.all_genders, x.all_percent, 
                                                               "female", "male"), axis=1)
df['Prob_MF'] = df.apply(lambda x: Prob_intersect(x.all_genders, x.all_percent, 
                                                               "male", "female"), axis=1)
df['Prob_MM'] = df.apply(lambda x: Prob_intersect(x.all_genders, x.all_percent, 
                                                               "male", "male"), axis=1)

print('Prob first female cond on last female:', df['Prob_FF'].mean()/df['Last_Author_probF'].mean())
print('Prob last female cond on first female:', df['Prob_FF'].mean()/df['First_Author_probF'].mean())
print('Prob first female cond on last male:', df['Prob_FM'].mean()/(1 - df['Last_Author_probF']).mean())
print('Prob last female cond on first male:', df['Prob_MF'].mean()/(1 - df['First_Author_probF']).mean())


In [None]:
sns.set(style="ticks")
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.5})

# Let's plot. Bars: First female when female last author, when  male last author
labels = ['Last female', 'Last male']

plt.figure(figsize=(6, 3))
to_plot = pd.DataFrame()
to_plot["gender"] = [0, 1]
to_plot["p_female"] = [df['Prob_FF'].mean()/df['Last_Author_probF'].mean(), \
                       df['Prob_FM'].mean()/(1 - df['Last_Author_probF']).mean()]
splot = sns.barplot(y="gender", x="p_female", data=to_plot, color="rebeccapurple", dodge=False, orient="h")

sns.despine()
plt.ylabel("")
plt.yticks([0, 1], labels)
plt.xlabel("Probability first author female")


#ax.grid(True)
#plt.axvline(x=p_ff, color="k", linestyle=":", linewidth=2)
plt.grid(alpha=0.5)
plt.xlim([0, 0.54])

plt.tight_layout()
plt.savefig("./Figures/conditional_on_last.pdf", dpi=300, bbox_inches="tight")
#plt.savefig("./Figures/female_senior_authors_increase_female_junior.jpg")
plt.show()

In [None]:
#bias
to_plot["p_female"][0] - to_plot["p_female"][1]

In [None]:
sns.set(style="ticks")
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.5})

# Let's plot. Bars: First female when female last author, when  male last author
labels = ['First female', 'First male']

plt.figure(figsize=(6, 3))
to_plot = pd.DataFrame()
to_plot["gender"] = [0, 1]
to_plot["p_female"] = [df['Prob_FF'].mean()/df['First_Author_probF'].mean(), \
                       df['Prob_MF'].mean()/(1 - df['First_Author_probF']).mean()]
splot = sns.barplot(y="gender", x="p_female", data=to_plot, color="rebeccapurple", dodge=False, orient="h")

sns.despine()
plt.ylabel("")
plt.yticks([0, 1], labels)
plt.xlabel("Probability last author female")


#ax.grid(True)
#plt.axvline(x=p_fl, color="k", linestyle=":", linewidth=2)
plt.grid(alpha=0.5)

plt.xlim([0, 0.54])

plt.tight_layout()
plt.savefig("./Figures/conditional_on_first.pdf", dpi=300, bbox_inches="tight")
#plt.savefig("./Figures/female_senior_authors_increase_female_junior.png", dpi=450, bbox_inches="tight")
#plt.savefig("./Figures/female_senior_authors_increase_female_junior.jpg")
plt.show()

In [None]:
# bias

to_plot["p_female"][0] - to_plot["p_female"][1]

In [None]:
# remove papers that have only one author for this analysis
df = df[df.Number_authors > 2].copy()


In [None]:
#Define functions to multiply probabilities in each row

def Prob_atLeastintersect(x,y, gender_atleast, gender_other, gender_last):
    prod = 1    
    for i,elem in enumerate(x):
        if (i>0) & (i<(len(x)-1)):
            if elem == gender_other:
                prod *= float(y[i]) 
            elif elem == gender_atleast:
                prod *= 1 - float(y[i])
    
    if x[-1] == gender_last:
        prod = (1-prod)*float(y[-1])
    else:
        prod = (1-prod)*(1 - float(y[-1]))
    
    return prod


# Create corresponding columns:

df['Prob_1FF'] = df.apply(lambda x: Prob_atLeastintersect(x.all_genders, x.all_percent, 'female',
                                                               "male", "female"), axis=1)
df['Prob_1FM'] = df.apply(lambda x: Prob_atLeastintersect(x.all_genders, x.all_percent, 'female',
                                                               "male", "male"), axis=1)

df['Prob_1MF'] = df.apply(lambda x: Prob_atLeastintersect(x.all_genders, x.all_percent, 'male',
                                                               "female", "female"), axis=1)
df['Prob_1MM'] = df.apply(lambda x: Prob_atLeastintersect(x.all_genders, x.all_percent, 'male',
                                                               "female", "male"), axis=1)

print('Prob at least female cond on last female:', df['Prob_1FF'].mean()/df['Last_Author_probF'].mean())
print('Prob at least female cond on last male:', df['Prob_1FM'].mean()/(1 - df['Last_Author_probF']).mean())
print('Prob at least male cond on last female:', df['Prob_1MF'].mean()/df['Last_Author_probF'].mean())
print('Prob at least male cond on last male:', df['Prob_1MM'].mean()/(1 - df['Last_Author_probF']).mean())

In [None]:
sns.set(style="ticks")
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.5})

# Let's plot. Bars: First female when female last author, when  male last author
labels = ['Last female', 'Last male']

plt.figure(figsize=(6, 3))
to_plot = pd.DataFrame()
to_plot["gender"] = [0, 1]
to_plot["p_female"] = [df['Prob_1FF'].mean()/df['Last_Author_probF'].mean(), \
                       df['Prob_1FM'].mean()/(1 - df['Last_Author_probF']).mean()]
splot = sns.barplot(y="gender", x="p_female", data=to_plot, color="rebeccapurple", dodge=False, orient="h")

sns.despine()
plt.ylabel("")
plt.yticks([0, 1], labels)
plt.xlabel("Probability at least one coauthor female")

plt.grid(alpha=0.5)
plt.xlim([0, 0.54])

plt.tight_layout()
plt.savefig("./Figures/conditionalAtleastone_on_last.pdf", dpi=300, bbox_inches="tight")

In [None]:
# bias

to_plot["p_female"][0] - to_plot["p_female"][1]

In [None]:
def Prob_atLeastintersect(x,y, gender_atleast, gender_other, gender_first):
    prod = 1    
    for i,elem in enumerate(x):
        if (i>0) & (i<(len(x)-1)):
            if elem == gender_other:
                prod *= float(y[i]) 
            elif elem == gender_atleast:
                prod *= 1 - float(y[i])
    
    if x[0] == gender_first:
        prod = (1-prod)*float(y[0])
    else:
        prod = (1-prod)*(1 - float(y[0]))
    
    return prod


# Create corresponding columns:

df['Prob_F1F'] = df.apply(lambda x: Prob_atLeastintersect(x.all_genders, x.all_percent, 'female',
                                                               "male", "female"), axis=1)
df['Prob_M1F'] = df.apply(lambda x: Prob_atLeastintersect(x.all_genders, x.all_percent, 'female',
                                                               "male", "male"), axis=1)


print('Prob at least female cond on first female:', df['Prob_1FF'].mean()/df['First_Author_probF'].mean())
print('Prob at least female cond on first male:', df['Prob_1FM'].mean()/(1 - df['First_Author_probF']).mean())

In [None]:
sns.set(style="ticks")
sns.set_context("notebook", font_scale=1.3, rc={"lines.linewidth": 2.5})

# Let's plot. Bars: First female when female last author, when  male last author
labels = ['First female', 'First male']

plt.figure(figsize=(6, 3))
to_plot = pd.DataFrame()
to_plot["gender"] = [0, 1]
to_plot["p_female"] = [df['Prob_1FF'].mean()/df['First_Author_probF'].mean(), \
                       df['Prob_1FM'].mean()/(1 - df['First_Author_probF']).mean()]
splot = sns.barplot(y="gender", x="p_female", data=to_plot, color="rebeccapurple", dodge=False, orient="h")

sns.despine()
plt.ylabel("")
plt.yticks([0, 1], labels)
plt.xlabel("Probability at least one coauthor female")

plt.grid(alpha=0.5)
plt.xlim([0, 0.54])

plt.tight_layout()
plt.savefig("./Figures/conditionalAtleastone_on_first.pdf", dpi=300, bbox_inches="tight")

In [None]:
# bias

to_plot["p_female"][0] - to_plot["p_female"][1]