### This notebook looks at the number of papers each individual author publishes.

It compares the probability of female and male authors to be among the top 10, 30, 100... most productive authors (both first and last authors).

As input data it uses the records in directory author_allgenders.

We can conclude here that men are generally more strongly represented among highly productive authors. This can be seen because the overall female quota of first authors, for example, is around 25 %, but it is only around 15 % among the most prolifically publishing authors. 

That this pattern appears both for first and for last authors means that there are structural factors that hold female seismologists back from publishing as much as their male colleagues.

In the case of the first authors, this may be an illustration of the leaky pipeline. The frequency of female authors stabilizes at a number of papers per author of 2 over the past ten years. It may be hypothesized that these are papers by graduate students who left academia after their PhD. This indicates that the number of female seismologists in permanent positions is well below 25 %.

For the last authors, it should be taken into account that female faculty are on the rise. Therefore, it is likely that female seismologists are more frequently young faculty that have taken up their appointments in recent years and have not yet had the possibility to publish senior-author papers as much as their male peers. However, there are thousands of last-authored papers by female seismologist authors that have only published one paper during the past ten years. These are unlikely to be the senior-authored works by supervising female faculty, and more likely to be contributions where the last author is simply a contributor.

The histograms at the bottom are not really relevant, they don't seem to show much that is interesting. That was just a try.

In [None]:
# install the follwoing packages in the enviroment:
# python3 -m pip install pandas
# python3 -m pip install seaborn

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import numpy as np
import json

import os

from read_jsondata import read_jsons

import time

In [None]:
# Define local paths

root = ! pwd
root = root[0]

RAW_DIR=root+"/author_allgenders/"  

if not os.path.exists(RAW_DIR):
    print("The directory {} does not exist.\nThere is no raw data for statistical analysis.".format(RAW_DIR))

#### READ DATAFRAME

In [None]:
df = read_jsons(RAW_DIR, columns=['journal','all_names', 'all_genders','all_percent','year'])
df

##### Clean data

In [None]:
# clean some journal names

df.loc[df.journal=='E%26PSL','journal'] = 'EPSL'

df.loc[df.journal.str.contains("Bulletin"),'journal'] = 'BSSA'

df.loc[df.journal.str.contains("Seismological"),'journal'] = 'SRL'


df

In [None]:
# Include impact factor:

dict_IF = {'Nature': 46.486, 'Science': 41.845, 'NatureGeoscience': 16.103, 'EPSL': 4.823, 'GRL': 4.952, 
        'JGRSolidEarth': 4.191, 'G3': 3.721, 'SRL': 3.131, 'Tectp': 3.048, 'SolidEarth': 2.921, 
       'GEOPHYSICS': 3.093, 'GJI': 2.834, 'BSSA': 2.274, 'PEPI': 2.413}

df['IF'] = df['journal'].map(dict_IF)

In [None]:
## Remove rows for papers from 2021

df = df[~df['year'].isin(['2021'])].copy()
df

##### Create new columns for statistics

In [None]:
# First author's gender and percentage:

df['First_Author'] = df['all_names'].apply(lambda x: x[0]) #take the first element of the list all_genders
df['First_Author_gend'] = df['all_genders'].apply(lambda x: x[0]) #take the first element of the list all_genders
df['First_Author_gendprob'] = df['all_percent'].apply(lambda x: x[0]) #take the first element of the list all_genders



# Last author's gender and percentage:

df['Last_Author'] = df['all_names'].apply(lambda x: x[-1]) #take the last element of the list all_genders
df['Last_Author_gend'] = df['all_genders'].apply(lambda x: x[-1]) #take the last element of the list all_genders
df['Last_Author_gendprob'] = df['all_percent'].apply(lambda x: x[-1]) #take the last element of the list all_genders

df

##### Clean names just in case

In [None]:
def Clean_names(x):
    first_name = x.split()[0] 
    last_name = x.split()[-1]
    
    name = first_name + ' ' + last_name
    
    return name


df['First_Author_clean'] = df['First_Author'].apply(lambda x: Clean_names(x))
df['Last_Author_clean'] = df['Last_Author'].apply(lambda x: Clean_names(x))

df.drop(columns=['First_Author', 'Last_Author'],inplace = True)
df

In [None]:
### Count number of papers for each author and create dictionary

dict_last = df.Last_Author_clean.value_counts().to_dict()
dict_first = df.First_Author_clean.value_counts().to_dict()

In [None]:
### Create new dataframes, one for first authors and another one for last authors

df_first = df[['First_Author_clean','First_Author_gend','First_Author_gendprob']].copy()
df_last = df[['Last_Author_clean','Last_Author_gend','Last_Author_gendprob']].copy()

df_first['Num_papers'] = df_first.First_Author_clean.map(dict_first) # create new column with number of papers
df_last['Num_papers'] = df_last.Last_Author_clean.map(dict_last)

In [None]:
### Drop duplicated name and sort in descending order of num_papers

df_first2 = df_first.drop_duplicates('First_Author_clean').sort_values(by=['Num_papers'],ascending=False).reset_index(drop = True)

df_last2 = df_last.drop_duplicates('Last_Author_clean').sort_values(by=['Num_papers'],ascending=False).reset_index(drop = True)

In [None]:
## It is easier to have all probabilities with respect to female

# prob(female) = 1 - prob(male)

# Prob first author female:

df_first2['First_Author_probF'] = df_first2['First_Author_gendprob']

df_first2.loc[df_first2['First_Author_gend'] == 'male','First_Author_probF'] = \
    1 - df_first2.loc[df_first2['First_Author_gend'] == 'male','First_Author_probF']

# Prob last author female:

df_last2['Last_Author_probF'] = df_last2['Last_Author_gendprob']

df_last2.loc[df_last2['Last_Author_gend'] == 'male','Last_Author_probF'] = \
    1 - df_last2.loc[df_last2['Last_Author_gend'] == 'male','Last_Author_probF']


In [None]:
print('Probability female on top 30 first authors', df_first2.loc[0:30,'First_Author_probF'].sum()/30) # last index is not included!!
print('Probability female on top 10 first authors', df_first2.loc[0:10,'First_Author_probF'].sum()/10)
print("\n")
# NOTE. Xin Liu is misclassified. I think is a postdoc in stanford, and he's a guy
# there is another Xin Liu in seismology in China, another man. Their articles appear combined, as their names are identical.
print('Probability female on top 10 first authors when binary gender', df_first2.loc[0:10,'First_Author_probF'].round().sum()/10)
print('Probability female on top 20 first authors when binary gender', df_first2.loc[0:20,'First_Author_probF'].round().sum()/20)

print('Probability female on top 30 first authors when binary gender', df_first2.loc[0:30,'First_Author_probF'].round().sum()/30)
print('Probability female on top 40 first authors when binary gender', df_first2.loc[0:40,'First_Author_probF'].round().sum()/40)

print('Probability female on top 50 first authors when binary gender', df_first2.loc[0:50,'First_Author_probF'].round().sum()/50)
print('Probability female on top 100 first authors when binary gender', df_first2.loc[0:100,'First_Author_probF'].round().sum()/100)
print('Probability female on top 1000 first authors when binary gender', df_first2.loc[0:1000,'First_Author_probF'].round().sum()/1000)
print('Probability female on top 2000 first authors when binary gender', df_first2.loc[0:2000,'First_Author_probF'].round().sum()/2000)
print('Probability female on top 10000 first authors when binary gender', df_first2.loc[0:10000,'First_Author_probF'].round().sum()/10000)
df_first2.iloc[0:30]

prob_f_in_n = []
prob_m_in_n = []
ns = np.arange(50, 5000, 10)
ps_per_a = []
for n in ns:
    prob_f_in_n.append(df_first2.loc[0:n,'First_Author_probF'].round().sum()/n)
    prob_m_in_n.append((1. - df_first2.loc[0:n,'First_Author_probF'] + 0.001).round().sum()/n)
    ps_per_a.append(df_first2.loc[n,'Num_papers'])
# plt.plot(ns, prob_f_in_n)
# plt.plot(ns, prob_m_in_n)
# plt.ylabel("Frequency of gender of first author\n out of N most productive authors")
# plt.xlabel("Average first-authored papers")
# plt.legend(["Female", "Male"])
# plt.title("Highly productive authors are more commonly male")
# plt.grid()
# plt.xticks([50, 1050, 2050, 3050, 4050],
#            [ps_per_a[0], ps_per_a[100], ps_per_a[200],
#             ps_per_a[300], ps_per_a[400]])
# plt.show()


plt.plot(ns, np.array(prob_f_in_n) / (np.array(prob_f_in_n) + np.array(prob_m_in_n)))
plt.ylabel("Frequency of female first authors\n among N most productive first authors")
plt.xlabel("Number of first-author papers M by the N most productive authors")
#plt.legend(["Female", "Male"])
plt.title("Highly productive first authors are more commonly male")
plt.grid()
# plt.xticks([50, 1050, 2050, 3050, 4050],
#            [ps_per_a[0], ps_per_a[100], ps_per_a[200],
#             ps_per_a[300], ps_per_a[400]])
plt.xticks([50, 1050, 2050, 3050, 4050,],
           ["N={}\nM={}".format(50, ps_per_a[0]),
            "N={}\nM={}".format(1050, ps_per_a[100]),
            "N={}\nM={}".format(2050, ps_per_a[200]),
            "N={}\nM={}".format(3050, ps_per_a[300]),
            "N={}\nM={}".format(4050, ps_per_a[400]),
            ])
plt.savefig("most_productive_first.png", dpi=300)
plt.show()

prob_f_in_n = []
prob_m_in_n = []
ns = np.arange(50, 7000, 10)
ps_per_a = []
n_before = 0
for n in ns:
    prob_f_in_n.append(df_last2.loc[0:n,'Last_Author_probF'].round().sum()/n)
    prob_m_in_n.append((1. - df_last2.loc[0:n,'Last_Author_probF'] + 0.001).round().sum()/n)
    ps_per_a.append(df_last2.loc[n,'Num_papers'])
    n_before = n
plt.plot(ns, prob_f_in_n)
plt.plot(ns, prob_m_in_n)
plt.ylabel("Frequency of gender of last author\n out of N most productive authors")
plt.xlabel("Nr of last-authored papers")
plt.legend(["Female", "Male"])
plt.title("Senior authors are more commonly male")
plt.grid()
plt.xticks([50, 1050, 2050, 3050, 4050],
           [ps_per_a[0], ps_per_a[100], ps_per_a[200],
            ps_per_a[300], ps_per_a[400]])
plt.show()

plt.plot(ns, np.array(prob_f_in_n) / (np.array(prob_f_in_n) + np.array(prob_m_in_n)))
plt.ylabel("Frequency of female last authors\n among N most productive last authors")
plt.xlabel("The N most productive authors with M papers each")
#plt.legend(["Female", "Male"])
plt.title("Senior authors are more commonly male")
plt.grid()
plt.xticks([50, 1050, 2050, 3050, 4050, 5050 , 6050],
           ["N={}\nM={}".format(50, ps_per_a[0]),
            "N={}\nM={}".format(1050, ps_per_a[100]),
            "N={}\nM={}".format(2050, ps_per_a[200]),
            "N={}\nM={}".format(3050, ps_per_a[300]),
            "N={}\nM={}".format(4050, ps_per_a[400]),
            "N={}\nM={}".format(5050, ps_per_a[500]),
            "N={}\nM={}".format(6050, ps_per_a[600]),
            ])
plt.savefig("most_productive_last.png", dpi=300)

plt.show()


In [None]:
print('Probability female on top 30 last authors', df_last2.loc[0:29,'Last_Author_probF'].sum()/30)
print('Probability female on top 10 last authors', df_last2.loc[0:9,'Last_Author_probF'].sum()/10)


df_last2.iloc[0:30,:]

### just for fun let's check some histograms

In [None]:
plt.hist(df_first2.Num_papers, alpha=0.5, bins=range(1, 12))
plt.xlim(1, 15)
#plt.ylim(0, 1000)
plt.title("All First Authors")
plt.show()
plt.hist(df_first2[df_first2.First_Author_probF > 0.5].Num_papers, alpha=0.5, bins=range(1, 12))
plt.xlim(1, 15)
#plt.ylim(0, 1000)
plt.title("Likely Female First Authors")
plt.show()
plt.hist(df_first2[df_first2.First_Author_probF < 0.5].Num_papers, alpha=0.5, bins=range(1, 12))
plt.xlim(1, 15)
#plt.ylim(0, 1000)
plt.title("Likely Male First Authors")
plt.show()

In [None]:
plt.hist(df_last2.Num_papers, alpha=0.5, bins=range(1, 12))
plt.xlim(1, 15)
#plt.ylim(0, 1000)
plt.title("All Last Authors")
plt.show()
plt.hist(df_last2[df_last2.Last_Author_probF > 0.5].Num_papers, alpha=0.5, bins=range(1, 12))
plt.xlim(1, 15)
#plt.ylim(0, 1000)
plt.title("Likely Female Last Authors")
plt.show()
plt.hist(df_last2[df_last2.Last_Author_probF < 0.5].Num_papers, alpha=0.5, bins=range(1, 12))
plt.xlim(1, 15)
#plt.ylim(0, 1000)
plt.title("Likely Male Last Authors")
plt.show()

### --- ok nothing strange with the histograms