## Do women ask more questions at a seminar if a woman asks the 1st question?

Prompted by a tweet from Duncan Green referring to a post on his blog:
http://oxfamblogs.org/fp2p/how-to-stop-men-asking-all-the-questions-in-seminars-its-really-easy/

Original source for data:
"Women’s visibility in academic seminars: women ask fewer questions than
men," Alecia Carter, Alyssa Croft, Dieter Lukas, Gillian Sandstrom

RTB wrote it, 14 December 2017. RTB revised for TAD answers, original name was sexQuestionDemo.m. Translated to Python by RTB and EB August 2021

Concepts covered:
1. histograms for summarizing data
2. difference between 'sig' and 'not sig' is not itself nec. stat. sig.
3. d-prime as a measure of effect size
4. simulating data under H0
5. detecting and removing sorting biases
6. extracting p-values from simulations

## Set up the environment


In [None]:
import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as stats
sns.set(style = 'white', font_scale = 1.2)
print("Setup Complete")

## Load data

Each datum represents a value derived from one academic seminar. Values
are percentage of questions from women minus the percentage of seminar
attendees who were women. Positive values indicate that women asked more
questions; negative values indicate that men asked more questions. The
two variables are the values for each seminar when a woman asked the 1st
question (`woman_first`) vs. when a man asked the 1st question
(`man_first`). 

We then have these as columns in a pandas dataframe `data`. Note that we don't see that many benefits from using pandas with just two columns and the analyses we do here, but we will use it to familiarize ourselves with using pandas. Also (for those who attended the primer), this isn't tidy data! To make it tidy, we should have each row be a seminar and two columns, one with the percentage values, and one that represents whether a man or woman asked the first question. 

NOTE: I reverse-engineered the raw data based on the graphic in the piece
by The Economist, so the numbers might not be exactly correct

In [None]:
woman_first = np.array([60,44,36,36,28,28,28,28,28,24,20,16,16,12,8,8,8,8,8,8,8,8,4,4,4,4,4,4,0,0,0,0,0,0,0,-4,-4,-4,-4,-4,-4,-8,-8,-8,-8,-8,-12,-12,-12,-12,-12,-12,-12,-12,-12,-16,-16,-16,-16,-16,-16,-16,-16,-16,-20,-20,-20,-20,-20,-20,-20,-20,-24,-24,-28,-32,-36,-40])

man_first = np.array([24,24,20,20,16,16,16,16,16,16,16,12,8,4,4,4,0,0,0,0,0,-4,-4,-4,-4,-4,-4,-4,-4,-4,-8,-8,-8,-8,-8,-8,-12,-12,-12,-12,-12,-12,-12,-12,-12,-12,-12,-12,-12,-12,-12,-16,-16,-16,-16,-20,-20,-20,-20,-20,-20,-20,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-24,-28,-28,-28,-28,-28,-28,-28,-28,-28,-32,-32,-32,-32,-32,-32,-32,-32,-32,-32,-36,-36,-36,-36,-36,-36,-36,-36,-36,-36,-40,-40,-40,-40,-40,-40,-40,-40,-40,-40,-40,-44,-44,-44,-44,-44,-44,-44,-44,-44,-44,-44,-44,-48,-48,-48,-48,-48,-48,-48,-48,-52,-52,-52,-52,-52,-52,-52,-56,-56,-56,-56,-56,-56,-56,-56,-60,-60,-60,-60,-64,-64,-64,-64,-68,-72,-72,-76]).astype('int')

# Create pandas dataframe
data = pd.DataFrame( { 'man_first': pd.Series(man_first),
                       'woman_first': pd.Series(woman_first),
                               })

# Show top 5 rows of data
data.head()

Note that we have a different number of data points for `woman_first` vs `man_first`, as you can see below. The extra rows in the `woman_first` column have NaN values. We can use `data[woman_first].drop_na()` to get all the non-NaN values.

In [None]:
# Count number of non-NaN values in each column
data.count()

In [None]:
# A few useful numbers
num_woman_first = data['woman_first'].count()
num_man_first = data['man_first'].count()
num_total = num_woman_first + num_man_first

## Plot data

In [None]:
# Plot histogram of data
fig, ax = plt.subplots(1, 1, figsize = (10, 5))
ax = sns.histplot(data, binwidth = 4, ax = ax)

# Set labels/title
ax.set( xlabel = 'Percentage of questions asked by women minus \n percentage of attendees who were women (% points)', 
        ylabel = 'Number of seminars',
        title = 'University seminars, relative share of questions asked by women');

# Set legend
plt.legend(['Woman 1st', 'Man 1st']);

## Do women tend to ask more questions when a woman asks the first question?

NOTE: Here is the original tweet from Duncan Green:
In academic seminars, ‘Men are > 2.5 times more likely to pose questions to the speakers. This male skew was observable only in those seminars in which a man asked first question. When a woman did so, gender split disappeared’. CHAIRS PLEASE NOTE – FIRST Q TO A WOMAN – EVERY TIME.’

It was based on this story he had read in The Economist (https://www.economist.com/news/science-and-technology/21732082-there-easy-fix-women-ask-fewer-questions-men-seminars)
Print edition | Science and technology
December 7, 2017

"ONE theory to explain the low share of women in senior academic jobs is that they have less self-confidence than men. This hypothesis is supported by data in a new working paper, by a team of researchers from five universities in America and Europe. In this study, observers counted the attendees, and the questions they asked, at 247 departmental talks and seminars in biology, psychology and philosophy that took place at 35 universities in ten countries. On average, half of each seminar’s audience was female. Men, however, were over 2.5 times more likely to pose questions to the speakers—an action that may be viewed (rightly or wrongly) as a sign of greater competence.

This male skew in question-asking was observable, however, only in those seminars in which a man asked the first question. When a woman did so, the gender split in question-asking was, on average, proportional to that of the audience. Simply handing the microphone to a woman rather than a man when the floor is opened for questions may make a difference, however small, to one of academia’s most intractable problems."

This is a classic example of the fallacy that Gelman & Stern write about: The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant

In other words, what the blogger did was two t-tests, one of which showed that the man-first data are highly statistically significant from 0. That is, men ask significantly more questions when a man asks the 1st Q.

In [None]:
_, p_man_first = stats.ttest_1samp(data['man_first'].dropna(), popmean = 0)
print(f'The p-value of the test using data from man first seminars is {p_man_first:.2e}')

_, p_woman_first = stats.ttest_1samp(data['woman_first'].dropna(), popmean = 0)
print(f'The p-value of the test using data from woman first seminars is {p_woman_first:.2f}')

But this does not necessarily mean that the two groups are significantly different FROM EACH OTHER. This is very common statistical error that is beautifully described in a classic paper:

Gelman A & Stern H (2006) "The Difference Between 'Significant' and 'Not Significant' is not Itself Statistically Significant", The American Statistician (2006) 60:328-331

To establish a difference between the two groups, you need to directly compare them. A simple way is with a 2-sample t-test:

In [None]:
t_statistic_two_sample, p_two_sample = stats.ttest_ind(data['woman_first'].dropna(), data['man_first'].dropna())
print(f'The t-statistic of the 2 sample test is {t_statistic_two_sample:.2f}')
print(f'The p-value of the 2 sample test is {p_two_sample:.2e}')

The above test assumes equal variances. We can test for this directly using a two-sample F-test for equal variances:

In [None]:
# Compute F statistic
F = data['woman_first'].var() / data['man_first'].var()

# Get number of data points
n_woman_first = data['woman_first'].count()
n_man_first = data['man_first'].count()

# Compute p value
p_value = 2 * min([stats.f.cdf(F, n_woman_first - 1 , n_man_first - 1), 1 - stats.f.cdf(F, n_woman_first - 1 , n_man_first - 1)])

print(f'The p-value is {p_value:.4f}')

Note that the t-statistic is also a kind of
effect size. It is normalized by a pooled estimate of the s.e.m., so it
will be larger than our d-prime, which is normalized by the pooled
estimate of the s.d.

Or, if we suspect the data are not normally distributed, we can use a
non-parametric test based on the ranks of the data. This is known as the
Wilcoxon Rank-Sum Test (equivalent to the Mann-Whitney U Test).


In [None]:
stats.ranksums(data['woman_first'].dropna(), data['man_first'].dropna())

In any case, there does appear to be a very real effect of who asks the
first question. So the conclusion is correct, even though it was
originally based on a faulty comparison.


## Bias produced by the way the data were sorted?
Note that there is a possible source of bias in the way the data
have been stratified (a fancy statistical word for "sorted"). If there are a relatively small number of questions asked at any given seminar, then when you divide the data into two subsets where the sex of the 1st question asker is fixed, you create a bias in that direction. We can simlulate this to see how big the bias effect is.

Keys to the simulation:
Under what hypothesis should we perform our simulation? Ans.: H0
What is H0? Ans.: M/W equally likely to ask question regardless of who asks the 1st question.

How do we simulate this?

In [None]:
# Eventually, I would probably convert the script to a function, in which case, 
# the below would be variables passed to the function.

# 1. Number of questions asked at each seminar
num_q_per_seminar = 6

# 2. Number of seminars to simulate (249 in original study)
num_seminars = num_total

# 3. Number of simulations to run. When I am first coding up a simulation,
# I generally set this number to something smallish, like 1000. Then, when
# I am confident that it is running correctly, I increase the number to
# 10,000 or 100,000.
num_sims = 1000

**QUESTION (Q1)**: What is the effect size the authors actually obtained?

In [None]:
# Calculate the effect size the authors actually obtained
real_effect_size = ...

print(real_effect_size)

Below, we will set the random number generator seed to 12. This ensures that we will all get the exact same answer (provided we run the simulation the same number of times). You would ordinarily NOT do this. Why?

**Note: we use numpy, not pandas, below.**

In [None]:
# Set random seed
np.random.seed(12)

# Variables to hold the results of our simulations
all_effect_sizes = np.zeros((num_sims,))

# TODO: Simulate the experiment nSims times under H0 
for k in range(num_sims):

    # Simulate data: each row is a question, each column a seminar. Assume a value 
    # of '1' means a woman asked the question; '0' means a man asked it.
    ...

    # Sort according to who asked the 1st question
    ...

    # Now calculate our effect size metric. Assume that attendance is 50/50. So 
    # we want to know the proportion of 1's (woman-asked questions) in each column
    ...

    # Store the effect size for this simulation
    all_effect_sizes[k] = ...

## Plot the simulated distribution of effect sizes

In [None]:
# Plot results of our simulation
ax = sns.histplot(all_effect_sizes)

# Draw a solid black line for the actual effect size
ax.plot([real_effect_size, real_effect_size], [0, ax.get_ylim()[1]], 'k')

# Set labels & title
ax.set(xlabel = 'Effect size due to sorting bias', 
      ylabel = '# of simulations', 
      title = f'# of seminars: {num_seminars}, Questions per seminar: {num_q_per_seminar}');

**QUESTION (Q2)**: What is the mean effect size, rounded to the nearest whole
number, if there are only 3 questions asked at each seminar?

**QUESTION (Q3)**: What if there are 4 questions per seminar?

**QUESTION (Q4)**: What if there are 5 questions per seminar?

**QUESTION (Q5)**: Do you see a trend? Try smaller values for `num_q_per_seminar` and
think about what is going on.

**QUESTION (Q6)**: So then why do the simulation? What does it add?

**QUESTION (Q7)**: What would happen to our distribution of simulated effect sizes if there
were 127 seminars instead of 249? Simulate it and compare!

**QUESTION (Q8)**: What is the smallest number of questions per seminar for
which you would be 95% confident that the actual effect size obtained
(25) was not purely do to a sorting bias? Hint: For a given `num_q_per_seminar`,
look at the distribution of your simulated values, and see where the
actual effect size obtained falls within this distribution.

**QUESTION (Q9)**: How could you eliminate this bias?