# Divergence Analysis

In [1]:
from gender_history.datasets.dataset_journals import JournalsDataset
from gender_history.divergence_analysis.divergence_analysis import DivergenceAnalysis
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In this notebook, we will look at divergent uses of terms and topics between two corpora. 

What the two corpora are is flexible. We can compare:
<ul>
    <li>Male vs. Female Authors</li>
    <li>1990s vs. 2000s</li>
    <li>Male authors with female advisors vs. male authors with male advisors</li>
</ul>

In each case, we want to know what terms are over or under-represented in one of the corpora.

## Male vs. Female Authors
Let's make this concrete and first look at male vs. female authors across all 9500 articles 
for which we have gender data (and they were not published by mixed teams).

In [2]:
d = JournalsDataset()

# Create two sub-datasets, one for female authors and one for male authors
c1 = d.copy().filter(author_gender='female')
c2 = d.copy().filter(author_gender='male')

# Run the divergence analysis
div = DivergenceAnalysis(d, c1, c2, sub_corpus1_name='women', sub_corpus2_name='men', 
                        analysis_type='topics', sort_by='dunning')
div.run_divergence_analysis(number_of_terms_or_topics_to_print=12)
pass

             women    women freq    men    men freq
---------  -------  ------------  -----  ----------
1951-1954       21     0.0104167    236   0.0314415
1955-1959       41     0.0203373    344   0.04583
1960-1964       37     0.0183532    498   0.0663469
1965-1969       42     0.0208333    555   0.0739408
1970-1974       82     0.0406746    634   0.0844658
1975-1979      125     0.062004     746   0.0993872
1980-1984      164     0.0813492    710   0.094591
1985-1989      188     0.093254     727   0.0968558
1990-1994      281     0.139385     755   0.100586
1995-1999      284     0.140873     767   0.102185
2000-2004      222     0.110119     543   0.0723421
2005-2009      287     0.142361     582   0.077538
2010-2015      242     0.12004      409   0.0544897


Terms distinctive for Corpus 1: women. 2016 Documents

    term                                  dunning    frequency_score    freq both     f women       f men
--  ----------------------------------  ---------  ------------

Many numbers. Let's start at the top:
```
             women    women freq    men    men freq
---------  -------  ------------  -----  ----------
1951-1954       21     0.0104167    236   0.0314415
1955-1959       41     0.0203373    344   0.04583
1960-1964       37     0.0183532    498   0.0663469
1965-1969       42     0.0208333    555   0.0739408
1970-1974       82     0.0406746    634   0.0844658
1975-1979      125     0.062004     746   0.0993872
1980-1984      164     0.0813492    710   0.094591
1985-1989      188     0.093254     727   0.0968558
1990-1994      281     0.139385     755   0.100586
1995-1999      284     0.140873     767   0.102185
2000-2004      222     0.110119     543   0.0723421
2005-2009      287     0.142361     582   0.077538
2010-2015      242     0.12004      409   0.0544897
```

This tells us how many articles we have in our dataset and how they are distributed. 

The next table tells us what topics are distinctive for female authors:
```
Terms distinctive for Corpus 1: women. 2016 Documents

    term                                  dunning    frequency_score    freq both     f women       f men
--  ----------------------------------  ---------  -----------------  -----------  ----------  ----------
83  (61) Gender and Feminism             471911             0.888413   0.0122003   0.0392632   0.00493157
82  (46) Family                          182113             0.758852   0.0158926   0.0343833   0.0109263
81  (32) Doctors & Patients               87796.3           0.765333   0.00716873  0.0158102   0.00484775
80  (76) Consumption and consumerism      59320.9           0.719713   0.0079228   0.0152741   0.00594836
79  (45) Cultural Turn                    35589.7           0.619321   0.0199978   0.028722    0.0176546
78  (71) Sexuality                        17988.9           0.62985    0.00836095  0.012387    0.00727961
77  (79) Legal History                    16332.7           0.592498   0.0160636   0.0213081   0.014655
76  (40) European Colonization of Asia    14058.2           0.654648   0.00438061  0.00698031  0.00368238
75  (54) Witchcraft and Magic             13663.4           0.640294   0.00532785  0.00813955  0.00457267
74  (43) Islamic History                  12776.7           0.613005   0.0081018   0.0114212   0.00721027
73  (39) Music                            12170.1           0.602071   0.00965676  0.0131793   0.00871065
72  (55) U.S. Civil Rights Movement       10806.7           0.606307   0.00784235  0.0108384   0.00703767

```

The table includes two measurements, dunning for Dunning Log-Likelihood score as well as Frequency score. We will cover them in a moment.
- freq_both: the average weight of the topic in both male and female corpora
- f women: the average weight of the topic in the women corpus
- f men: the average weight of the topic in the men corpus


### Dunning's Log-Likelihood Test
`dunning` is the Dunning Log-Likelihood test (G2) for the value. 

For a longer discussion of this score, see https://de.dariah.eu/tatom/feature_selection.html#chi2

Here's the intuition: Given two text corpora, our null hypothesis is that a given term should appear with the
same frequency in both corpora. e.g. we would expect the word "the" to make up about 1-2% of both corpora. 

Dunning's Log-Likelihood test gives us a way to test how far a term deviates from this null hypothesis, i.e.
how over or under-represented a term is in the female or male corpora (mathematically: how far the observed term counts
deviate from the expected term counts). 

Generally, terms will score high for G2 if a) they are frequent and b) they are heavily skewed towards one or another corpus. 
Ben Schmidt has a blog post which demonstrates how G2 captures a combination of additive difference (difference in absolute numbers) and
multiplicative difference (difference between frequencies): http://sappingattention.blogspot.com/2011/10/comparing-corpuses-by-word-use.html 

A Dunning value of 4 means p value < 0.05. A p value of 471911 as we find for "Gender and Feminism" means many many standard deviations away from expectation. 

### Frequency Score

The frequency score is another measure that we can use to measure how gendered a topic is. It goes from 
0 (only men write about a topic) to 1 (only women write about a topic).<br>
The math behind this score is simple:<br>
`frequency score(topic) = weight of topic among women / (weight of topic among women + weight of 
topic among men)`<br>
Here's an explanation that's more intuitive: Assume that we have an equal number of articles written
by men and women (we don't but let's presume that's the case for the moment). In that case, a
frequency score of 0.5 means that women contributed 50% of the weight to the topic and men the
remaining 50%. A score of 0.89 (the score of "Gender and Feminism") means that women contributed
90% of the weight and men 10%. <br>
The reality is more complicated simply because we have about three times as many articles authored
by men as articles authored by women (7506 to 2016). Hence, men did contribute about one third of 
the "Gender and Feminism" topic weight (37 to 79), simply because even when they are not gender
historians, they will occasionally use "Gender and Feminism" terms like "women," "male," or 
"female." <br>
If we look at the average weights, however, we see that among women, "Gender and 
Feminism" has an average weight of 3.93% whereas among men, it has an average weight of 0.49%. This
is the relationship that the frequency score expresses: 3.93% / (3.93% + 0.49%) = 0.89.


So, in very short, men and women work on different topics... very unexpected.

# Gender Differences in Military History

Let's look at something more interesting: differences within a topic. 

By Dunning score, "Military History" is the 12th most distinctive topic for men. So it's not quite at the top but I think within the paper it can stand
rhetorically as one of the most obviously male-coded topics. 

But women have also contributed substantially to the topic. 

What we can do is compare topics that men and women write about ONLY in a dataset of the articles that score in the top 5% for the military history topic.

In [3]:
d = JournalsDataset()

# retain only the articles scoring in the top 5% for topic 31 (military history)
d.topic_score_filter(31, min_percentile_score=95)

# Create two sub-datasets, one for female authors and one for male authors
c1 = d.copy().filter(author_gender='female')
c2 = d.copy().filter(author_gender='male')

div = DivergenceAnalysis(d, c1, c2, sub_corpus1_name='women', sub_corpus2_name='men',
                         analysis_type='topics', sort_by='dunning')
div.run_divergence_analysis(number_of_terms_or_topics_to_print=10)
pass

             women    women freq    men    men freq
---------  -------  ------------  -----  ----------
1951-1954        1     0.016129      23   0.0543735
1955-1959        0     0             17   0.0401891
1960-1964        1     0.016129      30   0.070922
1965-1969        1     0.016129      30   0.070922
1970-1974        1     0.016129      38   0.0898345
1975-1979        1     0.016129      36   0.0851064
1980-1984        2     0.0322581     26   0.0614657
1985-1989        3     0.0483871     42   0.0992908
1990-1994        8     0.129032      48   0.113475
1995-1999        9     0.145161      47   0.111111
2000-2004       11     0.177419      26   0.0614657
2005-2009       12     0.193548      38   0.0898345
2010-2015       12     0.193548      22   0.0520095


Terms distinctive for Corpus 1: women. 62 Documents

    term                                           dunning    frequency_score    freq both    f women       f men
--  -------------------------------------------  ------

First, off course, it's important to note the general imbalance: in the top 5% for military history, we have 423 articles from men and only 
62 from women. 

However, the differences are still quite interesting. 

"Gender and Feminism" and "Doctors and Patients" are generally more female. But "France" and "British Early Modern Political History" are not. 

What articles are behind these numbers? Let's have a look.

In [4]:
div.print_articles_for_top_topics(top_terms_or_topics=4, articles_per_term_or_topic=4)


Sample articles of distinctive topics for women

Topic 61 (Gender and Feminism). Highest scoring items:
   (1994) Philippa Levine: "Walking the Streets in a Way No Decent Woman Should": Women Police in World War I
   (1996) Margaret H. Darrow: French Volunteer Nursing and the Myth of War Experience in World War I
   (1998) Sonya O. Rose: Sex, Citizenship, and the Nation in World War II Britain
   (1990) Drew Gilpin Faust: Altars of Sacrifice: Confederate Women and the Narratives of War

Topic 37 (France). Highest scoring items:
   (1968) Nuria Sales de Bohigas: Some Opinions on Exemption from Military Service in Nineteenth-Century Europe
   (1985) Barbara Diefendorf: Prologue to a Massacre: Popular Unrest in Paris, 1557-1572
   (1996) Margaret H. Darrow: French Volunteer Nursing and the Myth of War Experience in World War I
   (1993) Joanna Waley-Cohen: China and Western Technology in the Late Eighteenth Century

Topic 51 (British Early Modern Political History). Highest scoring items


So, women are writing women, gender and sexuality into military history. They are also the ones who primarily write about nursing and hospitals. 
I still don't know hwat to make of "France" and "British Early Modern Political History". It might just be that these are two areas where women 
are overrepresented. Interpretations welcome.

Instead of looking at the topics, we can also look at the terms that are most overrepresented for female military historians.

In [11]:
d = JournalsDataset()

# retain only the articles scoring in the top 5% for topic 31 (military history)
d.topic_score_filter(31, min_percentile_score=95)

# Create two sub-datasets, one for female authors and one for male authors
c1 = d.copy().filter(author_gender='female')
c2 = d.copy().filter(author_gender='male')

div = DivergenceAnalysis(d, c1, c2, sub_corpus1_name='women', sub_corpus2_name='men',
                         analysis_type='terms', sort_by='dunning')
div.run_divergence_analysis(number_of_terms_or_topics_to_print=10)
div.print_articles_for_top_topics(top_terms_or_topics=5, articles_per_term_or_topic=5)
pass

Generating document term matrix...
Vocabulary length: 365
Generating document term matrix...
Generating document term matrix...
             women    women freq    men    men freq
---------  -------  ------------  -----  ----------
1951-1954        1     0.016129      23   0.0543735
1955-1959        0     0             17   0.0401891
1960-1964        1     0.016129      30   0.070922
1965-1969        1     0.016129      30   0.070922
1970-1974        1     0.016129      38   0.0898345
1975-1979        1     0.016129      36   0.0851064
1980-1984        2     0.0322581     26   0.0614657
1985-1989        3     0.0483871     42   0.0992908
1990-1994        8     0.129032      48   0.113475
1995-1999        9     0.145161      47   0.111111
2000-2004       11     0.177419      26   0.0614657
2005-2009       12     0.193548      38   0.0898345
2010-2015       12     0.193548      22   0.0520095


Terms distinctive for Corpus 1: women. 62 Documents

     term         dunning    frequency_sc

   Count africa: 112. (1990) Philip D. Curtin: The End of the "White Man's Grave"? Nineteenth-Century Mortality in West Africa
   Count africa: 76. (1995) Jonathan Zimmerman: Beyond Double Consciousness: Black Peace Corps Volunteers in Africa, 1961-1971
   Count africa: 68. (2011) Neil Roos: Education, Sex And Leisure: Ideology, Discipline And The Construction Of Race Among South African Servicemen During The Second World War
   Count africa: 63. (1991) Christopher Merrett; Roger Gravil: Comparing Human Rights: South Africa and Argentina, 1976-1989
   Count africa: 58. (1979) Daniel R. Headrick: The Tools of Imperialism: Technology and the Expansion of European Colonial Empires in the Nineteenth Century


Implementation note: The vocabulary only contains the top 1000 most frequent terms that appear at least 1000 times overall (i.e. including authors with unknown genders).

Both of these limitations cut down the number of flukes, i.e. rare terms that only appear in one or two articles but appear in those so often that it seems as if they displayed an important pattern.

This gives us the following terms as distinctive for women:
```

Terms distinctive for Corpus 1: women. 62 Documents

     term         dunning    frequency_score    count both    c women    c men
---  ---------  ---------  -----------------  ------------  ---------  -------
364  women       4024.8             0.879919          4394       2509     1885
363  police       885.096           0.816475          1828        817     1011
362  her          882.265           0.790901          2372        966     1406
361  she          609.632           0.809659          1349        588      761
360  work         228.413           0.659912          3016        786     2230
359  family       225.294           0.728117          1207        395      812
358  la           175.092           0.656093          2448        630     1818
357  chinese      152.77            0.697762          1188        351      837
356  memory       147.632           0.713158           945        294      651
355  committee    137.357           0.667914          1607        430     1177
```

And here are sample articles:


Again, "women" and "her" are not surprising but they help to flesh out how women enter into military history both as authors and
actors.
```
 Term: women. Highest scoring items:
   Count women: 461. (1994) Philippa Levine: "Walking the Streets in a Way No Decent Woman Should": Women Police in World War I
   Count women: 293. (1996) Elizabeth Heineman: The Hour of the Woman: Memories of Germany's "Crisis Years" and West German National Identity
   Count women: 208. (1990) Drew Gilpin Faust: Altars of Sacrifice: Confederate Women and the Narratives of War
   
 Term: her. Highest scoring items:
   Count her: 110. (2011) Linde Apel: Voices From The Rubble Society: "Operation Gomorrah" And Its Aftermath
   Count her: 101. (1990) Drew Gilpin Faust: Altars of Sacrifice: Confederate Women and the Narratives of War
   Count her: 93. (1996) Margaret H. Darrow: French Volunteer Nursing and the Myth of War Experience in World War I
```

Police is interesting:
```
Term: police. Highest scoring items:
   Count police: 346. (1994) Philippa Levine: "Walking the Streets in a Way No Decent Woman Should": Women Police in World War I
   Count police: 189. (1995) Gerda W. Ray: From Cossack to Trooper: Manliness, Police Reform, and the State
   Count police: 135. (1985) Elaine Glovka Spencer: Police-Military Relations in Prussia, 1848-1914
   Count police: 29. (2010) Mary Louise Roberts: The Price of Discretion: Prostitution, Venereal Disease, and the American Military in France, 1944—1946
   Count police: 23. (1984) Altina L. Waller: Community, Class and Race in the Memphis Riot of 1866
```

And work seems mostly about the work of women during the war:
```
 Term: work. Highest scoring items:
   Count work: 121. (1994) Philippa Levine: "Walking the Streets in a Way No Decent Woman Should": Women Police in World War I
   Count work: 67. (2001) Cheryl A. Wells: Battle Time: Gender, Modernity, and Confederate Hospitals
   Count work: 67. (1997) Henriette Donner: Under the Cross: Why V.A.D.s Performed the Filthiest Task in the Dirtiest War: Red Cross Women Volunteers, 1914-1918
   Count work: 51. (2013) Krisztina Robert: Constructions Of "Home," "Front," And Women'S Military Employment In First-World-War Britain: A Spatial Interpretation
   Count work: 43. (1996) Elizabeth Heineman: The Hour of the Woman: Memories of Germany's "Crisis Years" and West German National Identity
```

"La" is there again because there are a good number of articles dealing with France, e.g. 
Count la: 139. (1996) Margaret H. Darrow: French Volunteer Nursing and the Myth of War Experience in World War I


One final caveat: even though, for example, "police" is distinctive for female authors, men have used the term "police" more overall. It's just that
the top 5% of military history articles have a gender balance of about 7:1 male/female. 


# History of Sexuality

Note: Here, we'll limit the analysis to articles after 1990.

Our history of sexuality has two peaks. One in the 1970s with Freud, and another during and after the 1990s around
what we would now consider the history of sexuality.

Also, I'm less interested here in connections that men and women make to other topics and more in differences within a topic. 
Most notably the focus on gay history.

In [12]:
d = JournalsDataset()

# retain articles published after 1990
d.filter(start_year=1990)

# retain only the articles scoring in the top 5% for topic 31 (military history)
d.topic_score_filter(71, min_percentile_score=95)

# Create two sub-datasets, one for female authors and one for male authors
c1 = d.copy().filter(author_gender='female')
c2 = d.copy().filter(author_gender='male')

div = DivergenceAnalysis(d, c1, c2, sub_corpus1_name='women', sub_corpus2_name='men',
                         analysis_type='terms', sort_by='dunning')
div.run_divergence_analysis(number_of_terms_or_topics_to_print=10)
div.print_articles_for_top_topics(top_terms_or_topics=5, articles_per_term_or_topic=5)
pass

Generating document term matrix...
Vocabulary length: 134
Generating document term matrix...
Generating document term matrix...
             women    women freq    men    men freq
---------  -------  ------------  -----  ----------
1990-1994       16      0.158416     28    0.21875
1995-1999       15      0.148515     33    0.257812
2000-2004       26      0.257426     26    0.203125
2005-2009       27      0.267327     27    0.210938
2010-2015       17      0.168317     14    0.109375


Terms distinctive for Corpus 1: women. 101 Documents

     term         dunning    frequency_score    count both    c women    c men
---  ---------  ---------  -----------------  ------------  ---------  -------
133  her          948.335           0.688922          6328       4216     2112
132  women        688.23            0.645146          7911       4913     2998
131  she          571.923           0.690267          3761       2511     1250
130  girls        283.166           0.723134          1335

The women terms are largely what I would have expected:
```
Terms distinctive for Corpus 1: women. 101 Documents

     term         dunning    frequency_score    count both    c women    c men
---  ---------  ---------  -----------------  ------------  ---------  -------
133  her          948.335           0.688922          6328       4216     2112
132  women        688.23            0.645146          7911       4913     2998
131  she          571.923           0.690267          3761       2511     1250
130  girls        283.166           0.723134          1335        937      398
129  sexuality    227.534           0.660657          2124       1353      771
128  mother       206.216           0.6738            1637       1065      572
127  home         183.081           0.657222          1787       1132      655
126  woman        144.855           0.6364            1892       1158      734
125  sex          127.511           0.603186          2939       1699     1240
124  female       112.849           0.619854          1919       1142      777
```

But the men's terms are interesting:
```
Terms distinctive for Corpus 2: men. 128 Documents

    term         dunning    frequency_score    count both    c women    c men
--  ---------  ---------  -----------------  ------------  ---------  -------
 0  freud       -444.088           0.185509          1092        186      906
 1  gay         -319.101           0.261703          1400        339     1061
 2  childhood   -221.618           0.289789          1257        338      919
 3  fear        -220.519           0.288809          1239        332      907
 4  education   -142.344           0.324912          1170        354      816
 5  war         -136.835           0.385621          2643        955     1688
 6  social      -133.017           0.432414          7349       2992     4357
 7  his         -127.999           0.447506         11711       4942     6769
 8  political   -123.353           0.372552          1919        669     1250
 9  black       -100.448           0.34427           1046        336      710
 ```

Freud remains a major touchstone even after 1990:
```
 Term: freud. Highest scoring items:
   Count freud: 477. (1991) John E. Toews: Historicizing Psychoanalysis: Freud in His Time and for Our Time
   Count freud: 92. (1993) Bruce Mazlish: A Triptych: Freud's The Interpretation of Dreams, Rider Haggard's She, and Bulwer-Lytton's The Coming Race
   Count freud: 62. (1995) Fred Weinstein: Psychohistory and the Crisis of the Social Sciences
   Count freud: 32. (2001) Andrew R. Heinze: Jews and American Popular Psychology: Reconsidering the Protestant Paradigm of Popular Thought
   Count freud: 30. (1997) John E. Talbott: Soldiers, Psychiatrists, and Combat Trauma
```

And gay history is strongly skewed towards men. The question for me is if we can argue for a strong connection between gay history and women's and 
gender history. i.e. are they parallel developments? (note to self: check if men writing on gay history were more often advised by women!):
```
 Term: gay. Highest scoring items:
   Count gay: 265. (2010) Daniel Rivers: "In The Best Interests Of The Child": Lesbian And Gay Parenting Custody Cases, 1967-1985
   Count gay: 250. (2011) Kevin J. Mumford: The Trouble with Gay Rights: Race and the Politics of Sexual Orientation in Philadelphia, 1969-1982
   Count gay: 175. (2010) David K. Johnson: Physique Pioneers: The Politics Of 1960S Gay Consumer Culture
   Count gay: 113. (1991) John E. Toews: Historicizing Psychoanalysis: Freud in His Time and for Our Time
   Count gay: 78. (2007) Craig M. Loftin: Unacceptable Mannerisms: Gender Anxieties, Homosexual Activism, and Swish in the United States, 1945-1965
```

Childhood surprises me as well. "children" and "child" both skew strongly female but "childhood" is more balanced. 
```
 Term: childhood. Highest scoring items:
   Count childhood: 146. (2008) Patrick J. Ryan: How New Is the "New" Social Study of Childhood? The Myth of a Paradigm Shift
   Count childhood: 119. (1998) Hugh Cunningham: Histories of Childhood
   Count childhood: 80. (2005) Brian Platt: Japanese Childhood, Modern Childhood: The Nation-State, the School, and 19th-Century Globalization
   Count childhood: 74. (1991) Timothy Haggerty; Peter N. Stearns: The Role of Fear: Transitions in American Emotional Standards for Children, 1850-1950
   Count childhood: 61. (1995) Sigurður Gylfi Magnússon: From Children's Point of View: Childhood in Nineteenth-Century Iceland
```

Frankly, I wonder if this whole nexus of childhood, fear, psychology is still tied to Freud. At any rate, we also find it in the next two terms,
"fear" and "education":
```
Term: fear. Highest scoring items:
   Count fear: 364. (1991) Timothy Haggerty; Peter N. Stearns: The Role of Fear: Transitions in American Emotional Standards for Children, 1850-1950
   Count fear: 84. (2006) Peter N. Stearns: Fear and Contemporary History: A Review Essay
   Count fear: 76. (1993) Peter N. Stearns: Girls, Boys, and Emotions: Redefinitions and Historical Change
   Count fear: 30. (1999) Alan Hunt: Anxiety and Social Explanation: Some Anxieties about Anxiety
   Count fear: 21. (1995) Jeremy Krikler: Social Neurosis and Hysterical Pre-Cognition in South Africa: A Case-Study and Reflections

 Term: education. Highest scoring items:
   Count education: 182. (1996) Jeffrey P. Moran: "Modernism Gone Mad": Sex Education Comes to Chicago, 1913
   Count education: 181. (2005) Christopher P. Loss: "The Most Wonderful Thing Has Happened to Me in the Army": Psychology, Citizenship, and American Higher Education in World War II
   Count education: 61. (2005) Brian Platt: Japanese Childhood, Modern Childhood: The Nation-State, the School, and 19th-Century Globalization
   Count education: 48. (2004) Daryl Michael Scott: Postwar Pluralism, Brown v. Board of Education, and the Origins of Multicultural Education
   Count education: 39. (1998) David Yosifon; Peter N. Stearns: The Rise and Fall of American Posture
```
