# Divergence Analysis

In [10]:
from gender_history.datasets.dataset_journals import JournalsDataset
from gender_history.divergence_analysis.divergence_analysis import DivergenceAnalysis
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

In this notebook, we will look at divergent uses of terms and topics between two corpora. 

What the two corpora are is flexible. We can compare:
<ul>
    <li>Male vs. Female Authors</li>
    <li>1990s vs. 2000s</li>
    <li>Male authors with female advisors vs. male authors with male advisors</li>
    <li>Harvard authors vs. non-Harvard authors</li>
</ul>

In each case, we want to know what terms are over or under-represented in one of the corpora.

## Male vs. Female Authors
Let's make this concrete and first look at male vs. female authors across all 9500 articles 
for which we have gender data (and they were not published by mixed teams).

In [11]:
d = JournalsDataset(use_equal_samples_dataset=False)

# Create two sub-datasets, one for female authors and one for male authors
c1 = d.copy().filter(author_gender='female')
c2 = d.copy().filter(author_gender='male')

# Run the divergence analysis
div = DivergenceAnalysis(d, c1, c2, sub_corpus1_name='women', sub_corpus2_name='men')
div.run_divergence_analysis(analysis_type='topics')

             women    women freq    men    men freq
---------  -------  ------------  -----  ----------
1976-1984      269      0.133433   1318   0.175593
1985-1989      188      0.093254    727   0.0968558
1990-1994      281      0.139385    755   0.100586
1995-1999      284      0.140873    767   0.102185
2000-2004      222      0.110119    543   0.0723421
2005-2009      287      0.142361    582   0.077538
2010-2015      242      0.12004     409   0.0544897


Terms distinctive for Corpus 1: women. 2016 Documents

    term                                           dunning    frequency_score    freq both     f women       f men
--  ------------------------------------------  ----------  -----------------  -----------  ----------  ----------
89  (61) Gender and Feminism                    471911               0.888413  0.0122003    0.0392632   0.00493157
88  (32) Doctors & Patients                      87796.3             0.765333  0.00716873   0.0158102   0.00484775
87  (46) Family    

Unnamed: 0,index,term,dunning,frequency_score,count both,c women,c men,freq both,f women,f men
0,65,(66) German and Austro-Hungarian Diplomatic Hi...,-75177.420448,0.203818,448850.251471,28875.801039,419974.450433,0.011785,0.003581,0.013988
1,81,(82) German Intellectual History,-66958.325681,0.259937,581673.441300,50143.034431,531530.406868,0.015272,0.006218,0.017704
2,25,(26) American Political Thought,-46350.409319,0.262685,411023.879958,35895.806588,375128.073370,0.010791,0.004451,0.012494
3,73,(74) U.S. Political Parties & Elections,-36589.471704,0.277716,364775.059494,34144.296160,330630.763334,0.009577,0.004234,0.011012
4,63,(64) Political History of the Cold War,-31838.618007,0.298944,380100.515967,39059.338772,341041.177196,0.009980,0.004844,0.011359
...,...,...,...,...,...,...,...,...,...,...
85,16,(17) Nations and Boundaries,67396.588658,0.683155,536532.437267,196761.482180,339770.955087,0.014087,0.024400,0.011317
86,75,(76) Consumption and consumerism,59320.869643,0.719713,301763.692566,123170.016114,178593.676452,0.007923,0.015274,0.005948
87,45,(46) Family,182113.193924,0.758852,605318.839276,277266.995516,328051.843760,0.015893,0.034383,0.010926
88,31,(32) Doctors & Patients,87796.261872,0.765333,273042.470724,127493.658760,145548.811964,0.007169,0.015810,0.004848
