## Step 1: Computational Inductive Exploration

Analysis 2: Structural Topic Model  
The below code produces the output for Table 3 and Table 4

1. [Compare models](#compare)
2. [Explore topics by organization](#explore)

<a id='compare'></a>
## Compare Models

First, read in the Structural Topic Model data containing models with 20, 30, 40, and 50 topics. Look at the top weighted words per topic to choose which model is best.

The following code produces the output for Table 3.

In [26]:
#enable use of R in a Python kernel
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [78]:
%%R
library(stm)

#load the saved STM
load("../input_data/stm_1234.RData")

In [38]:
%%R
#Produces output for Table 3
#Top weighted words for the 'abortion' topic in the k=20 topic model
terms20 <- labelTopics(mod.20, n=20)
cat("Top weighted words for the 'abortion' topic in the k=20 topic model\n")
print(terms20$prob[16,])
cat("\nTop weighted words for the 'abortion' topic in the k=30 topic model\n")
terms30 <- labelTopics(mod.30, n=20)
print(terms30$prob[18,])
cat("\nTop weighted words for the 'abortion' topic in the k=40 topic model\n")
terms40 <- labelTopics(mod.40, n=20)
print(terms40$prob[2,])
cat("\nTop weighted words for the 'abortion' topic in the k=50 topic model\n")
terms50 <- labelTopics(mod.50, n=20)
print(terms50$prob[9,])

Top weighted words for the 'abortion' topic in the k=20 topic model
 [1] "abort"    "women"    "law"      "can"      "doctor"   "will"    
 [7] "control"  "one"      "infect"   "woman"    "may"      "medic"   
[13] "mani"     "use"      "state"    "birth"    "pregnanc" "right"   
[19] "pill"     "must"    

Top weighted words for the 'abortion' topic in the k=30 topic model
 [1] "abort"   "law"     "women"   "state"   "medic"   "doctor"  "legal"  
 [8] "right"   "will"    "court"   "woman"   "hospit"  "new"     "control"
[15] "can"     "decis"   "repeal"  "one"     "bill"    "want"   

Top weighted words for the 'abortion' topic in the k=40 topic model
 [1] "abort"   "women"   "law"     "doctor"  "medic"   "hospit"  "will"   
 [8] "woman"   "can"     "state"   "control" "one"     "legal"   "right"  
[15] "repeal"  "new"     "mani"    "perform" "clinic"  "want"   

Top weighted words for the 'abortion' topic in the k=50 topic model
 [1] "abort"    "hospit"   "center"   "medic"    "mater

In [39]:
%%R -o df_all 
#The above outputs the R variable df_all for use in Python cells below

#merge theta onto original dataset
meta$ID <- seq.int(nrow(meta))
theta <- data.frame(mod.40$theta)
theta$ID <- seq.int(nrow(theta))
df_all <- merge(meta, theta)
df_all['ID'] <- NULL
df_all['X'] <- NULL

<a id='explore'></a>
## Explore Topics by Organization

Once you decide on a model you can calculate the prevelance of the model across groups.

The following code produces the output for Table 4

In [40]:
#Read df_all from R into a Pandas dataframe
import pandas
df = pandas.DataFrame(df_all)
df

Unnamed: 0,doc,city,publication,date,word_count,org,identifier,wave,text_string,X1,...,X31,X32,X33,X34,X35,X36,X37,X38,X39,X40
1,notessecondyear_70.txt,nyc,notessecondyear,1969,553,redstockings,1,2,1 1 1 1 1 10 11 2 2 2 2 3 3 3 4 5 6 7 8 9 A An...,0.000108,...,0.000030,0.000563,0.000168,0.000012,0.000080,1.601908e-07,0.000004,0.000547,4.344782e-04,0.001872
2,chicago.cwlu_womankind.1971.11.06.txt,chicago,cwlu_womankind,1971,890,cwlu,2,2,411 93 Actually Alice American American Any As...,0.000683,...,0.001261,0.072608,0.000659,0.000051,0.003934,2.370884e-05,0.000765,0.000813,8.932289e-05,0.000268
3,nyc.masses_1916.04.21.txt,nyc,masses,1916,425,heterodoxy,3,1,All Anarchist Anarchist And Birth Birth Birth ...,0.000013,...,0.000140,0.000385,0.036311,0.001572,0.000022,7.744428e-03,0.000628,0.001633,2.317906e-03,0.002181
4,nyc.redstockings.1973.mainardi.marriagequestio...,nyc,redstockings,1973,972,redstockings,4,2,1968 1968 50s 60s Although Although American A...,0.000106,...,0.000667,0.461181,0.000885,0.000008,0.000254,1.166657e-05,0.000083,0.023737,1.385348e-03,0.004373
5,chicago.cwlu_womankind.1972.01.01.txt,chicago,cwlu_womankind,1972,39,cwlu,5,2,1972 5 Ghots I January Womankind a bind by cro...,0.000028,...,0.001579,0.000261,0.024764,0.002650,0.012088,6.019012e-01,0.011103,0.000128,3.216998e-03,0.000334
6,notesfirstyear_30.txt,nyc,notesfirstyear,1968,442,redstockings,6,2,12 12 15 1868 1868 1968 28 A AUNT All Anybody ...,0.001169,...,0.014021,0.004732,0.003569,0.001689,0.019514,1.164017e-04,0.011646,0.006362,1.552764e-02,0.009689
7,chicago.cwlu_womankind.1972.05.14.txt,chicago,cwlu_womankind,1972,976,cwlu,7,2,1 1970 2 2 3 4 4 5 6 7 8 A AT Also Also Amer A...,0.000011,...,0.000075,0.000291,0.002064,0.000260,0.000511,3.852390e-04,0.001004,0.000943,4.548404e-04,0.001381
8,nyc.redstockings.1973.sarachild.programforcons...,nyc,redstockings,1973,785,redstockings,8,2,1 2 3 A A A APPENDIX And CONSCIOUSNESSRAISING ...,0.000096,...,0.000022,0.000557,0.000149,0.000010,0.000067,1.310724e-07,0.000004,0.000586,3.692345e-04,0.001492
9,chicago.cwlu_womankind.1972.11.11.txt,chicago,cwlu_womankind,1972,985,cwlu,9,2,1867 1972 A AND ARTICLES Adopt Affiar All Amaz...,0.000189,...,0.049130,0.008029,0.004333,0.000305,0.079622,1.280873e-03,0.000064,0.011200,3.911868e-02,0.011208
10,chicago.cwlu_womankind.1972.03.20.txt,chicago,cwlu_womankind,1972,369,cwlu,10,2,Above CWLU CWLU CWLU Chicago Chicago Discus Li...,0.000399,...,0.000106,0.001322,0.008745,0.000343,0.031367,6.535829e-04,0.001451,0.000028,3.536808e-05,0.001559


In [41]:
#add topic word count column by multiplying the topic weight by the word count for each document

#create column list for use later
col_list = []
for c in range(1,41):
    col = "X"+str(c)
    new_col = "X"+str(c)+"_wc"
    col_list.append(new_col)
    df[new_col] = df.apply(lambda row: (row[col] * row['word_count']), axis=1)
col_list.append('word_count')
col_list.append('org')

In [42]:
#keep only topic word count columns, the org column, and the document word count column
df_new = df[col_list]

In [43]:
#group the dataframe by organization and calculate the percent of words associated with each topic for each organization
#This output will be used in Table 4

#add total word count by organization and total word count for each topic by organization
grouped = df_new.groupby('org').sum().reset_index()
grouped.set_index('org', inplace=True)

#divide topic word count by total word count, by organization
for c in col_list[:-2]:
    grouped[c] = grouped[c]/grouped['word_count']
del grouped['word_count']
grouped

Unnamed: 0_level_0,X1_wc,X2_wc,X3_wc,X4_wc,X5_wc,X6_wc,X7_wc,X8_wc,X9_wc,X10_wc,...,X31_wc,X32_wc,X33_wc,X34_wc,X35_wc,X36_wc,X37_wc,X38_wc,X39_wc,X40_wc
org,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
cwlu,0.000337,0.035401,0.011933,0.054308,0.019985,0.047512,0.074941,0.001916,0.07986,0.064614,...,0.026419,0.014523,0.026964,0.039432,0.048336,0.050851,0.038626,0.001312,0.00122,0.011161
heterodoxy,9.5e-05,0.000533,0.001758,0.013252,0.010775,0.053107,0.000121,0.222227,5e-06,0.000677,...,0.000439,0.013248,0.007853,0.015818,9e-06,0.001628,0.000262,0.010888,0.002258,0.025866
hullhouse,0.274824,0.000105,0.000102,0.006938,0.000469,4.3e-05,2.9e-05,0.003556,1.5e-05,0.001021,...,0.00135,0.00091,0.000627,0.000137,0.023953,0.001243,9.9e-05,0.001314,0.00037,0.00094
redstockings,0.000384,0.031346,0.040831,0.008958,0.012954,0.042884,0.002226,0.009477,0.002983,0.007501,...,0.059237,0.034534,0.005985,0.015679,0.002656,0.000606,0.001421,0.046479,0.105886,0.045484


In [44]:
########################################################
########################################################
#####rename top 12 topics to match labels in Table 4####
########################################################
########################################################

#Hull House Social Activites = X1
#Public Institutions = X27
#Hull House Practical Activities = X28
#Sanger and Birth Control = X8
#Women's lives = X26
#Women's Resistance = X21
#Anti-War = X7
#Liberation School = X9
#Women's Sexual Health = X10
#Forms of Resistance = X25
#Movement Theory = X14
#Movement History = X39

#########################################################
#########################################################


grouped.rename(columns={'X1_wc': "Hull House Social Activities", 'X28_wc': 'Public Institutions', 'X27_wc': 'Hull House Practical Activities',
          'X8_wc': 'Sanger and Birth Control', 'X26_wc': "Women's Lives", 'X21_wc': "Women's Resistance",
          'X7_wc': "Anti-War", 'X9_wc': 'Liberation School', 'X10_wc': "Women's Sexual Health", 
           'X25_wc': 'Forms of Resistance', 'X14_wc': "Movement Theory", 'X39_wc': 'Movement History'}, inplace=True)

In [46]:
#Calculate the top 3 most frequent topics for Hull House (Table 4)
#By percent of total words aligned with each topic for Hull House
grouped.loc['hullhouse'].sort_values(ascending=False)[:3]

Public Institutions                0.275350
Hull House Social Activities       0.274824
Hull House Practical Activities    0.178544
Name: hullhouse, dtype: float64

In [47]:
%%R
#Output top weighted words for top Hull House topics (Table 4)
terms40 <- labelTopics(mod.40, n=20)
cat("Top Words for 'Public Institutions' Topic\n")
print(terms40$prob[28,])
cat("\nTop Words for 'Hull House Social Activities' Topic\n")
print(terms40$prob[1,])
cat("\nTop Words for 'Hull House Practical Activities' Topic\n")
print(terms40$prob[27,])

Top Words for 'Public Institutions' Topic
 [1] "hullhous"     "miss"         "school"       "children"     "hous"        
 [6] "chicago"      "resid"        "year"         "work"         "offic"       
[11] "summer"       "public"       "citi"         "street"       "open"        
[16] "made"         "neighborhood" "investig"     "associ"       "visit"       

Top Words for 'Hull House Social Activities' Topic
 [1] "club"      "year"      "member"    "miss"      "boy"       "social"   
 [7] "mrs"       "hullhous"  "even"      "parti"     "meet"      "room"     
[13] "two"       "given"     "committe"  "present"   "danc"      "entertain"
[19] "mani"      "one"      

Top Words for 'Hull House Practical Activities' Topic
 [1] "hullhous" "play"     "year"     "given"    "greek"    "italian" 
 [7] "lectur"   "build"    "meet"     "organ"    "dramat"   "music"   
[13] "present"  "audienc"  "mani"     "school"   "russian"  "club"    
[19] "one"      "chicago" 


In [48]:
#Calculate the top 3 most frequent topics for Heterodoxy (Table 4)
#By percent of total words aligned with each topic for Heterodoxy
grouped.loc['heterodoxy'].sort_values(ascending=False)[:3]

Sanger and Birth Control    0.222227
Women's Resistance          0.216267
Women's Lives               0.086840
Name: heterodoxy, dtype: float64

In [49]:
%%R
#Output top weighted words for top Heterodoxy topics (Table 4)
cat("Top Words for 'Sanger and Birth Control' Topic\n")
print(terms40$prob[8,])
cat("\nTop Words for 'Women's Resistance' Topic\n")
print(terms40$prob[21,])
cat("\nTop Words for 'Women's Lives' Topic\n")
print(terms40$prob[26,])

Top Words for 'Sanger and Birth Control' Topic
 [1] "sanger"   "one"      "will"     "public"   "birth"    "inform"  
 [7] "can"      "time"     "year"     "new"      "give"     "life"    
[13] "control"  "mrs"      "law"      "book"     "make"     "pamphlet"
[19] "woman"    "case"    

Top Words for 'Women's Resistance' Topic
 [1] "woman"      "women"      "man"        "will"       "suffrag"   
 [6] "men"        "one"        "life"       "great"      "world"     
[11] "sex"        "home"       "suffragett" "like"       "say"       
[16] "vote"       "can"        "new"        "social"     "never"     

Top Words for 'Women's Lives' Topic
 [1] "one"    "love"   "will"   "mother" "life"   "day"    "littl"  "man"   
 [9] "know"   "billi"  "woman"  "work"   "came"   "take"   "well"   "ladi"  
[17] "like"   "mickey" "time"   "hand"  


In [50]:
#Calculate the top 3 most frequent topics for CWLU (Table 4)
#By percent of total words aligned with each topic for CWLU
grouped.loc['cwlu'].sort_values(ascending=False)[:3]

Liberation School        0.079860
Anti-War                 0.074941
Women's Sexual Health    0.064614
Name: cwlu, dtype: float64

In [51]:
%%R
#Output top weighted words for top CWLU topics (Table 4)
cat("Top Words for 'Liberation School' Topic\n")
print(terms40$prob[9,])
cat("\nTop Words for 'Anti-War' Topic\n")
print(terms40$prob[7,])
cat("\nTop Words for 'Women's Sexual Health' Topic\n")
print(terms40$prob[10,])

Top Words for 'Liberation School' Topic
 [1] "women"     "liber"     "work"      "cwlu"      "union"     "chicago"  
 [7] "call"      "offic"     "peopl"     "center"    "chang"     "will"     
[13] "legal"     "can"       "societi"   "womankind" "problem"   "come"     
[19] "togeth"    "abort"    

Top Words for 'Anti-War' Topic
 [1] "vietnam"   "vietnames" "peopl"     "war"       "american"  "south"    
 [7] "north"     "nixon"     "bomb"      "govern"    "will"      "prison"   
[13] "one"       "can"       "peac"      "agreement" "forc"      "militari" 
[19] "viet"      "saigon"   

Top Words for 'Women's Sexual Health' Topic
 [1] "women"      "gonorrhea"  "doctor"     "infect"     "can"       
 [6] "pain"       "treatment"  "drug"       "diseas"     "patient"   
[11] "caus"       "pill"       "bacteria"   "penicillin" "vagina"    
[16] "tube"       "symptom"    "uterus"     "birth"      "examin"    


In [52]:
#Calculate the top 3 most frequent topics for Redstockings (Table 4)
#By percent of total words aligned with each topic for Redstockings
grouped.loc['redstockings'].sort_values(ascending=False)[:3]

Movement History       0.105886
Movement Theory        0.093310
Forms of Resistance    0.085934
Name: redstockings, dtype: float64

In [53]:
%%R
#Output top weighted words for top Redstockings topics (Table 4)
cat("Top Words for 'Movement History' Topic\n")
print(terms40$prob[39,])
cat("\nTop Words for 'Movement Theory' Topic\n")
print(terms40$prob[14,])
cat("\nTop Words for 'Forms of Resistance Topic\n")
print(terms40$prob[25,])

Top Words for 'Movement History' Topic
 [1] "movement" "women"    "feminist" "histori"  "radic"    "femin"   
 [7] "liber"    "polit"    "new"      "lesbian"  "even"     "one"     
[13] "first"    "time"     "idea"     "origin"   "now"      "year"    
[19] "media"    "write"   

Top Words for 'Movement Theory' Topic
 [1] "radic"             "liber"             "women"            
 [4] "polit"             "feminist"          "attack"           
 [7] "movement"          "group"             "consciousnessrais"
[10] "left"              "issu"              "power"            
[13] "action"            "person"            "peopl"            
[16] "problem"           "psycholog"         "theori"           
[19] "interest"          "oppress"          

Top Words for 'Forms of Resistance Topic
 [1] "women"    "men"      "liber"    "movement" "male"     "group"   
 [7] "organ"    "struggl"  "revolut"  "oppress"  "work"     "fight"   
[13] "now"      "right"    "chang"    "equal"    "polit"    "mu