# Reproducilbe Exploration

The below code allows readers to explore output from a 40 topic Structural Topic Model.

See [this repository](https://github.com/lknelson/computational-grounded-theory) for information on how these data, and the topic model, was constructed.

This is an example of what you might do if you want to make your exploratory analysis more reproducible.

In this particular example I assume the reader does not know anything about coding in Python.

## To explore the results:

All you need to do is change the string between the quotation marks in the first line of the code below and then run the cell. 

1. Change the string between the qutation marks to one of the following: "X1", "X2", "X3", ... "X40".  
2. Once you change the string, click on the code cell below. You'll know it's active if you see a blue border to the left of the cell.
3. Run the cell. To do this click on the "play" button at the top of the screen.
4. You should see the output appear, or change.

## How to interpret the output:

You will see three different types of output on your screen:

1. The first output you see on your screen is the 20 heighest weighted words for your chosen topic. These words suggest what the topic is about.

2. The second output is the dataset sorted in descending order based on the topic weight for each document. This sorted dataframe suggests which publication, city, and wave the topic was most prevalent.
    * Metadata include:
        * document name
        * city in which the document was published
        * name of the publication
        * date of publication 
        * the organization that produced the document
        * the feminist wave in which the document was produced
            * wave = 1 means the document was made in the first wave feminist movement
            * wave = 2 indicates the second wave feminist movement. 
        * The text_string column is the full text sorted alphebetically. Due to copy right restrictions I can not release the full text. Ideally, a replication repository would be able to do so.
3. The third output is the percent of words aligned with the topic, calculated separately for each organizations. This output suggests how prevalence of the topic for each organization.

In [71]:
topic_number = "X10" #change the string between the quotation marks to explore different topics
                    #valid values: "X1", "X2", "X3", ... "X40"

import pandas #use pandas to structure data
from IPython.display import display, HTML #makes output look pretty

df = pandas.read_csv("data/comparativewomensmovement_dataset_withtopicweights.csv", index_col=0) #read in dataset, including text, metadata, and topic weights
topic_words = pandas.read_csv("data/comparativewomensmovement_topweightedtopicwords.csv", index_col=0) #read in dataset including top weighted words for each topic

#bunch of print statements to make output interpretable
print('==========================================================================================================')
print()
print("Top 20 Weighted Words for Topic %s:" %topic_number)
print("")
print(topic_words[topic_number]) #print top weighted words for chosen topic
print()
print('==========================================================================================================')
print()
print("Text and Metadata sorted by the topic weight for Topic %s:" % topic_number)

#
df['topic_number_perc'] = df[topic_number] / df['word_count']
display(df[['doc', 'city', 'publication', 'date', 'org', 'wave', 'text_string', topic_number]].sort_values(by=topic_number, ascending=False)[:10]) #sort data by topic weight

#group the dataframe by organization and calculate the percent of words associated with each topic for each organization

#add total word count by organization and total word count for each topic by organization
grouped = df_new.groupby('org').sum().reset_index()
grouped.set_index('org', inplace=True)

#divide topic word count by total word count, by organization, for the chosen topic
grouped[topic_number] = (grouped[topic_number]/grouped['word_count']) * 100

#print results
print("==========================================================================================================")
print()
print("Percent of total words from each organization aligned with Topic %s:" %topic_number)
display(grouped[topic_number].sort_values(ascending=False))


Top 20 Weighted Words for Topic X10:

1          women
2      gonorrhea
3         doctor
4         infect
5            can
6           pain
7      treatment
8           drug
9         diseas
10       patient
11          caus
12          pill
13      bacteria
14    penicillin
15        vagina
16          tube
17       symptom
18        uterus
19         birth
20        examin
Name: X10, dtype: object


Text and Metadata sorted by the topic weight for Topic X10:


Unnamed: 0,doc,city,publication,date,org,wave,text_string,X10
329,chicago.cwlu_womankind.1973.01.10.txt,chicago,cwlu_womankind,1973,cwlu,2,10 10 1020 15 1971 2 20 2448 3 40 40 44 60 60 ...,0.967419
447,chicago.cwlu_womankind.1972.12.10.txt,chicago,cwlu_womankind,1972,cwlu,2,100 102 2 20 200240 2030 30 50 80 810 9 99 A A...,0.946059
858,chicago.cwlu_womankind.1973.04.14.txt,chicago,cwlu_womankind,1973,cwlu,2,0004 0015 0F 1 1 10 100 12 14 15100000 2 3 4 4...,0.941231
205,chicago.cwlu_womankind.1973.04.13.txt,chicago,cwlu_womankind,1973,cwlu,2,1 1 12 1945 1950s 1972 1973 1973 2 2 20 24 24 ...,0.933901
641,chicago.cwlu_womankind.1972.12.09.txt,chicago,cwlu_womankind,1972,cwlu,2,10 12 140000 1960 1ies 20 20 20 2000000 25 25 ...,0.861972
106,chicago.cwlu_womankind.1972.07.06.txt,chicago,cwlu_womankind,1972,cwlu,2,1 1 10 10 10day 12 2 3 4 50 6 90 A A A A A Aft...,0.820432
239,chicago.cwlu_womankind.1972.10.12.txt,chicago,cwlu_womankind,1972,cwlu,2,1 100 2 3 30 4 5 5 A A Advantages Advantages A...,0.817158
856,chicago.cwlu_womankind.1972.07.07.txt,chicago,cwlu_womankind,1972,cwlu,2,1 10 12 2 3 3 4 5 A A A AVC After Another Avoi...,0.812207
748,chicago.cwlu_womankind.1972.10.13.txt,chicago,cwlu_womankind,1972,cwlu,2,1 1 100 100 12 125 14 15 1972 2 20 25 3 3 3 3 ...,0.783695
78,chicago.cwlu_womankind.1972.10.05.txt,chicago,cwlu_womankind,1972,cwlu,2,000 05 05 100 100 100 1000 15 160 1970 20 20 2...,0.780233



Percent of total words from each organization aligned with Topic X10:


org
cwlu            6.461351
redstockings    0.750059
hullhouse       0.102120
heterodoxy      0.067733
Name: X10, dtype: float64