# Reproducilbe Exploration

The below code allows readers to explore output from a 40 topic Structural Topic Model.

See [this repository](https://github.com/lknelson/computational-grounded-theory) for information on how these data, and the topic model, was constructed.

This is an example of what you might do if you want to make your exploratory analysis more reproducible.

In this particular example I assume the reader does not know anything about coding in Python.

## To explore the results:

All you need to do is change the string between the quotation marks in the first line of the code below and then run the cell. 

1. Change the string between the qutation marks to one of the following: "X1", "X2", "X3", ... "X40".  
2. Once you change the string, click on the code cell below. You'll know it's active if you see a blue border to the left of the cell.
3. Run the cell. To do this click on the "play" button at the top of the screen.
4. You should see the output appear, or change.

## How to interpret the output:

You will see three different types of output on your screen:

1. The first output you see on your screen is the 20 heighest weighted words for your chosen topic. These words suggest what the topic is about.

2. The second output is the dataset sorted in descending order based on the topic weight for each document. This sorted dataframe suggests which publication, city, and wave the topic was most prevalent.
    * Metadata include:
        * document name
        * city in which the document was published
        * name of the publication
        * date of publication 
        * the organization that produced the document
        * the feminist wave in which the document was produced
            * wave = 1 means the document was made in the first wave feminist movement
            * wave = 2 indicates the second wave feminist movement. 
        * The text_string column is the full text sorted alphebetically. Due to copy right restrictions I can not release the full text. Ideally, a replication repository would be able to do so.
3. The third output is the percent of words aligned with the topic, calculated separately for each organization. This output suggests the prevalence of the topic for each organization.

In [None]:
topic_number = "X10" #change the string between the quotation marks to explore different topics
                    #valid values: "X1", "X2", "X3", ... "X40"

import pandas #use pandas to structure data
from IPython.display import display, HTML #makes output look pretty

df = pandas.read_csv("A-data/comparativewomensmovement_dataset_wordcount.csv", index_col=0) #read in dataset, including text, metadata, and topic weights
topic_words = pandas.read_csv("A-data/comparativewomensmovement_topweightedtopicwords.csv", index_col=0) #read in dataset including top weighted words for each topic

try:
    df[topic_number] #test to make sure the string entered is valid
    #a bunch of print statements to make output interpretable
    print('==========================================================================================================')
    print()
    print("Top 20 Weighted Words for Topic %s:" %topic_number)
    print("")
    print(topic_words[topic_number]) #print top weighted words for chosen topic
    print()
    print('==========================================================================================================')
    print()
    print("Text and Metadata sorted by the topic weight for Topic %s:" % topic_number)

    #
    col_name = topic_number+"_prop"
    df[col_name] = df[topic_number] / df['word_count']
    display(df[['doc', 'city', 'publication', 'date', 'org', 'wave', 'text_string', col_name, topic_number]].sort_values(by=topic_number, ascending=False)[:10]) #sort data by topic weight

    #group the dataframe by organization and calculate the percent of words associated with each topic for each organization

    #add total word count by organization and total word count for each topic by organization
    grouped = df.groupby('org').sum().reset_index()
    grouped.set_index('org', inplace=True)

    #divide topic word count by total word count, by organization, for the chosen topic
    grouped[topic_number] = (grouped[topic_number]/grouped['word_count']) * 100

    #print results
    print("==========================================================================================================")
    print()
    print("Percent of total words from each organization aligned with Topic %s:" %topic_number)
    display(grouped[topic_number].sort_values(ascending=False))
except:
    print("Oops! Something went wrong. Make sure you entered a valid string between the quotation marks in the first line of code.")