# Computational Grounded Theory Tutorial
### Author: Campbell Lund
### 6/6/2023
Computational techniques can be implemented in societal research to help address problems of scalability when analyzing large corpuses. The following tutorial follows Laura K. Nelson's proposed framework for computational ground theory. It combines her code found 
at this [link](https://github.com/lknelson/computational-grounded-theory/tree/master) and written instructions from her paper: [Computational Grounded Theory: A Methodological Framework](https://journals.sagepub.com/doi/full/10.1177/0049124117729703). 

### Table of contents:
1. [Difference of Proportions](#sec1)

## Section 1: Difference of Proportions <a name="sec1"></a>
Difference of proportions analysis allows us to calculate the difference in proportional word frequencies between two texts. In this section, we'll read and clean our data, then perform the analysis. Our goal is to determine key themes or characteristics of a document based on its most frequent words in order to help us understand the content without a thorough reading.

In [None]:
# import libraries
import pandas
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# read data
df = pandas.read_csv("your_file_here")
df

In [None]:
# organize data based on topic
topic1 = df[df['topicName'] == 'topic1']
topic2 = df[df['topicName'] == 'topic2']
...

In [None]:
# initialize countvectorizer function, removing stop words
countvec = CountVectorizer(stop_words="english")

In [None]:
# preform analysis
# fit data
topic1_topic2 = pandas.DataFrame(countvec.fit_transform([topic1, topic2]).toarray(), columns=countvec.get_feature_names())
# get word count
topic1_topic2['word_count'] = topic1_topic2.sum(axis=1)
# scale values based on word count
topic1_topic2 = topic1_topic2.iloc[:,0:].div(topic1_topic2.word_count, axis=0)
# calculate the difference between values in row 0 and row 1
topic1_topic2.loc[2] = topic1_topic2.loc[0] - topic1_topic2.loc[1]

In [None]:
# the words with the highest difference of proportions are distinct to topic1
# the words with the lowest (the highest negative) difference of proportions are distinct to topic2
topic1_topic2.loc[2].sort_values(axis=0, ascending=False)

In [None]:
# repeat above 2 cells for each topic comparison