<a href="https://colab.research.google.com/github/lizhuofan95/Scaling_Human_Coding/blob/master/ASA_Working_001_20220706.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop: Introduction to Machine Learning into Qualitative Research

Coding allows us to see themes and patterns in qualitative data, but it can also get repetitive and dreary, especially with codes that are rather informational than interpretative. 

Previous approaches to automating qualitative coding rely heavily on prespecified rules, statistical assumptions, or dictionaries of keywords, often at the price of interpretative adaptability. 

Recent developments in deep learning models of natural languages provide a promising opportunity for qualitative researchers to scale their coding without compromising adaptability. Instead of following prespecified rules, deep learning algorithms are designed to mimic human behavior using human-generated examples. 

After iteratively establishing a codebook and some learning examples by hand-coding a sample of qualitative data, deep learning can help researchers quickly scale their initial codings on the remaining data and save up time and energy for thinking deeply about the data and the code. 

This Google Colab notebook walks you through the deep-learning-powered workflow we developed in our own study for analyzing in-depth interview data. 

## Introduction to Python in Google Colab

You are in Google Colab, or "Colaboratory", which allows you to write and execute Python in your browser without having to set up a programming environment on your own device. It also provides access to computational power for deep learning models free of charge for noncommercial uses. 

In this workshop, we use Google Colab and publicly available data to demonstrate the workflow, but you should NOT upload your own human subject data without a plan to protect their confidentiality! 

Now You can click the first sidebar button on the left to access the _table of Content_. Clicking "cell(s) hidden" will reveal more. You can also expand sections by clicking the small arrows to the left of headers. Try this now with the header "practice running code."  

### Running Python Code in Google Colab

You can execute embedded code by clicking on the arrow to the left of the cell. This is how we run programs or use functions. Try this below. 

In [None]:
print ("Hello world, I am here to code your qualitative data more efficiently!")

Now run the simple program below. It will ask for your name, and say if it likes it.

In [None]:
name = input('What is your name? ')
if name == 'Avery':
  print('Awesome name!', name, 'is pretty cool.')
elif name == "Zhuofan":
  print('Good name!', name, 'is pretty cool.')
else:
  print(' Well', name, ', your name is good I guess.')

What is your name? Zhuofan
Good name! Zhuofan is pretty cool.


Now that you understand the basics of this interface, the rest of this notebook will walk you through our deep-learning-powered workflow. 

## Overview of the Workflow

(1) Developing Codebook V.1;

(2) Coding Training Data;

(3) **Importing data from ATLAS.ti**;

(4) **Preprocessing**;

(5) **Scaling**;

(6) **Recoding**;

(7) **Reimporting to ATLAS.ti**

(8) Examing Machine-Coded Data and Revising Codebook;

(9) Repeat (2)-(8) with Codebook V.2...

Assuming that you have done an initial coding on a sample of the data, the following Python code does (3)-(7) for you. 

## Part I: Importing data from ATLAS.ti

Many qualitative researchers use QDA software to code their data, so here we start with an Excel spreadsheet that mimics the export function in ATLAS.ti. 

Assuming we code on the level of paragraphs, each row should contain all relevant information about a single paragraph. 

Column-wise, this spreadsheet should include the following columns:

- Interview ID
- Paragraph ID
- Paragraph Content
- Initial Codings

Row-wise, it should include paragraphs that you have "just coded" as training data, some leave-out paragraphs

**\[Insert a screen shot of the Excel export\]**

### Importing Data from ATLAS.ti

This cell loads a python function for importing data from ATLAS.ti-generated Excel spreadsheets.

In [None]:
import pandas as pd  # The most commonly used library for data wrangling in Python

class read_ATLAS:
    """Loads data ATLAS.ti-generated export files in .xls/.xlsx format"""
    
    def __init__(self, path):
        """
        Reads excel files. 

        Args:
            path: file path to an ATLAS export file. It must have one paragraph per row and for each paragraph 
                  include the following columns:

                - Document: unique interview id
                - Reference: unique pragraph id
                - Quotation Content: paragraph content
                - Codes: a list of codes that have been manually assigned to the paragraph
        """
        
        self.data = pd.read_excel(path)[['Document', 'Reference', 'Quotation Content', 'Codes']]
    
    def get_original(self):
        """
        Reads the original version of the data. 

        Returns:
            data: the original, abbreviated version of the data, which we will use for reimporting machine-generated codings back into ATLAS.ti.
        """
        
        data = self.data
        
        data['Codes'] = data['Codes'].apply(lambda x: str(x).split("\n"))
        N = len(data)
        
        for i in range(N):
            data['Codes'][i] = [data['Codes'][i][j].strip("#") for j in range(len(data['Codes'][i]))]
        
        data['Codes_Frozen'] = data['Codes'].apply(frozenset).to_frame(name='Codes_Frozen')
        for code in frozenset.union(*data.Codes_Frozen):
            data[code] = data.apply(lambda _: int(code in _.Codes_Frozen), axis=1)
        
        return data
    
    def get_training(self, min_length = -1):
        """
        Reads in the original data, but only keeps participants' speeches over a certain length for machine learning, 
        excluding researchers' notes, interviewers' speeches, and other functional text that are not being coded analytically. 

        This function also transforms the list of codes into one-hot encodings, such that the presence of any given code is 
        denoted by an entire columns of 0s and 1s. 

        Args:
            min_length = the minimum number of characters that a paragraph must include to be included as valid data. This is to 
                         exclude filler sentences such as "yeah", "no", "I know" etc. Only comes into effect if you specify a 
                         value that is greater or equal to 1. 

        Returns: 
            data: the abbrevaited version of the data, which we will use for machine learning. 
        """
        data = self.data
        
        data['Quotation Content'] = data['Quotation Content'].apply(lambda x: str(x).split("\t"))
        data['Codes'] = data['Codes'].apply(lambda x: str(x).split("\n"))
        N = len(data)
        data['Spk'] = ""
        data['Quote'] = ""
        for i in range(N):
            line = data['Quotation Content'][i]
            if len(line) >= 2:
                data['Spk'][i] = line[0].strip(":\s")
                data['Quote'][i] = line[1].strip('\u202c')
            elif len(line) == 1:
                data['Spk'][i] = ""
                data['Quote'][i] = line[0].strip()
            else:
                data['Spk'][i] = data['Quote'] = ""
            data['Codes'][i] = [data['Codes'][i][j].strip("#") for j in range(len(data['Codes'][i]))]
    
        data['Codes_Frozen'] = data['Codes'].apply(frozenset).to_frame(name='Codes_Frozen')
        for code in frozenset.union(*data.Codes_Frozen):
            data[code] = data.apply(lambda _: int(code in _.Codes_Frozen), axis=1)
    
        #data = data[data["Spk"].isin(["MSPKR","FSPKR"])].reset_index(drop = True)
        data = data[(data["Spk"] != "") & (data["INT_NEW"] != 1)].reset_index(drop = True)
        
        if min_length >= 0: 
            data = data[data['Quote'].map(len) >= min_length].reset_index(drop = True)
        
        cols = [col for col in data.columns if col not in ['Quotation Content', 'Codes', 'Spk', 'Codes_Frozen']]
    
        return data[cols]

This cell load the actual Excel file from our GitHub repository and executives the function above to import data into Python. 

In [None]:
path = "data\July_v3_INT_NEWisquestion.xlsx"

data_original = read_ATLAS(path = path).get_original()

data_ML = read_ATLAS(path = path).get_training(min_length = 10)

This is how our original data should look like in Python:

In [None]:
data_original

This is how our data should look like in one-hot encoding, which we will use for machine learning:

In [None]:
data_ML

## Preprocessing

This cell loads a function that translates our raw text data into something computer can understand - numerical vectors. We use Bidirectional Encoder Representation from Transformers, or BERT, a deep learning language model often trained on enourmous volumes of text (for English, 3300 million words from books and wikipedia articles) to represent fine-grained sequential and contextual information at all levels of natural language in high-dimensional vectors. 

Our implementation uses `PyTorch`, a popular deep learning library developed by Facebook and pretrained BERT models available through the `Transformers` library by another company called HuggingFace. 

In [None]:
!pip install transformers

import numpy as np

import torch
from torch.utils.data import TensorDataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

def text2feature(text, MAX_LENGTH = 512, BATCH_SIZE = 16, device = "cuda"):
    """
    Use pretrained BERT model to vectorize raw text. 
    
    Args:
        text: the text of interest, stored in the "Quote" column of our DataFrame "data_ML". 
        
        MAX_LENGTH: the maximum number of "wordpieces" (usually words but not always) in each paragraph to be used 
                    for representing the paragraph, capped at 512. 
        
        BATCH_SIZE: the maximum number of samples to be used in a single neural network iteration. It is recommended
                    to use a batch size of 32 or 64. We used 16 due to the limitation of our GPU memory. 
        
        device: "CPU" or "cuda". "cuda" is the architecture of the NVIDIA graphics processing unit (GPU) which is
                used to accelerate machine learning. Only use if you either (1) have a NVIDIA GPU and have installed 
                the CUDA development tools following the instruction or (2) use cloud computing (e.g. Google Colab). 
        
    Returns: 
        feature: an N_row by 768 array of vectors that represent each paragraph using a vector of 768.
    """

    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    
    encodings = list(map(lambda t: tokenizer.encode(t, padding=True, truncation=True, max_length = MAX_LENGTH, add_special_tokens=True), text))
    
    max_len = 0
    for i in encodings:
        if len(i) > max_len:
            max_len = len(i)

    encodings_padded = np.array([i + [0]*(max_len-len(i)) for i in encodings])
    attention_mask = [[float(i > 0) for i in ii] for ii in encodings_padded]
    
    dataset = TensorDataset(torch.tensor(encodings_padded, dtype = torch.int), torch.tensor(attention_mask))
    
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)
    
    model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)
    
    features = []

    with torch.no_grad():
        for step_num, batch_data in enumerate(dataloader):
            token_ids, masks = tuple(t.to(device) for t in batch_data)
            last_hidden_states = model(token_ids, masks)
            features.append(last_hidden_states[0][:,0,:].cpu().detach().numpy())
            """
            The model actually produces a vector of 768 for each of the 512 "wordpiece" in every sequence, but the 
            vector of every sequence, denoted by [CLS], is always a special classification token that can be used as
            the aggregate sequence representation for classification tasks. An alterantive is to represent the sequence
            by averaging all 512 vectors, which tends to produce similar results.           
            """
    features = np.vstack(features)
    
    return features

This cell executes the function above and transforms text data into vectors. We should obtain a N × 768 matrix that represents each row using a 768-dimensional vector. 

In [None]:
text = data_ML['Quote'].values.tolist()
features = text2feature(text)
features.shape

Our data look like this in vectors:

In [None]:
features

But before we extending our initial codings to all the remaining data, we want to have an sense of how reliable it would be. We do so by setting aside a small sample of coded data and compare machine-generated codings against our own codings on this already coded sample. 

To make sure the accuracy of our predictions is not dependent upon the idiosyncracy of the training/test data used, we randomly split our coded cases into training and test data and split multiple times to ensure that our results are not dependent upon . We want as much training data as possible, but we also want to reserve about 15%-30% for testing. 

In [None]:
from sklearn.model_selection import ShuffleSplit

def split(all_cases, n_splits = 10, test_size = 0.25):
    """
    Split any coded data into training and test sets. 
    
    Args:
        n_splits: how many randomly reshuffled splits to generate. The default is 10. 
        
        test_size: how many cases to use as test data. 
                   - If between 0 and 1, represents the proportion of the dataset to include;
                   - If integer greater than 1, represents the the absolute number of test samples.
                   - The default is 25%. 
    Returns:
        train_set, test_set
    
    """
    train_test_split = ShuffleSplit(n_splits = n_splits, test_size = test_size)
    train_set = []
    test_set = []
    for train_index, test_index in train_test_split.split(all_cases):
        train_set.append(train_index.tolist())
        test_set.append(test_index.tolist())
    
    return train_set, test_set

This cell specifies the number of random shuffle-splits and the number of interviews to reserve for testing:

In [None]:
n_splits = 20
test_size = 4

This cell specifies a list of interviews to use for training/testing (from the "Document" column) and a list of codes to scale (from the "Codes" column):

In [None]:
all_cases = ['7069_CASE', 
             '4039_CASE', 
             '7061_CASE', 
             '7522_CASE', 
             '7068_CASE', 
             '7044_CASE', 
             '7049_CASE', 
             '10018_CASE', 
             '10002_CASE', 
             '10021_CASE', 
             '7035_CASE',
             '7028_CASE',
             '7512_CASE',
             '7031_CASE'
            ]

eval_codes = ['IllnessNarrative', 'SOURCESofCULTURE_FamilyOrCommunity', 'SOURCESofCULTURE_Medicine', 'HealthcareSystemsIssues']

This cell executes the shuffle-split and generates the training and testing datasets:

In [None]:
train_set, test_set = split(all_cases, n_splits, test_size)

## Scaling

Since BERT is typically deployed with data that are much bigger and deep learning models much more sophisticated than required by our tasks, here we simplify it by combining BERT vectors with a regularized logistic regression, which despite the strong assumption of linearity has been proven highly efficient for smaller scale data. 

In [None]:
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, cohen_kappa_score

class Classifier:
    def __init__(self):
        
        self.clf = LogisticRegression(penalty = 'l2', solver = 'liblinear', C = 1, class_weight= 'balanced')
        
    def train(self, features: NDArray, labels: NDArray):

        self.clf.fit(features, labels)
    
    def predict(self, features: NDArray) -> NDArray:

        predictions = self.clf.predict(features)
        predprob = self.clf.predict_proba(features)
    
        return predictions, predprob
    
    def code(self, train_features, test_features, train_labels):
        
        self.train(train_features, train_labels)
        self.predicted_labels, self.predicted_probabilities = self.predict(test_features)
        
        return self.predicted_labels, self.predicted_probabilities
    
    def metrics(self, test_labels):
        
        accuracy = accuracy_score(test_labels, self.predicted_labels)
        f1 = f1_score(test_labels, self.predicted_labels, pos_label=1)
        precision = precision_score(test_labels, self.predicted_labels)
        recall = recall_score(test_labels, self.predicted_labels)
        kappa = cohen_kappa_score(test_labels, self.predicted_labels)
        
        return [accuracy, f1, precision, recall, kappa]

def classify(classifier, data, eval_codes, train_set, test_set, method = ['pred']):
    """
    Train a machine learning model to predict human codings of interest based on each training/test split.
    
    Args:
        classifier: the machine learning model. 
        
        eval_codes: the list of codes to be scaled, corresponding to the "Codes" column from the ATLAS.ti export file.
        
        train_set/test_set: the ids of interviews to use as training/test data. 
        
        method: "pred" or "eval". 
                
                - The "pred" method predicts codings on uncoded data without returning metrics. 
                - The "eval" method predicts codings on coded data and compare the predictions against original human codings. 
        
    Returns:
        predictions: the predicted probability of a code on a paragraph.
        
        metrics: a list of performance metrics the "eval" method generates. 
    
    """
    
    predictions = pd.DataFrame({'code': [], 'test_set': [], 'p': []})
    metrics = pd.DataFrame({'code': [], 'test_set': [], 'n':[], 'accuracy': [], 'f1': [], 'precision': [], 'recall': [],  
                            'kappa': []})
    
    for code in eval_codes:
        for train_ids, test_ids in zip(train_set, test_set):

            train_data = data[data['Document'].isin([all_cases[id] for id in train_ids])]
            test_data = data[-data['Document'].isin([all_cases[id] for id in train_ids])]

            train_id = train_data.index.tolist()
            test_id = test_data.index.tolist()

            train_features = features[train_id]
            test_features = features[test_id]

            train_labels = train_data[code].values.tolist()
            
            predicted_labels, predicted_probabilities = classifier.code(train_features, test_features, train_labels)

            predictions = predictions.append(pd.Series([code, test_ids] + [[x[1] for x in predicted_probabilities]], index = predictions.columns), ignore_index=True)

            if method == 'eval':
                test_labels = test_data[code].values.tolist()
                metrics = metrics.append(pd.Series([code, test_ids, data_ML.loc[data_ML['Document'].isin([all_cases[id] for id in train_ids]), code].sum()] + classifier.metrics(test_labels), index = metrics.columns), ignore_index=True) 
            
    return predictions, metrics

This cell executes the machine learning model and evaluates the results against test data: 

In [None]:
predictions, metrics = classify(Classifier(), data_ML, eval_codes, train_set, test_set, 'eval')

In [None]:
pd.DataFrame({'N': data_ML[eval_codes].sum()}).join(metrics.groupby('code')[['n', 'accuracy', 'f1', 'precision', 'recall', 'kappa']].mean())

In [None]:
metrics.sort_values(['recall', 'f1'], ascending = False).groupby('code').nth(0)

In [None]:
train_set = [[0, 1, 2, 3, 4, 5, 6, 7]]
test_set = [[8, 9, 10]]

predictions, _ = classify(Classifier(), data_ML, eval_codes, train_set, test_set)

## Recoding

Now that we have casted a wide net that presumbly catches about 70% of all paragraphs that would have been coded as "Educational Background", we want to quickly review the results and filter out obivous false positives.

This cell creates 

In [None]:
recode_set = [8, 9, 10]
combine_codes = [0, 1, 2, 3]

sorting = [False, False, False, False]
sorting_vars = [['IllnessNarrative'], ['HealthcareSystemsIssues']]

pruning = [True, True, True, True]
pruning_thresholds = [0.5, 0.5, 0.5, 0.5]

In [None]:
def gen_recode(data, predictions, all_cases, recode_set, combine_codes, sorting, pruning, sorting_vars = None, pruning_threshold = None):
    
    recode = data[data['Document'].isin([all_cases[id] for id in recode_set])][['Quote', 'Document', 'Reference']].reset_index(drop = True)
    
    for code in eval_codes:
        #recode_set = best.loc[code]['test_set']
        p = predictions.loc[(predictions['code'] == code) & (predictions['test_set'].apply(lambda x: x==recode_set))]['p'].values.tolist()[0]
        col = pd.DataFrame({str(combine_codes[eval_codes.index(code)]) + '_P_' + code :p})
        recode = recode.join(col)
    
    for doc_id in range(max(combine_codes)+1):
        out = recode[['Document', 'Reference']]
        name = "Pred_"

        rowids_to_keep = []

        for col in (col for col in recode if col.startswith(str(doc_id))):

            out = out.join(recode[[col]])

            colname = col.split('_',1)[1]

            out.rename(columns={col: colname}, inplace=True)

            out[colname] = ''

            name = name + colname + "_"

            if pruning[doc_id] == True:
                rowids_to_keep.append(out[colname].index[out[colname] >= pruning_threshold[doc_id]].tolist() )

        #if sorting[doc_id] == True:
        #    out.sort_values(by = ["P_" + v for v in sorting_vars[doc_id]], inplace = True, ascending = False)

        out = out.join(recode[['Quote']])

        if pruning[doc_id] == True:
            out.iloc[list(set([item for sublist in rowids_to_keep for item in sublist]))].sort_index().to_excel(name + ".xlsx")
        else:
            out.to_excel("recoding\" + name + ".xlsx")

In [None]:
gen_recode(data_ML, predictions, all_cases, recode_set, combine_codes, sorting, pruning, sorting_vars, pruning_thresholds)

In [None]:
recoded_set = ["IllnessNarrative", "SOURCESofCULTURE_FamilyOrCommunity"]

recoded = pd.read_excel("recoding\Recoded_Pred_P_IllnessNarrative_P_SOURCESofCULTURE_FamilyOrCommunity_P_SOURCESofCULTURE_Medicine_.xlsx")

In [None]:
def recoding_metrics(data, recoded, recoded_set):

    metrics = pd.DataFrame({'code': [], 'accuracy': [], 'f1': [], 'precision': [], 'recall': [], 'alpha': [], 
                            'kappa': []})
    
    for code in recoded_set:
        recoded.loc[recoded[code].isna(), code] = recoded.loc[recoded[code].isna(), "P_"+code]
        
        predicted_labels = data.loc[(data['Document'].isin([all_cases[id] for id in recode_set])), code].values.tolist()
        test_labels = [(x>=0.5)*1 for x in recoded[code].values.tolist()]
    
        accuracy = accuracy_score(test_labels, predicted_labels)
        f1 = f1_score(test_labels, predicted_labels, pos_label=1)
        precision = precision_score(test_labels, predicted_labels)
        recall = recall_score(test_labels, predicted_labels)
        alpha = krippendorff.alpha(np.stack((test_labels, predicted_labels)))
        kappa = cohen_kappa_score(test_labels, predicted_labels)
        
        metrics = metrics.append(pd.Series([code, accuracy, f1, precision, recall, alpha, kappa], index = metrics.columns), ignore_index=True) 
    
    return metrics

In [None]:
metrics_x = recoding_metrics(data_ML, recoded, recoded_set)

metrics_x.to_excel("output\Recoding_metrics_all.xlsx")

## Reimporting to ATLAS.ti

In [None]:
def reimport2ATLAS(full_data, recoded, recoded_set, recoded_cases):
    for code in recoded_set:
        recoded.loc[recoded[code].isna(), code] = recoded.loc[recoded[code].isna(), "P_"+code]
        recoded[code] = recoded[code].apply(lambda x: x >= 0.5).astype(int)
        full_data.merge(recoded[['Document', 'Reference', code]], on = ['Document', 'Reference'], how = 'left')
    
    code_list = [col for col in full_data.columns.tolist() if col not in ['Document', 'Reference', 'Quote', 'Quotation Content', 'Codes', 'Codes_Frozen','nan']]
    
    for i in range(len(full_data)):
        if full_data.loc[i,"Document"] in all_cases:
            for code in code_list:
                if(full_data.loc[i, code] >= 0.5): full_data.loc[i,'Quotation Content'] = full_data.loc[i,'Quotation Content'] + ' ' + '#' + code
        
    full_data['idx'] = full_data.groupby('Document').cumcount()
    full_data['p_idx'] = 'p' + full_data['idx'].astype(str)
    full_pivot = full_data.pivot(index='Document',columns='p_idx', values = 'Quotation Content')
    full_pivot = full_pivot.reindex(sorted(full_pivot.columns, key=lambda x: float(x[1:])), axis=1)
    full_pivot.to_excel("output\Recoded_output_transformed_all_cases_all_codes.xlsx")

In [None]:
reimport2ATLAS(data_original, recoded, recoded_set, all_cases)