# Workshop: Introduction to Machine Learning for Qualitative Research

Thanks for checking out this interactive python notebook! It runs through google colab to allow including bit of software. The notebook walks you through the deep-learning-powered workflow we developed to help analyze large volumes of ethnographic data.

You can read about our team [here] and access our code repo [here]. Some  more resources for  integrating qualitative research and computation are available [here](https://cmabramson.com/resources). 


## Introduction to Python in Google Colab

You are in google colaboratory (or 'colab'). Colab is free for noncomercial use. Colab allows you to write and execute Python code in your web browser without having to set up a programming environment on your computer. Colab also has tools for annotation, so we set it up like an interactive wiki.

You can **click the sidebar button** on the left to show or hide the _table of Contents_. Clicking "**cell(s) hidden**" will reveal more. You can also expand sections by clicking the small arrows to the left of headers. Clicking the arrow again will hide the next cell.

 Try this now with the header "practice running code." 



### Practice Running Code 

You can execute embedded code by clicking on the arrow to the left of the cell. This is how we run programs or use functions. Try this below. 

In [None]:
print ("Hello world, I am here to code your qualitative data more efficiently!")

Hello world, I am here to code your qualitative data more efficiently!


Now run the simple program below. It will ask for your name, and say if it likes it.

In [None]:
name = input('What is your name? ')
if name == 'Avery':
  print('Awesome name!', name, 'is pretty cool.')
elif name == "Zhuofan":
  print('Good name!', name, 'is pretty cool.')
elif name == "Dan":
  print('Good name!', name, 'is pretty cool.')
elif name == "Corey":
  print('Meh!')
else:
  print(' Well', name, ', your name is good I guess.')

What is your name? Corey
Meh!


Now that you understand the basics of this interface, the rest of this notebook will walk you through our deep-learning-powered workflow [new blog link]. You can expand the background section below to read more background if you like!

## Background: Coding Qualitative Interviews and Fieldnotes with HHMLA

### What is Coding?

Coding is like adding hashtags (#'s) to the text in your qualitative data. In qualitative research, this usually involves a researcher saying this paragraph of fieldnotes, interview response, or document is an example of *something*. Like [#talk_of_morality](https://link.springer.com/article/10.1007/s11133-010-9175-8). Coding allows us to see themes and patterns in a study. It also helps retrieve key text when writing, allows more complex analyses or comparisons, and can help produce interesting visualizations.  You can read about what coding is and is not [here](https://cmabramson.com/resources/f/qualitative-coding-simplified). 




### Why Machine Learning?

One of the problems with coding data is that it takes a long time, but  still requires flexibility to identify new patterns in data. Contemporary natural language processing (NLP) techniques use machine learning to  speed up the process. The basic idea is to get a computer to help accurately classify paragraphs of text by [learning from human coders](https://cmabramson.com/resources/f/using-machine-learning-with-ethnographic-interviews) familiar with the data. You can read about technical details [here](https://osf.io/preprints/socarxiv/gpr4n/).

Over the course of our work with the [Medical Cultures Lab ](https://www.cultureofmedicine.org/research/metholdology)at UCSF, we developed an approach  called HHMLA (short for: Hybrid Human-Machine Learning Approach) that uses a combination of traditional human coding and NLP to efficiently index a lot of qualitative data with high accuracy. It also identifies patterns human coders miss. Importantly, HHMLA maintains the flexibility and iteration that lead us to use qualitive research in the first place.   As a bonus, it does not require a huge volume of data anymore thanks to advances in AI. You can read more about it [here](https://cmabramson.com/resources/f/using-machine-learning-with-ethnographic-interviews).

*NOTE: Sometimes machine learning has been done as a way to bypass an interpretive human reading. Our goal here is to show how the process of human coding can be extended efficiently, not replaced. This reflects our epistemic comitments, but may work well for other approaches too.*

### How is this Different From Using a Dictionary Based Approach to Text Analysis?

Previous approaches to automating qualitative coding rely heavily on prespecified rules, statistical assumptions, or dictionaries of keywords, often at the price of interpretative adaptability. 

Recent developments in deep learning models of natural languages provide a promising opportunity for qualitative researchers to scale their coding without compromising adaptability. Instead of following prespecified rules, deep learning algorithms are designed to mimic human behavior using human-generated examples. 

For instance... a [dictionary based approach may miss 'kick the bucket'] when looking for talk of death. An approach using machine learning would learn this idiom from human coding.

After iteratively establishing a codebook and some learning examples by hand-coding a sample of qualitative data, deep learning can help researchers quickly scale their initial codings on the remaining data and save up time and energy for thinking deeply about the data and the code. 


### A Machine Learning Glossary 

- Machine learning: the ability of an computer algorithm to build models based on sample data and make predictions and decisions on unseen data without being explicitly programmed to do so. An example is Google Search's autocomplete: When you type something in the search bar, it will predict the next word you are going to type and make suggestions. 

- Training data: a dataset of examples used by machine learning algorithms to learn patterns and build predictive models. 

- Test data: a dataset of examples used to provide an unbiased evaluation of how well trained models can make predictions. 

- Classification tasks: the task of assigning a class labels to input examples. An example is spam detection: many email services will classify incoming emails into spams and non-spams based on the content of the email. 

- Recall: measures how many percentage of all relevant cases a model picks up in a classification task. Suppose you are in a restaurant, a waiter misses three out of four items you ordered has a recall of 0.25, and of course you want a waiter with higher recall! 

- Precision: measures how many percentage of cases a model picks up are actually relevant in a classification task. Suppose you are in a restaurant, your waiter brings to you four items, but three out of the four items are items that you didn't order. This waiter has a precision of 25%, and of course you want a waiter with higher precision! 

- F-1: measures the overall performance of a machine learning model in classification tasks. 

## Overview of  Workflow






*NOTES*

*   *This notebook will walk through a  workflow for using machine learning to code in-depth interview data.*
*  *In our real world application, we used qualitative data and codings from the Patient Deliberation Study described [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4730903/).* 

*   *For the demo, we only use publically available information from the oral history archive [here](https://ethw.org/Oral-History:List_of_all_Oral_Histories).* 
*   *Uploading confidential human subjects data to google colab may create ethical and/or institutional issues, so we are not recomending its use in this capacity.Python can be configured to run in secure data enviornments.*  
*   *The appendix at the end of the  notebook has code that illustrates how we scrapped the webpage.*



Our workflow here presupposes that you have some qualitative data and that is coded in a consistant way. Although a simplified explanation of coding is provided [here](https://cmabramson.com/resources/f/qualitative-coding-simplified), collecting data, developing a codebook, and applying codes consistently to that data is a task beyond the scope of this tutorial. So we start at the point in a project where:
1. We have some relatively stable codes.
2. Have applied those codes to qualitative data (e.g. quotes from interviews).
3. But still have a bunch of text--maybe a few dozen (or hundred) interviews, or a years worth of field notes-- that need to get coded...

At this point we need to do the following:
1. Export the coded data in a table. 
2. Import the coded data into python
3. Train the AI to scale the codes to the remaining data
4. Evaluate, tune, and possibly recode for better accuracy
5. Export the data.
6. (Optionally) Import [link] the data into a QDA program like ATLAS.ti, for more analytical options.
7. Make sense of the data and write up our findings.

We will cover steps 2-5.

Figure 1 provides a visualization. Yes, it is an attempt at a vaporwave aesthetic.
[link]

## Part I: Preparing Data 
Data from in-depth interviews should be formatted as follows:
1. Each row should be a unit of text (usually a paragraph or question response)
2. That text should be included in a column labeled 'quotation content'.
3. Another column should be labeled 'Codes,' and have codes listed in []
4. Additional columns can be used to identify which interview the text is part of, and where in the interview it occurs. This allows reconstruction in other software.  

This will look something like this:
[link]

### Exporting Data from ATLAS.ti
We have a tutorial on how to do this using ATLAS.ti, the QDA software our team uses [here](https://cmabramson.com/resources/f/sub-setting-qualitative-data-for-machine-learning), but you can export a similar table using other programs or compile text using python. 

### Importing Data into Python



After we export the interview data to an Excel file, we then tell the program where to find it. 

We can also specify the minimum length of each paragraph (in character), meaning any paragraph shorter than this minimum will be ignored. We set the number to be 50 characters, so that filler paragraphs such as "Yep", "Okay" and "I see" will not be used as training data. 

In [None]:
path = "https://github.com/lizhuofan95/ASA2022_Workshop/blob/main/data/oralhistory_coded_short.xlsx?raw=true"

min_length = 50

This cell will load the actual Excel file based on the oral history project data hosted on our GitHub repository. It imports the formated data into Python. 

*NOTE: The dash symbols inside code cells denote comments that will not run when you run the whole cell.* 

*Sometimes we use dash symbols to "comment out" code that we don't need for the time being but can be later "uncommented" and run in other circumstances.* 

*Lines enclosed by triple quotation marks are docstrings that many developers use to describe in detail what the function does. These also will not run when you run the cell.*

In [None]:
import pandas as pd  # The most commonly used library for data wrangling in Python

def read_ATLAS(path, min_length = -1):
    """
      Loads data ATLAS.ti-generated export files in .xls/.xlsx format, creates a copy that includes only participants' 
      speeches over a certain length for machine learning, and transforms the "Codes" column to one-hot encoding. 

      Args:
          path: file path to an ATLAS export file. It must have one paragraph per row and for each paragraph 
                include the following columns:

                    - Document: unique interview id
                    - Reference: unique pragraph id
                    - Quotation Content: paragraph content
                    - Codes: a list of codes that have been manually assigned to the paragraph

          min_length = the minimum number of characters that a paragraph must include to be included as valid data. This is to 
                       exclude filler sentences such as "yeah", "no", "I know" etc. Only comes into effect if you specify a 
                       value that is greater or equal to 1. 
      Returns: 
          original: the original, abbreviated version of the data, which we will use for reimporting machine-generated codings back into ATLAS.ti.
          training: the abbrevaited version of the data, which we will use for machine learning. 
    """
    
    data = pd.read_excel(path, engine='openpyxl', dtype = str)[['Document', 'Reference', 'Quotation Content', 'Codes']]
    
    data['Codes'] = data['Codes'].apply(lambda x: str(x).split("\n"))
    data['Quote'] = data['Quotation Content']

    data['Codes_Frozen'] = data['Codes'].apply(frozenset).to_frame(name='Codes_Frozen')
    for code in frozenset.union(*data.Codes_Frozen):
        data[code] = data.apply(lambda _: int(code in _.Codes_Frozen), axis=1)
    
    original = data

    # data['Quotation Content'] = data['Quotation Content'].apply(lambda x: str(x).split("\t"))
    # N = len(data)
    # data['Spk'] = ""
    # data['Quote'] = ""
    # for i in range(N):
    #     line = data['Quotation Content'][i]
    #     if len(line) >= 2:
    #         data['Spk'][i] = line[0].strip(":\s")
    #         data['Quote'][i] = line[1].strip('\u202c')
    #     elif len(line) == 1:
    #         data['Spk'][i] = ""
    #         data['Quote'][i] = line[0].strip()
    #     else:
    #         data['Spk'][i] = data['Quote'] = ""
    #     data['Codes'][i] = [data['Codes'][i][j].strip("#") for j in range(len(data['Codes'][i]))]
    # data = data[data["Spk"].isin(["MSPKR","FSPKR"])].reset_index(drop = True)
    # data = data[(data["Spk"] != "") & (data["INT_NEW"] != 1)].reset_index(drop = True)
    
    if min_length >= 0: 
        data = data[data['Quote'].map(len) >= min_length].reset_index(drop = True)  # Filter out paragraphs below the minimum length
        
    training = data[[col for col in data.columns if col not in ['Quotation Content', 'Codes', 'Spk', 'Codes_Frozen', 'nan']]]
    
    return original, training

original, training = read_ATLAS(path = path, min_length = min_length)

This is how our data should look after import, one unit of analysis of text per row with identifiers and metadata. A wide range of computational tools will  become available once you transform any text data into this format.

**p.s. Don't worry, you can transform it back later.**

In [None]:
training

Unnamed: 0,Document,Reference,Quote,Background
0,Henry_B._Abajian,0,This is an interview with Henry Abajian on the...,1
1,Henry_B._Abajian,1,In 1938 I graduated with an electrical enginee...,1
2,Henry_B._Abajian,2,What were you particularly interested in at th...,1
3,Henry_B._Abajian,3,"It was all power engineering, and the electron...",0
4,Henry_B._Abajian,5,Yes. How I got there was interesting. The head...,0
...,...,...,...,...
19976,Anthony_Zimbalatti,177036,Before we talk about your Grumman experiences ...,0
19977,Anthony_Zimbalatti,177038,What at the time caused you and your colleague...,0
19978,Anthony_Zimbalatti,177039,I don't know. We always had these bull session...,0
19979,Anthony_Zimbalatti,177041,Both of us worked for Milt. I think he had an ...,0


### Preprocessing

For machine learning models to learn how humans code qualitative data, we also need to transform any raw text to something that computer can understand: numbers. 

People have come up with many ways of using numbers to represent text, but most of those systems are static and purely representational (e.g. Morse code), not for understanding the semantics and mimicing the human usage of human languages. 

Recent developments in natural language processing have nonetheless made many breakthroughs towards this direction, by using deep learning models trained on enourmous volumes of text data to generate high-dimensional vectors representing raw text such that the resulting representations can be used to learn and mimic how language is used. 

And our goal is precisely to mimic how human researchers code interviews. 


_We use BERT, or Bidirectional Encoder Representation from Transformers, a family of deep learning language models trained on thousands of millions English words from sources like Google Books and Wikipedia to generate fine-grained and context-specific representations of language._

## Part II: Train the AI

### Set up a Deep Learning Environment

This cell sets up the computational environment in Google Colab. This will allow us to use machine learning.

Before you run the cell below, go to "Runtime - Change Runtime Type - Hardware Accelerator" and select "GPU". This will speed up our model. 

In [None]:
device = "cuda" # or "cpu"

!pip install transformers

import numpy as np

import torch
from torch.utils.data import TensorDataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

def text2feature(text, MAX_LENGTH = 512, BATCH_SIZE = 16, device = device):
    """
    Use pretrained BERT model to vectorize raw text. 
    
    Args:
        text: the text of interest, stored in the "Quote" column of our DataFrame "data_ML". 
        
        MAX_LENGTH: the maximum number of "wordpieces" (usually words but not always) in each paragraph to be used 
                    for representing the paragraph, capped at 512. 
        
        BATCH_SIZE: the maximum number of samples to be used in a single neural network iteration. It is recommended
                    to use a batch size of 32 or 64. We used 16 due to the limitation of our GPU memory. 
        
        device: "CPU" or "cuda". "cuda" is the architecture of the NVIDIA graphics processing unit (GPU) which is
                used to accelerate machine learning. Only use if you either (1) have a NVIDIA GPU and have installed 
                the CUDA development tools following the instruction or (2) use cloud computing (e.g. Google Colab). 
        
    Returns: 
        feature: an N_row by 768 array of vectors that represent each paragraph using a vector of 768.
    """

    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    
    encodings = list(map(lambda t: tokenizer.encode(t, padding=True, truncation=True, max_length = MAX_LENGTH, add_special_tokens=True), text))
    
    max_len = 0
    for i in encodings:
        if len(i) > max_len:
            max_len = len(i)

    encodings_padded = np.array([i + [0]*(max_len-len(i)) for i in encodings])
    attention_mask = [[float(i > 0) for i in ii] for ii in encodings_padded]
    
    dataset = TensorDataset(torch.tensor(encodings_padded, dtype = torch.int), torch.tensor(attention_mask))
    
    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE)
    
    model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)
    
    features = []

    with torch.no_grad():
        for step_num, batch_data in enumerate(dataloader):
            token_ids, masks = tuple(t.to(device) for t in batch_data)
            last_hidden_states = model(token_ids, masks)
            features.append(last_hidden_states[0][:,0,:].cpu().detach().numpy())
            """
            The model actually produces a vector of 768 for each of the 512 "wordpiece" in every sequence, but the 
            vector of every sequence, denoted by [CLS], is always a special classification token that can be used as
            the aggregate sequence representation for classification tasks. An alterantive is to represent the sequence
            by averaging all 512 vectors, which tends to produce similar results.           
            """
    features = np.vstack(features)
    
    return features

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 34.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 69.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 13.8 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 73.2 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninsta

### From Text to Vectors Using Deep Learning

On average this will take about 5-7 minutes.  

In [None]:
text = training['Quote'].values.tolist()   # Extract text from the DataFrame
features = text2feature(text)              # This function transforms text data into vectors. 
features.shape                             # We should obtain a N × 768 matrix that represents each row using a 768-dimensional vector.

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


(19981, 768)

Our data look like this in vectors:

In [None]:
features

array([[-8.04037899e-02, -1.23658910e-01, -3.61755282e-01, ...,
        -4.42422088e-03,  5.73920608e-01,  4.51830387e-01],
       [ 1.52352005e-02,  7.93554708e-02, -3.65893543e-01, ...,
         2.96420068e-01,  3.53529871e-01,  6.38014317e-01],
       [ 1.67413235e-01, -4.30313461e-02, -2.37162262e-01, ...,
        -1.04779854e-01,  2.66423166e-01,  2.84095228e-01],
       ...,
       [-8.22876766e-02,  5.59788980e-02, -4.67753440e-01, ...,
         1.56504020e-01,  7.09437609e-01,  3.97252321e-01],
       [ 1.43082321e-01, -1.50504559e-01, -4.59693998e-01, ...,
        -5.96625090e-04,  4.93743062e-01,  5.25567591e-01],
       [-6.87612742e-02,  3.02690975e-02, -1.31134659e-01, ...,
        -1.04792535e-01,  1.61510170e-01,  2.77071506e-01]], dtype=float32)

Those numbers are not just any arbitrary numbers! They provide relational representations, as well as we can get in computers, of the linguistic and semantic features of each paragraph as it is composed and used in the context. These representations will help the algorithm find the paragraphs that are relevant to our coding. 

### Generate Testing Samples for Evaluation

But before we use these vectors to extend our initial codings to all the remaining data, we want to have an sense of how reliable it would be. We do so by setting aside a small sample of coded data and compare machine-generated codings against our own codings on this already coded sample. And we want to do this multiple times to weed out the effect of outlier cases. 

We tell the algorithm how many times we want to test what percentage of the data we want to use for testing. 

In [None]:
n_splits = 20       # how many randomly reshuffled splits to generate.
test_size = 0.75    # how many cases to use as test data. 

In [None]:
from sklearn.model_selection import ShuffleSplit

def split(all_cases, n_splits = 10, test_size = 0.25):
    """
    Split any coded data into training and test sets. 
    
    Args:
        n_splits: how many randomly reshuffled splits to generate. The default is 10. 
        
        test_size: how many cases to use as test data. 
                   - If between 0 and 1, represents the proportion of the dataset to include;
                   - If integer greater than 1, represents the the absolute number of test samples.
                   - The default is 25%. 
    Returns:
        train_set, test_set
    
    """
    train_test_split = ShuffleSplit(n_splits = n_splits, test_size = test_size)
    train_set = []
    test_set = []
    for train_index, test_index in train_test_split.split(all_cases):
        train_set.append(train_index.tolist())
        test_set.append(test_index.tolist())
    
    return train_set, test_set

all_cases = training['Document'].unique().tolist()

train_set, test_set = split(all_cases, n_splits, test_size)

_The more data we use for testing, the less data we have left for training._

_Here we are testing on all the remaining data because all the paragraphs are "coded" already in our "toy dataset". In actual research, we would code 25% of all data, use 15-20% to train the model, and 5%-10% to evaluate its performance, and then scale the codings to the remaining 75%._

## Part III: Scaling

Now we train a classifier algorithm to predict our own codings using the vector representations and training examples we just obtained. 

First, we tell the algorithm which code we have already established on the training data and want to scale to the remaining data. 

In [None]:
eval_codes = ['Background']

In [None]:
# since BERT is typically deployed with data that are much bigger and deep 
# learning models much more sophisticated than required by our tasks, here 
# we simplify it by combining BERT vectors with a regularized logistic 
# regression, which we found to be highly effective for smaller scale data. 

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, cohen_kappa_score

from typing import Union
from scipy.sparse import spmatrix

NDArray = Union[np.ndarray, spmatrix]

class Classifier:
    def __init__(self):
        """
        Initalizes a logistic regression classifier.
        """
        self.clf = LogisticRegression(penalty = 'l2', solver = 'liblinear', C = 1, class_weight= 'balanced')
        
    def train(self, features: NDArray, labels: NDArray):
        """
        Trains the classifier using the given training examples.

        Args: 
            features: A feature matrix, where each row represents a text.

            labels: A label vector, where each entry represents a label.
        """
        self.clf.fit(features, labels)
    
    def predict(self, features: NDArray) -> NDArray:
        """
        Makes predictions for each of the given examples.

        Args:
            features: A feature matrix, where each row represents a text.
        Such matrices will typically be generated via TextToFeatures.
        
        Returns: 
            predictions: A prediction vector, where each entry represents a label.
            predprob: A probability vector, where each entry represents the predicted probability of the corresponding label in the prediction vector. 
        """


        predictions = self.clf.predict(features)
        predprob = self.clf.predict_proba(features)
    
        return predictions, predprob
    
    def code(self, train_features, test_features, train_labels):
        
        self.train(train_features, train_labels)
        self.predicted_labels, self.predicted_probabilities = self.predict(test_features)
        
        return self.predicted_labels, self.predicted_probabilities
    
    def metrics(self, test_labels):
        """
        Obtains evaluation metrics by comparing model predictions to original human coding on test data. 
        """

        accuracy = accuracy_score(test_labels, self.predicted_labels)
        f1 = f1_score(test_labels, self.predicted_labels, pos_label=1)
        precision = precision_score(test_labels, self.predicted_labels)
        recall = recall_score(test_labels, self.predicted_labels)
        kappa = cohen_kappa_score(test_labels, self.predicted_labels)
        
        return [accuracy, f1, precision, recall, kappa]

def classify(classifier, data, eval_codes, train_set, test_set, method = ['pred']):
    """
    Train a machine learning model to predict human codings of interest based on each training/test split.
    
    Args:
        classifier: the machine learning model. 
        
        eval_codes: the list of codes to be scaled, corresponding to the "Codes" column from the ATLAS.ti export file.
        
        train_set/test_set: the ids of interviews to use as training/test data. 
        
        method: "pred" or "eval". 
                
                - The "pred" method predicts codings on uncoded data without returning metrics. 
                - The "eval" method predicts codings on coded data and compare the predictions against original human codings. 
        
    Returns:
        predictions: the predicted probability of a code on a paragraph.
        
        metrics: a list of performance metrics the "eval" method generates. 
    
    """
    
    predictions = pd.DataFrame({'code': [], 'test_set': [], 'p': []})
    metrics = pd.DataFrame({'code': [], 'test_set': [], 'n':[], 'accuracy': [], 'f1': [], 'precision': [], 'recall': [],  
                            'kappa': []})
    
    for code in eval_codes:
        for train_ids, test_ids in zip(train_set, test_set):

            train_data = data[data['Document'].isin([all_cases[id] for id in train_ids])]
            test_data = data[-data['Document'].isin([all_cases[id] for id in train_ids])]

            train_id = train_data.index.tolist()
            test_id = test_data.index.tolist()

            train_features = features[train_id]
            test_features = features[test_id]

            train_labels = train_data[code].values.tolist()
            
            predicted_labels, predicted_probabilities = classifier.code(train_features, test_features, train_labels)

            predictions = predictions.append(pd.Series([code, test_ids] + [[x[1] for x in predicted_probabilities]], index = predictions.columns), ignore_index=True)

            if method == 'eval':
                test_labels = test_data[code].values.tolist()
                metrics = metrics.append(pd.Series([code, test_ids, data.loc[data['Document'].isin([all_cases[id] for id in train_ids]), code].sum()] + classifier.metrics(test_labels), index = metrics.columns), ignore_index=True) 
            
    return predictions, metrics

predictions, metrics = classify(Classifier(), training, eval_codes, train_set, test_set, 'eval')

pd.DataFrame({'N': training[eval_codes].sum()}).join(metrics.groupby('code')[['n', 'accuracy', 'f1', 'precision', 'recall', 'kappa']].mean())

NameError: ignored

### Note

_Remember the definition of recall and precision and the restaurant examples - the model is doing a particularly good job in casting a wide net and not missing too many orders - on average machine learning catches about 70% of all relevant paragraphs, which is not too far below the average human performance in qualitative coding._

_As a tradeoff, it is not very precise and contains a lot of items - paragraphs - that you did not order, but we can easily fix it in our next step, because now instead of the whole 20k-paragraph dataset, we are looking at only about 20% of it._

- _Recall: measures how many percentage of all relevant cases a model picks up in a classification task. Suppose you are in a restaurant, a waiter misses three out of four items you ordered has a recall of 0.25, and of course you want a waiter with higher recall!_

- _Precision: measures how many percentage of cases a model picks up are actually relevant in a classification task. Suppose you are in a restaurant, your waiter brings to you four items, but three out of the four items are items that you didn't order. This waiter has a precision of 25%, and of course you want a waiter with higher precision!_ 

- _F-1: measures the overall performance of a machine learning model in classification tasks._

_For more detail, see https://journals.sagepub.com/doi/10.1177/23780231211062345_

## Part IV (OPTIONAL): Recoding

Now that we have caught about 70% of all paragraphs that would have been coded as "Educational Background", we want to quickly review the results and filter out obivous errors - especially the irrelevant ones that the algorithm has mistaken to be "Background".

This cell creates highly customized spreadsheets that include part of the original text that the algorithm has labeled as "Background". Researchers can then go through this list to correct any obvious _false positives_. 

In most cases, recoding machine predictions only takes a fraction of the time and effort required by full human coding, because the algorithm has filtered out the vast majority of "irrelevant" texts, as defined by _YOUR_ own coding.

We explained this point in a recent piece [here](https://osf.io/preprints/socarxiv/gpr4n/). The figure below reveals the 'punchline' of our tests.



In [None]:
from numpy.ma.core import outerproduct
from google.colab import files

test_set = metrics.sort_values(['recall', 'f1'], ascending = False).groupby('code').nth(0)['test_set'].tolist()[0]
train_set = list(set(range(len(all_cases)))-set(test_set))
predictions, _ = classify(Classifier(), training, eval_codes, [train_set], [test_set])

recode_set = test_set
combine_codes = [0]

sorting = [True]
sorting_vars = [['Background']]

pruning = [True]
pruning_thresholds = [0.5]

def gen_recode(data, predictions, all_cases, recode_set, combine_codes, sorting, pruning, sorting_vars = None, pruning_threshold = None):
    """
    Export all machine-coded rows in several Excel spreadsheets that facilitate human reviewing and recoding. 
    
    Args:
        data: a DataFrame that stores the original training data (i.e. "training"). 
        
        predictions: a DataFrame that stores a list of machine-generated codings on the target data for each code (i.e. "predictions"). 
        
        all_cases: the ids of interviews to use as training/test data. 

        recode_set: a List of the ids of interviews to be reviewed and recoded. 

        combine_codes: a List that defines which codes are to be exported and recoded together in one spreadsheet. The number of unique values is the number of spreadsheets to be generated and codes that are assigned the same value will appear in the same spreadsheet

        sorting: a List of boolean values (True or False) where sorting[i] defines whether spreadsheet[i] should be sorted by predicted probabilities or remain in chronological order. 

        sorting_vars: if sorting[i] = True, sorting_vars[i] defines by which code's predicted probabilities the whole spreadsheet will be sorted. 
        
        pruning: a List of boolean values (True or False) where pruning[i] defines whether spreadsheet[i] should be pruned by predicted probabilities.

        pruning_thresholds: if pruning[i] = True, pruning_thresholds[i] defines the threshold of pruning (rows with a predicted probability below the threshold will not appear in the recoding spreadsheet). 
        
    Returns:
        
        spreadsheets: a list of DataFrames that each stores one recoding spreadsheet. 
    
    """
    recode = data[data['Document'].isin([all_cases[id] for id in recode_set])][['Quote', 'Document', 'Reference']].reset_index(drop = True)
    
    for code in eval_codes:
        #recode_set = best.loc[code]['test_set']
        p = predictions.loc[(predictions['code'] == code) & (predictions['test_set'].apply(lambda x: x==recode_set))]['p'].values.tolist()[0]
        col = pd.DataFrame({str(combine_codes[eval_codes.index(code)]) + '_P_' + code :p})
        recode = recode.join(col)
    
    spreadsheets = []

    for doc_id in range(max(combine_codes)+1):
        out = recode[['Document', 'Reference']]
        name = "Pred_"

        rowids_to_keep = []

        for col in (col for col in recode if col.startswith(str(doc_id))):

            out = out.join(recode[[col]])

            colname = col.rsplit('_',1)[1]

            name = name + colname

            out.rename(columns={col: name}, inplace=True)

            out[colname+'?'] = ''

            if pruning[doc_id] == True:
                rowids_to_keep.append(out[name].index[out[name] >= pruning_threshold[doc_id]].tolist() )

        out = out.join(recode[['Quote']])

        if pruning[doc_id] == True:
            out = out.iloc[list(set([item for sublist in rowids_to_keep for item in sublist]))]
        
        if sorting[doc_id] == True:
            out = out.sort_values(by = ["Pred_" + v for v in sorting_vars[doc_id]], ascending = False)
        else:
            out = out.sort_index()
        
        spreadsheets.append(out)

        # Uncommenting the following two lines will generate actual Excel files and download them to your local device. 

        #out.to_excel(name + ".xlsx")
        #files.download(name + ".xlsx")

    return spreadsheets

recoding_spreadsheet = gen_recode(training, predictions, all_cases, recode_set, combine_codes, sorting, pruning, sorting_vars, pruning_thresholds)

recoding_spreadsheet[0]

Unnamed: 0,Document,Reference,Pred_Background,Background?,Quote
1494,Patricia_Brown,19501,0.999971,,"My father was William Madison Brown, and my mo..."
9899,Vincente_Ortega,119695,0.999954,,My father was in the hardware business. I have...
5250,Arminta_Harness,63468,0.999941,,"Mother graduated from high school there, and D..."
13879,Ben_Vester,161096,0.999926,,I was born down in North Carolina in a little ...
8633,Arthur_McComas,104792,0.999921,,My father was a building manager. Both my pare...
...,...,...,...,...,...
13279,Ralph_Strong,151383,0.500444,,We had a little arc furnace and a copper cruci...
6817,Margaret_Kipilo,86791,0.500353,,No. I never remember saying anything. I don’t ...
13821,Gottfried_Ungerboeck,159032,0.500263,,Was it a one year period you were in the milit...
6292,Milton_Kant,81923,0.500193,,The Navy was not using television. I think tha...


Now a human researcher can quickly review all the paragraphs the algorithm finds to be potentially relevant to our "Background" code and check whether any paragraph is not. 

In [None]:
recoding_spreadsheet[0].iloc[0]['Quote']

In [None]:
recoding_spreadsheet[0].iloc[1]['Quote']

In [None]:
recoding_spreadsheet[0].iloc[-1]['Quote']

Here is a simple word cloud built from rows that our machine learning model thinks are relevant to our code "Background"! The bigger a word is, the more frequent it is in the text. 

## Part IV (OPTIONAL): Visualize

In [None]:
!pip install wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt

wordcloud = WordCloud(width=800, height=400, scale = 10).generate(recoding_spreadsheet[0]['Quote'].str.cat(sep=' '))
plt.figure( figsize=(20,10) )
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


NameError: ignored

## Part V: Export for External Analysis

### Preparing for ATLAS.ti

Once you have confirmed that the coding is adequate (and, if necessary, reviewed and recoded the machine-generated codings for obvious errors), you can use the function below to export the coded data in a special format that can then be reimported into ATLAS.ti. 

*This can also be integrated into other QDA software, analyzed in a spreadsheet, or python.*

We add machine-generated and human-reviewed codings back to the original data (including all rows) in the original order. 

A tutorial can be found here. [link]

In [None]:
def reimport2ATLAS(full_data, recoded, recoded_set, recoded_cases):
    for code in recoded_set:
        recoded.loc[recoded[code].isna(), code] = recoded.loc[recoded[code].isna(), "P_"+code]
        recoded[code] = recoded[code].apply(lambda x: x >= 0.5).astype(int)
        full_data.merge(recoded[['Document', 'Reference', code]], on = ['Document', 'Reference'], how = 'left')
   
    code_list = [col for col in full_data.columns.tolist() if col not in ['Document', 'Reference', 'Quote', 'Quotation Content', 'Codes', 'Codes_Frozen','nan']] 
    
    for i in range(len(full_data)):
        if full_data.loc[i,"Document"] in all_cases:
            for code in code_list:
                if(full_data.loc[i, code] >= 0.5): full_data.loc[i,'Quotation Content'] = full_data.loc[i,'Quotation Content'] + ' ' + '#' + code
        
    full_data['idx'] = full_data.groupby('Document').cumcount()
    full_data['p_idx'] = 'p' + full_data['idx'].astype(str)
    full_pivot = full_data.pivot(index='Document',columns='p_idx', values = 'Quotation Content')
    full_pivot = full_pivot.reindex(sorted(full_pivot.columns, key=lambda x: float(x[1:])), axis=1)
    full_pivot.to_excel("Recoded_output_transformed_all_cases_all_codes.xlsx")

# recoded_set = ["Background"]
# recoded = pd.read_excel("")
# reimport2ATLAS(original, recoded, recoded_set, all_cases)

## Appendix: Scraping Interview Data

The interview data we use in this tutorial were scraped from The Institute of Electrical and Electronics Engineers History Center's [Engineering and Technology History Wiki](https://ethw.org/Oral-History:List_of_all_Oral_Histories), using the Python code below. 

Please note that no part of the data may be quoted for publication without the written permission of the Director of IEEE History Center. Request for permission to quote for publication should be addressed to the IEEE History Center Oral History Program, IEEE History Center, 445 Hoes Lane, Piscataway, NJ 08854 USA or ieee-history@ieee.org. 

### Install necessary libraries

In [None]:
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

from bs4 import BeautifulSoup
import re

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.3.0-py3-none-any.whl (981 kB)
[K     |████████████████████████████████| 981 kB 4.7 MB/s 
[?25hCollecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.10-py2.py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 58.6 MB/s 
[?25hCollecting trio~=0.17
  Downloading trio-0.21.0-py3-none-any.whl (358 kB)
[K     |████████████████████████████████| 358 kB 49.9 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting pyOpenSSL>=0.14
  Downloading py

0% [Working]            Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.39)] [                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
0% [1 InRelease gpgv 1,581 B] [Waiting for headers] [Connecting to security.ubu                                                                               Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
0% [1 InRelease gpgv 1,581 B] [Waiting for headers] [Connecting to security.ubu0% [1 InRelease gpgv 1,581 B] [Waiting for headers] [Connecting to security.ubu                                                                               Get:4 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InReleas

### Get the Table of Content

In [None]:
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get("https://ethw.org/Oral-History:List_of_all_Oral_Histories")
content = driver.page_source
soup = BeautifulSoup(content)

links = []

for link in soup.find_all('a', attrs={'href': re.compile("/Oral-History")}):
    links.append(link.get('href'))

links.index("/Oral-History:Henry_B._Abajian")

links[48:]

  """Entry point for launching an IPython kernel.


['/Oral-History:Henry_B._Abajian',
 '/Oral-History:Willis_Adcock',
 '/Oral-History:Michael_Adler',
 '/Oral-History:Norbert_Adler',
 '/Oral-History:Roberto_Aguilera',
 '/Oral-History:William_Ross_Aiken',
 '/Oral-History:Rachid_Alami',
 '/Oral-History:Charles_Alexander',
 '/Oral-History:Royal_P._Allaire',
 '/Oral-History:David_W._Allan',
 '/Oral-History:Frances_%22Fran%22_Allen',
 '/Oral-History:Robert_Ambrose',
 '/Oral-History:Brian_Anderson',
 '/Oral-History:Deborah_Anderson',
 '/Oral-History:W._Cleon_Anderson',
 '/Oral-History:Wes_Anderson',
 '/Oral-History:Eva_Andrei',
 '/Oral-History:Fred_Andrews',
 '/Oral-History:Bruce_Angwin',
 '/Oral-History:David_Anthony',
 '/Oral-History:Diran_Apelian',
 '/Oral-History:Frank_F._Aplan',
 '/Oral-History:Tatsuo_Arai',
 '/Oral-History:Michael_Arbib',
 '/Oral-History:Ronald_Arkin',
 '/Oral-History:Ken_Arnold',
 '/Oral-History:Lyn_Arscott',
 '/Oral-History:Robert_Arzbaecher',
 '/Oral-History:Minoru_Asada',
 '/Oral-History:Eric_Ash',
 '/Oral-History:W

### Get Individual Oral Histories

In [None]:
interviews = pd.DataFrame(columns = ['person', 'interview', 'bio', 'section', 'speaker', 'text'])

for link in links[48:50]:
    driver.get("https://ethw.org" + link)
    content = driver.page_source
    soup = BeautifulSoup(content)
    row = [[link.split(':')[1],
        soup.find('span', {'id': 'About_the_Interview'}).find_next('p').get_text() if soup.find('span', {'id': 'About_the_Interview'}) is not None else '', 
        soup.find_all('h2')[0].find_next('p').get_text() if soup.find_all('h2') is not None else '', 
        p.find_previous('h3').find_next('span', {'class' : 'mw-headline'}).get_text() if p.find_previous('h3') is not None else '',
        p.find('b').get_text(), 
        p.findNext('p').get_text()
    ] for p in soup.find_all('p') if (p.find('b') is not None) and (p.findNext('p') is not None)]
    interviews = pd.concat([interviews, pd.DataFrame(row, columns = ['person', 'interview', 'bio', 'section', 'speaker', 'text'])], axis = 0)

interviews = interviews.replace(r'\n',' ', regex=True)

interviews['speaker'] = interviews['speaker'].replace(r':', '', regex=True)

In [None]:
interviews

Unnamed: 0,person,interview,bio,section,speaker,text
0,Henry_B._Abajian,HENRY B. ABAJIAN: An Interview Conducted by Fr...,Abajian got an Electrical Engineering degree i...,Educational Background,Nebeker,This is an interview with Henry Abajian on the...
1,Henry_B._Abajian,HENRY B. ABAJIAN: An Interview Conducted by Fr...,Abajian got an Electrical Engineering degree i...,Educational Background,Abajian,In 1938 I graduated with an electrical enginee...
2,Henry_B._Abajian,HENRY B. ABAJIAN: An Interview Conducted by Fr...,Abajian got an Electrical Engineering degree i...,Educational Background,Nebeker,What were you particularly interested in at th...
3,Henry_B._Abajian,HENRY B. ABAJIAN: An Interview Conducted by Fr...,Abajian got an Electrical Engineering degree i...,Radiation Lab,Abajian,"It was all power engineering, and the electron..."
4,Henry_B._Abajian,HENRY B. ABAJIAN: An Interview Conducted by Fr...,Abajian got an Electrical Engineering degree i...,Radiation Lab,Nebeker,That first year of operation.
...,...,...,...,...,...,...
141,Willis_Adcock,WILLIS ADCOCK: An Interview Conducted by David...,Adcock was born in Canada but moved to the US ...,Family and retirement,Morton,Okay.
142,Willis_Adcock,WILLIS ADCOCK: An Interview Conducted by David...,Adcock was born in Canada but moved to the US ...,Family and retirement,Adcock,What’s your area of interest?
143,Willis_Adcock,WILLIS ADCOCK: An Interview Conducted by David...,Adcock was born in Canada but moved to the US ...,Family and retirement,Morton,"I’m an historian. I work for the IEEE, but I w..."
144,Willis_Adcock,WILLIS ADCOCK: An Interview Conducted by David...,Adcock was born in Canada but moved to the US ...,Family and retirement,Adcock,Where?


In [None]:
#interviews.to_excel("scraped_interviews.xlsx")
#files.download("scraped_interviews.xlsx")