# ALIGN Tutorial Notebook: DEVIL'S ADVOCATE

This notebook provides an introduction to **ALIGN**, 
a tool for quantifying multi-level linguistic similarity 
between speakers, using the "Devil's Advocate" transcript data reported in Duran, Paxton, and Fusaroli: "ALIGN: Analyzing Linguistic Interactions with Generalizable techNiques - a Python Library".  This method was also introduced in Duran, Paxton, and Fusaroli (2019), which can be accessed here for reference: https://osf.io/kx8ur/. 

## Tutorial Overview

The Devil's Advocate ("DA") study examines interpersonal linguistic alignment between dyads across two conversations where participants either agreed or disagreed with each other (as a randomly assigned between-dyads condition) and where one of the conversations involved the truth and the other deception (as a within-subjects condition), with order of conversations counterbalanced across dyads. 

**Transcript Data:**

The complete de-identified dataset of raw conversational transcripts is hosted on a secure protected access data repository provided by the Inter-university Consortium for Political and Social Research (ICPSR). These transcripts need to be downloaded to use this tutorial. Please click on the link to the ICPSR repository to access: http://dx.doi.org/10.3886/ICPSR37124.v1. 

**Analysis:**

To replicate the results reported in Duran, Paxton, and Fusaroli (2019), or for an example of R code used to analzye the ALIGN output for this dataset, please visit the OSF repository for this project: https://osf.io/3TGUF/

***

## Table of Contents

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [Highest-level functions](#Highest-level-functions)
* [Setup](#Setup)
    * [Import libraries](#Import-libraries)
    * [Specify ALIGN path settings](#Specify-ALIGN-path-settings)
* [Phase 1: Prepare transcripts](#Phase-1:-Prepare-transcripts)
    * [Preparation settings](#Preparation-settings)
    * [Run preparation phase](#Run-preparation-phase)
* [Phase 2: Calculate alignment](#Phase-2:-Calculate-alignment)
    * [For real data: Alignment calculation settings](#For-real-data:-Alignment-calculation-settings)
    * [For real data: Run alignment calculation](#For-real-data:-Run-alignment-calculation)
    * [For surrogate data: Alignment calculation settings](#For-surrogate-data:-Alignment-calculation-settings)
    * [For surrogate data: Run alignment calculation](#For-surrogate-data:-Run-alignment-calculation)
* [ALIGN output overview](#ALIGN-output-overview)
    * [Speed calculations](#Speed-calculations)
    * [Printouts!](#Printouts!)

***

# Getting Started

## Preparing input data

**The transcript data used for this analysis adheres to the following requirements:**

* Each input text file contains a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row corresponds to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
            * **NOTE: For the DA dataset, the label with a value of 0 indicates speaker did not receive any special assignment at the start of the experiment, a value of 1 indicates the speaker has been assigned the role of deceiver (i.e., “devil’s advocate) at the start of the experiment.**
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`

## Filename conventions

* Each conversation text file must be regularly formatted, including a prefix for dyad and a prefix for conversation prior to the identifier for each that are separated by a unique character. By default, ALIGN looks for patterns that follow this convention: `dyad1-condA.txt`
    * However, users may choose to include any label for dyad or condition so long as the two labels are distinct from one another and are not subsets of any possible dyad or condition labels. Users may also use any character as a separator so long as it does not occur anywhere else in the filename.
    * The chosen file format **must** be used when saving **all** files for this analysis.

**NOTE: For the DA dataset, each conversation text file is saved in the format of: dyad_condX-Y-Z (e.g., dyad11_cond1-0-2).**

Such that for X, Y, and Z condition codes:

* X = Indicates whether the conversation involved dyads who agreed or disagreed with each other: value of 1 indicates a disagreement conversation, value of 2 indicates an agreement conversation (e.g., “cond1”)
* Y = Indicates whether the conversation involved deception: value of 0 indicates truth, value of 1 indicates deception.
* Z = Indicates conversation order. Given each dyad had two conversations: value of 2 indicates the conversation occurred first, value of 3 indicates the conversation occurred last.   

## Highest-level functions

Given appropriately prepared transcript files, ALIGN can be run in 3 high-level functions:

**`prepare_transcripts`**: Pre-process each standardized 
conversation, checking it conforms to the requirements. 
Each utterance is tokenized and lemmatized and has 
POS tags added.

**`calculate_alignment`**: Generates turn-level and 
conversation-level alignment scores (lexical, 
conceptual, and syntactic) across a range of 
*n*-gram sequences.

**`calculate_baseline_alignment`**: Generate a surrogate corpus
and run alignment analysis (using identical specifications 
from `calculate_alignment`) on it to produce a baseline.

***

# Setup

## Import libraries

Install ALIGN if you have not already.

In [None]:
import sys
!{sys.executable} -m pip install align

Import packages we'll need to run ALIGN.

In [None]:
import align, os
import pandas as pd

Import `time` so that we can get a sense of how
long the ALIGN pipeline takes.

In [None]:
import time

Import `warnings` to flag us if required files aren't provided.

In [None]:
import warnings

## Install additional NTLK packages

Download some addition `nltk` packages for `align` to work.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

## Specify ALIGN path settings

ALIGN will need to know where the raw transcripts are stored, where to store the processed data, and where to read in any additional files needed for optional ALIGN parameters.

### Required directories

For the sake of this tutorial, specify a base path that will serve as our jumping-off point for our saved data. All of the shipped data will be called from the package directory but the DA transcripts will need to be added manually. 

**`BASE_PATH`**: Containing directory for this tutorial.

In [None]:
BASE_PATH = os.getcwd()

**`DA_EXAMPLE`**: Subdirectories for output and other
files for this tutorial. (We'll create a default directory
if one doesn't already exist.)

In [None]:
DA_EXAMPLE = os.path.join(BASE_PATH,
                              'DA/')

In [None]:
if not os.path.exists(DA_EXAMPLE):
    os.makedirs(DA_EXAMPLE)

**`TRANSCRIPTS`**: Transcript text files must be first downloaded from the ICPSR repository.

Next, set variable for folder name (as string) for relative location of folder into which the downloaded transcript files need to be manually added. (We'll create a default directory if one doesn't already exist.)

In [None]:
TRANSCRIPTS = os.path.join(DA_EXAMPLE,
               'DA-transcripts/')

In [None]:
if not os.path.exists(TRANSCRIPTS):
    os.makedirs(TRANSCRIPTS)

In [None]:
if not os.listdir(TRANSCRIPTS) :
    warnings.warn('DA text files not found at the specified '
                  'location. Please download from '
                  'http://dx.doi.org/10.3886/ICPSR37124.v1 '
                  'and add to directory.')    

**`PREPPED_TRANSCRIPTS`**: Set variable for folder name 
(as string) for relative location of folder into which 
prepared transcript files will be saved. (We'll create
a default directory if one doesn't already exist.)

In [None]:
PREPPED_TRANSCRIPTS = os.path.join(DA_EXAMPLE,
                                   'DA-prepped/')

In [None]:
if not os.path.exists(PREPPED_TRANSCRIPTS):
    os.makedirs(PREPPED_TRANSCRIPTS)

**`ANALYSIS_READY`**: Set variable for folder name 
(as string) for relative location of folder into 
which analysis-ready dataframe files will be saved.
(We'll create a default directory if one doesn't
already exist.)

In [None]:
ANALYSIS_READY = os.path.join(DA_EXAMPLE,
                              'DA-analysis/')

In [None]:
if not os.path.exists(ANALYSIS_READY):
    os.makedirs(ANALYSIS_READY)

**`SURROGATE_TRANSCRIPTS`**: Set variable for folder name 
(as string) for relative location of folder into which all
prepared surrogate transcript files will be saved. (We'll
create a default directory if one doesn't already exist.)

In [None]:
SURROGATE_TRANSCRIPTS = os.path.join(DA_EXAMPLE,
                                     'DA-surrogate/')

In [None]:
if not os.path.exists(SURROGATE_TRANSCRIPTS):
    os.makedirs(SURROGATE_TRANSCRIPTS)

### Paths for optional parameters

**`OPTIONAL_PATHS`**: If using Stanford POS tagger or
pretrained vectors, the path to these files. If these
files are provided in other locations, be sure to
change the file paths for them. (We'll create a default
directory if one doesn't already exist.)

In [None]:
OPTIONAL_PATHS = os.path.join(DA_EXAMPLE,
                             'optional_directories/')

In [None]:
if not os.path.exists(OPTIONAL_PATHS):
    os.makedirs(OPTIONAL_PATHS)

#### Stanford POS Tagger

The Stanford POS tagger **will not be used** by 
default in this example. However, you may use them
by uncommenting and providing the requested file 
paths in the cells in this section and then changing 
the relevant parameters in the ALIGN calls below.

If desired, we could use the Standford part-of-speech 
tagger along with the Penn part-of-speech tagger
(which is always used in ALIGN). To do so, the files
will need to be downloaded separately: 
https://nlp.stanford.edu/software/tagger.shtml#Download

**`STANFORD_POS_PATH`**: If using Stanford POS tagger
with the Penn POS tagger, path to Stanford directory.

In [None]:
# STANFORD_POS_PATH = os.path.join(OPTIONAL_PATHS,
#                                  'stanford-postagger-full-2018-10-16/')

In [None]:
# if os.path.exists(STANFORD_POS_PATH) == False:
#     warnings.warn('Stanford POS directory not found at the specified '
#                       'location. Please update the file path with '
#                       'the folder that can be directly downloaded here: '
#                       'https://nlp.stanford.edu/software/stanford-postagger-full-2018-10-16.zip '
#                       '- Alternatively, comment out the '
#                       '`STANFORD_POS_PATH` information.')

**`STANFORD_LANGUAGE`**: If using Stanford tagger,
set language model to be used for POS tagging.

In [None]:
# STANFORD_LANGUAGE = os.path.join('models/english-left3words-distsim.tagger')

In [None]:
# if os.path.exists(STANFORD_POS_PATH + STANFORD_LANGUAGE) == False:
#     warnings.warn('Stanford tagger language not found at the specified '
#                       'location. Please update the file path or comment '
#                       'out the `STANFORD_POS_PATH` information.')

#### Google News pretrained vectors

The Google News pretrained vectors **will be used**
by default in this example. The file is available for
download here: https://code.google.com/archive/p/word2vec/

If desired, researchers may choose to read in pretrained
`word2vec` vectors rather than creating a semantic space
from the corpus provided. This may be especially useful 
for small corpora (i.e., fewer than 30k unique words),
although the choice of semantic space corpus should be
made with careful consideration about the nature of the
linguistic context (for further discussion, see Duran, 
Paxton, & Fusaroli, 2019).

**`PRETRAINED_INPUT_FILE`**: If using pretrained vectors, path
to pretrained vector files. You may choose to download the file
directly to this path or change the path to a different one.

In [None]:
PRETRAINED_INPUT_FILE = os.path.join(OPTIONAL_PATHS,
                            'GoogleNews-vectors-negative300.bin')

In [None]:
if os.path.exists(PRETRAINED_INPUT_FILE) == False:
    warnings.warn('Google News vector not found at the specified '
                      'location. Please update the file path with '
                      'the .bin file that can be accessed here: '
                      'https://code.google.com/archive/p/word2vec/ '
                      '- Alternatively, comment out the `PRETRAINED_INPUT_FILE` information')

***

# Phase 1: Prepare transcripts

In Phase 1, we take our raw transcripts and get them ready
for later ALIGN analysis.

## Preparation settings

There are a number of parameters that we can set for the
`prepare_transcripts()` function:

In [None]:
print(align.prepare_transcripts.__doc__)

For the sake of this demonstration, we'll keep everything as
defaults. Among other parameters, this means that:
* any turns fewer than 2 words will be removed from the corpus
 (`minwords=2`),
* we'll be using regex to strip out any filler words
 (e.g., "uh," "um," "huh"; `use_filler_list=None`),
* if you like, you can ignore the regex option and supply additional filler words as `use_filler_list=["string1", "string2"]`
* moreover, if you like, you can include regex and supply your own filler words, but be sure to set `filler_regex_and_list=True`  
* we'll be using the Project Gutenberg corpus to create our 
 spell-checker algorithm (`training_dictionary=None`),
* we'll rely only on the Penn POS tagger because the Stanford tagger is quite slow 
 (`add_stanford_tags=False`), and
* our data will be saved both as individual conversation files
 and as a master dataframe of all conversation outputs
 (`save_concatenated_dataframe=True`).

## Run preparation phase

First, we prepare our transcripts by reading in individual `.txt`
files for each conversation, clean up undesired text and turns,
spell-check, tokenize, lemmatize, and add POS tags.

In [None]:
start_phase1 = time.time()

In [None]:
model_store = align.prepare_transcripts(
                        input_files=TRANSCRIPTS,
                        output_file_directory=PREPPED_TRANSCRIPTS,
                        run_spell_check=True,
                        minwords=2,
                        use_filler_list=None,
                        filler_regex_and_list=False,
                        training_dictionary=None,
                        add_stanford_tags=False,
                        ### if you want to run the Stanford POS tagger, be sure to uncomment the next two lines
                        # stanford_pos_path=STANFORD_POS_PATH,
                        # stanford_language_path=STANFORD_LANGUAGE,    
                        save_concatenated_dataframe=True)

In [None]:
end_phase1 = time.time()

***

# Phase 2: Calculate alignment

## For real data: Alignment calculation settings

There are a number of parameters that we can set for the
`calculate_alignment()` function:

In [None]:
print(align.calculate_alignment.__doc__)

For the sake of this tutorial, we'll keep everything as
defaults. Among other parameters, this means that we'll:
* use only unigrams and bigrams for our *n*-grams
 (`maxngram=2`),
* use pretrained vectors instead of creating our own
 semantic space, since our tutorial corpus is quite
 small (`use_pretrained_vectors=True` and
 `pretrained_file_directory=PRETRAINED_INPUT_FILE`),
* ignore exact lexical duplicates when calculating
 syntactic alignment,
* we'll rely only on the Penn POS tagger 
 (`add_stanford_tags=False`), and
* implement high- and low-frequency cutoffs to clean
 our transcript data (`high_sd_cutoff=3` and 
 `low_n_cutoff=1`).

Whenever we calculate a baseline level of alignment,
we need to include the same parameter values for any
parameters that are present in both `calculate_alignment()`
(this step) and `calculate_baseline_alignment()`
(next step). As a result, we'll specify these here:

In [None]:
# set standards to be used for real and surrogate
INPUT_FILES = PREPPED_TRANSCRIPTS
MAXNGRAM = 2
USE_PRETRAINED_VECTORS = True
SEMANTIC_MODEL_INPUT_FILE = os.path.join(DA_EXAMPLE,
                                         'align_concatenated_dataframe.txt')
PRETRAINED_FILE_DRIRECTORY = PRETRAINED_INPUT_FILE
ADD_STANFORD_TAGS = False
IGNORE_DUPLICATES = True
HIGH_SD_CUTOFF = 3
LOW_N_CUTOFF = 1

## For real data: Run alignment calculation

In [None]:
start_phase2real = time.time()

In [None]:
[turn_real,convo_real] = align.calculate_alignment(
                            input_files=INPUT_FILES,
                            maxngram=MAXNGRAM,   
                            use_pretrained_vectors=USE_PRETRAINED_VECTORS,
                            pretrained_input_file=PRETRAINED_INPUT_FILE,
                            semantic_model_input_file=SEMANTIC_MODEL_INPUT_FILE,
                            output_file_directory=ANALYSIS_READY,
                            add_stanford_tags=ADD_STANFORD_TAGS,
                            ignore_duplicates=IGNORE_DUPLICATES,
                            high_sd_cutoff=HIGH_SD_CUTOFF,
                            low_n_cutoff=LOW_N_CUTOFF)

In [None]:
end_phase2real = time.time()

## For surrogate data: Alignment calculation settings

For the surrogate or baseline data, we have many of the same
parameters for `calculate_baseline_alignment()` as we do for
`calculate_alignment()`:

In [None]:
print(align.calculate_baseline_alignment.__doc__)

As mentioned above, when calculating the baseline, it is **vital** 
to include the *same* parameter values for any parameters that 
are included  in both `calculate_alignment()` and 
`calculate_baseline_alignment()`. As a result, we re-use those
values here.

We demonstrate other possible uses for labels by setting 
`dyad_label = time`, allowing us to compare alignment over 
time across the same speakers. We also demonstrate how to 
generate a subset of surrogate pairings rather than all 
possible pairings.

In addition to the parameters that we're re-using from
the `calculate_alignment()` values (see above), we'll 
keep most parameters at their defaults by:
* preserving the turn order when creating surrogate
 pairs (`keep_original_turn_order=True`),
* specifying condition with `cond` prefix 
 (`condition_label='cond'`), and
* using a hyphen to separate the condition and
 dyad identifiers (`id_separator='\-'`).
 
However, we will also change some of these defaults,
including:
* generating only a subset of surrogate data equal
 to the size of the real data (`all_surrogates=False`)
 and
* specifying that we'll be shuffling the baseline data
 by time instead of by dyad (`dyad_label='time'`).

## For surrogate data: Run alignment calculation

In [None]:
start_phase2surrogate = time.time()

In [None]:
[turn_surrogate,convo_surrogate] = align.calculate_baseline_alignment(
                                    input_files=INPUT_FILES, 
                                    maxngram=MAXNGRAM,
                                    use_pretrained_vectors=USE_PRETRAINED_VECTORS,
                                    pretrained_input_file=PRETRAINED_INPUT_FILE,
                                    semantic_model_input_file=SEMANTIC_MODEL_INPUT_FILE,
                                    output_file_directory=ANALYSIS_READY,
                                    add_stanford_tags=ADD_STANFORD_TAGS,
                                    ignore_duplicates=IGNORE_DUPLICATES,
                                    high_sd_cutoff=HIGH_SD_CUTOFF,
                                    low_n_cutoff=LOW_N_CUTOFF,
                                    surrogate_file_directory=SURROGATE_TRANSCRIPTS,
                                    all_surrogates=False,
                                    keep_original_turn_order=True,
                                    id_separator='\_',
                                    dyad_label='dyad',
                                    condition_label='cond')

In [None]:
convo_surrogate

In [None]:
end_phase2surrogate = time.time()

***

# ALIGN output overview

## Speed calculations

As promised, let's take a look at how long it takes to run each section. Time is given in seconds.

**Phase 1:**

In [None]:
end_phase1 - start_phase1

**Phase 2, real data:**

In [None]:
end_phase2real - start_phase2real

**Phase 2, surrogate data:**

In [None]:
end_phase2surrogate - start_phase2surrogate

**All phases:**

In [None]:
end_phase2surrogate - start_phase1

## Printouts!

And that's it! Before we go, let's take a look at the output from the real data analyzed at the turn level for each conversation (`turn_real`) and at the conversation level for each dyad (`convo_real`). We'll then look at our surrogate data, analyzed both at the turn level (`turn_surrogate`) and at the conversation level (`convo_surrogate`). In our next step, we would then take these data and plug them into our statistical model of choice. As an example of how this was done for Duran, Paxton, and Fusaroli (2019, *Psychological Methods*, https://doi.org/10.1037/met0000206) please visit: https://osf.io/3TGUF/ 

In [None]:
turn_real.head(10)

In [None]:
convo_real.head(10)

In [None]:
turn_surrogate.head(10)

In [None]:
convo_surrogate.head(10)