# To-do list

**Things for Alex to do:**
* [ ] Handling requirements (after getting them)
* [ ] Dockerizing
* [ ] Jupyter app-ifying
* [ ] Getting Stanford tagger included automatically
* [ ] Clean up markdown text (when final notebooks are ready)
* [ ] See if I can implement w2v function (https://github.com/a-paxton/Gensim-LSI-Word-Similarities)
* [ ] Convert functions into library
* [ ] When run analysis, run syntax over token, lemma is weird

**Things for Nick to do:**
* [x] Implement surrogate to match by conversation order AND conversation type
* [x] Make file names more intuitive
* [ ] Identify condition/dyad/number flexibly (using regex) - SKIPPED
* [x] Allow surrogate baseline to be created using a smaller subset (permutations) — 2-3x?
* [**???**] Do pip freeze or conda list -e > req.txt
* [**???**] Redo analysis with new baseline + consider doing sample-wise shuffled baseline - Have questions on how to proceed
* [ ] Go over manuscript again with new baseline + review comments/edits
* [ ] Need to create a simple other_filler_list as a text file that can be modified by a user and imported to be used here - make note that we only catch 2-letter fillers at this point with the regular expression default 
* [ ] Note that align_concatenated_dataframe.txt takes the place of forSemantic.txt. Make updates accordingly. 
* [ ] We could/should probably make `convobyconvo` an optional add-on from `turnbyturn`.
* [ ] Consider other POS taggers: https://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag

***

# ALIGN

This notebook provides an introduction to **ALIGN**, a tool for quantifying multi-level linguistic similarity between speakers. 

***

**Table of Contents**:

* [Getting Started](#Getting-Started)
    * [Prerequisites](#Prerequisites)
    * [Preparing input data](#Preparing-input-data)
    * [Filename conventions](#Filename-conventions)
    * [User-specified parameters](#User-specified-parameters)
    * [Main calls](#Main-calls)
* [Setup](#Setup)
    * [Import libraries](#Import-libraries)
    * [User-specified settings](#User-specified-settings)
* [Phase 1: Generate "prepped" transcripts](#Phase-1:-Generate-"prepped"-transcripts)
    * [](#)
* [Phase 2: Generate alignment scores](#Phase-2:-Generate-alignment-scores)

***

# Getting Started

### Prerequisites

* Jupyter Notebook with Python 2.7.1.3 kernel
* Packages in `requirements.txt`

*See notes in "DISTRIBUTION ISSUES" Notebook for suggestions on how to package effectively and accomodate Python 3 users*

### Preparing input data

* Each input text file needs to contain a single conversation organized in an `N x 2` matrix
    * Text file must be tab-delimited.
* Each row must correspond to a single conversational turn from a speaker.
    * Rows must be temporally ordered based on their occurrence in the conversation.
    * Rows must alternate between speakers.
* Speaker identifier and content for each turn are divided across two columns.
    * Column 1 must have the header `participant`.
        * Each cell specifies the speaker.
        * Each speaker must have a unique label (e.g., `P1` and `P2`, `0` and `1`).
    * Column 2 must have the header `content`.
        * Each cell corresponds to the transcribed utterance from the speaker.
        * Each cell must end with a newline character: `\n`
* See folder `examples > toy_data-original` in Github repository for an example

### Filename conventions

* Each conversation text file needs to be named in the format: `A_B.txt`
    * `A` corresponds to the dyad number for that conversation
    * `B` corresponding to a condition code for that conversation 

### User-specified parameters

* Define input path
* Define input folder where original transcripts are located
* Define folder to save prepped transcripts 
* Define folder to save surrogate transcripts 
* Decide maximum size for n-gram chunking
    * Default: 3
* Decide the minimum number of words for each turn
    * Default: 3
    * CRITICAL: The minimum number of words has to be at least as long as n-gram maximum size otherwise error will be generated
* Decide on whether to run the Stanford tagger along with NLTK default tagger (slow) or NLTK tagger alone (fast)
    * Default: 0 (NTLK tagger alone)
    
* remove_regex_fillers
* remove_other_list
    * need to combine the above into an either use regular expression or user-generated list
    * now USE_FILLER_LIST
    
* Decide on max delay between partner's turns to generate alignment score 
    * Currently only option is for contiguous turns
    * Will be updated in a future version

### Main calls

`PHASE1RUN`

* Converts each conversation into standardized format.
* Each utterance is tokenized and lemmatized and has POS tags added.

`PHASE2RUN_REAL`

* Generates turn-level and conversation-level alignment scores (lexical, conceptual, and syntactic) across a range of n-gram sequences

`PHASE2RUN_SURROGATE`

* Generates a surrogate corpus.
* Runs identical analysis as PHASE2RUN_REAL on the surrogate corpus.

### Checking latest version of Python and 3rd-party packages

In [125]:
import pandas
import numpy
import scipy
import nltk
import gensim 

print("Pandas Version Info:\n{}".format(pandas.__version__))
print("Numpy Version Info:\n{}".format(numpy.__version__))
print("Scipy Version Info:\n{}".format(scipy.__version__))
print("NLTK Version Info:\n{}".format(nltk.__version__))
print("Gensim Version Info:\n{}".format(gensim.__version__))

import sys
print("Python and Conda Environment Info:\n{}".format(sys.version))


Pandas Version Info:
0.21.1
Numpy Version Info:
1.11.3
Scipy Version Info:
0.19.0
NLTK Version Info:
3.2.5
Gensim Version Info:
3.1.0
Python and Conda Environment Info:
2.7.13 |Anaconda 2.3.0 (x86_64)| (default, Dec 20 2016, 23:05:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]


***

# Setup

Here, we'll get ready to run ALIGN over our target dataset.

[To top](#ALIGN).

## Import libraries

### Standard libraries

In [126]:
import os,re,math,csv,string,random,logging,glob,ast,itertools,operator
from os import listdir 
from os.path import isfile, join 
from collections import Counter, defaultdict 
from itertools import chain, combinations

### Third-party libraries

For data analysis and data handling:

In [127]:
import pandas as pd
import numpy as np
from scipy import spatial 

For natural language processing:

In [128]:
import nltk
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet as wn 
from nltk.tag.stanford import StanfordPOSTagger
from nltk.util import ngrams

Specify the NLTK default POS tagger

In [129]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/nduran/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

**Note:** With older version of NLTK (pre 3.1), the maxent_treebank_pos_tagger is also available: `nltk.download('maxent_treebank_pos_tagger')`

Specify additional POS tagger: Stanford

**Note**: The `StanfordPOSTagger` will be
used in conjunction with local folder `stanford-postagger-2015-04-20` and `.jar` file. Both files will be called below if analysis is being run with the Stanford tagger

For building semantic space:

In [130]:
from gensim.models import word2vec 

## User-specified settings

### Directories and folders

Set working directory, in which all notebook and supporting files are located.

Or "Pathname for the unzipped project folder" if going with `anaconda-project.yml` configuration file

In [131]:
INPUT_PATH=os.getcwd()+'/'

Set variable for folder name (as string) for relative location of folder containing the original transcript files.

In [132]:
# TRANSCRIPTS = 'toy_data-original/'
TRANSCRIPTS = "TRANSCRIPTS/"

Set variable for folder name (as string) for relative location of folder into which prepared transcript files will be saved.

In [133]:
# PREPPED_TRANSCRIPTS = 'toy_data-prepped/'
PREPPED_TRANSCRIPTS = "PREPPED_TRANSCRIPTS/"

Set variable for folder name (as string) for relative location of folder into which analysis-ready dataframe files will be saved.

In [134]:
# ANALYSIS_READY = 'toy_data-analysis/'
ANALYSIS_RESULTS = "ANALYSIS_RESULTS/"

Set variable for folder name (as string) for relative location of folder into which all prepared surrogate transcript files will be saved.

In [135]:
# SURROGATE_TRANSCRIPTS = 'toy_data-surrogate/'
SURROGATE_TRANSCRIPTS = "SURROGATE_TRANSCRIPTS/"

### Analysis settings

Set maximum size for n-gram chunking. (Default: 3)

In [136]:
MAXNGRAM = 3

Set minimum number of words for each turn. (Default: 3)

CRITICAL: The minimum number of words has to be at least as long as n-gram maximum size


In [137]:
MINWORDS = 3

Choose POS tagger. 

* DEFAULT: Enter `0` to run NLTK default POS tagger (NLTK 3.0.3: maxent_treebank_pos_tagger)
* Enter `1` to run both NLTK default POS tagger and Stanford POS tagger. Adding the Stanford POS tagger will lead to an increase in processing time. 

In [138]:
ADD_STANFORD_TAGS = 1

Set max delay between partner's turns when generating alignment score. Currently, the only acceptable value is 1 (i.e., contiguous turns).

In [139]:
DELAY = 1

Choose method for removing speech fillers 

* DEFAULT: Choose to remove 2-letter fillers (via regular expressions)    
* .txt list of filler words or regular expression

In [140]:
USE_FILLER_LIST = 0

***

# Phase 1: Generate "prepped" transcripts

## Initial clean-up

* **[Clean up text](#Clean-up-text)** by removing:
    * numbers, punctuation, and other non-ASCII alphabet characters
    * common speech fillers (e.g., "um", "huh") and their derivations
    * empty turns that may have inadvertently been included
    * user-specified short turns
        * removes short turns that are at least as long as maximum n-gram
* **[Merge adjacent turns by the same participant](#Merge-adjacent-turns-by-the-same-participant)** into a single utterance row.

[To top](#ALIGN).

### Clean up text

**ND** Problem in initial function in that empty turns were throwing in an error in Line 30. It is necessary to drop empty turns before removing fillers (see Line 27).

**ND** Corrected problem in how list comprehension on Line 40 was searching through filler_list.

**ALEX** Is there a better way of handling Line 19? Add it to the main function?

**ND** Also changed how we were counting number of words. Before I was using regular expressions that counted something like "let's" as one word when should be considered as two words (let us) given this is the what the ngram analysis is based on. To keep what is considered a word consistent across functions, now counting words based on NLTK tokenizer (Line 51) which breaks something like "let's" into two words. 

In [141]:
def InitialCleanup(dataframe,
                        MINWORDS=3,
                        USE_FILLER_LIST=0):
    
    """
    Perform basic text cleaning to prepare dataframe
    for analysis. Remove non-letter/-space characters,
    empty turns, turns below a minimum length (default:
    1 word), and fillers.
    
    By default, remove 2-letter fillers through regex.
    If desired, skip regex filtering of fillers with
    `USE_FILLER_LIST=1`.
    
    If desired, remove other words (e.g., fillers) 
    passed as a list to `filler_list` argument.
    """
    
    filler_list=ast.literal_eval(file(INPUT_PATH+'fillers.txt').read().lower())
    
    # only allow strings, spaces, and newlines to pass
    WHITELIST = string.letters + '\'' + ' '
    clean = []
    utteranceLen = []
     
    ## remove inadvertent empty turns 
    dataframe = dataframe[pd.notnull(dataframe['content'])]
    
    for value in dataframe['content'].values:            
        cleantext = ''.join(c for c in value if c in WHITELIST).lower() 

        ## remove typical speech fillers, examples: "um, mm, oh, hm, uh"
        if USE_FILLER_LIST == 0:
            cleantext = re.sub('^[uhmo]+[mh]+\s', ' ', cleantext) ##// at the start of a string
            cleantext = re.sub('\s[uhmo]+[mh]+\s', ' ', cleantext) ##// within a string 
            cleantext = re.sub('\s[uhmo]+[mh]$', ' ', cleantext) ##// end of a string 
            
        # OPTIONAL: remove speech fillers or other words specified by user in a list
        if USE_FILLER_LIST == 1:
            cleantext = [word for word in cleantext.split(" ") if word not in filler_list]
            cleantext = " ".join(cleantext)
                    
        # append cleaned lines
        clean.append(cleantext)        
                
    ## drop the old "content" column and add the clean "content" column
    dataframe = dataframe.iloc[:, [0,1]]
    dataframe['content'] = clean
        
    ## remove rows that are now blank or do not meet "MINWORDS" requirement, then drop length column    
    dataframe['utteranceLen'] = dataframe['content'].apply(lambda x: word_tokenize(x)).str.len()
    dataframe = dataframe.drop(dataframe[dataframe.utteranceLen < int(MINWORDS)].index)
    dataframe = dataframe.iloc[:, [0,1]]
        
    # return the cleaned dataframe    
    return dataframe

### Merge adjacent turns by the same participant

In [142]:
def AdjacentMerge(dataframe):
    """
    ADD HERE
    """    
    
    repeat=1
    while repeat==1:
        l1=len(dataframe) 
        DfMerge = []
        k = 0
        if len(dataframe) > 0:
            while k < len(dataframe)-1: 
                if dataframe['participant'].iloc[k] != dataframe['participant'].iloc[k+1]:
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])         
                    k = k + 1
                elif dataframe['participant'].iloc[k] == dataframe['participant'].iloc[k+1]:                    
                    DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k] + " " + dataframe['content'].iloc[k+1]])           
                    k = k + 2   
            if k == len(dataframe)-1:
                DfMerge.append([dataframe['participant'].iloc[k], dataframe['content'].iloc[k]])      
        
        dataframe=pd.DataFrame(DfMerge,columns=('participant','content'))
        if l1==len(dataframe): 
            repeat=0 
                
    return dataframe

## Prepare transcript text

* **[Check spelling](#Check-spelling)** via a Bayesian spell-checking algorithm (http://norvig.com/spell-correct.html).
* **[Tokenize and apply spell correction](#Tokenize-and-apply-spell-correction)** to the original transcript text.
* **[Lemmatize](#Lemmatize)** using WordNet-derived categories.
* [**Part-of-speech tagging**](#Part-of-speech-tagging) with user-defined tagger(s) on both lemmatized and non-lemmatized tokens.
    * Users may choose to use the NLTK default POS tagger (default) and/or the Stanford POS tagger (optional). The NLTK default tagger is more time-efficient.

**ND** For tokenization, I noticed a problem that when you tokenize something any contraction, like "let's," the tokenizer creates "let" and "'s", so when the spell checker sees "'s" is automatically changes it to "is" when it should be "us." To minimize as many problems as possible, I now have set up Lines 29-37 to import a list of common contractions and their expanded forms, converts it to a dictionary, and in Line 41, does the conversion. However, for contractions not in the list, see Line 44 in that spell checker now skips situations where "'s" appears as a token. I should note that this is only a real issue if the original transcription avoids contractions altogether, but this is a somewhat unrealistic expectation.    

**ALEX** Should lines 29-37 be handled in main function?

### Tokenize and apply spell correction

In [143]:
def Tokenize(text,nwords):
    """
    Given list of text to be processed and a list 
    of known words, return a list of edited and 
    tokenized words.
    """
    
    # internal function: identify possible spelling errors for a given word
    def edits1(word): 
        splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes    = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
        replaces   = [a + c + b[1:] for a, b in splits for c in string.lowercase if b]
        inserts    = [a + c + b     for a, b in splits for c in string.lowercase]
        return set(deletes + transposes + replaces + inserts)

    # internal function: identify known edits
    def known_edits2(word,nwords):
        return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in nwords)

    # internal function: identify known words
    def known(words,nwords): return set(w for w in words if w in nwords)

    # internal function: correct spelling
    def correct(word,nwords):
        candidates = known([word],nwords) or known(edits1(word),nwords) or known_edits2(word,nwords) or [word]
        return max(candidates, key=nwords.get)

    # expand out based on a fixed list of common contractions 
    contract_list=INPUT_PATH+'contractions.txt'
    contract_dict = ast.literal_eval(file(contract_list).read().lower())
    contractions_re = re.compile('(%s)' % '|'.join(contract_dict.keys()))      
    # internal function:    
    def expand_contractions(text, contractions_re=contractions_re):
        def replace(match):
            return contract_dict[match.group(0)]
        return contractions_re.sub(replace, text.lower())

    # process all words in the text
    cleantoken = []
    text = expand_contractions(text)
    token = word_tokenize(text)
    for word in token:        
        if "'" not in word:
            cleantoken.append(correct(word,nwords))
        else:
            cleantoken.append(word) 

#     cleantoken = []
#     token = word_tokenize(text)    
#     for word in token:
#         cleantoken.append(correct(word,nwords))

    return cleantoken

### Lemmatize

In [144]:
def pos_to_wn(tag):
    """
    Convert NLTK default tagger output into a format that Wordnet
    can use in order to properly lemmatize the text.
    """
    
    # create some inner functions for simplicity
    def is_noun(tag):
        return tag in ['NN', 'NNS', 'NNP', 'NNPS']
    def is_verb(tag):
        return tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    def is_adverb(tag):
        return tag in ['RB', 'RBR', 'RBS']
    def is_adjective(tag):
        return tag in ['JJ', 'JJR', 'JJS']
    
    # check each tag against possible categories
    if is_noun(tag):
        return wn.NOUN
    elif is_verb(tag):
        return wn.VERB
    elif is_adverb(tag):
        return wn.ADV
    elif is_adjective(tag):
        return wn.ADJ
    else:
        return wn.NOUN

In [145]:
def Lemmatize(tokenlist):
    lemmatizer = WordNetLemmatizer() 
    defaultPos = nltk.pos_tag(tokenlist) # get the POS tags from NLTK default tagger
    words_lemma = []
    for item in defaultPos:  
        words_lemma.append(lemmatizer.lemmatize(item[0],pos_to_wn(item[1]))) # need to convert POS tags to a format (NOUN, VERB, ADV, ADJ) that wordnet uses to lemmatize
    return words_lemma

**PROBLEM** In original version, problem with ApplyTokenLemmatize at line 7. When you run Line 6, which shows the output of the Tokenize function, everything looks correct. However, when appending to the dataframe within the loop (Line 7), it either fails to save or saves as a null list. This failure can be seen in the output of Line 8. However, for reasons I do not know, this problem only occurs when dropping utterances when they do not meet minimium word requirements. That is why we did not see any issues when MINWORDS was set to 1 (in toy dataset, there were no instances of single words). 

In [146]:
# def ApplyTokenLemmatize(df,nwords):
#     df['token'] = ""
#     df['lemma'] = ""
#     for i in range(0,len(df)):
        
#         print("What SHOULD be returned from tokenization function:",Tokenize(df['content'].iloc[i],nwords))
#         df['token'].iloc[i]=Tokenize(df['content'].iloc[i],nwords) ### <<<<<< PROBLEM IS HERE! 
#         print("What IS actually being saved to the updated df:",df['token'].iloc[i])
        
        
#         df['lemma'].iloc[i]=Lemmatize(df['token'].iloc[i])  
#     return df

**ND** Corrected version of ApplyTokenLemmatize

In [147]:
def ApplyTokenLemmatize(df,nwords):
    token = []
    lemma = []
    for i in range(0,len(df)):
        token.append(Tokenize(df['content'].iloc[i],nwords))        
        lemma.append(Lemmatize(Tokenize(df['content'].iloc[i],nwords)))
    df['token'] = token
    df['lemma'] = lemma     
    return df

### Part-of-speech tagging

**ALEX** Is this the best way of handling ADD_STANFORD_TAGS by putting it in the function? If I want to change to run analyses with or without Stanford I have to update it across multiple functions. It would be much easier to just declare it **once** as "0" or "1" early (as done in the "Analysis Settings" section) and it ensures that it is consistent across all functions that rely on it. Would help avoid mistakes, no?

**ND** Note that dropping "Penn" in tagger name when saving to file because it turns out this is the wrong conceptualization. What "Penn" really corresponds to is the default NLTK tagger, which happens to be `averaged_perceptron_tagger` for the updated NLTK version. 

In [148]:
def ApplyPOSTagging(df,filename,ADD_STANFORD_TAGS=0):

    """
    Apply part-of-speech tagging to a dataframe of conversation turns 
    (df). Pass filename as a string to create create a new df variable. 
    By default, return only tags from the NLTK default POS tagger. Optionally,
    also return Stanford POS tagger results by setting  
    `ADD_STANFORD_TAGS=1`.
    """
    
    # create new columns in our dataframe
    df['tagged_token'] = ""
    df['tagged_lemma'] = ""
    df['file'] = ""
    if ADD_STANFORD_TAGS == 1:
        df['tagged_stan_token'] = ""
        df['tagged_stan_lemma'] = ""
        
    # if desired, import Stanford tagger
    if ADD_STANFORD_TAGS == 1:
        stanford_tagger = StanfordPOSTagger(INPUT_PATH + 'stanford-postagger-2015-04-20/models/english-bidirectional-distsim.tagger',
                                            INPUT_PATH + 'stanford-postagger-2015-04-20/stanford-postagger.jar')
    
    # cycle through each line in the dataframe
    for i in range(0,len(df)):
        df['file'].iloc[i]=filename

        # by default, tag with ... <<<<<<<<<
        pos_token=nltk.pos_tag(df['token'].iloc[i])
        df['tagged_token'].iloc[i]=pos_token 
        pos_lemma=nltk.pos_tag(df['lemma'].iloc[i])
        df['tagged_lemma'].iloc[i]=pos_lemma 

        # if desired, also tag with Stanford tagger
        if ADD_STANFORD_TAGS == 1:
            pos_stan_token=stanford_tagger.tag(df['token'].iloc[i])
            df['tagged_stan_token'].iloc[i]=pos_stan_token    
            pos_stan_lemma=stanford_tagger.tag(df['lemma'].iloc[i])
            df['tagged_stan_lemma'].iloc[i]=pos_stan_lemma  

    # return finished dataframe
    return df

## RUN Phase 1

* For each original transcript file, saves new file with columns for:
    * "Clean" text
    * Tokenized words
    * Tokenized lemmatized-words
    * NLTK default POS-tagging on tokenized words
    * NLTK default POS-tagging on lemmatized words
    * Stanford POS-tagging on tokenized words
    * Stanford POS-tagging on lemmatized-words
* Also saves a single datasheet with all tokenized lemmatized utterances from all transcripts as individual rows
    * called align_concatenated_dataframe.txt
    * to be used in building semantic space for Phase 2

In [152]:
def PHASE1RUN(input_file_directory, 
                   output_file_directory,
                   training_dictionary=INPUT_PATH+'big.txt',
                   ADD_STANFORD_TAGS=0):   

    """
    Given a directory of individual .txt files, 
    return a completely prepared dataframe of transcribed 
    conversations for later ALIGN analysis, including: text 
    cleaning, merging adjacent turns, spell-checking, 
    tokenization, lemmatization, and part-of-speech tagging. 
    By default, return only the NLTK default 
    POS tagger values; optionally, also return Stanford POS tagger
    values with `add_standford_tagger=1`.
    """
    
    # create an internal function to train the model
    def train(features): 
        model = defaultdict(lambda: 1)
        for f in features:
            model[f] += 1
        return model
        
    # train our spell-checking model
    nwords = train(re.findall('[a-z]+',(file(training_dictionary).read().lower())))
    
    # cycle through all files 
#     import glob
    file_list = glob.glob(input_file_directory+"*.txt")
#     file_list = glob.glob(input_file_directory+"dyad_10-condition_2.txt")
    
    main = []
    for fileName in file_list:      
        
        # let us know which file we're processing
        dataframe = pd.read_csv(fileName, sep='\t',encoding='utf-8')
        print "Processing: "+fileName

        # clean up, merge, spellcheck, tokenize, lemmatize, and POS-tag
        dataframe = InitialCleanup(dataframe)
        dataframe = AdjacentMerge(dataframe)
        dataframe = ApplyTokenLemmatize(dataframe,nwords)
        
        dataframe = ApplyPOSTagging(dataframe, 
                                    os.path.basename(fileName), 
                                    ADD_STANFORD_TAGS)
        
        # export the conversation's dataframe as a CSV
        dataframe.to_csv(output_file_directory + os.path.basename(fileName), 
                         encoding='utf-8',index=False,sep='\t')
        main.append(dataframe)

    # save the concatenated dataframe
    main = pd.concat(main, axis=0)
    main.to_csv(output_file_directory +'../' + "align_concatenated_dataframe.txt",encoding='utf-8',
                index=False, sep='\t')
    
    # return the dataframe
    return main

In [156]:
# PHASE1RUN(input_file_directory=INPUT_PATH+TRANSCRIPTS,
#                       output_file_directory=INPUT_PATH+PREPPED_TRANSCRIPTS,
#                       training_dictionary=INPUT_PATH+'big.txt')

***

# Phase 2: Generate alignment scores

* [**Create helper functions**](#Create-helper-functions) for processing turn- and conversation-level data.
* **[Build semantic space](#Build-semantic-space)** from the `forSemantic.txt` generated in Phase 1 and return a `word2vec` semantic space and vocabulary list.

[To top.](#ALIGN)

### Create helper functions

In [601]:
def ngram_pos(sequence1,sequence2,
                   ignore_duplicates=True,
                   ngramsize=2):
    """
    Remove mimicked lexical sequences from two interlocutors'
    sequences and return a dictionary of counts of ngrams
    of the desired size for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.
    
    By default, ignore duplicate lexical n-grams when
    processing these sequences. If desired, this may
    be changed with `ignore_duplicates=False`.
    """     

    # remove duplicates and recreate sequences
    sequence1 = set(ngrams(sequence1,ngramsize))
    sequence2 = set(ngrams(sequence2,ngramsize))
 
    # if desired, remove duplicates from sequences
    if ignore_duplicates==True:
        new_sequence1 = [tuple([' '.join(pair) for pair in tup]) for tup in list(sequence1 - sequence2)]
        new_sequence2 = [tuple([' '.join(pair) for pair in tup]) for tup in list(sequence2 - sequence1)]
    else:
        new_sequence1 = [tuple([' '.join(pair) for pair in tup]) for tup in sequence1]
        new_sequence2 = [tuple([' '.join(pair) for pair in tup]) for tup in sequence2]
        
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [602]:
def ngram_lexical(sequence1,sequence2,ngramsize=2):
    """
    Create ngrams of the desired size for each of two
    interlocutors' sequences and return a dictionary 
    of counts of ngrams for each sequence.
    
    By default, consider bigrams. If desired, this may be 
    changed by setting `ngramsize` to the appropriate 
    value.
    """   
    
    # generate ngrams
    sequence1 = list(ngrams(sequence1,ngramsize))
    sequence2 = list(ngrams(sequence2,ngramsize)) 

    # join for counters
    new_sequence1 = [' '.join(pair) for pair in sequence1]
    new_sequence2 = [' '.join(pair) for pair in sequence2]
    
    # return counters
    return Counter(new_sequence1), Counter(new_sequence2)

In [603]:
def get_cosine(vec1, vec2): 
    """
    Derive cosine similarity metric, standard measure.
    Adapted from <https://stackoverflow.com/a/33129724>.
    """     
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

In [604]:
def build_composite_semantic_vector(lemma_seq,vocablist,highDimModel):
    """
    Function for producing vocablist and model is called in the main loop
    """
    getComposite = [0] * len(highDimModel[vocablist[1]])    
    for w1 in lemma_seq:
        if w1 in vocablist:
            semvector = highDimModel[w1]
            getComposite = getComposite + semvector    
    return getComposite

# what we want to do here is find the union of the vocablist within the HDM and then sum over all of the columns.
# should be faster/easier than current instantiation

### Build semantic space

In [605]:
def BuildSemanticModel(semantic_model_input_file,
                            high_sd_cutoff=3,
                            low_n_cutoff=1):
    """
    Given an input file produced by the ALIGN Phase 1 functions, 
    build a semantic model from all transcripts in all conversations
    in target corpus after removing high- and low-frequency words.
    High-frequency words are determined by a user-defined number of
    SDs over the mean (by default, `high_sd_cutoff=3`). Low-frequency
    words must appear over a specified number of raw occurrences 
    (by default, `low_n_cutoff=1`).
    
    Frequency cutoffs can be removed by `high_sd_cutoff=None` and/or
    `low_n_cutoff=0`.
    """
    
    # read in the file
    data1 = pd.read_csv(semantic_model_input_file, sep='\t',encoding='utf-8')
    
    # get frequency count of all included words
    all_words = filter(str.isalpha,[word.strip() for word in str(data1['lemma']).split(',')])
    frequency = defaultdict(int)
    for word in all_words:
        frequency[word] += 1
        
    # remove words that only occur more frequently than our cutoff (defined in occurrences)
    frequency = {word: freq for word, freq in frequency.iteritems() if freq > low_n_cutoff}

    # if desired, remove high-frequency words (over user-defined SDs above mean) 
    if high_sd_cutoff == None:
        contentWords = [word for word in frequency.keys()] 
    else:
        getOut = np.mean(frequency.values())+(np.std(frequency.values())*(high_sd_cutoff))
        contentWords = {word: freq for word, freq in frequency.iteritems() if freq < getOut}.keys()
    
    # identify the sentences in the file, stripping out words we won't keep
    getSentences = [re.sub('[^\w\s]+','',str(row)).split(' ') for row in list(data1['lemma'])]
    keepSentences = [[word for word in row if word in contentWords] for row in getSentences]
    
    # build actual semantic space
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
    semantic_model = word2vec.Word2Vec(keepSentences, min_count=low_n_cutoff)

    # return all the content words and the word2vec model space
    return contentWords, semantic_model

In [31]:
# BuildSemanticModel(semantic_model_input_file=INPUT_PATH + "align_concatenated_dataframe.txt")

### Calculate lexical and POS alignment scores for each n-gram length across two comparison vectors

In [606]:
def LexicalPOSAlignment(tok1,lem1,penn_tok1,penn_lem1,
                             tok2,lem2,penn_tok2,penn_lem2,
                             stan_tok1=None,stan_lem1=None,
                             stan_tok2=None,stan_lem2=None,
                             ngramsLength=3,
                             ignore_duplicates=True,
                             ADD_STANFORD_TAGS = 0):
    
    """
    Derive lexical and part-of-speech alignment scores
    between interlocutors (suffix `1` and `2` in arguments
    passed to function). 
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `ADD_STANFORD_TAGS=1` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigrams when calculating
    similarity. If desired, this window may be expanded 
    by changing the `ngramsLength` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # create empty dictionaries for syntactic similarity
    cosine_syntax_penn_tok = {}
    cosine_syntax_penn_lex = {}
    
    # if desired, generate Stanford-based scores
    if ADD_STANFORD_TAGS == 1:
        cosine_syntax_stan_tok = {}
        cosine_syntax_stan_lem = {}
    
    # create empty dictionaries for lexical similarity
    cosine_lexical_tok = {}
    cosine_lexical_lem = {}
    
    # cycle through all desired ngram lengths
    for ngram in range(2,ngramsLength+1):
         
        # calculate similarity for lexical ngrams (tokens and lemmas)
        [vectorT1, vectorT2] = ngram_lexical(tok1,tok2,ngramsize=ngram)
        [vectorL1, vectorL2] = ngram_lexical(lem1,lem2,ngramsize=ngram)
        cosine_lexical_tok['cosine_lexical_tok{0}'.format(ngram)] = get_cosine(vectorT1,vectorT2)
        cosine_lexical_lem['cosine_lexical_lem{0}'.format(ngram)] = get_cosine(vectorL1, vectorL2)

        # calculate similarity for Penn POS ngrams (tokens)
        [vector_penn_tok1, vector_penn_tok2] = ngram_pos(penn_tok1,penn_tok2,
                                                ngramsize=ngram,
                                                ignore_duplicates=ignore_duplicates) 
        cosine_syntax_penn_tok['cosine_syntax_penn_tok{0}'.format(ngram)] = get_cosine(vector_penn_tok1, 
                                                                                            vector_penn_tok2)
        
        # calculate similarity for Penn POS ngrams (lemmas)
        [vector_penn_lem1, vector_penn_lem2] = ngram_pos(penn_lem1,penn_lem2,
                                                              ngramsize=ngram,
                                                              ignore_duplicates=ignore_duplicates) 
        cosine_syntax_penn_lex['cosine_syntax_penn_lex{0}'.format(ngram)] = get_cosine(vector_penn_lem1, 
                                                                                            vector_penn_lem2) 

        # if desired, also calculate using Stanford POS
        if ADD_STANFORD_TAGS == 1:         
          
            # calculate similarity for Stanford POS ngrams (tokens)
            [vector_stan_tok1, vector_stan_tok2] = ngram_pos(stan_tok1,stan_tok2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            cosine_syntax_stan_tok['cosine_syntax_stan_tok{0}'.format(ngram)] = get_cosine(vector_stan_tok1,
                                                                                                vector_stan_tok2)

            # calculate similarity for Stanford POS ngrams (lemmas)
            [vector_stan_lem1, vector_stan_lem2] = ngram_pos(stan_lem1,stan_lem2,
                                                                  ngramsize=ngram,
                                                                  ignore_duplicates=ignore_duplicates) 
            cosine_syntax_stan_lem['cosine_syntax_stan_lem{0}'.format(ngram)] = get_cosine(vector_stan_lem1,
                                                                                                vector_stan_lem2)
        
    # return requested information
    if ADD_STANFORD_TAGS==1:
        dictionaries_list = [cosine_syntax_penn_tok, cosine_syntax_penn_lex,
                             cosine_syntax_stan_tok, cosine_syntax_stan_lem, 
                             cosine_lexical_tok, cosine_lexical_lem]      
    else:
        dictionaries_list = [cosine_syntax_penn_tok, cosine_syntax_penn_lex,
                             cosine_lexical_tok, cosine_lexical_lem]      
    return dictionaries_list

## Generate turn-level analysis of alignment scores

In [607]:
def conceptualAlignment(lem1, lem2, vocablist, highDimModel):
    
    """
    Calculate conceptual alignment scores from list of lemmas
    from between two interocutors (suffix `1` and `2` in arguments
    passed to function) using `word2vec`.
    """

    # aggregate composite high-dimensional vectors of all words in utterance
    W2Vec1 = build_composite_semantic_vector(lem1,vocablist,highDimModel)
    W2Vec2 = build_composite_semantic_vector(lem2,vocablist,highDimModel)

    # return cosine distance alignment score
    return 1 - spatial.distance.cosine(W2Vec1, W2Vec2)  

In [608]:
def returnMultilevelAlignment(cond_info,
                                   partnerA,tok1,lem1,penn_tok1,penn_lem1,
                                   partnerB,tok2,lem2,penn_tok2,penn_lem2,
                                   vocablist, highDimModel, 
                                   stan_tok1=None,stan_lem1=None,
                                   stan_tok2=None,stan_lem2=None,
                                   ADD_STANFORD_TAGS=0,
                                   ngramsLength=3, 
                                   ignore_duplicates=True):

    """
    Calculate lexical, syntactic, and conceptual alignment
    between a pair of turns by individual interlocutors 
    (suffix `1` and `2` in arguments passed to function), 
    including leading/following comparison directionality.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `ADD_STANFORD_TAGS=1` and by providing appropriate 
    values for `stan_tok1`, `stan_lem1`, `stan_tok2`, and 
    `stan_lem2`.
    
    By default, consider only bigrams when calculating
    similarity. If desired, this window may be expanded 
    by changing the `ngramsLength` argument value.
    
    By default, remove exact duplicates when calculating
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # create empty dictionaries 
    partner_direction = {}
    condition_info = {}
    cosine_semanticL = {}
    
    # calculate lexical and syntactic alignment
    dictionaries_list = LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 ngramsLength=ngramsLength,
                                                 ignore_duplicates=ignore_duplicates,
                                                 ADD_STANFORD_TAGS=ADD_STANFORD_TAGS)
    
    # calculate conceptual alignment
    cosine_semanticL['cosine_semanticL'] = conceptualAlignment(lem1,lem2,vocablist,highDimModel)
    dictionaries_list.append(cosine_semanticL.copy())
    
    # determine directionality of leading/following comparison
    partner_direction['partner_direction'] = str(partnerA) + ">" + str(partnerB)
    dictionaries_list.append(partner_direction.copy())

    # add condition information
    condition_info['condition_info'] = cond_info    
    dictionaries_list.append(condition_info.copy())

    # return alignment scores
    return dictionaries_list

In [609]:
def TurnByTurnAnalysis(dataframe,
                            vocablist,
                            highDimModel, 
                            delay=1,
                            maxngram = 3,
                            ADD_STANFORD_TAGS=0,
                            ignore_duplicates=True):    

    """
    Calculate lexical, syntactic, and conceptual alignment
    between interlocutors over an entire conversation.
    Automatically detect individual speakers by unique
    speaker codes.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 4. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `ADD_STANFORD_TAGS=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # if we don't want the Stanford tagger data, set defaults
    if ADD_STANFORD_TAGS == 0:
        stan_tok1=None
        stan_lem1=None
        stan_tok2=None
        stan_lem2=None
    
    # prepare the data to the appropriate type
    
    print dataframe['token']
    
    dataframe['token'] = dataframe['token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))    
    dataframe['lemma'] = dataframe['lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_token'] = dataframe['tagged_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_token'] = dataframe['tagged_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    dataframe['tagged_lemma'] = dataframe['tagged_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
    dataframe['tagged_lemma'] = dataframe['tagged_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
    
    # if desired, prepare the Stanford tagger data
    if ADD_STANFORD_TAGS == 1:           
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_token'] = dataframe['tagged_stan_token'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: re.sub('[^\w\s]+','',x).split(' '))
        dataframe['tagged_stan_lemma'] = dataframe['tagged_stan_lemma'].apply(lambda x: zip(x[0::2],x[1::2])) # thanks to https://stackoverflow.com/a/4647086

    # create lagged version of the dataframe
    df_original = dataframe.drop(dataframe.tail(delay).index,inplace=False)
    df_lagged = dataframe.shift(-delay).drop(dataframe.tail(delay).index,inplace=False)
    
    # cycle through each pair of turns
    aggregated_df = pd.DataFrame()
    for i in range(0,df_original.shape[0]):

        # identify the condition for this dataframe
        cond_info = dataframe['file'].unique()
        if len(cond_info)==1: 
            cond_info = str(cond_info[0])
        
        # break and flag error if we have more than 1 condition per dataframe
        else: 
            raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)

        # grab all of first participant's data
        first_row = df_original.iloc[i]
        first_partner = first_row['participant']
        # text1=first_row['content']
        tok1=first_row['token']
        lem1=first_row['lemma']
        penn_tok1=first_row['tagged_token']
        penn_lem1=first_row['tagged_lemma']

        # grab all of lagged participant's data
        lagged_row = df_lagged.iloc[i]
        lagged_partner = lagged_row['participant']
        # text2=lagged_row['content']
        tok2=lagged_row['token']
        lem2=lagged_row['lemma']
        penn_tok2=lagged_row['tagged_token']
        penn_lem2=lagged_row['tagged_lemma']
        
        # if desired, grab the Stanford tagger data for both participants
        if ADD_STANFORD_TAGS == 1:           
            stan_tok1=first_row['tagged_stan_token']
            stan_lem1=first_row['tagged_stan_lemma']
            stan_tok2=lagged_row['tagged_stan_token']
            stan_lem2=lagged_row['tagged_stan_lemma']
                
        # process multilevel alignment
        dictionaries_list=returnMultilevelAlignment(cond_info=cond_info,
                                                         partnerA=first_partner,
                                                         # text1=text1,
                                                         tok1=tok1,lem1=lem1,
                                                         penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                         partnerB=lagged_partner,
                                                         # text2=text2,
                                                         tok2=tok2,lem2=lem2,
                                                         penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                         vocablist=vocablist,
                                                         highDimModel=highDimModel,
                                                         stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                         stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                         ngramsLength = maxngram,
                                                         ignore_duplicates = ignore_duplicates,
                                                         ADD_STANFORD_TAGS = ADD_STANFORD_TAGS) 
        
        # append data to existing structures
        next_df_line = pd.DataFrame.from_dict(dict(j for i in dictionaries_list for j in i.items()),
                               orient='index').transpose()
        aggregated_df = aggregated_df.append(next_df_line)
            
    # reformat turn information and add index
    aggregated_df = aggregated_df.reset_index(drop=True).reset_index().rename(columns={"index":"time"})

    # give us our finished dataframe
    return aggregated_df

Generate conversation-level analysis of alignment scores
-----------------------------------------------------

In [611]:
def ConvoByConvoAnalysis(dataframe,
                          ngramsLength = 3,
                          ignore_duplicates=True,
                          ADD_STANFORD_TAGS = 0):

    """
    Calculate analysis of multilevel similarity over
    a conversation between two interlocutors from a 
    transcript dataframe prepared by Phase 1
    of ALIGN. Automatically detect speakers by unique
    speaker codes.
    
    By default, include maximum n-gram comparison of 2. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `ADD_STANFORD_TAGS=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """

    # identify the condition for this dataframe
    cond_info = dataframe['file'].unique()
    if len(cond_info)==1: 
        cond_info = str(cond_info[0])
    
    # break and flag error if we have more than 1 condition per dataframe
    else: 
        raise ValueError('Error! Dataframe contains multiple conditions. Split dataframe into multiple dataframes, one per condition: '+cond_info)
   
    # if we don't want the Stanford info, set defaults 
    if ADD_STANFORD_TAGS==0:
        stan_tok1 = None
        stan_lem1 = None
        stan_tok2 = None
        stan_lem2 = None

    # identify individual interlocutors
    df_A = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[0]]
    df_B = dataframe.loc[dataframe['participant'] == dataframe['participant'].unique()[1]]
   
    # concatenate the token, lemma, and POS information for participant A
    tok1 = [word for turn in df_A['token'] for word in turn]
    lem1 = [word for turn in df_A['lemma'] for word in turn]
    penn_tok1 = [POS for turn in df_A['tagged_token'] for POS in turn]    
    penn_lem1 = [POS for turn in df_A['tagged_token'] for POS in turn] 
    if ADD_STANFORD_TAGS == 1:
        stan_tok1 = [POS for turn in df_A['tagged_stan_token'] for POS in turn]    
        stan_lem21 = [POS for turn in df_A['tagged_stan_lemma'] for POS in turn] 

    # concatenate the token, lemma, and POS information for participant B
    tok2 = [word for turn in df_B['token'] for word in turn]
    lem2 = [word for turn in df_B['lemma'] for word in turn]
    penn_tok2 = [POS for turn in df_B['tagged_token'] for POS in turn]    
    penn_lem2 = [POS for turn in df_B['tagged_token'] for POS in turn] 
    if ADD_STANFORD_TAGS == 1:
        stan_tok2 = [POS for turn in df_B['tagged_stan_token'] for POS in turn]    
        stan_lem2 = [POS for turn in df_B['tagged_stan_lemma'] for POS in turn] 
        
    # process multilevel alignment
    dictionaries_list = LexicalPOSAlignment(tok1=tok1,lem1=lem1,
                                                 penn_tok1=penn_tok1,penn_lem1=penn_lem1,
                                                 tok2=tok2,lem2=lem2,
                                                 penn_tok2=penn_tok2,penn_lem2=penn_lem2,
                                                 stan_tok1=stan_tok1,stan_lem1=stan_lem1,
                                                 stan_tok2=stan_tok2,stan_lem2=stan_lem2,
                                                 ngramsLength=ngramsLength,
                                                 ignore_duplicates=ignore_duplicates,
                                                 ADD_STANFORD_TAGS=ADD_STANFORD_TAGS)
    
    # append data to existing structures
    dictionary_df = pd.DataFrame.from_dict(dict(j for i in dictionaries_list for j in i.items()),
                           orient='index').transpose()
    dictionary_df['condition_info'] = cond_info
            
    # return the dataframe
    return dictionary_df

## RUN Phase 2: Actual Partners

* For each prepped transcript file, runs turn-level and conversational-level alignment scores
* Saves output into single datasheet to be used in statistical analysis

In [612]:
def PHASE2RUN_REAL(input_file_directory, 
                        output_file_directory,
                        semantic_model_input_file,
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        delay=1,
                        maxngram=3,
                        ignore_duplicates=True,
                        ADD_STANFORD_TAGS=0):   

    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `PHASE1RUN` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 4. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `ADD_STANFORD_TAGS=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    """
    
    # grab the files in the list
    file_list = glob.glob(input_file_directory+"*.txt")
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                        high_sd_cutoff=high_sd_cutoff,
                                                        low_n_cutoff=low_n_cutoff)

    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
    
    # cycle through each prepared file
    for fileName in file_list:
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel)
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = ConvoByConvoAnalysis(dataframe=dataframe,
                                             ngramsLength = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             ADD_STANFORD_TAGS = ADD_STANFORD_TAGS)
            AlignmentC2C=AlignmentC2C.append(xC2C)
        
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN.to_csv(output_file_directory+"AlignmentT2T.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO.to_csv(output_file_directory+"AlignmentC2C.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN, FINAL_CONVO

In [155]:
# PHASE2RUN_REAL(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
#                         output_file_directory = INPUT_PATH+ANALYSIS_READY,
#                         semantic_model_input_file = INPUT_PATH+'align_concatenated_dataframe.txt',
#                         high_sd_cutoff=3,
#                         low_n_cutoff=1,
#                         delay=1,
#                         maxngram=3,
#                         ignore_duplicates=True,
#                         ADD_STANFORD_TAGS=0)

Generate surrogate pairings
-------------------------
* Collects all possible pairs of participants across the dyads in each condition and creates surrogate pairings by combining their conversational turns, preserving turn order. Output saved as new separate conversational transcripts. 
* Main Function:
    * GenerateSurrogate 

In [None]:
def GenerateSurrogate(original_conversation_list,
                           surrogate_file_directory,
                           all_surrogates = False,
                           id_separator = '\-',
                           dyad_label='dyad',
                           condition_label='cond',
                           keep_original_turn_order = False):
    
    """
    Create transcripts for surrogate pairs of 
    participants (i.e., participants who did not 
    genuinely interact in the experiment), which
    will later be used to generate baseline levels 
    of alignment. Store surrogate files in a new
    folder each time the surrogate generation is run.
    
    Returns a list of all surrogate files created.

    By default, the separator between dyad ID and
    condition ID is a hyphen ('\-'). If desired,
    this may be changed in the `id_separator` 
    argument.

    By default, condition IDs will be identified as 
    any characters following `cond`. If desired,
    this may be changed with the `condition_label`
    argument.
    
    By default, dyad IDs will be identified as 
    any characters following `dyad`. If desired,
    this may be changed with the `dyad_label`
    argument.
    
    By default, generate surrogates only from a subset
    of all possible pairings. If desired, instead 
    generate surrogates from all possible pairings
    with `all_surrogates=True`.
    
    By default, create surrogates by shuffling all
    turns within each surrogate partner's data. If 
    desired, retain the original ordering of each
    surrogate partner's data with 
    `keep_original_turn_order = True`.
    """
        
    # create a subfolder for the new set of surrogates
    import time
    new_surrogate_path = surrogate_file_directory + 'surrogate_run-' + str(time.time()) +'/'
    if not os.path.exists(new_surrogate_path):
        os.makedirs(new_surrogate_path)
        
    # grab condition types from each file name
    file_info = [re.sub('\.txt','',os.path.basename(file_name)) for file_name in original_conversation_list]
    condition_ids = list(set([re.findall('[^'+id_separator+']*'+condition_label+'.*',metadata)[0] for metadata in file_info]))
    files_conditions = {}
    for unique_condition in condition_ids:
        next_condition_files = [add_file for add_file in original_conversation_list if unique_condition in add_file]
        files_conditions[unique_condition] = next_condition_files
    
    # cycle through conditions
    for condition in files_conditions.keys():
        
        # grab all possible pairs of conversations of this condition
        paired_surrogates = [pair for pair in combinations(files_conditions[condition],2)]
        
        # default: randomly pull from all pairs to get target surrogate sample
        if all_surrogates == False:
            import math
            paired_surrogates = random.sample(paired_surrogates, 
                                              int(math.ceil(len(files_conditions[condition])/2)))
            
        # cycle through surrogate pairings
        for next_surrogate in paired_surrogates:
            
            # read in the files
            original_file1 = os.path.basename(next_surrogate[0])
            original_file2 = os.path.basename(next_surrogate[1])
            original_df1=pd.read_csv(next_surrogate[0], sep='\t',encoding='utf-8')
            original_df2=pd.read_csv(next_surrogate[1], sep='\t',encoding='utf-8')
            
            # get participants A and B from df1
            participantA_1_code = min(original_df1['participant'].unique())
            participantB_1_code = max(original_df1['participant'].unique())
            participantA_1 = original_df1[original_df1['participant'] == participantA_1_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_1 = original_df1[original_df1['participant'] == participantB_1_code].reset_index().rename(columns={'file': 'original_file'})
            
            # get participants A and B from df2
            participantA_2_code = min(original_df2['participant'].unique())
            participantB_2_code = max(original_df2['participant'].unique())
            participantA_2 = original_df2[original_df2['participant'] == participantA_2_code].reset_index().rename(columns={'file': 'original_file'})
            participantB_2 = original_df2[original_df2['participant'] == participantB_2_code].reset_index().rename(columns={'file': 'original_file'})
            
            # identify truncation point for both surrogates (to have even number of turns)
            surrogateX_turns=min([participantA_1.shape[0],
                                  participantB_2.shape[0]])
            surrogateY_turns=min([participantA_2.shape[0],
                                  participantB_1.shape[0]])
            
            # if desired, preserve original turn order for surrogate pairs
            if keep_original_turn_order == True:
                surrogateX = participantA_1.truncate(after=surrogateX_turns-1,copy=False).append(
                                participantB_2.truncate(after=surrogateX_turns-1,copy=False)).sort(
                                ['index']).reset_index(drop=True).rename(columns={'index': 'original_index'})
                surrogateY = participantA_2.truncate(after=surrogateX_turns-1,copy=False).append(
                                participantB_1.truncate(after=surrogateX_turns-1,copy=False)).sort(
                                ['index']).reset_index(drop=True).rename(columns={'index': 'original_index'})
            
            # otherwise, just shuffle all turns within participants
            else:
                
                # shuffle for first surrogate pairing
                surrogateX_A1 = participantA_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX_B2 = participantB_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateX = surrogateX_A1.append(surrogateX_B2).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})
                
                # and for second surrogate pairing
                surrogateY_A2 = participantA_2.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY_B1 = participantB_1.truncate(after=surrogateX_turns-1,copy=False).sample(frac=1).reset_index(drop=True)
                surrogateY = surrogateY_A2.append(surrogateY_B1).sort_index().reset_index(drop=True).rename(columns={'index': 'original_index'})

            # create filename for our surrogate file
            original_dyad1 = re.findall(dyad_label+'[^'+id_separator+']*',original_file1)[0]
            original_dyad2 = re.findall(dyad_label+'[^'+id_separator+']*',original_file2)[0]
            surrogateX['file'] = condition + '-' + original_dyad1 + '-' + original_dyad2
            surrogateY['file'] = condition + '-' + original_dyad1 + '-' + original_dyad2
            nameX='SurrogatePair-'+original_dyad1+'A'+'-'+original_dyad2+'B'+'-'+condition+'.txt'
            nameY='SurrogatePair-'+original_dyad2+'A'+'-'+original_dyad1+'B'+'-'+condition+'.txt'
            
            # save to file
            surrogateX.to_csv(new_surrogate_path + nameX, encoding='utf-8',index=False,sep='\t')
            surrogateY.to_csv(new_surrogate_path + nameY, encoding='utf-8',index=False,sep='\t')
            
    # return list of all surrogate files
    return glob.glob(new_surrogate_path+"*.txt")

RUN Phase 2: Surrogate Partners
-------------------------------
* Runs function to generate new surrogate transcript conversations (separate files)
* For each surrogate transcript file, runs turn-level and conversational-level alignment scores
* Saves output into single datasheet to be used in statistical analysis

In [None]:
def PHASE2RUN_SURROGATE(input_file_directory, 
                             surrogate_file_directory,
                             output_file_directory,
                             semantic_model_input_file,
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=False,
                             keep_original_turn_order = False,
                             delay=1,
                             maxngram=3,
                             ignore_duplicates=True,
                             ADD_STANFORD_TAGS=0):   
    """
    Given a directory of individual .txt files and the
    vocab list that have been generated by the `PHASE1RUN` 
    preparation stage, return multi-level alignment 
    scores with turn-by-turn and conversation-level metrics
    for surrogate baseline conversations.
    
    By default, create the semantic model with a 
    high-frequency cutoff of 3 SD over the mean. If 
    desired, this can be changed with the 
    `high_sd_cutoff` argument and can be removed with
    `high_sd_cutoff=None`.
    
    By default, create the semantic model with a 
    low-frequency cutoff in which a word will be 
    removed if they occur 1 or fewer times. if
    desired, this can be changed with the 
    `low_n_cutoff` argument and can be removed with
    `low_n_cutoff=0`.
    
    By default, compare only adjacent turns. If desired,
    the comparison distance may be changed by increasing
    the `delay` argument.
    
    By default, include maximum n-gram comparison of 4. If
    desired, this may be changed by passing the appropriate
    value to the the `maxngram` argument.
    
    By default, return scores based only on Penn POS taggers. 
    If desired, also return scores using Stanford tagger with 
    `ADD_STANFORD_TAGS=1`.
    
    By default, remove exact duplicates when calculating POS
    similarity scores (i.e., does not consider perfectly
    mimicked lexical items between speakers). If desired, 
    duplicates may be included when calculating scores by 
    passing `ignore_duplicates=False`.
    
    By default, the separator between dyad ID and
    condition ID in each file name is a hyphen ('\-'). 
    If desired, this may be changed with the 
    `id_separator` argument.

    By default, condition IDs in each file name
    will be identified as any characters following 
    `cond`. If desired, this may be changed with the 
    `condition_label` argument.
    
    By default, dyad IDs in each file name
    will be identified as any characters following 
    `dyad`. If desired, this may be changed with the 
    `dyad_label` argument.
    
    By default, generate surrogates only from a subset
    of all possible pairings. If desired, instead 
    generate surrogates from all possible pairings
    with `all_surrogates=True`
    """
    
    # grab the files in the input list
    file_list = glob.glob(input_file_directory+"*.txt")
    surrogate_file_list = GenerateSurrogate(original_conversation_list = file_list,
                                                   surrogate_file_directory = surrogate_file_directory,
                                                   all_surrogates = all_surrogates,
                                                   id_separator = id_separator,
                                                   condition_label = condition_label,
                                                   dyad_label = dyad_label,
                                                   keep_original_turn_order = keep_original_turn_order) 
    
    # build the semantic model to be used for all conversations
    [vocablist, highDimModel] = BuildSemanticModel(semantic_model_input_file=semantic_model_input_file,
                                                        high_sd_cutoff=high_sd_cutoff,
                                                        low_n_cutoff=low_n_cutoff)
    
    # create containers for alignment values
    AlignmentT2T = pd.DataFrame()
    AlignmentC2C = pd.DataFrame()
    
    # cycle through the files
    for fileName in surrogate_file_list:
        
        # process the file if it's got a valid conversation
        dataframe=pd.read_csv(fileName, sep='\t',encoding='utf-8')
        if len(dataframe) > 0:
            
            # let us know which filename we're processing
            print "Processing: "+fileName   

            # calculate turn-by-turn alignment scores
            xT2T=TurnByTurnAnalysis(dataframe=dataframe,
                                         delay=delay,
                                         maxngram=maxngram,
                                         vocablist=vocablist,
                                         highDimModel=highDimModel)
            AlignmentT2T=AlignmentT2T.append(xT2T)
            
            # calculate conversation-level alignment scores
            xC2C = ConvoByConvoAnalysis(dataframe=dataframe,
                                             ngramsLength = maxngram,
                                             ignore_duplicates=ignore_duplicates,
                                             ADD_STANFORD_TAGS = ADD_STANFORD_TAGS)
            AlignmentC2C=AlignmentC2C.append(xC2C)
        
        # if it's invalid, let us know
        else:
            print "Invalid file: "+fileName   
            
    # update final dataframes
    FINAL_TURN_SURROGATE = AlignmentT2T.reset_index(drop=True)
    FINAL_CONVO_SURROGATE = AlignmentC2C.reset_index(drop=True)
    
    # export the final files
    FINAL_TURN_SURROGATE.to_csv(output_file_directory+"AlignmentT2T_Surrogate.txt",
                      encoding='utf-8',index=False,sep='\t')   
    FINAL_CONVO_SURROGATE.to_csv(output_file_directory+"AlignmentC2C_Surrogate.txt",
                       encoding='utf-8',index=False,sep='\t') 

    # display the info, too
    return FINAL_TURN_SURROGATE, FINAL_CONVO_SURROGATE

In [None]:
[turn_surrogate,convo_surrogate] = PHASE2RUN_SURROGATE(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                             surrogate_file_directory= INPUT_PATH+SURROGATE_TRANSCRIPTS,
                             output_file_directory= INPUT_PATH+ANALYSIS_READY,
                             semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=False,
                             keep_original_turn_order = False,
                             delay=1,
                             maxngram=3,
                             ignore_duplicates=True,
                             ADD_STANFORD_TAGS=0)

# Run everything!

## Phase 1: Prep

In [None]:
import time
start_phase1 = time.time()

In [None]:
# model_store = PHASE1RUN(input_file_directory=INPUT_PATH+TRANSCRIPTS,
#                       output_file_directory=INPUT_PATH+PREPPED_TRANSCRIPTS,
#                       training_dictionary=INPUT_PATH+'big.txt')

## Phase 2: Real

In [None]:
start_phase2real = time.time()

In [None]:
[turn_real,convo_real]= PHASE2RUN_REAL(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                        output_file_directory = INPUT_PATH+ANALYSIS_READY,
                        semantic_model_input_file = INPUT_PATH+'align_concatenated_dataframe.txt',
                        high_sd_cutoff=3,
                        low_n_cutoff=1,
                        delay=1,
                        maxngram=3,
                        ignore_duplicates=True,
                        ADD_STANFORD_TAGS=0)

## Phase 2: Surrogate

In [None]:
start_phase2surrogate = time.time()

In [None]:
[turn_surrogate,convo_surrogate] = PHASE2RUN_SURROGATE(input_file_directory = INPUT_PATH+PREPPED_TRANSCRIPTS, 
                             surrogate_file_directory= INPUT_PATH+SURROGATE_TRANSCRIPTS,
                             output_file_directory= INPUT_PATH+ANALYSIS_READY,
                             semantic_model_input_file=INPUT_PATH+'align_concatenated_dataframe.txt',
                             high_sd_cutoff=3,
                             low_n_cutoff=1,
                             id_separator = '\-',
                             condition_label='cond',
                             dyad_label='dyad',
                             all_surrogates=False,
                             keep_original_turn_order = False,
                             delay=1,
                             maxngram=3,
                             ignore_duplicates=True,
                             ADD_STANFORD_TAGS=0)

In [None]:
end=time.time()

## Speed calculations

Phase 1 time:

In [None]:
start_phase2real - start_phase1

Phase 2 real time:

In [None]:
start_phase2surrogate - start_phase2real

Phase 2 surrogate time:

In [None]:
end - start_phase2surrogate

All 3 phases:

In [None]:
end - start_phase1

## Printouts!

In [None]:
turn_real.head(10)

In [None]:
convo_real.head(10)

In [None]:
turn_surrogate.head(10)

In [None]:
convo_surrogate.head(10)