In [8]:
import nltk
import re
import uuid

In [2]:
beckett_text = "Well, I prefer that, I must say I prefer that oh you know, oh you, oh I suppose the audience, well well, so there is an audience, it's a public show, you buy your seat and you wait, perhaps it's free, a free show, you take your seat and you wait for it to begin, or perhaps it's compulsory, a compulsory show. You wait for the compulsory show to begin, it takes time, you hear a voice, perhaps it is a recitation, that is the show, someone reciting, selected passages, old favourites, or someone improvising, you can barely hear him, that's the show, you can't leave, you are afraid to leave... you make the best of it, you try and be reasonable, you came too early, here we'd need latin, it's only beginning, it hasn't begun, he'll appear any moment, he'll begin any moment. He is only preluding, clearing his throat, alone in his dressing room, or it's the stage manager giving his instructions, his last recommendations before the curtain rises, that's the show waiting for the show, to the sound of a murmur, you try and be reasonable, perhaps it is not a voice at all, perhaps it's the air ascending, descending, flowing, eddying, seeking exit, finding none, and the spectators, where are they, you didn't notice, in the anguish of waiting, never noticed you were waiting alone, that is the show for the fools in the palace waiting, waiting alone, that is the show, waiting alone, in the restless air, for it to begin, for something to begin, for there to be something else but you, for the power to rise, the courage to leave. You try and be reasonable, perhaps you are blind, probably deaf, the show is over, all is over, but where then is the hand, the helping hand, or merely charitable, or the hired hand, it's a long time coming, to take yours and draw you away, that is the show, free, gratis and for nothing, waiting alone, blind, deaf, you don't know where, you don't know for what, for a hand to come and draw you away, somewhere else, where perhaps it's worse"

In [3]:
clean_tokens = re.split(r"[ \t\n,\.]+", beckett_text)
dirty_tokens = re.split(r" ", beckett_text)
assert len(dirty_tokens) == len(clean_tokens)

Dear David,

I want to create a web of relationships in the following passage. The relationships have to link to what has come before – it can only “rewind”, not “fast-forward”. As an example, you can link the “you try and be reasonable” in green back to the “you try and be reasonable” in red, but not vice versa. 

Relationships could be in the following categories:

1. repeated words
2. repeated phrases or fragments (eg. “you wait for...” or “you try and be reasonable”)
3. words that sound similar (rhyme; half-rhyme; shared syllables or phonemes).

## Repeated phrases (no fragments yet)

In [12]:
# set the minimum size of a phrase to be connected in the text

def find_phrases(clean_tokens, min_phrase_size=5):
    "Find repeated phrases in text and return index of two positions"
    
    visited_keys = dict()
    pairs = set()
    ignored_idxs = set() #is this the best way to do this?
    
    for idx in range(len(clean_tokens)-min_phrase_size):
        if idx in ignored_idxs:
            continue
        key = " ".join(clean_tokens[idx:idx+min_phrase_size])
        # if phrase was visited before
        if key in visited_keys:
            # add arrow from current idx to previous idx
            pairs.add((visited_keys[key], idx))
            # prevent phrases bigger than min size to be added more than once
            fut_lookup_a = min_phrase_size + visited_keys[key]
            fut_lookup_b = min_phrase_size + idx
            while fut_lookup_b < len(clean_tokens) and clean_tokens[fut_lookup_a] == clean_tokens[fut_lookup_b]:
                ignored_idxs.add(fut_lookup_b-min_phrase_size+1)
                fut_lookup_a += 1
                fut_lookup_b += 1
                
        # add new key and position to dictionary
        visited_keys[key] = idx

    return pairs




In [17]:
def color_phrase(tokens, init_pos, phrase_len, colour):
    tokens[init_pos] = r"{\color{" + colour + r"} " + tokens[init_pos]
    tokens[init_pos+phrase_len-1] = tokens[init_pos+phrase_len-1] + "}"
    return tokens

def add_tikz_mark(tokens, pos):
    pos_adj = pos+phrase_size//2
    orig_token = tokens[pos_adj]
    label = str(uuid.uuid4())[:8]
    #update token inplace
    tokens[pos_adj] = "\n\\tikz[remember picture,inner sep=0pt] \\node (" + label + ")  {};%%\n" + orig_token
    return (tokens, label)

def create_tikz_node(label1, label2):
    return r"""
\begin{tikzpicture}[remember picture,overlay]
  \draw[arrows=->,blue,line width=2pt,opacity=0.20] (""" + label1 + r""") ++ (-0.25em,1ex) .. controls ++ (2,0) and ++(0,3) .. ($(""" + label2 + r""")+(4pt,1.35ex)$);
\end{tikzpicture}"""

# def create_tikz_node(label1, label2):
#     return r"""
# \begin{tikzpicture}[remember picture,overlay]
#   \draw[arrows=->,blue,line width=2pt,opacity=0.20] (""" + label1 + r""") -- (""" + label2 + r""");
# \end{tikzpicture}"""

In [6]:
def build_text(tokens, nodes):
    preamble = r"""\
\documentclass{article}
\usepackage{tikz}
\usetikzlibrary{calc}
\linespread{1.5}


\begin{document} 
    """
    
    postamble = r"""

\end{document}   
    """

    main = " ".join(tokens)
    nodes = "\n".join(nodes)
    
    return "\n".join([preamble, main, nodes, postamble])
    
    

In [20]:
phrase_size = 3
nodes = []
pairs = find_phrases(clean_tokens, phrase_size)

clean_tokens = re.split(r"[ \t\n,\.]+", beckett_text)
dirty_tokens = re.split(r" ", beckett_text)

# create pairs for each arrow
for (pos1, pos2) in pairs:

    #add color to phrases

    tokens_colored_ = color_phrase(dirty_tokens, pos1, phrase_size, "red")
    tokens_colored = color_phrase(tokens_colored_, pos2, phrase_size, "green")
#     tokens_colored = dirty_tokens

    #mark tokens with tikz tags
    tokens_marked_, label1 = add_tikz_mark(tokens_colored, pos1)
    tokens_marked, label2 = add_tikz_mark(tokens_marked_, pos2)
    

    #store the node information
    nodes.append(create_tikz_node(label2, label1))
    

    
print(build_text(tokens_marked, nodes))


\documentclass{article}
\usepackage{tikz}
\usetikzlibrary{calc}
\linespread{1.5}


\begin{document} 
    
Well, {\color{red} I 
\tikz[remember picture,inner sep=0pt] \node (7cc445a6)  {};%%
prefer that,} I must say {\color{green} I 
\tikz[remember picture,inner sep=0pt] \node (90411839)  {};%%
prefer that} oh you know, oh you, oh I suppose the audience, well well, so there is an audience, it's a public show, you buy {\color{red} your 
\tikz[remember picture,inner sep=0pt] \node (e1636d03)  {};%%
seat and} you wait, perhaps it's free, a free show, you take {\color{green} your 
\tikz[remember picture,inner sep=0pt] \node (3127d716)  {};%%
seat and} you wait {\color{red} for 
\tikz[remember picture,inner sep=0pt] \node (08e03e5e)  {};%%
it to} begin, or perhaps it's compulsory, a compulsory show. You wait for the compulsory show to begin, it takes time, you hear a voice, {\color{red} perhaps 
\tikz[remember picture,inner sep=0pt] \node (07090a0b)  {};%%
it is} a recitation, {\color{red} t

In [10]:
dirty_tokens

['Well,',
 'I',
 'prefer',
 'that,',
 'I',
 'must',
 'say',
 'I',
 'prefer',
 'that',
 'oh',
 'you',
 'know,',
 'oh',
 'you,',
 'oh',
 'I',
 'suppose',
 'the',
 'audience,',
 'well',
 'well,',
 'so',
 'there',
 'is',
 'an',
 'audience,',
 "it's",
 'a',
 'public',
 'show,',
 'you',
 'buy',
 '{\\color{red} your',
 '{\\color{red} seat',
 '\n\\tikz[remember picture,inner sep=0pt] \\node (dfa39ac1)  {};%%\nand',
 '\n\\tikz[remember picture,inner sep=0pt] \\node (e6b46338)  {};%%\nwait,}',
 'perhaps}',
 'perhaps',
 "it's",
 'free,',
 'a',
 'free',
 'show,',
 'you',
 'take',
 '{\\color{green} your',
 '{\\color{green} seat',
 '\n\\tikz[remember picture,inner sep=0pt] \\node (9de66ff0)  {};%%\nand',
 '\n\\tikz[remember picture,inner sep=0pt] \\node (948ae8dd)  {};%%\nwait}',
 '{\\color{red} for}',
 '{\\color{red} for',
 'it',
 '\n\\tikz[remember picture,inner sep=0pt] \\node (5481c57b)  {};%%\nto',
 'or}',
 'or',
 'perhaps',
 "it's",
 'compulsory,',
 'a',
 'compulsory',
 'show.',
 'You',
 'wait

In [119]:
print(add_tikz_mark(dirty_tokens[13])[0])


\tikz[remember picture,inner sep=0pt] \node (3825d819)  {};%%
oh


In [56]:
p1 = 228
p2 = 241

r1 = " ".join(dirty_tokens[p1-5:p1+min_phrase_size+5])
r2 = " ".join(dirty_tokens[p2-5:p2+min_phrase_size+5])
print(r1)
print()
print(r2)

never noticed you were waiting alone, that is the show for the fools in 
the

in 
the palace waiting, waiting alone, that is the show, waiting alone, in the restless


In [58]:
"bla" ++ "ble"

TypeError: bad operand type for unary +: 'str'