# Text exploration and processing


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize import WhitespaceTokenizer
from nltk import FreqDist
from keras.callbacks import ModelCheckpoint
import numpy as np
import tensorflow as tf
import progressbar


## Import text data

The data is currently one text file, with each line corresponding to one post. The method for extraction from reddit is detailed in the scraper file. 

The data will first be explored.

In [None]:
num_posts = 5000
file_path = '../raw_data/relationships_{}.txt'.format(num_posts)
gram_len = 4

with open(file_path, 'r') as file:
    raw_relationship_data = file.read()
    print("file imported")

In [None]:
raw_relationship_data[:200]

Based on this first post this raises an interesting point about capital letters. I assumed that we wouldn't need to lowercase all the data, but if this Capitalised Every Word syntax is prevelant then this could be an issue. We will assume that lowercasing will produce a more informative model due to uniformity and lower likelyhood of Out Of Vocab words.

We are going to explore the punctuation in the text as a whole to see what may be insignificant.

In [None]:
print("Count of punctuations/special characters")

interesting_punctuation = [
    "\n", ".", ",", ":",";", "\t", "?","!", "-", "(", ")", ":(", ":)", "</3",
    "[", "]", "'", '"', "<","_"
]

for punctuation in interesting_punctuation:
    print(repr(punctuation), "\t", raw_relationship_data.count(punctuation))


This shows that we have exactly the right number of `\n` symbols. The other punctuation may not be relevant as there are not huge numbers of non full stops, question marks and (maybe?) commas.

## Clean data

We want the data to take into account certain grammatical and punctuation syntax. Therefore we are going to map certain symbols to another, and to indicate where the end of a sentence is. It must be ensured that there are adequate spaces between relevant tokens or they won't be parse properly. 

The punctuation that is going to be kept in is:

* full stops
* question marks
* brackets (one type)

We are going to convert the text to lower case for all words in order to increase the uniformity of the text.

The newline `/n` symbol is going to be converted to ` <END> ` to indicate the end of a post (using the assumtion that posts are one line per post).

Should probably be using regular expressions here for better performance but alas this is a first run.

### Lowercase the data


In [None]:
raw_relationship_data = raw_relationship_data.lower()
print(raw_relationship_data[:100])

### Add spaces to the punctuation we want to keep

Should replace with a loop to look nicer

In [None]:
raw_relationship_data = raw_relationship_data.replace("<", " ")
raw_relationship_data = raw_relationship_data.replace(">", " ")


raw_relationship_data = raw_relationship_data.replace("\n", " <END> <START> ")
raw_relationship_data = raw_relationship_data.replace(".", " . ")
raw_relationship_data = raw_relationship_data.replace("?", " ? ")
raw_relationship_data = raw_relationship_data.replace(",", " , ")

raw_relationship_data = raw_relationship_data.replace("[", " (")
raw_relationship_data = raw_relationship_data.replace("]", ") ")

raw_relationship_data = raw_relationship_data.replace(":", " ")
raw_relationship_data = raw_relationship_data.replace(";", " ")
raw_relationship_data = raw_relationship_data.replace("-", " ")
raw_relationship_data = raw_relationship_data.replace("!", " ")
raw_relationship_data = raw_relationship_data.replace("_", " ")

raw_relationship_data = raw_relationship_data.replace('"', "")
raw_relationship_data = raw_relationship_data.replace("'", "")
raw_relationship_data = raw_relationship_data.replace("“", "")
raw_relationship_data = raw_relationship_data.replace('”', "")
raw_relationship_data = raw_relationship_data.replace('’', "")
raw_relationship_data = raw_relationship_data.replace('…', " ")
raw_relationship_data = raw_relationship_data.replace('...', " , ")


I gave up on not using regular expressions, we can check what non-alpha nums are still within the text.

Some of the emojis produced are minorly upsetting.

In [None]:
import re
set(re.sub(r'[A-Za-z0-9 ]', '', raw_relationship_data))

From this we can see there is a wide range of punctuation that is not covered by our replacing procedure. We will remove all:

* alphanumerics
* full stops, commas, question marks
* characters in the `<END>` symbol

In [None]:
relationship_data = re.sub(r'[^A-Za-z0-9 <>(),.?]', '', raw_relationship_data)
relationship_data = re.sub("(  \.  \.)+", " ", relationship_data)
relationship_data = re.sub("  +", " ", relationship_data)

print(relationship_data[:200])

In [None]:
set(re.sub(r'[A-Za-z0-9 ]', '', relationship_data))

The (GENGER_AGE) syntax may be useful to replace with a generic placeholder in order to prevent rare / out of vocab issues, the model will end up predicting some age based on langauge.

Not quite sure where to tokenise this data, definitely before creating the sequences but not sure if the data should be sentences first.

Will go with before creating sentences.


### Tokenization

Separate the string into words using spaces to determine a new token. This will make punctuation tokens which is what we want for sentence structure.

Could use one of NLTK's casual tokenizer but as we have already preprocessed the strings for our own purpose the standard one may do fine. EDIT: as we have processed out words and punctuation to have whitespace where appropriate the WhitespaceTokenizer is best here.
EDIT2: We don't use these tokenised words later, but they're interesting to look at / explore the WS tokeniser.

In [None]:
ws_tk = WhitespaceTokenizer() 

relationships_word_tokened = ws_tk.tokenize(relationship_data)

print(relationships_word_tokened[:50])

print("Number of tokens total:", len(relationships_word_tokened))

Unsurprisingly many of our most common words are stop words, but these are important to our sentence structure so they will be kept in. 

We may choose the use the sentence structure of our data instead of a bag of words model, this will mean tokenising the sentences as well as words. I've done this kind of backwards as the `\n` strings denoted new posts previously but now we get a string for each post that has been cleaned.

In [None]:
relationship_data_sents = relationship_data.split(" <END> <START> ")
relationship_data_sents[0] = relationship_data_sents[0].replace("<START>", "")


print(relationship_data_sents[:10])


Get some more information about our posts

In [None]:
relationship_data_sents_words = (ws_tk.tokenize(post) for post in relationship_data_sents if post)

MAX_SEQ_LENGTH = len(max(relationship_data_sents_words, key=len))

relationship_data_sents_words = (ws_tk.tokenize(post) for post in relationship_data_sents if post)

MIN_SEQ_LENGTH = len(min(relationship_data_sents_words, key=len))


print("Max post length: ", MAX_SEQ_LENGTH, "\n\n")
print("Min post length: ", MIN_SEQ_LENGTH, "\n\n")

relationship_data_sents_words = [ws_tk.tokenize(post) for post in relationship_data_sents if post]


We now have a list containing each post, within each post is a list of each token within the post. The longest post is given by `MAX_SEQ_LENGTH`

### Generate vocab
***Warning***

The below step will take a time in the order of minutes if # posts >10,000

In [None]:

# this has to be done after tokenisation or it will count strings
vocab = set(relationships_word_tokened)
len_vocab = len(vocab) + 1
print("Vocab length: ", len_vocab)


We need to convert the word data into integers the model will be able to understand, a little bit cheating but keras has a nice way to do this.

In [None]:
from keras.preprocessing.text import Tokenizer

# convert the posts to embeddings
keras_embedder = Tokenizer(num_words=None, filters=[], lower=False, split=" ")

keras_embedder.fit_on_texts(relationship_data_sents)

embedded_sents = keras_embedder.texts_to_sequences(relationship_data_sents)

print(len(embedded_sents))

Lets create a function so that we can keep only sequences with the most frequent words (by numbeer of occurances in the corpus. This should allow the model to generalise better.)

Shouldn't convert from words -> embeddings -> words but oh well.

Total overall count in corpus possibly not the best metric to determine inclusion, tf-idf? But simple to impliment and should get the right result of reducing noise.

Possibly also drop very unique trigrams? Would reduce serendipity, but make easier predictions

Some embedding values will have no data but the model should be able to handle that *gestures wildly*.

The below function is implimented to speed up trainng and imrpove predictions by keeping in only sequences which contain words that are used above some (arbitrary?) frequency. The model will not likely ever predict a word that has only been used once in it's corpus. This will alter the distribution of words and impact overfitting, but it seems sensible to introduce the ability to control for frequency.

In [None]:
from collections import Counter

TOKEN_FREQ_DICT = Counter(relationships_word_tokened)

# we have an embedder to go from words to numbers, this dict goes numbers to words
# there is probably an implimentation w/in keras for this but..

INV_EMBEDDER = {index: word for word, index in keras_embedder.word_index.items()}

def check_freq(sequence, threshold):
    """Returns True/False if sequence contains token that is frequen enough"""
    keep = False
    for index in sequence:
        word = INV_EMBEDDER[index]
        if TOKEN_FREQ_DICT[word] > threshold:
            keep = True
    return keep


Quick exploration of the most frequent words.

In [None]:
import matplotlib.pyplot as plt

most_common_40 = TOKEN_FREQ_DICT.most_common()[:40]

words, counts = list(zip(*most_common_40))

plt.bar(words, list(counts))
plt.xticks(rotation=90);

Appears to be (unsurprisingly) mostly stop words + topic related nouns (boyfriend, girlfriend, friend)

In [None]:
from keras.preprocessing.sequence import pad_sequences
import functools
import operator
import itertools

keep_in_if_more_than = 5

sequences = []

#consider converting `sequences` to a generator format
embedded_sents_right_len = (post for post in embedded_sents if len(post) > gram_len)
for post_num, post in enumerate(embedded_sents_right_len):
    if post_num % 1000 == 0:
        print("Post number", post_num)
    for index in range(gram_len-1, len(post)):
        single_sequence = post[index-gram_len-1:index+1]
        if check_freq(single_sequence, keep_in_if_more_than):
            # other methods for appending tested for speed
            #sequences = sequence + singe_sequence
            #sequences = itertools.chain(single_sequence, sequences)
            sequences.append(single_sequence)
            
print("loop end")


flattened_sequences = functools.reduce(operator.concat, sequences)
print("Total words: {}".format(len(flattened_sequences)))
print("Number unique words: {}".format(len(set(flattened_sequences))))
print("Number of sequences: {}".format(len(sequences)))


We now have an embedding for each post. We can now make the train/
predict seqence pairs. 

We want to ensure all our sequences are padded adequately, they should already be now by prunning non-`gram_len` lengths, but it doesn't (if you exclude time) hurt to add this.

In [None]:
padded_sequences = pad_sequences(np.array(sequences), maxlen=gram_len, padding='pre')
        
print(padded_sequences.shape)

Convert the sequences to features + targets in order to train a model in a categorical manner.

Would be good to see if this is easier in a vectorised or functional approach.

In [None]:
from sklearn.utils import shuffle

# split into input and output elements
X = []
y = []

#some wrongly formatted sequences were passing through
padded_sequences = (np.array(each) for each in padded_sequences 
                    if each != [] and len(each) > 1)

# y is the final element of each sequence
for each_seq in padded_sequences:
    X_each_seq, y_each_seq = each_seq[:-1], each_seq[-1]
    X.append(X_each_seq)
    y.append(y_each_seq)

X = np.array(X)
y = np.array(y)

X, y = shuffle(X, y, random_state=0)

print(X.shape)
print(y.shape)
print(y[:20])

Save the data to be accessed.

In [None]:
import pickle
import time

timestr = time.strftime("%Y%m%d-%H%M%S")

pickle_file = "../processed_data/{}-{}-{}-{}".format(timestr, keep_in_if_more_than, gram_len,
                                                    num_posts)
"""
with open(pickle_file, "wb") as f:
    pickle.dump((X, y, keras_embedder, len_vocab, gram_len), f)

   """ 
 
    


In [None]:
# Open file just written to check contents
with open(pickle_file, "rb") as f:
    X, y, keras_embedder, len_vocab, gram_len = pickle.load(f)