<a href="https://colab.research.google.com/github/michaelgfalk/fugitive-words/blob/master/english_sequence_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# English Language Model for Foreign Word Detection

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Lambda, LSTM, CuDNNLSTM, Dropout, TimeDistributed, Masking
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.text import Tokenizer # For one-hot encoding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from google.colab import drive # For saving
import pickle as p
import regex as re
import pandas as pd

In [None]:
# Link to Google drive for disk access
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# set random seed for notebook
random_seed = 425

## 1. Import and clean training data

For this task, we want a list of English words. The model will examine them to learn how letters follow one another in English.

Training data has been harvested from the three large English corpora included in the Natural Language Toolkit: the 'brown' corpus of contemporary American English, the 'reuters' corpus of recent news articles, and the 'gutenberg' corpus, which comprises 18 literary texts, mostly from the Romantic period, but including a few plays of Shakespeare and the King James Bible. It has also been sourced from the 'lexicon' files for Contemporary and Historical American corpora on the BYU Corpus site. These corpora have between them ~5-10 million tokens, which equates to about 150,000 unique types in practice.

Since no Australian or NZ corpus has been used, hopefully there are very few Papuan, Aboriginal and Austronesian words in the training set, and the model should learn to give a low probability to strings from those language families.

The words will be exploded into characters, special characters and punctuation will be removed.

**NB:** In earlier versions of this notebook I neglected to save the preprocessed data & the dict that defines how the tokenizer converts strings into a numeric representation. Don't forget it this time!

In [None]:
# Get training data from NLTK
import nltk
nltk.download('brown')
nltk.download('reuters')
nltk.download('gutenberg')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [None]:
# Get training data from BYU corpora samples
byu_words = []
with open('/content/gdrive/My Drive/waves_of_words/byu_lexicons.txt', 'r') as file:
  for line in file:
    line = line.rstrip() # remove trailing whitespace
    byu_words.append(line) 

# Sanity check:
print(f'There are {len(byu_words)} unique types in the byu corpora. The first five are:\n {byu_words[0:4]}')

There are 114097 unique types in the byu corpora. The first five are:
 ['word', 'the', 'and', 'of']


In [None]:
# Combine into single wordlist
words = set(nltk.corpus.brown.words() + nltk.corpus.reuters.words() + nltk.corpus.gutenberg.words() + byu_words)
words = list(words)

# Regex for stripping special characters
regex = re.compile(r'[\W\d_]') # Match any non-word character, digit or underscore

# Clean

# Set all words to lower case, and add start and end characters:
words = set(['S' + word.lower() + 'E' for word in words])
# Get rid of digits and non-word characters:
words = set([regex.sub('', word) for word in words])
# Convert set to list:
words = list(words)
# Get rid of empty strings/junk
words = [word for word in words if len(word) > 0]

In [None]:
# Have another look at the data
print(f'There are {len(words)} unique types in the training data.')

There are 149849 unique types in the training data.


In [None]:
# Format for tensorflow
tkzr = Tokenizer(lower = False, char_level = True, oov_token = "?", filters = None) # out-of-vocab character represented by ?
tkzr.fit_on_texts(words)
seq_list = tkzr.texts_to_sequences(words)
data = pad_sequences(seq_list, padding = 'pre')

In [None]:
char_to_int = tkzr.word_index
int_to_char = {value:key for key,value in tkzr.word_index.items()}

In [None]:
# Save training data and word_index to Google Drive
with open('/content/gdrive/My Drive/waves_of_words/20190215_seq_data.p', 'wb') as file:
  p.dump({"data":data,"tkzr":tkzr}, file)

In [None]:
with open('/content/gdrive/My Drive/waves_of_words/20190215_seq_data.p', 'rb') as file:
  saved = p.load(file)

data = saved['data']
tkzr = saved['tkzr']

In the previous iteration of this notebook, I set up the data wrong. In that version, I just got the model to predict the last letter in the sequence.

This model will work differently. Instead of predicting the last character, it will predict the next character. This means that $X$ and $Y$ will ahve the same shape: $(m, t-1, n)$, where $m$ is the number of training examples, $t-1$ the maximum length of the sequences minus 1, and $n$ the number of features (in this case, characters in the alphabet, plus $0$ for no character, $1$ for an unknown character, and the two special characters $S$ and $E$, which represent the start and end of the word respectively).

$X$ will contain the characters from $0:t-1$, while $Y$ will contain the characters for $1:t$.

In [None]:
# Get dimensions
# Convert data to one-hot encoding
print(f'Data dimensions before one-hot-encoding: {data.shape}')
data = tf.keras.utils.to_categorical(data, dtype = 'float32') # For some reason Tensorflow requires a float input
print(f'Data dimensions after one-hot-encoding: {data.shape}')
data = data[:,:,1:] # Drop padding (we don't want the LSTM to learn a feature for an empty timestep)
print(f'Data dimensions after dropping padding feature: {data.shape}')
m, t, n = data.shape
X = data[:,0:-1,:]
Y = data[:,1:,:]

# Shuffle and split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = random_seed)

# Sanity check
print(f'The shape of X_train is {X_train.shape}')
print(f'The shape of X_test is {X_test.shape}')
print(f'The shape of Y_train is {Y_train.shape}')
print(f'The shape of Y_test is {Y_test.shape}')

Data dimensions before one-hot-encoding: (149849, 30)
Data dimensions after one-hot-encoding: (149849, 30, 35)
Data dimensions after dropping padding feature: (149849, 30, 34)
The shape of X_train is (134864, 29, 34)
The shape of X_test is (14985, 29, 34)
The shape of Y_train is (134864, 29, 34)
The shape of Y_test is (14985, 29, 34)


## 2. Instantiate and train model

In [None]:
def init_lstm(max_time, num_features, hidden_state_dim = 10, rnn_layers = 3, drop_rate = 0.5):
  """
  Implementation of a deep LSTM for sequence learning.
  
  params:
    max_time (int): the maximum sequence length in the training data
    num_features (int): the number of individual characters in the training set
    lstm_hidden (int): the size of the hidden state in the LSTM cells
    num_rnn_layers (int): the number of LSTM layers desired
    drop_rate (float: 0 < x =< 1): the number of inputs to randomly ignore in the Dropout layers
  
  returns:
    lstm_net: a Keras Model() object
  """
  
  # Define input to model
  # NB: Althought an integer input might seem to make sense for one-hot encoding, tf requires a float input
  seq_in = Input(shape = (max_time, num_features), dtype = 'float32', name = "seq_in")
  
  # Hidden layers
  for i in range(rnn_layers):
    if i == 0:
      X = CuDNNLSTM(units = hidden_state_dim, return_sequences = True, name = "lstm_" + str(i))(seq_in)
    else:
      X = CuDNNLSTM(units = hidden_state_dim, return_sequences = True, name = "lstm_" + str(i))(X)
    X = Dropout(rate = drop_rate, name = "dropout_" + str(i))(X)
  
  # Activation layer (apply to each timestep)
  Yhat = TimeDistributed(Dense(num_features, activation = "softmax"))(X)
  
  # Create model
  lstm_net = Model(inputs = [seq_in], outputs = Yhat)
  
  return lstm_net

In [None]:
# Set hyperparameters
hd = 150
l = 4
dr = 0.3

# Initialise model
model = init_lstm(max_time = t-1, num_features = n, hidden_state_dim = hd, rnn_layers = l, drop_rate = dr)

# Sanity check
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
seq_in (InputLayer)          (None, 29, 34)            0         
_________________________________________________________________
lstm_0 (CuDNNLSTM)           (None, 29, 150)           111600    
_________________________________________________________________
dropout_0 (Dropout)          (None, 29, 150)           0         
_________________________________________________________________
lstm_1 (CuDNNLSTM)           (None, 29, 150)           181200    
_________________________________________________________________
dropout_1 (Dropout)          (None, 29, 150)           0         
_________________________________________________________________
lstm_2 (CuDNNLSTM)   

In [None]:
# Create optimizer and compile
opt = tf.keras.optimizers.Adam(clipnorm = 5)
model.compile(optimizer = opt, loss = 'categorical_crossentropy')

In [None]:
# Train new model using class weights and save
model.fit(x = X_train, y = Y_train,
                      batch_size = 128, epochs = 20,
                      validation_data = [X_test, Y_test],
                      verbose = 1)


model.save('/content/gdrive/My Drive/waves_of_words/20190220_model_more_layers_20_epochs.h5')

Train on 134864 samples, validate on 14985 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## 3. Evaluate the model

Now the test. Does the model assign a higher or lower probability to Aboriginal words than to unseen English words?

**Notes:**

*20 Feb 2019:* It doesn't seem to make much difference if you train a much deeper model. The model seemed to make just the same kind of prediciton, even when I increased the number of hidden layers from 2 to 4.

In [None]:
# Set paths for saved model and training/test data
model_path = '/content/gdrive/My Drive/waves_of_words/20190220_model_more_layers_20_epochs.h5'
data_path = '/content/gdrive/My Drive/waves_of_words/20190215_seq_data.p'

In [None]:
# RUN THIS CELL IF YOU NEED TO EVALUATE A SAVED MODEL

# Load the model
model = tf.keras.models.load_model(model_path)

# Load training and test data
with open(data_path,'rb') as file:
  save_dict = p.load(file)

# Unpack data
data = save_dict['data']
tkzr = save_dict['tkzr']

# Reshape data

data = tf.keras.utils.to_categorical(data, dtype = 'float32') # For some reason Tensorflow requires a float input
data = data[:,:,1:] # Drop padding (we don't want the LSTM to learn a feature for an empty timestep)
m, t, n = data.shape
X = data[:,0:-1,:]
Y = data[:,1:,:]

# Shuffle and split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, random_state = random_seed)

The following helper functions can be used to retrieve sequence probabilities, and also to prepare texts for processing by the model.

In [None]:
def get_seq_probs(data, model, padding_removed = True):
  """
  Given data, get the model's predicted probabilties for each sequence.
  
  Args:
    data (np.array): a 3d-tensor of dimensions examples x time-steps x num_features
    model (keras.Model): a trained RNN
    
  Returns:
    probs (np.array): a 1d numpy array of the probability of each sequence
  """
  
  # Split data into X and Y
  if padding_removed:
    X = data[:,:-1,:]
    Y = data[:,1:,:]
  else:
    X = data[:,:-1,1:]
    Y = data[:,1:,1:]
  
  # Get predictions for X
  pred = model.predict(X)
  
  # Compute product for each sequence
  pred = np.ma.masked_array(pred, Y == 0) # Get probs for the correct letters
  pred = np.sum(pred, axis = 2) # Sum over feature vectors
  pred = np.prod(pred, axis = 1) # Multiply over timesteps
  pred = np.ma.compressed(pred)
  
  return pred

In [None]:
def process_texts(words, tkzr, maxlen = 30):
  """
  Given a list of texts and a tokeniser, creates one-hot matrix.
  
  Aguments:
    text_list (list): the texts (in this case, words)
    tkzr (keras.preprocessing.Tokenizer): a Tokeniser that has been fit on the data
    
  Returns:
    data (np.array): a 3d numpy array of dimensions m x t x n, with the padding
                     category removed
  """
  ## CLEAN UP TEXT
  
  # Regex for stripping special characters
  regex = re.compile(r'[\W\d_]') # Match any non-word character, digit or underscore
  # Make sure all the words are strings
  words = [word for word in words if type(word) == str]
  # Get rid of digits and non-word characters:
  words = set([regex.sub('', word) for word in words])
  # Set all words to lower case, and add start and end characters:
  words = set(['S' + word.lower() + 'E' for word in words])
  # Convert set to list:
  words = list(words)
  # Get rid of empty strings/junk
  words = [word for word in words if len(word) > 0]
  print(f'Text data cleaned: there are {len(words)} texts in the corpus.')
  
  ## CONVERT TO BINARY REPRESENTATION
  # t and n are fixed by the model/tokenzier:
  t = maxlen
  n = len(tkzr.word_index) + 1
  
  seq_list = tkzr.texts_to_sequences(words) # Convert to list of feature vectors
  data = pad_sequences(seq_list, padding = 'pre', truncating = 'post', maxlen = t) # Convert to matrix of fixed width
  data = tf.keras.utils.to_categorical(data, dtype = 'float32', num_classes = n) # One-hot encode
  data = data[:,:,1:] # Remove padding feature from 3-tensor
  print(f'Data converted to binary representation. It has dimensions: {data.shape}')
  
  return(data)
  

In [None]:
def reconstruct_sequences(data, tkzr, padding_removed = True):
  """
  Reconstruct the sequences from the data matrix using the tokeniser.
  
  Arguments:
    data (np.ndarray): a 3d array of one-hot encoded sequences
    tkzr (keras_preprocessing.text.Tokenizer): the Tokeniser used to preprocess
          the data
  
  Returns:
    seqs (list): the words from the data
  """
  
  int_to_char = {value:key for key,value in tkzr.word_index.items()}
  int_to_char[0] = "" # Add padding variable to index
  
  if padding_removed:
    # Add padding feature back to data
    m, t, n = data.shape
    pad_slice = np.zeros(shape = (m, t, 1))
    data = np.concatenate([pad_slice, data], axis = 2)
  
  # Convert binary matrix to dense
  indices = np.argmax(data, axis = -1)
  
  # Convert indices to characters
  seqs = np.apply_along_axis(lambda row : [int_to_char[x] for x in row], axis = 1, arr = indices)
  # Concatenate characters
  seqs = [''.join(row) for row in seqs]
  
  return(seqs)

In [None]:
# Import Australian words, and see how the model does on them
gamilaraay = pd.read_excel('/content/gdrive/My Drive/waves_of_words/GamilaraayExport.xlsx')
gamilaraay = gamilaraay['OriginalForm'].tolist()
# Remove everything after the comma
gamilaraay = [word for word in gamilaraay if type(word) == str]
comma = re.compile(r',.+')
gamilaraay = [comma.sub('', word) for word in gamilaraay]
print(gamilaraay[0:10])

# Predict
gam_data = process_texts(gamilaraay, tkzr)
gam_probs = get_seq_probs(gam_data, model)

['girran', '-bidi', 'yii-li', 'buluuy', 'yilaalu', 'galiya-y', 'yu-gi', 'garra-li', 'buruma', 'ngadaa']
Text data cleaned: there are 4879 texts in the corpus.
Data converted to binary representation. It has dimensions: (4879, 30, 34)


In [None]:
# Try with another Australian language
gunaikurnai = pd.read_excel('/content/gdrive/My Drive/waves_of_words/KurnaiExport.xlsx')
gunaikurnai = gunaikurnai['OriginalForm'].tolist()
# Remove everything after the comma
gunaikurnai = [word for word in gunaikurnai if type(word) == str]
comma = re.compile(r',.+')
gunaikurnai = [comma.sub('', word) for word in gunaikurnai]
print(gunaikurnai[0:10])

# Predict
kur_data = process_texts(gunaikurnai, tkzr)
kur_probs = get_seq_probs(kur_data, model)

['jirrah', 'wadhan', 'baan', 'ngooran', 'miowera', 'wrang', 'jellangoong', 'booran', 'wokook', 'kooragan']
Text data cleaned: there are 3319 texts in the corpus.
Data converted to binary representation. It has dimensions: (3319, 30, 34)


In [None]:
test_data = np.concatenate([X_test, Y_test[:,[-1],:]], axis = 1)
eng_probs = get_seq_probs(test_data, model)

In [None]:
print(f'The average probability of a Training word is {np.mean(eng_probs):.6f}')
print(f'The average probability of a Gamilaraay word is {np.mean(gam_probs):.6f}')
print(f'The average probability of a Gunaikurnai word is {np.mean(kur_probs):.6f}')

The average probability of a Training word is 0.000089
The average probability of a Gamilaraay word is 0.000257
The average probability of a Gunaikurnai word is 0.000008


*Sigh* Perhaps we can try normalising for length...

In [None]:
gam_pred = model.predict(gam_data[:,:-1,:])
eng_pred = model.predict(X_test)

In [None]:
print(f'The mean probabilility in the prediction matrix for Gamilaraay is: {np.mean(gam_pred)}.')
print(f'The mean probabilility in the prediction matrix for English is: {np.mean(eng_pred)}.')
print(f'The mean probabilility in the masked matrix for Gamilaraay is: {np.mean(np.ma.masked_array(gam_pred, gam_data[:,1:,:] == 0))}.')
print(f'The mean probabilility in the masked matrix for English is: {np.mean(np.ma.masked_array(eng_pred, Y_test == 0))}.')

The mean probabilility in the prediction matrix for Gamilaraay is: 0.029411764815449715.
The mean probabilility in the prediction matrix for English is: 0.029411761090159416.
The mean probabilility in the masked matrix for Gamilaraay is: 0.2770350196576344.
The mean probabilility in the masked matrix for English is: 0.420858431607581.


In [None]:
# Let's see what effect the length of the sequence is having ...
eng_len = np.sum(test_data, axis = 2) # Sum over third dimension - now each sequence looks like [0,0,0,0, ... 0,1,1,1 ... 1]
eng_len = np.sum(eng_len, axis = 1) # Sum over second dimension to get lengths of each sequence

gam_len = np.sum(gam_data, axis = 2)
gam_len = np.sum(gam_len, axis = 1)

kur_len = np.sum(kur_data, axis = 2)
kur_len = np.sum(kur_len, axis = 1)

# Reconstruct the words
eng_words = reconstruct_sequences(test_data, tkzr)
gam_words = reconstruct_sequences(gam_data, tkzr)
kur_words = reconstruct_sequences(kur_data, tkzr)

# Put into data frame
eng_df = pd.DataFrame.from_dict({'eng_word':eng_words, 'seq_len':eng_len, 'eng_prob':eng_probs})
gam_df = pd.DataFrame.from_dict({'gam_word':gam_words, 'seq_len':gam_len, 'gam_prob':gam_probs})
kur_df = pd.DataFrame.from_dict({'kur_word':kur_words, 'seq_len':kur_len, 'kur_prob':kur_probs})

In [None]:
eng_agg_probs = eng_df.groupby('seq_len').mean()
gam_agg_probs = gam_df.groupby('seq_len').mean()
kur_agg_probs = kur_df.groupby('seq_len').mean()

eng_n = eng_df.groupby('seq_len').count().drop('eng_prob', 1)
gam_n = gam_df.groupby('seq_len').count().drop('gam_prob', 1)
kur_n = kur_df.groupby('seq_len').count().drop('kur_prob', 1)

comb_probs = eng_agg_probs.join([eng_n, gam_agg_probs, gam_n, kur_agg_probs, kur_n])
comb_probs['eng_gt_gam'] = comb_probs.eng_prob > comb_probs.gam_prob
comb_probs['eng_gr_kur'] = comb_probs.eng_prob > comb_probs.kur_prob

In [None]:
comb_probs

Unnamed: 0_level_0,eng_prob,eng_word,gam_prob,gam_word,kur_prob,kur_word,eng_gt_gam,eng_gr_kur
seq_len,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2.0,0.9989445,1,0.9989445,1,,,False,False
3.0,0.04217488,2,0.04099197,4,,,True,False
4.0,0.001842734,58,0.001880533,32,0.00226414,2.0,False,False
5.0,0.0001264541,401,0.0001628201,111,0.0001771153,71.0,False,False
6.0,2.97899e-05,852,2.228185e-05,384,3.381302e-05,233.0,True,False
7.0,1.342543e-05,1296,3.452653e-06,757,5.293638e-06,404.0,True,True
8.0,7.389273e-06,1837,6.047894e-07,856,8.653188e-07,525.0,True,True
9.0,4.096907e-06,2066,5.155606e-08,787,1.652746e-07,488.0,True,True
10.0,3.135183e-06,2024,5.867082e-09,571,2.359774e-08,395.0,True,True
11.0,2.505302e-06,1749,3.456087e-10,368,1.632114e-10,274.0,True,True


In [None]:
# There is no linear correlation between the probability of the sequence and the length, but how about
# a rank correlation?
print(eng_df.corr(method = 'spearman'))
print(gam_df.corr(method = 'spearman'))
print(kur_df.corr(method = 'spearman'))

          eng_prob  seq_len
eng_prob   1.00000 -0.58779
seq_len   -0.58779  1.00000
          gam_prob   seq_len
gam_prob  1.000000 -0.802803
seq_len  -0.802803  1.000000
          kur_prob   seq_len
kur_prob  1.000000 -0.775826
seq_len  -0.775826  1.000000


In [None]:
eng_stats = (eng_df.
             groupby('seq_len').std().fillna(0).rename(columns = {'eng_prob':'eng_std'}).
             join(
                 eng_df.groupby('seq_len').mean().rename(columns = {'eng_prob':'eng_mean'})
             ).
             reset_index())

In [None]:
# What if we try benchmarking using the probabilities of the training set?
train_data = np.concatenate([X_train, Y_train[:,[-1],:]], axis = 1)
train_probs = get_seq_probs(train_data, model)
train_len = np.sum(train_data, axis = 2)
train_len = np.sum(train_len, axis = 1)
train_df = pd.DataFrame({'train_prob':train_probs, 'seq_len':train_len})
eng_stats = (train_df.
             groupby('seq_len').std().fillna(0).rename(columns = {'train_prob':'train_std'}).
             join(
                 train_df.groupby('seq_len').mean().rename(columns = {'train_prob':'train_mean'})
             ).
             reset_index())

In [None]:
# Can we use the standard deviation to benchmark the Australian words?
kur_df_merged = kur_df.merge(eng_stats, how = 'left', on = 'seq_len')
gam_df_merged = gam_df.merge(eng_stats, how = 'left', on = 'seq_len')
eng_df_merged = eng_df.merge(eng_stats, how = 'left', on = 'seq_len')

In [None]:
# How many words are one standard deviation from the mean?
s_factor = 0.2
l_factor = 0
r_factor = 30

k = (kur_df_merged.train_mean - s_factor * kur_df_merged.train_std) > kur_df_merged.kur_prob
g = (gam_df_merged.train_mean - s_factor * gam_df_merged.train_std) > gam_df_merged.gam_prob
e = (eng_df_merged.train_mean - s_factor * eng_df_merged.train_std) > eng_df_merged.eng_prob

k_f = (l_factor < kur_df_merged.seq_len) & (kur_df_merged.seq_len < r_factor)
g_f = (l_factor < gam_df_merged.seq_len) & (gam_df_merged.seq_len < r_factor)
e_f = (l_factor < eng_df_merged.seq_len) & (eng_df_merged.seq_len < r_factor)

print(f'Only words with between {l_factor} and {r_factor} characters were considered.')
print(f'{(k & k_f).sum()/k_f.sum():.2f} of the Gunaikurnai words are {s_factor} std from the mean of same-length English words')
print(f'{(g & g_f).sum()/g_f.sum():.2f} of the Gamilaraay words have a probability {s_factor} std from the mean of same-length English words')
print(f'{(e & e_f).sum()/e_f.sum():.2f} of the test English words have a probability {s_factor} std from the mean of same-length English words')

Only words with between 0 and 30 characters were considered.
0.83 of the Gunaikurnai words are 0.2 std from the mean of same-length English words
0.87 of the Gamilaraay words have a probability 0.2 std from the mean of same-length English words
0.64 of the test English words have a probability 0.2 std from the mean of same-length English words


In [None]:
# What if we try comparing the means of the probabilities?
gam_correct_probs = np.sum(np.ma.masked_array(gam_pred, gam_data[:,1:,:] == 0), axis = 2)
gam_means = np.mean(gam_correct_probs, axis = 1)
gam_mean_df = pd.DataFrame({'seq_len':gam_len, 'mean_prob':gam_means})

eng_correct_probs = np.sum(np.ma.masked_array(eng_pred, Y_test == 0), axis = 2)
eng_means = np.mean(eng_correct_probs, axis = 1)
eng_mean_df = pd.DataFrame({'seq_len':eng_len, 'mean_prob':eng_means})

In [None]:
gm_quart = gam_mean_df.quantile([0.,0.25,0.5,0.75,1.])
gm_quart

Unnamed: 0,mean_prob,seq_len
0.0,0.034828,2.0
0.25,0.259699,7.0
0.5,0.302355,9.0
0.75,0.342579,11.0
1.0,0.999472,30.0


In [None]:
em_quart = eng_mean_df.quantile([0.,0.25,0.5,0.75,1.])
em_quart

Unnamed: 0,mean_prob,seq_len
0.0,0.133283,2.0
0.25,0.350209,8.0
0.5,0.398812,10.0
0.75,0.466562,12.0
1.0,0.999472,30.0


In [None]:
gp_quart = gam_df.quantile([0.,0.25,0.5,0.75,1.])
gp_quart

Unnamed: 0,gam_prob,seq_len
0.0,0.0,2.0
0.25,2.060078e-14,7.0
0.5,6.177382e-10,9.0
0.75,2.199242e-07,11.0
1.0,0.9989445,30.0


In [None]:
ep_quart = eng_df.quantile([0.,0.25,0.5,0.75,1.])
ep_quart

Unnamed: 0,eng_prob,seq_len
0.0,4.972087e-41,2.0
0.25,8.643084e-09,8.0
0.5,2.846334e-07,10.0
0.75,3.94436e-06,12.0
1.0,0.9989445,30.0


In [None]:
# What if the set the threshold at the third quartile for Gamilaraay?
thresh = gm_quart.iloc[3,0]

def predict(x):
  if x < thresh:
    return 1
  else:
    return 0

In [None]:
thresh

0.3425790203942193

In [None]:
y = [1 for x in range(len(gam_means))] + [0 for x in range(len(eng_means))]
y = np.array(y, dtype = 'int32')
y_hat = [predict(x) for x in gam_means] + [predict(x) for x in eng_means]
y_hat = np.array(y_hat, dtype = 'int32')

In [None]:
# Calculate precision and recall:
true_positive = ((y == 1) & (y_hat == 1)).sum()
false_positive = ((y == 0) & (y_hat == 1)).sum()
false_negative = ((y == 1) & (y_hat == 0)).sum()

precision = true_positive / (true_positive + false_positive)
recall = true_positive / (true_positive + false_negative)

print(f'When the threshold is set at {thresh:.8f}, the model can discriminate English from Gamilaraay with precision {precision:.2f} and recall {recall:.2f}.')

When the threshold is set at 0.34257902, the model can discriminate English from Gamilaraay with precision 0.53 and recall 0.75.
