# A very fast and loose first attempt with the universal sentence encoder
#### Load packages

In [1]:
import pandas as pd
import numpy as np
import nltk
import tensorflow_hub as hub
from sklearn.cluster import MiniBatchKMeans

#### Load the pre-trained model from Tensor-Hub

In [2]:
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

#### Load our data

In this case, I'm loading all of the text.

In [3]:
data = pd.read_csv('./data/trackGBV_xls_all.csv')

In [4]:
data.head()

Unnamed: 0,docid,contents
0,73794,Home | Databases | WorldLII | Search | Feedbac...
1,238749,Kanoanie v Peter Kum Kee & Sons [2007] KIHC...
2,303537,Bade v Regina [2014] SBCA 13; SICOA-CRAC 31...
3,299802,Ngirametuker v Oikull Village [2013] PWSC 1...
4,240504,Bank of Kiribati v Corbett [2004] KIHC 48; ...


In [5]:
data.shape

(48983, 2)

#### Split the documents into sentences using a custom function based on `punkt`

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gregorytozzi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
def split_sent(doc):
    doc_carriage_return = doc.split('\n')
    doc_carriage_return = [line for line in doc_carriage_return if line != '']
    sentences = []
    for line in doc_carriage_return:
        for sent in nltk.tokenize.sent_tokenize(line):
            sentences.append(sent)
    return sentences

#### Build embeddings from a subset of the documents

In [9]:
docs = list(data.contents)

In [10]:
sentences = []
for doc in docs[:1000]:
    for sent in split_sent(doc):
        sentences.append(sent)

In [11]:
len(sentences)

153596

In [12]:
embeddings = embed(sentences[:30000]).numpy()

In [13]:
embeddings.shape

(30000, 512)

#### Cluster the embeddings

In [14]:
clust = MiniBatchKMeans(n_clusters=500, verbose=1)
clust.fit(embeddings)
y_hat = clust.predict(embeddings)

Init 1/3 with method: k-means++


  init_size=init_size)


Inertia for init 1/3: 59.294983
Init 2/3 with method: k-means++


  init_size=init_size)


Inertia for init 2/3: 63.396748
Init 3/3 with method: k-means++


  init_size=init_size)


Inertia for init 3/3: 62.964703
Minibatch iteration 1/30000: mean batch inertia: 0.885546, ewa inertia: 0.885546 
Minibatch iteration 2/30000: mean batch inertia: 0.807968, ewa inertia: 0.885029 
Minibatch iteration 3/30000: mean batch inertia: 0.720702, ewa inertia: 0.883933 
Minibatch iteration 4/30000: mean batch inertia: 0.752522, ewa inertia: 0.883057 
Minibatch iteration 5/30000: mean batch inertia: 0.760959, ewa inertia: 0.882243 
Minibatch iteration 6/30000: mean batch inertia: 0.693615, ewa inertia: 0.880986 
Minibatch iteration 7/30000: mean batch inertia: 0.732939, ewa inertia: 0.879999 
Minibatch iteration 8/30000: mean batch inertia: 0.757549, ewa inertia: 0.879183 
Minibatch iteration 9/30000: mean batch inertia: 0.710965, ewa inertia: 0.878061 
[MiniBatchKMeans] Reassigning 50 cluster centers.
Minibatch iteration 10/30000: mean batch inertia: 0.690426, ewa inertia: 0.876810 
Minibatch iteration 11/30000: mean batch inertia: 0.723001, ewa inertia: 0.875785 
Minibatch iter

Minibatch iteration 107/30000: mean batch inertia: 0.614800, ewa inertia: 0.774065 
Minibatch iteration 108/30000: mean batch inertia: 0.642040, ewa inertia: 0.773185 
Minibatch iteration 109/30000: mean batch inertia: 0.664158, ewa inertia: 0.772458 
[MiniBatchKMeans] Reassigning 50 cluster centers.
Minibatch iteration 110/30000: mean batch inertia: 0.651874, ewa inertia: 0.771654 
Minibatch iteration 111/30000: mean batch inertia: 0.671493, ewa inertia: 0.770987 
Minibatch iteration 112/30000: mean batch inertia: 0.617403, ewa inertia: 0.769963 
Minibatch iteration 113/30000: mean batch inertia: 0.700054, ewa inertia: 0.769497 
Minibatch iteration 114/30000: mean batch inertia: 0.647671, ewa inertia: 0.768684 
Minibatch iteration 115/30000: mean batch inertia: 0.648915, ewa inertia: 0.767886 
Minibatch iteration 116/30000: mean batch inertia: 0.647763, ewa inertia: 0.767085 
Minibatch iteration 117/30000: mean batch inertia: 0.664548, ewa inertia: 0.766402 
Minibatch iteration 118/30

Minibatch iteration 207/30000: mean batch inertia: 0.630258, ewa inertia: 0.705872 
Minibatch iteration 208/30000: mean batch inertia: 0.665808, ewa inertia: 0.705605 
[MiniBatchKMeans] Reassigning 50 cluster centers.
Minibatch iteration 209/30000: mean batch inertia: 0.647986, ewa inertia: 0.705220 
Minibatch iteration 210/30000: mean batch inertia: 0.658534, ewa inertia: 0.704909 
Minibatch iteration 211/30000: mean batch inertia: 0.644956, ewa inertia: 0.704510 
Minibatch iteration 212/30000: mean batch inertia: 0.613345, ewa inertia: 0.703902 
Minibatch iteration 213/30000: mean batch inertia: 0.611648, ewa inertia: 0.703287 
Minibatch iteration 214/30000: mean batch inertia: 0.606957, ewa inertia: 0.702645 
Minibatch iteration 215/30000: mean batch inertia: 0.645227, ewa inertia: 0.702262 
Minibatch iteration 216/30000: mean batch inertia: 0.688816, ewa inertia: 0.702172 
Minibatch iteration 217/30000: mean batch inertia: 0.632561, ewa inertia: 0.701708 
Minibatch iteration 218/30

Minibatch iteration 314/30000: mean batch inertia: 0.605536, ewa inertia: 0.663344 
Minibatch iteration 315/30000: mean batch inertia: 0.524008, ewa inertia: 0.662415 
Minibatch iteration 316/30000: mean batch inertia: 0.596677, ewa inertia: 0.661977 
Minibatch iteration 317/30000: mean batch inertia: 0.654299, ewa inertia: 0.661925 
Minibatch iteration 318/30000: mean batch inertia: 0.620665, ewa inertia: 0.661650 
Minibatch iteration 319/30000: mean batch inertia: 0.635932, ewa inertia: 0.661479 
Minibatch iteration 320/30000: mean batch inertia: 0.621105, ewa inertia: 0.661210 
Minibatch iteration 321/30000: mean batch inertia: 0.648196, ewa inertia: 0.661123 
Minibatch iteration 322/30000: mean batch inertia: 0.642345, ewa inertia: 0.660998 
Minibatch iteration 323/30000: mean batch inertia: 0.643415, ewa inertia: 0.660881 
[MiniBatchKMeans] Reassigning 50 cluster centers.
Minibatch iteration 324/30000: mean batch inertia: 0.639582, ewa inertia: 0.660739 
Minibatch iteration 325/30

Minibatch iteration 419/30000: mean batch inertia: 0.622395, ewa inertia: 0.640487 
Minibatch iteration 420/30000: mean batch inertia: 0.622710, ewa inertia: 0.640368 
Minibatch iteration 421/30000: mean batch inertia: 0.557889, ewa inertia: 0.639819 
Minibatch iteration 422/30000: mean batch inertia: 0.604551, ewa inertia: 0.639583 
Minibatch iteration 423/30000: mean batch inertia: 0.588175, ewa inertia: 0.639241 
Minibatch iteration 424/30000: mean batch inertia: 0.595606, ewa inertia: 0.638950 
Minibatch iteration 425/30000: mean batch inertia: 0.589373, ewa inertia: 0.638619 
Minibatch iteration 426/30000: mean batch inertia: 0.559690, ewa inertia: 0.638093 
Minibatch iteration 427/30000: mean batch inertia: 0.545711, ewa inertia: 0.637477 
Minibatch iteration 428/30000: mean batch inertia: 0.615368, ewa inertia: 0.637330 
[MiniBatchKMeans] Reassigning 50 cluster centers.
Minibatch iteration 429/30000: mean batch inertia: 0.657224, ewa inertia: 0.637463 
Minibatch iteration 430/30

Computing label assignment and total inertia


#### Examine some results

In [23]:
index = 8
sentence_index = np.where(y_hat==index)[0]
for i in sentence_index:
    print('-----------------------------------------------------------------')
    print(sentences[i])
    print('\n')

-----------------------------------------------------------------
Counsel for Appellant: Pro Se


-----------------------------------------------------------------
Counsel for Appellee: Oldiais Ngirakelau


-----------------------------------------------------------------
Counsel for the 1st Respondent: E. Veretawatini: Veretawatini Esq.


-----------------------------------------------------------------
Counsel: Appellants in Person


-----------------------------------------------------------------
Counsel: Appellants in Person


-----------------------------------------------------------------
Counsel: Appellant in person


-----------------------------------------------------------------
Counsel: S. Maharaj for the Appellant


-----------------------------------------------------------------
Counsel: Appellant in Person


-----------------------------------------------------------------
Counsel for Appellant No.


-----------------------------------------------------------------
Co

In [37]:
index = 33
sentence_index = np.where(y_hat==index)[0]
for i in sentence_index:
    print('-----------------------------------------------------------------')
    print(sentences[i])
    print('\n')

-----------------------------------------------------------------
For the following reasons, the decision of the Land Court is AFFIRMED.


-----------------------------------------------------------------
He gave two principal reasons for his decision:


-----------------------------------------------------------------
The learned trial Magistrate gave the following principal reasons for his decision:-


-----------------------------------------------------------------
The court's reasons follow.


-----------------------------------------------------------------
VERDICT AND REASONS


-----------------------------------------------------------------
[4] These are my reasons.


-----------------------------------------------------------------
I now give my reasons.


-----------------------------------------------------------------
REASONS FOR JUDGMENT OF THE COURT


-----------------------------------------------------------------
 REASONS FOR DECISION


-------------------------------

In [38]:
index = 34
sentence_index = np.where(y_hat==index)[0]
for i in sentence_index:
    print('-----------------------------------------------------------------')
    print(sentences[i])
    print('\n')

-----------------------------------------------------------------
The courts are not intended by s. 10(1) to be independent of the law but independent within it.


-----------------------------------------------------------------
accepted self-serving testimonies and irrelevant documents presented by Appellee."


-----------------------------------------------------------------
One customary law was proven while the other was not.


-----------------------------------------------------------------
The Land Court's decision was not clearly erroneous and must be AFFIRMED.


-----------------------------------------------------------------
It is not capable of raising the material facts."


-----------------------------------------------------------------
Whether or not termination lawful


-----------------------------------------------------------------
In my view, the assessors' decision was not perverse.


-----------------------------------------------------------------
However, on c

In [43]:
index = 54
sentence_index = np.where(y_hat==index)[0]
for i in sentence_index:
    print('-----------------------------------------------------------------')
    print(sentences[i])
    print('\n')

-----------------------------------------------------------------
The competing affidavit material demonstrates the existence of strongly contested issues of fact.


-----------------------------------------------------------------
I am of the view that the overall balance of convenience requires that a stay be granted.


-----------------------------------------------------------------
We review the Land Court's factual findings for clear error and its conclusions of law de novo.


-----------------------------------------------------------------
is warranted only if the findings so lack evidentiary support in the record that no reasonable trier of fact could have reached the same conclusion."


-----------------------------------------------------------------
[5] As a general matter, "[t]he Tochi Daicho is presumed to be accurate, and a party seeking to rebut it must present clear and convincing evidence."


-----------------------------------------------------------------
The expres