# Selecting samples of 10k articles from CC_News with 5 diversity metrics:
1. [Setup / Removing non-English articles from the dataset](#setup)
2. [Random](#random)
3. [Mean TFIDF](#meantfidf)
4. [Jaccard similarity](#jaccard) -- ERROR: jaccard similarity not successfully scaled to CC_News dataset
5. [Trigram language model entropy](#3entropy)
6. [Measures of embedding diversity:](#embedding)
    1. [Outlier](#outlier)
    2. [Word Embedding Diversity](#wed)
<br>
*all metrics were computed, and models subsequently trained, using CPU on a macbook with M1 pro chip

In [1]:
__author__ = "Jon Ball"
__version__ = "CS224U, Stanford, Spring 2022"

Setup modules:

In [2]:
from datasets import load_dataset
from tqdm import tqdm
import numpy as np
import utils
import json
import time
import os

For reproducibility:

In [3]:
utils.fix_random_seeds()

Data:

In [4]:
ccn = load_dataset("cc_news", split="train")["text"]

Reusing dataset cc_news (/Users/jball/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/ae469e556251e6e7e20a789f93803c7de19d0c4311b6854ab072fecb4e401bd6)


In [5]:
print(f"Number of articles: {len(ccn)}\n")
print(ccn[0])

Number of articles: 708241

There's a surprising twist to Regina Willoughby's last season with Columbia City Ballet: It's also her 18-year-old daughter Melina's first season with the company. Regina, 40, will retire from the stage in March, just as her daughter starts her own career as a trainee. But for this one season, they're sharing the stage together.
Performing Side-By-Side In The Nutcracker
Regina and Melina are not only dancing in the same Nutcracker this month, they're onstage at the same time: Regina is doing Snow Queen, while Melina is in the snow corps, and they're both in the Arabian divertissement. "It's very surreal to be dancing it together," says Regina. "I don't know that I ever thought Melina would take ballet this far."
Left: Regina and Melina with another company member post-snow scene in 2003. Right: The pair post-snow scene in 2017 (in the same theater)
Keep reading at dancemagazine.com.


### Check that the CC_News articles are actually in English using <a href="https://github.com/google/cld3">gcld3</a>: <a id="setup"></a>

In [6]:
import gcld3

In [7]:
classifier = gcld3.NNetLanguageIdentifier(min_num_bytes=0, max_num_bytes=100000)

In [8]:
for idx, article in enumerate(ccn):
    result = classifier.FindLanguage(text=article)
    if result.language != "en":
        print(idx, "\n", article)
        break

89 
 ০ = 0, ১ = 1, ২ = 2, ৩ = 3, ৪ = 4, ৫ = 5, ৬ = 6, ৭ = 7, ৮ = 8, ৯ = 9
্ = See example (Hasant/Viram) ় = * (Nukta) ʼ = ' (Urdhacomma) ঽ = & (Avagrah) ৺ = ~ (Isshar) ৹ = a~ (Bengali ana sign) ৲ = Rs~ (Bengali Rupee sign) ৳ = T~ (Taka sign) । = | (Devanagari danda) ॥ = || (Devanagari double danda) ₹ = Rs (Indian Rupee sign) 卐 = +~ (Swastika sign) Zero Width Joiner = ^ Zero Width Non Joiner = ^^
These symbols will type Bengali characters first but if "~" will be followed, it will remove previously typed Bengali character and then type the symbol.
Symbols & ~ * : ^ | ' have special meaning. You can type this way & = &~ ~ = ~~ * = *~ : = :~ ^ = ^~ | = |~ ' = '~
The English symbols [ ] { } ( ) < > - + / = ; . , " ? ! % \ _ $ @ # translate into the same symbols.
Example নমস্কার can be written by typing "namaskaar"
As per Rule # 3, ligature will be rendered. ZWJ and ZWNJ characters are used to produce alternate rendering of ligature.
A consonant followed by ZWJ character will produce half-

#### Use gcld3 to filter out articles like the above which are not fully English and likely to be included in each sample:

In [9]:
disinclusion_index = []
for idx, article in enumerate(tqdm(ccn)):
    result = classifier.FindLanguage(text=article)
    if result.language != "en":
        disinclusion_index.append(idx)

100%|█████████████████████████████████| 708241/708241 [06:19<00:00, 1867.32it/s]


In [10]:
%%time

cc_news = [
    article for idx, article in enumerate(ccn) if idx not in disinclusion_index
]

CPU times: user 21 s, sys: 33.1 ms, total: 21.1 s
Wall time: 21.1 s


In [11]:
print(f"Number of clean articles: {len(cc_news)}")

Number of clean articles: 703532


In [12]:
with open(os.path.join("data", "cc_news.json"), "w") as outfile:
    json.dump(cc_news, outfile)

## Mean-TFIDF (variation on Baeza-Yates et al. 1999) <a id="meanidf"></a>

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from operator import itemgetter
import numpy as np
import json
import time
import os

In [2]:
with open(os.path.join("data","cc_news.json"), "r") as infile:
    cc_news = json.load(infile)

print(len(cc_news))

703532


In [3]:
tfidf_vec = TfidfVectorizer()

In [4]:
%time tfidf_matrix = tfidf_vec.fit_transform(cc_news)

CPU times: user 1min 36s, sys: 2.57 s, total: 1min 39s
Wall time: 1min 39s


In [5]:
def mean_idf(doc_idx, tfidf_matrix=tfidf_matrix, tfidf_vectorizer=tfidf_vec):
    feature_index = tfidf_matrix[doc_idx,:].nonzero()[1]
    word_idf_scores = tfidf_vectorizer.idf_[feature_index]
    return np.mean(word_idf_scores)

In [6]:
%time cc_idf = [mean_idf(doc_idx) for doc_idx in range(len(cc_news))]

CPU times: user 19min 37s, sys: 5min 57s, total: 25min 34s
Wall time: 25min 35s


In [7]:
print(f"""Max IDF: {max(cc_idf)}
Min IDF: {min(cc_idf)}
Mean: {np.mean(cc_idf)}
Std Dev: {np.std(cc_idf)}""")

Max IDF: 12.365534427447304
Min IDF: 2.2256180368294363
Mean: 4.42180964113847
Std Dev: 0.5235009566338252


In [8]:
idf_indexed = list(enumerate(cc_idf))
idf_indexed10k = sorted(idf_indexed, key=itemgetter(1))[-10000:] #indices and top 10,000 Mean IDF scores
idf_indices10k = [tup[0] for tup in idf_indexed10k]
idf_scores10k = [tup[1] for tup in idf_indexed10k]

In [9]:
print(f"""Max IDF in subsample: {max(idf_scores10k)}
Min IDF in subsample: {min(idf_scores10k)}
Mean: {np.mean(idf_scores10k)}
Std Dev: {np.std(idf_scores10k)}""")

Max IDF in subsample: 12.365534427447304
Min IDF in subsample: 5.776707236449172
Mean: 6.319554792407199
Std Dev: 0.5852270273612017


In [10]:
%%time

cc_idf10k = [
    article for idx, article in enumerate(cc_news) if idx in idf_indices10k
]

CPU times: user 57 s, sys: 544 ms, total: 57.5 s
Wall time: 57.5 s


In [11]:
with open(os.path.join("data", "cc_idf10k.json"), "w") as outfile:
    json.dump(cc_idf10k, outfile)

## Jaccard distance <a id="jaccard"></a>

In [1]:
from tokenizers import BertWordPieceTokenizer
from operator import itemgetter
from tqdm import tqdm

import multiprocess as mp
import numpy as np
import time
import json
import os

In [2]:
with open(os.path.join("data","cc_news.json"), "r") as infile:
    cc_news = json.load(infile)

print(len(cc_news))

703532


In [3]:
tokenizer = BertWordPieceTokenizer(os.path.join("bert-base-uncased-vocab", "vocab.txt"))

In [4]:
cc_ids = tuple(
    [
        (article_idx, set(tokenizer.encode(cc_news[article_idx]).ids[:512]))\
        for article_idx in tqdm(range(len(cc_news)))
    ]
)

100%|█████████████████████████████████| 703532/703532 [09:11<00:00, 1275.09it/s]


In [5]:
del cc_news, tokenizer, BertWordPieceTokenizer

In [6]:
def mean_jaccard_dist(article_idx, token_set):
    
    jdistances = [
        1.0 - len(token_set.intersection(ref_tuple[1])) / len(token_set.union(ref_tuple[1])) \
        for ref_tuple in cc_ids if article_idx != ref_tuple[0]
    ]
    
    return (article_idx, np.mean(jdistances))

In [None]:
Manually updated

In [7]:
%%time

if __name__ == "__main__":
    
    with mp.Pool(8) as pool:
        cc_jaccard = pool.starmap(mean_jaccard_dist, cc_ids)

KeyboardInterrupt: 

## Trigram language model entropy <a id="3entropy"></a>

In [1]:
from tokenizers import BertWordPieceTokenizer
from nltk.lm.preprocessing import pad_both_ends, flatten
from nltk.util import ngrams
from nltk.lm import MLE
from operator import itemgetter
from tqdm import tqdm
import numpy as np
import time
import json
import os

In [2]:
with open(os.path.join("data","cc_news.json"), "r") as infile:
    cc_news = json.load(infile)

print(len(cc_news))

703532


In [3]:
tokenizer = BertWordPieceTokenizer(os.path.join("bert-base-uncased-vocab", "vocab.txt"))

In [4]:
cc_toks = [
    tokenizer.encode(cc_news[article_idx]).tokens[:512] for article_idx in tqdm(range(len(cc_news)))
]

100%|█████████████████████████████████| 703532/703532 [10:15<00:00, 1143.80it/s]


In [5]:
%%time

cc_padded = [
    list(pad_both_ends(article_toks, n=3)) for article_toks in cc_toks
]

CPU times: user 10.3 s, sys: 57.3 s, total: 1min 7s
Wall time: 1min 45s


In [6]:
cc_vocab = list(flatten(cc_padded))
print(len(cc_vocab))

239677179


In [7]:
%%time

cc_grams = [
    list(ngrams(article_padded, n=3)) for article_padded in cc_padded
]

CPU times: user 26.6 s, sys: 2min 19s, total: 2min 46s
Wall time: 5min 59s


In [8]:
trigram_model = MLE(3)

In [9]:
%time trigram_model.fit(cc_grams, cc_vocab)

CPU times: user 16min 42s, sys: 19min, total: 35min 43s
Wall time: 1h 32s


In [10]:
cc_entropy = [
    trigram_model.entropy(cc_grams[article_idx]) for article_idx in tqdm(range(len(cc_grams)))
]

100%|██████████████████████████████████| 703532/703532 [56:06<00:00, 208.96it/s]


In [11]:
print(f"""Max entropy: {max(cc_entropy)}
Min entropy: {min(cc_entropy)}
Mean: {np.mean(cc_entropy)}
Std Dev: {np.std(cc_entropy)}""")

Max entropy: 6.441736870435621
Min entropy: 0.6607880940607269
Mean: 4.531580789229363
Std Dev: 0.9035417963006608


In [12]:
entropy_indexed = list(enumerate(cc_entropy))
entropy_indexed10k = sorted(entropy_indexed, key=itemgetter(1))[-10000:]
entropy_indices10k = [tup[0] for tup in entropy_indexed10k]
entropy_scores10k = [tup[1] for tup in entropy_indexed10k]

In [13]:
print(f"""Max entropy in subsample: {max(entropy_scores10k)}
Min entropy in subsample: {min(entropy_scores10k)}
Mean: {np.mean(entropy_scores10k)}
Std Dev: {np.std(entropy_scores10k)}""")

Max entropy in subsample: 6.441736870435621
Min entropy in subsample: 5.589578462962402
Mean: 5.680277063155953
Std Dev: 0.08761192165393417


In [14]:
%%time

cc_entropy10k = [
    article for idx, article in enumerate(cc_news) if idx in entropy_indices10k
]

CPU times: user 57.7 s, sys: 2.08 s, total: 59.8 s
Wall time: 1min 4s


In [15]:
with open(os.path.join("data", "cc_entropy10k.json"), "w") as outfile:
    json.dump(cc_entropy10k, outfile)

In [16]:
with open(os.path.join("data", "cc_entropy10k.txt"), "w") as outfile:
    for article in cc_entropy10k:
        outfile.write(article)

## Get embeddings <a id="embedding"></a>

In [1]:
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
from tqdm import tqdm
import numpy as np
import json
import time
import os

In [2]:
with open(os.path.join("data","cc_news.json"), "r") as infile:
    cc_news = json.load(infile)

print(len(cc_news))

703532


In [3]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [4]:
%%time
#using BertTokenizer rather than BertTokenizerFast to avoid forking parallelism issue with for-loop below
tokenized_inputs = [
    tokenizer(cc_news[doc_idx], truncation=True, max_length=512, padding="max_length", return_tensors="pt")\
    for doc_idx in range(len(cc_news))
]

CPU times: user 50min 53s, sys: 15 s, total: 51min 8s
Wall time: 51min 9s


In [5]:
del cc_news

First pass (interrupted due to power outage):

In [6]:
def get_mean_pooled_array(article_idx, tokenized_inputs=tokenized_inputs):
    
    input_ = tokenized_inputs[article_idx]
    
    output_ = model(**input_, output_hidden_states=True)
    
    out_tensor = torch.mean(output_.last_hidden_state, axis=1)
    
    return out_tensor.detach().numpy()

In [7]:
%%time

for article_idx in tqdm(range(len(tokenized_inputs))):
    
    arr = get_mean_pooled_array(article_idx)
    filename = "cc" + str(article_idx) + ".npy"
    np.save(os.path.join("data", "cc_arrays", filename), arr)

 50%|██████████████▌              | 352567/703532 [76:22:02<76:01:12,  1.28it/s]


KeyboardInterrupt: 

Second pass:

In [8]:
def get_mean_pooled_array(article_toks, model=model):
    
    output_ = model(**article_toks, output_hidden_states=True)
    
    out_tensor = torch.mean(output_.last_hidden_state, axis=1)
    
    return out_tensor.detach().numpy()

In [9]:
%%time

for new_idx, article_toks in tqdm(enumerate(tokenized_inputs[352567:])):
    
    arr = get_mean_pooled_array(article_toks)
    filename = "cc" + str(new_idx + 352567) + ".npy"
    np.save(os.path.join("data", "cc_arrays", filename), arr)

350965it [85:28:24,  1.14it/s]

CPU times: user 26d 4min 32s, sys: 4d 6h 18min 12s, total: 30d 6h 22min 44s
Wall time: 3d 13h 28min 24s





## Euclidean distance from mean corpus vector (Stasaski et al., 2020; Larson et al., 2019)

In [1]:
from scipy.spatial.distance import euclidean
from operator import itemgetter
from tqdm import tqdm
import numpy as np
import json
import time
import os

In [2]:
arr_path = os.path.join("data", "cc_arrays")

In [3]:
for roots, dirs, files in os.walk(arr_path):
    
    print(len(files))

    indices = [
        int(filename[2:-4]) for filename in files
    ]
    
    filepaths = [
        os.path.join(arr_path, filename) for filename in files
    ]
    
    idx2path = sorted(
        list(zip(indices, filepaths)),
        key = itemgetter(0)
    )
    
    print(idx2path[0])

703532
(0, 'data/cc_arrays/cc0.npy')


In [4]:
cc_array = np.zeros((len(idx2path), 768))

for idx, filepath in tqdm(idx2path):
    
    arr = np.load(filepath)
    cc_array[idx] = arr
        
print(cc_array.shape)

100%|█████████████████████████████████| 703532/703532 [01:44<00:00, 6734.86it/s]

(703532, 768)





In [5]:
mean_arr = np.mean(cc_array, axis=0)

In [6]:
cc_eudist = []

for idx in tqdm(range(cc_array.shape[0])):
    eudist = euclidean(cc_array[idx], mean_arr)
    cc_eudist.append(eudist)

100%|███████████████████████████████| 703532/703532 [00:02<00:00, 288114.68it/s]


In [7]:
print(f"""Max cosine distance from corpus mean: {max(cc_eudist)}
Min: {min(cc_eudist)}
Mean: {np.mean(cc_eudist)}
Std Dev: {np.std(cc_eudist)}""")

Max cosine distance from corpus mean: 13.397324553882964
Min: 1.960511518807967
Mean: 4.003239536662265
Std Dev: 0.8837525446080264


In [8]:
eudist_indexed = list(enumerate(cc_eudist))
eudist_indexed10k = sorted(eudist_indexed, key=itemgetter(1))[-10000:]
eudist_indices10k = [tup[0] for tup in eudist_indexed10k]
eudist_scores10k = [tup[1] for tup in eudist_indexed10k]

In [9]:
print(f"""Max euclidean distance in subsample: {max(eudist_scores10k)}
Min: {min(eudist_scores10k)}
Mean: {np.mean(eudist_scores10k)}
Std Dev: {np.std(eudist_scores10k)}""")

Max euclidean distance in subsample: 13.397324553882964
Min: 6.509962991455852
Mean: 6.921735291458649
Std Dev: 0.5148770026302417


In [10]:
with open(os.path.join("data","cc_news.json"), "r") as infile:
    cc_news = json.load(infile)

print(len(cc_news))

703532


In [11]:
%%time

cc_eudist10k = [
    article for idx, article in enumerate(cc_news) if idx in eudist_indices10k
]

CPU times: user 48.1 s, sys: 902 ms, total: 49 s
Wall time: 49 s


In [12]:
with open(os.path.join("data", "cc_eudist10k.json"), "w") as outfile:
    json.dump(cc_eudist10k, outfile)

In [13]:
with open(os.path.join("data", "cc_eudist10k.txt"), "w") as outfile:
    json.dump(cc_eudist10k, outfile)

## Word Embedding Diversity / Mean pairwise cosine distance (Palumbo et al. 2020)

In [1]:
from sklearn.metrics import pairwise_distances_chunked
from operator import itemgetter
from itertools import chain
from tqdm import tqdm
import numpy as np
import json
import time
import os

In [2]:
arr_path = os.path.join("data", "cc_arrays")

In [3]:
for roots, dirs, files in os.walk(arr_path):
    
    print(len(files))

    indices = [
        int(filename[2:-4]) for filename in files
    ]
    
    filepaths = [
        os.path.join(arr_path, filename) for filename in files
    ]
    
    idx2path = sorted(
        list(zip(indices, filepaths)),
        key = itemgetter(0)
    )
    
    print(idx2path[-1])

703532
(703531, 'data/cc_arrays/cc703531.npy')


In [4]:
cc_array = np.zeros((len(idx2path), 768))

for idx, filepath in tqdm(idx2path):
    
    arr = np.load(filepath)
    cc_array[idx] = arr

100%|█████████████████████████████████| 703532/703532 [01:41<00:00, 6928.59it/s]


In [5]:
%%time

cc_wed_chunked = []
it = 0
tock = time.time()
m_start = 0
m_end = 0

for chunk in pairwise_distances_chunked(cc_array, metric="cosine"):
    
    cc_wed_chunked.append(np.mean(chunk, axis=1).tolist())
    
    it += 1
    tick = time.time()
    m_start = m_end
    m_end += chunk.shape[0]
    
    print(f"Iter {it}  :  Rows {m_start}-{m_end}  :  {tick - tock} seconds elapsed")
    tock = time.time()

Iter 1  :  Rows 0-190  :  2.7233941555023193 seconds elapsed
Iter 2  :  Rows 190-380  :  3.185492992401123 seconds elapsed
Iter 3  :  Rows 380-570  :  2.486224889755249 seconds elapsed
Iter 4  :  Rows 570-760  :  2.200418710708618 seconds elapsed
Iter 5  :  Rows 760-950  :  2.2093570232391357 seconds elapsed
Iter 6  :  Rows 950-1140  :  2.2308738231658936 seconds elapsed
Iter 7  :  Rows 1140-1330  :  2.2043309211730957 seconds elapsed
Iter 8  :  Rows 1330-1520  :  2.2553868293762207 seconds elapsed
Iter 9  :  Rows 1520-1710  :  2.2104079723358154 seconds elapsed
Iter 10  :  Rows 1710-1900  :  2.191317081451416 seconds elapsed
Iter 11  :  Rows 1900-2090  :  2.2452123165130615 seconds elapsed
Iter 12  :  Rows 2090-2280  :  2.214040994644165 seconds elapsed
Iter 13  :  Rows 2280-2470  :  2.2247989177703857 seconds elapsed
Iter 14  :  Rows 2470-2660  :  2.2463490962982178 seconds elapsed
Iter 15  :  Rows 2660-2850  :  2.229434013366699 seconds elapsed
Iter 16  :  Rows 2850-3040  :  2.22113

Iter 124  :  Rows 23370-23560  :  2.2567250728607178 seconds elapsed
Iter 125  :  Rows 23560-23750  :  2.230362892150879 seconds elapsed
Iter 126  :  Rows 23750-23940  :  2.227579116821289 seconds elapsed
Iter 127  :  Rows 23940-24130  :  2.211789846420288 seconds elapsed
Iter 128  :  Rows 24130-24320  :  2.211935043334961 seconds elapsed
Iter 129  :  Rows 24320-24510  :  2.222219944000244 seconds elapsed
Iter 130  :  Rows 24510-24700  :  2.2193939685821533 seconds elapsed
Iter 131  :  Rows 24700-24890  :  2.190221071243286 seconds elapsed
Iter 132  :  Rows 24890-25080  :  2.2092318534851074 seconds elapsed
Iter 133  :  Rows 25080-25270  :  2.199831962585449 seconds elapsed
Iter 134  :  Rows 25270-25460  :  2.2603700160980225 seconds elapsed
Iter 135  :  Rows 25460-25650  :  2.2467548847198486 seconds elapsed
Iter 136  :  Rows 25650-25840  :  2.2310068607330322 seconds elapsed
Iter 137  :  Rows 25840-26030  :  2.2088911533355713 seconds elapsed
Iter 138  :  Rows 26030-26220  :  2.20877

Iter 244  :  Rows 46170-46360  :  2.2400572299957275 seconds elapsed
Iter 245  :  Rows 46360-46550  :  2.2135531902313232 seconds elapsed
Iter 246  :  Rows 46550-46740  :  2.218379020690918 seconds elapsed
Iter 247  :  Rows 46740-46930  :  2.227048873901367 seconds elapsed
Iter 248  :  Rows 46930-47120  :  2.214615821838379 seconds elapsed
Iter 249  :  Rows 47120-47310  :  2.2035562992095947 seconds elapsed
Iter 250  :  Rows 47310-47500  :  2.221041202545166 seconds elapsed
Iter 251  :  Rows 47500-47690  :  2.212592840194702 seconds elapsed
Iter 252  :  Rows 47690-47880  :  2.2212719917297363 seconds elapsed
Iter 253  :  Rows 47880-48070  :  2.2128329277038574 seconds elapsed
Iter 254  :  Rows 48070-48260  :  2.2214131355285645 seconds elapsed
Iter 255  :  Rows 48260-48450  :  2.2418320178985596 seconds elapsed
Iter 256  :  Rows 48450-48640  :  2.207494020462036 seconds elapsed
Iter 257  :  Rows 48640-48830  :  2.2243571281433105 seconds elapsed
Iter 258  :  Rows 48830-49020  :  2.2237

Iter 364  :  Rows 68970-69160  :  2.2157440185546875 seconds elapsed
Iter 365  :  Rows 69160-69350  :  2.206223964691162 seconds elapsed
Iter 366  :  Rows 69350-69540  :  2.2147817611694336 seconds elapsed
Iter 367  :  Rows 69540-69730  :  2.205334186553955 seconds elapsed
Iter 368  :  Rows 69730-69920  :  2.2416298389434814 seconds elapsed
Iter 369  :  Rows 69920-70110  :  2.208853006362915 seconds elapsed
Iter 370  :  Rows 70110-70300  :  2.234790086746216 seconds elapsed
Iter 371  :  Rows 70300-70490  :  2.2193338871002197 seconds elapsed
Iter 372  :  Rows 70490-70680  :  2.2389068603515625 seconds elapsed
Iter 373  :  Rows 70680-70870  :  2.2203521728515625 seconds elapsed
Iter 374  :  Rows 70870-71060  :  2.22337007522583 seconds elapsed
Iter 375  :  Rows 71060-71250  :  2.22879695892334 seconds elapsed
Iter 376  :  Rows 71250-71440  :  2.2092580795288086 seconds elapsed
Iter 377  :  Rows 71440-71630  :  2.2148258686065674 seconds elapsed
Iter 378  :  Rows 71630-71820  :  2.213398

Iter 484  :  Rows 91770-91960  :  2.199388027191162 seconds elapsed
Iter 485  :  Rows 91960-92150  :  2.234147071838379 seconds elapsed
Iter 486  :  Rows 92150-92340  :  2.2596731185913086 seconds elapsed
Iter 487  :  Rows 92340-92530  :  2.2020750045776367 seconds elapsed
Iter 488  :  Rows 92530-92720  :  2.240078926086426 seconds elapsed
Iter 489  :  Rows 92720-92910  :  2.2175557613372803 seconds elapsed
Iter 490  :  Rows 92910-93100  :  2.222337007522583 seconds elapsed
Iter 491  :  Rows 93100-93290  :  2.2415549755096436 seconds elapsed
Iter 492  :  Rows 93290-93480  :  2.2156970500946045 seconds elapsed
Iter 493  :  Rows 93480-93670  :  2.2221100330352783 seconds elapsed
Iter 494  :  Rows 93670-93860  :  2.22360897064209 seconds elapsed
Iter 495  :  Rows 93860-94050  :  2.2169930934906006 seconds elapsed
Iter 496  :  Rows 94050-94240  :  2.233288049697876 seconds elapsed
Iter 497  :  Rows 94240-94430  :  2.225677251815796 seconds elapsed
Iter 498  :  Rows 94430-94620  :  2.186295

Iter 602  :  Rows 114190-114380  :  2.2343389987945557 seconds elapsed
Iter 603  :  Rows 114380-114570  :  2.2012526988983154 seconds elapsed
Iter 604  :  Rows 114570-114760  :  2.2260098457336426 seconds elapsed
Iter 605  :  Rows 114760-114950  :  2.203031063079834 seconds elapsed
Iter 606  :  Rows 114950-115140  :  2.228433847427368 seconds elapsed
Iter 607  :  Rows 115140-115330  :  2.213196039199829 seconds elapsed
Iter 608  :  Rows 115330-115520  :  2.217625141143799 seconds elapsed
Iter 609  :  Rows 115520-115710  :  2.237542152404785 seconds elapsed
Iter 610  :  Rows 115710-115900  :  2.1864922046661377 seconds elapsed
Iter 611  :  Rows 115900-116090  :  2.2266931533813477 seconds elapsed
Iter 612  :  Rows 116090-116280  :  2.221708059310913 seconds elapsed
Iter 613  :  Rows 116280-116470  :  2.212054967880249 seconds elapsed
Iter 614  :  Rows 116470-116660  :  2.2252390384674072 seconds elapsed
Iter 615  :  Rows 116660-116850  :  2.2173640727996826 seconds elapsed
Iter 616  :  

Iter 719  :  Rows 136420-136610  :  2.230185031890869 seconds elapsed
Iter 720  :  Rows 136610-136800  :  2.2270147800445557 seconds elapsed
Iter 721  :  Rows 136800-136990  :  2.2250261306762695 seconds elapsed
Iter 722  :  Rows 136990-137180  :  2.228605031967163 seconds elapsed
Iter 723  :  Rows 137180-137370  :  2.232684850692749 seconds elapsed
Iter 724  :  Rows 137370-137560  :  2.223909854888916 seconds elapsed
Iter 725  :  Rows 137560-137750  :  2.211688995361328 seconds elapsed
Iter 726  :  Rows 137750-137940  :  2.229233980178833 seconds elapsed
Iter 727  :  Rows 137940-138130  :  2.2029502391815186 seconds elapsed
Iter 728  :  Rows 138130-138320  :  2.212584972381592 seconds elapsed
Iter 729  :  Rows 138320-138510  :  2.2195992469787598 seconds elapsed
Iter 730  :  Rows 138510-138700  :  2.2247180938720703 seconds elapsed
Iter 731  :  Rows 138700-138890  :  2.22019100189209 seconds elapsed
Iter 732  :  Rows 138890-139080  :  2.2325968742370605 seconds elapsed
Iter 733  :  Ro

Iter 836  :  Rows 158650-158840  :  2.2328193187713623 seconds elapsed
Iter 837  :  Rows 158840-159030  :  2.2383368015289307 seconds elapsed
Iter 838  :  Rows 159030-159220  :  2.196439027786255 seconds elapsed
Iter 839  :  Rows 159220-159410  :  2.2219319343566895 seconds elapsed
Iter 840  :  Rows 159410-159600  :  2.243903875350952 seconds elapsed
Iter 841  :  Rows 159600-159790  :  2.2207140922546387 seconds elapsed
Iter 842  :  Rows 159790-159980  :  2.2424662113189697 seconds elapsed
Iter 843  :  Rows 159980-160170  :  2.218514919281006 seconds elapsed
Iter 844  :  Rows 160170-160360  :  2.224591016769409 seconds elapsed
Iter 845  :  Rows 160360-160550  :  2.241450071334839 seconds elapsed
Iter 846  :  Rows 160550-160740  :  2.2316999435424805 seconds elapsed
Iter 847  :  Rows 160740-160930  :  2.2495410442352295 seconds elapsed
Iter 848  :  Rows 160930-161120  :  2.2268521785736084 seconds elapsed
Iter 849  :  Rows 161120-161310  :  2.22724986076355 seconds elapsed
Iter 850  :  

Iter 953  :  Rows 180880-181070  :  2.2075860500335693 seconds elapsed
Iter 954  :  Rows 181070-181260  :  2.2128710746765137 seconds elapsed
Iter 955  :  Rows 181260-181450  :  2.2329790592193604 seconds elapsed
Iter 956  :  Rows 181450-181640  :  2.2108981609344482 seconds elapsed
Iter 957  :  Rows 181640-181830  :  2.2139058113098145 seconds elapsed
Iter 958  :  Rows 181830-182020  :  2.223437786102295 seconds elapsed
Iter 959  :  Rows 182020-182210  :  2.201338052749634 seconds elapsed
Iter 960  :  Rows 182210-182400  :  2.2134671211242676 seconds elapsed
Iter 961  :  Rows 182400-182590  :  2.2331340312957764 seconds elapsed
Iter 962  :  Rows 182590-182780  :  2.2056379318237305 seconds elapsed
Iter 963  :  Rows 182780-182970  :  2.2219200134277344 seconds elapsed
Iter 964  :  Rows 182970-183160  :  2.226151943206787 seconds elapsed
Iter 965  :  Rows 183160-183350  :  2.2181012630462646 seconds elapsed
Iter 966  :  Rows 183350-183540  :  2.2426669597625732 seconds elapsed
Iter 967 

Iter 1069  :  Rows 202920-203110  :  2.2334799766540527 seconds elapsed
Iter 1070  :  Rows 203110-203300  :  2.2325081825256348 seconds elapsed
Iter 1071  :  Rows 203300-203490  :  2.211881399154663 seconds elapsed
Iter 1072  :  Rows 203490-203680  :  2.207576036453247 seconds elapsed
Iter 1073  :  Rows 203680-203870  :  2.230830192565918 seconds elapsed
Iter 1074  :  Rows 203870-204060  :  2.2257800102233887 seconds elapsed
Iter 1075  :  Rows 204060-204250  :  2.2147116661071777 seconds elapsed
Iter 1076  :  Rows 204250-204440  :  2.2052841186523438 seconds elapsed
Iter 1077  :  Rows 204440-204630  :  2.2122950553894043 seconds elapsed
Iter 1078  :  Rows 204630-204820  :  2.21455717086792 seconds elapsed
Iter 1079  :  Rows 204820-205010  :  2.220224142074585 seconds elapsed
Iter 1080  :  Rows 205010-205200  :  2.2306768894195557 seconds elapsed
Iter 1081  :  Rows 205200-205390  :  2.217360019683838 seconds elapsed
Iter 1082  :  Rows 205390-205580  :  2.2175559997558594 seconds elapsed

Iter 1184  :  Rows 224770-224960  :  2.211031913757324 seconds elapsed
Iter 1185  :  Rows 224960-225150  :  2.2112112045288086 seconds elapsed
Iter 1186  :  Rows 225150-225340  :  2.2145862579345703 seconds elapsed
Iter 1187  :  Rows 225340-225530  :  2.229710817337036 seconds elapsed
Iter 1188  :  Rows 225530-225720  :  2.222614049911499 seconds elapsed
Iter 1189  :  Rows 225720-225910  :  2.249789237976074 seconds elapsed
Iter 1190  :  Rows 225910-226100  :  2.200927972793579 seconds elapsed
Iter 1191  :  Rows 226100-226290  :  2.1947481632232666 seconds elapsed
Iter 1192  :  Rows 226290-226480  :  2.218440055847168 seconds elapsed
Iter 1193  :  Rows 226480-226670  :  2.2064971923828125 seconds elapsed
Iter 1194  :  Rows 226670-226860  :  2.2281100749969482 seconds elapsed
Iter 1195  :  Rows 226860-227050  :  2.21662974357605 seconds elapsed
Iter 1196  :  Rows 227050-227240  :  2.222968101501465 seconds elapsed
Iter 1197  :  Rows 227240-227430  :  2.230988025665283 seconds elapsed
It

Iter 1299  :  Rows 246620-246810  :  2.2256648540496826 seconds elapsed
Iter 1300  :  Rows 246810-247000  :  2.246767997741699 seconds elapsed
Iter 1301  :  Rows 247000-247190  :  2.2213330268859863 seconds elapsed
Iter 1302  :  Rows 247190-247380  :  2.2118682861328125 seconds elapsed
Iter 1303  :  Rows 247380-247570  :  2.198997974395752 seconds elapsed
Iter 1304  :  Rows 247570-247760  :  2.2205240726470947 seconds elapsed
Iter 1305  :  Rows 247760-247950  :  2.2133779525756836 seconds elapsed
Iter 1306  :  Rows 247950-248140  :  2.20646071434021 seconds elapsed
Iter 1307  :  Rows 248140-248330  :  2.2128942012786865 seconds elapsed
Iter 1308  :  Rows 248330-248520  :  2.209981918334961 seconds elapsed
Iter 1309  :  Rows 248520-248710  :  2.2160871028900146 seconds elapsed
Iter 1310  :  Rows 248710-248900  :  2.2033050060272217 seconds elapsed
Iter 1311  :  Rows 248900-249090  :  2.22330904006958 seconds elapsed
Iter 1312  :  Rows 249090-249280  :  2.199699878692627 seconds elapsed


Iter 1414  :  Rows 268470-268660  :  2.210921049118042 seconds elapsed
Iter 1415  :  Rows 268660-268850  :  2.2216289043426514 seconds elapsed
Iter 1416  :  Rows 268850-269040  :  2.2158830165863037 seconds elapsed
Iter 1417  :  Rows 269040-269230  :  2.202625036239624 seconds elapsed
Iter 1418  :  Rows 269230-269420  :  2.2211198806762695 seconds elapsed
Iter 1419  :  Rows 269420-269610  :  2.215120315551758 seconds elapsed
Iter 1420  :  Rows 269610-269800  :  2.197505235671997 seconds elapsed
Iter 1421  :  Rows 269800-269990  :  2.2505087852478027 seconds elapsed
Iter 1422  :  Rows 269990-270180  :  2.193404197692871 seconds elapsed
Iter 1423  :  Rows 270180-270370  :  2.202120304107666 seconds elapsed
Iter 1424  :  Rows 270370-270560  :  2.2073168754577637 seconds elapsed
Iter 1425  :  Rows 270560-270750  :  2.2196238040924072 seconds elapsed
Iter 1426  :  Rows 270750-270940  :  2.2204301357269287 seconds elapsed
Iter 1427  :  Rows 270940-271130  :  2.203591823577881 seconds elapsed

Iter 1529  :  Rows 290320-290510  :  2.2110378742218018 seconds elapsed
Iter 1530  :  Rows 290510-290700  :  2.184769868850708 seconds elapsed
Iter 1531  :  Rows 290700-290890  :  2.184591054916382 seconds elapsed
Iter 1532  :  Rows 290890-291080  :  2.1576011180877686 seconds elapsed
Iter 1533  :  Rows 291080-291270  :  2.1980412006378174 seconds elapsed
Iter 1534  :  Rows 291270-291460  :  2.1848011016845703 seconds elapsed
Iter 1535  :  Rows 291460-291650  :  2.2068519592285156 seconds elapsed
Iter 1536  :  Rows 291650-291840  :  2.193223714828491 seconds elapsed
Iter 1537  :  Rows 291840-292030  :  2.212441921234131 seconds elapsed
Iter 1538  :  Rows 292030-292220  :  2.1675310134887695 seconds elapsed
Iter 1539  :  Rows 292220-292410  :  2.187201738357544 seconds elapsed
Iter 1540  :  Rows 292410-292600  :  2.1721699237823486 seconds elapsed
Iter 1541  :  Rows 292600-292790  :  2.2165658473968506 seconds elapsed
Iter 1542  :  Rows 292790-292980  :  2.184058904647827 seconds elapse

Iter 1644  :  Rows 312170-312360  :  2.180311918258667 seconds elapsed
Iter 1645  :  Rows 312360-312550  :  2.2114439010620117 seconds elapsed
Iter 1646  :  Rows 312550-312740  :  2.1815619468688965 seconds elapsed
Iter 1647  :  Rows 312740-312930  :  2.2269320487976074 seconds elapsed
Iter 1648  :  Rows 312930-313120  :  2.203005075454712 seconds elapsed
Iter 1649  :  Rows 313120-313310  :  2.193474054336548 seconds elapsed
Iter 1650  :  Rows 313310-313500  :  2.1584219932556152 seconds elapsed
Iter 1651  :  Rows 313500-313690  :  2.2291419506073 seconds elapsed
Iter 1652  :  Rows 313690-313880  :  2.1553781032562256 seconds elapsed
Iter 1653  :  Rows 313880-314070  :  2.21583890914917 seconds elapsed
Iter 1654  :  Rows 314070-314260  :  2.1718389987945557 seconds elapsed
Iter 1655  :  Rows 314260-314450  :  2.2270829677581787 seconds elapsed
Iter 1656  :  Rows 314450-314640  :  2.164322853088379 seconds elapsed
Iter 1657  :  Rows 314640-314830  :  2.2289371490478516 seconds elapsed
I

Iter 1759  :  Rows 334020-334210  :  2.2446882724761963 seconds elapsed
Iter 1760  :  Rows 334210-334400  :  2.2192342281341553 seconds elapsed
Iter 1761  :  Rows 334400-334590  :  2.2060186862945557 seconds elapsed
Iter 1762  :  Rows 334590-334780  :  2.223633050918579 seconds elapsed
Iter 1763  :  Rows 334780-334970  :  2.2144551277160645 seconds elapsed
Iter 1764  :  Rows 334970-335160  :  2.222707986831665 seconds elapsed
Iter 1765  :  Rows 335160-335350  :  2.2126197814941406 seconds elapsed
Iter 1766  :  Rows 335350-335540  :  2.211031913757324 seconds elapsed
Iter 1767  :  Rows 335540-335730  :  2.2158308029174805 seconds elapsed
Iter 1768  :  Rows 335730-335920  :  2.222075939178467 seconds elapsed
Iter 1769  :  Rows 335920-336110  :  2.2212469577789307 seconds elapsed
Iter 1770  :  Rows 336110-336300  :  2.2119710445404053 seconds elapsed
Iter 1771  :  Rows 336300-336490  :  2.2217652797698975 seconds elapsed
Iter 1772  :  Rows 336490-336680  :  2.2133188247680664 seconds elap

Iter 1874  :  Rows 355870-356060  :  2.2367606163024902 seconds elapsed
Iter 1875  :  Rows 356060-356250  :  2.2026150226593018 seconds elapsed
Iter 1876  :  Rows 356250-356440  :  2.2151191234588623 seconds elapsed
Iter 1877  :  Rows 356440-356630  :  2.205454111099243 seconds elapsed
Iter 1878  :  Rows 356630-356820  :  2.2164199352264404 seconds elapsed
Iter 1879  :  Rows 356820-357010  :  2.216411828994751 seconds elapsed
Iter 1880  :  Rows 357010-357200  :  2.223597764968872 seconds elapsed
Iter 1881  :  Rows 357200-357390  :  2.2251179218292236 seconds elapsed
Iter 1882  :  Rows 357390-357580  :  2.2207329273223877 seconds elapsed
Iter 1883  :  Rows 357580-357770  :  2.1920692920684814 seconds elapsed
Iter 1884  :  Rows 357770-357960  :  2.2238762378692627 seconds elapsed
Iter 1885  :  Rows 357960-358150  :  2.2138919830322266 seconds elapsed
Iter 1886  :  Rows 358150-358340  :  2.2283780574798584 seconds elapsed
Iter 1887  :  Rows 358340-358530  :  2.221029043197632 seconds elap

Iter 1989  :  Rows 377720-377910  :  2.2137959003448486 seconds elapsed
Iter 1990  :  Rows 377910-378100  :  2.2343008518218994 seconds elapsed
Iter 1991  :  Rows 378100-378290  :  2.215453863143921 seconds elapsed
Iter 1992  :  Rows 378290-378480  :  2.225612163543701 seconds elapsed
Iter 1993  :  Rows 378480-378670  :  2.1995668411254883 seconds elapsed
Iter 1994  :  Rows 378670-378860  :  2.2018308639526367 seconds elapsed
Iter 1995  :  Rows 378860-379050  :  2.2287070751190186 seconds elapsed
Iter 1996  :  Rows 379050-379240  :  2.2201666831970215 seconds elapsed
Iter 1997  :  Rows 379240-379430  :  2.218024969100952 seconds elapsed
Iter 1998  :  Rows 379430-379620  :  2.259341239929199 seconds elapsed
Iter 1999  :  Rows 379620-379810  :  2.1893391609191895 seconds elapsed
Iter 2000  :  Rows 379810-380000  :  2.1956379413604736 seconds elapsed
Iter 2001  :  Rows 380000-380190  :  2.192436933517456 seconds elapsed
Iter 2002  :  Rows 380190-380380  :  2.2283613681793213 seconds elaps

Iter 2104  :  Rows 399570-399760  :  2.1962130069732666 seconds elapsed
Iter 2105  :  Rows 399760-399950  :  2.2218310832977295 seconds elapsed
Iter 2106  :  Rows 399950-400140  :  2.206170082092285 seconds elapsed
Iter 2107  :  Rows 400140-400330  :  2.22172212600708 seconds elapsed
Iter 2108  :  Rows 400330-400520  :  2.2122130393981934 seconds elapsed
Iter 2109  :  Rows 400520-400710  :  2.2271242141723633 seconds elapsed
Iter 2110  :  Rows 400710-400900  :  2.2296230792999268 seconds elapsed
Iter 2111  :  Rows 400900-401090  :  2.2081501483917236 seconds elapsed
Iter 2112  :  Rows 401090-401280  :  2.215512275695801 seconds elapsed
Iter 2113  :  Rows 401280-401470  :  2.209902048110962 seconds elapsed
Iter 2114  :  Rows 401470-401660  :  2.2067067623138428 seconds elapsed
Iter 2115  :  Rows 401660-401850  :  2.20393705368042 seconds elapsed
Iter 2116  :  Rows 401850-402040  :  2.253713846206665 seconds elapsed
Iter 2117  :  Rows 402040-402230  :  2.5752360820770264 seconds elapsed


Iter 2219  :  Rows 421420-421610  :  2.355026960372925 seconds elapsed
Iter 2220  :  Rows 421610-421800  :  2.2030868530273438 seconds elapsed
Iter 2221  :  Rows 421800-421990  :  2.2648940086364746 seconds elapsed
Iter 2222  :  Rows 421990-422180  :  2.1915640830993652 seconds elapsed
Iter 2223  :  Rows 422180-422370  :  2.500980854034424 seconds elapsed
Iter 2224  :  Rows 422370-422560  :  2.3261051177978516 seconds elapsed
Iter 2225  :  Rows 422560-422750  :  2.199522018432617 seconds elapsed
Iter 2226  :  Rows 422750-422940  :  2.2526562213897705 seconds elapsed
Iter 2227  :  Rows 422940-423130  :  2.2113559246063232 seconds elapsed
Iter 2228  :  Rows 423130-423320  :  2.2034692764282227 seconds elapsed
Iter 2229  :  Rows 423320-423510  :  2.2225260734558105 seconds elapsed
Iter 2230  :  Rows 423510-423700  :  2.2230899333953857 seconds elapsed
Iter 2231  :  Rows 423700-423890  :  2.238265037536621 seconds elapsed
Iter 2232  :  Rows 423890-424080  :  2.197679281234741 seconds elaps

Iter 2334  :  Rows 443270-443460  :  2.2194759845733643 seconds elapsed
Iter 2335  :  Rows 443460-443650  :  2.240509033203125 seconds elapsed
Iter 2336  :  Rows 443650-443840  :  2.2202601432800293 seconds elapsed
Iter 2337  :  Rows 443840-444030  :  2.1960201263427734 seconds elapsed
Iter 2338  :  Rows 444030-444220  :  2.2413101196289062 seconds elapsed
Iter 2339  :  Rows 444220-444410  :  2.232254981994629 seconds elapsed
Iter 2340  :  Rows 444410-444600  :  2.445511817932129 seconds elapsed
Iter 2341  :  Rows 444600-444790  :  2.583728075027466 seconds elapsed
Iter 2342  :  Rows 444790-444980  :  2.3563787937164307 seconds elapsed
Iter 2343  :  Rows 444980-445170  :  2.491295099258423 seconds elapsed
Iter 2344  :  Rows 445170-445360  :  2.533223867416382 seconds elapsed
Iter 2345  :  Rows 445360-445550  :  2.345484972000122 seconds elapsed
Iter 2346  :  Rows 445550-445740  :  2.1642632484436035 seconds elapsed
Iter 2347  :  Rows 445740-445930  :  2.2428882122039795 seconds elapsed

Iter 2449  :  Rows 465120-465310  :  2.2036681175231934 seconds elapsed
Iter 2450  :  Rows 465310-465500  :  2.248353958129883 seconds elapsed
Iter 2451  :  Rows 465500-465690  :  2.2179131507873535 seconds elapsed
Iter 2452  :  Rows 465690-465880  :  2.2106308937072754 seconds elapsed
Iter 2453  :  Rows 465880-466070  :  2.221956729888916 seconds elapsed
Iter 2454  :  Rows 466070-466260  :  2.221727132797241 seconds elapsed
Iter 2455  :  Rows 466260-466450  :  2.211311101913452 seconds elapsed
Iter 2456  :  Rows 466450-466640  :  2.2321012020111084 seconds elapsed
Iter 2457  :  Rows 466640-466830  :  2.214468002319336 seconds elapsed
Iter 2458  :  Rows 466830-467020  :  2.2088520526885986 seconds elapsed
Iter 2459  :  Rows 467020-467210  :  2.212455987930298 seconds elapsed
Iter 2460  :  Rows 467210-467400  :  2.2108728885650635 seconds elapsed
Iter 2461  :  Rows 467400-467590  :  2.234970808029175 seconds elapsed
Iter 2462  :  Rows 467590-467780  :  2.2041690349578857 seconds elapsed

Iter 2564  :  Rows 486970-487160  :  2.1281120777130127 seconds elapsed
Iter 2565  :  Rows 487160-487350  :  2.1309778690338135 seconds elapsed
Iter 2566  :  Rows 487350-487540  :  2.121778964996338 seconds elapsed
Iter 2567  :  Rows 487540-487730  :  2.1295390129089355 seconds elapsed
Iter 2568  :  Rows 487730-487920  :  2.1055140495300293 seconds elapsed
Iter 2569  :  Rows 487920-488110  :  2.132938861846924 seconds elapsed
Iter 2570  :  Rows 488110-488300  :  2.1254701614379883 seconds elapsed
Iter 2571  :  Rows 488300-488490  :  2.110623836517334 seconds elapsed
Iter 2572  :  Rows 488490-488680  :  2.1228549480438232 seconds elapsed
Iter 2573  :  Rows 488680-488870  :  2.1067540645599365 seconds elapsed
Iter 2574  :  Rows 488870-489060  :  2.1259589195251465 seconds elapsed
Iter 2575  :  Rows 489060-489250  :  2.127351760864258 seconds elapsed
Iter 2576  :  Rows 489250-489440  :  2.106437921524048 seconds elapsed
Iter 2577  :  Rows 489440-489630  :  2.115186929702759 seconds elapse

Iter 2679  :  Rows 508820-509010  :  2.1227591037750244 seconds elapsed
Iter 2680  :  Rows 509010-509200  :  2.1102242469787598 seconds elapsed
Iter 2681  :  Rows 509200-509390  :  2.134925127029419 seconds elapsed
Iter 2682  :  Rows 509390-509580  :  2.1118478775024414 seconds elapsed
Iter 2683  :  Rows 509580-509770  :  2.1387088298797607 seconds elapsed
Iter 2684  :  Rows 509770-509960  :  2.13063383102417 seconds elapsed
Iter 2685  :  Rows 509960-510150  :  2.126631259918213 seconds elapsed
Iter 2686  :  Rows 510150-510340  :  2.1229071617126465 seconds elapsed
Iter 2687  :  Rows 510340-510530  :  2.132646083831787 seconds elapsed
Iter 2688  :  Rows 510530-510720  :  2.124408006668091 seconds elapsed
Iter 2689  :  Rows 510720-510910  :  2.1148688793182373 seconds elapsed
Iter 2690  :  Rows 510910-511100  :  2.1095058917999268 seconds elapsed
Iter 2691  :  Rows 511100-511290  :  2.1183969974517822 seconds elapsed
Iter 2692  :  Rows 511290-511480  :  2.0963730812072754 seconds elapse

Iter 2794  :  Rows 530670-530860  :  2.1370279788970947 seconds elapsed
Iter 2795  :  Rows 530860-531050  :  2.1344540119171143 seconds elapsed
Iter 2796  :  Rows 531050-531240  :  2.121623992919922 seconds elapsed
Iter 2797  :  Rows 531240-531430  :  2.189502000808716 seconds elapsed
Iter 2798  :  Rows 531430-531620  :  2.1405210494995117 seconds elapsed
Iter 2799  :  Rows 531620-531810  :  2.1191201210021973 seconds elapsed
Iter 2800  :  Rows 531810-532000  :  2.122507095336914 seconds elapsed
Iter 2801  :  Rows 532000-532190  :  2.1115121841430664 seconds elapsed
Iter 2802  :  Rows 532190-532380  :  2.126065731048584 seconds elapsed
Iter 2803  :  Rows 532380-532570  :  2.118553876876831 seconds elapsed
Iter 2804  :  Rows 532570-532760  :  2.116818904876709 seconds elapsed
Iter 2805  :  Rows 532760-532950  :  2.125925302505493 seconds elapsed
Iter 2806  :  Rows 532950-533140  :  2.112169027328491 seconds elapsed
Iter 2807  :  Rows 533140-533330  :  2.139744997024536 seconds elapsed
I

Iter 2909  :  Rows 552520-552710  :  2.2178728580474854 seconds elapsed
Iter 2910  :  Rows 552710-552900  :  2.2262539863586426 seconds elapsed
Iter 2911  :  Rows 552900-553090  :  2.219669818878174 seconds elapsed
Iter 2912  :  Rows 553090-553280  :  2.213273763656616 seconds elapsed
Iter 2913  :  Rows 553280-553470  :  2.2389261722564697 seconds elapsed
Iter 2914  :  Rows 553470-553660  :  2.2172560691833496 seconds elapsed
Iter 2915  :  Rows 553660-553850  :  2.226027011871338 seconds elapsed
Iter 2916  :  Rows 553850-554040  :  2.2191476821899414 seconds elapsed
Iter 2917  :  Rows 554040-554230  :  2.186371088027954 seconds elapsed
Iter 2918  :  Rows 554230-554420  :  2.2069878578186035 seconds elapsed
Iter 2919  :  Rows 554420-554610  :  2.2469661235809326 seconds elapsed
Iter 2920  :  Rows 554610-554800  :  2.2044129371643066 seconds elapsed
Iter 2921  :  Rows 554800-554990  :  2.2272472381591797 seconds elapsed
Iter 2922  :  Rows 554990-555180  :  2.220003128051758 seconds elaps

Iter 3024  :  Rows 574370-574560  :  2.214901924133301 seconds elapsed
Iter 3025  :  Rows 574560-574750  :  2.204425811767578 seconds elapsed
Iter 3026  :  Rows 574750-574940  :  2.1914069652557373 seconds elapsed
Iter 3027  :  Rows 574940-575130  :  2.2339859008789062 seconds elapsed
Iter 3028  :  Rows 575130-575320  :  2.183777093887329 seconds elapsed
Iter 3029  :  Rows 575320-575510  :  2.202185869216919 seconds elapsed
Iter 3030  :  Rows 575510-575700  :  2.1889190673828125 seconds elapsed
Iter 3031  :  Rows 575700-575890  :  2.2024238109588623 seconds elapsed
Iter 3032  :  Rows 575890-576080  :  2.1874632835388184 seconds elapsed
Iter 3033  :  Rows 576080-576270  :  2.2282609939575195 seconds elapsed
Iter 3034  :  Rows 576270-576460  :  2.1901330947875977 seconds elapsed
Iter 3035  :  Rows 576460-576650  :  2.2119171619415283 seconds elapsed
Iter 3036  :  Rows 576650-576840  :  2.214120864868164 seconds elapsed
Iter 3037  :  Rows 576840-577030  :  2.220118999481201 seconds elapse

Iter 3139  :  Rows 596220-596410  :  2.1941099166870117 seconds elapsed
Iter 3140  :  Rows 596410-596600  :  2.1683480739593506 seconds elapsed
Iter 3141  :  Rows 596600-596790  :  2.204949140548706 seconds elapsed
Iter 3142  :  Rows 596790-596980  :  2.2020559310913086 seconds elapsed
Iter 3143  :  Rows 596980-597170  :  2.2040371894836426 seconds elapsed
Iter 3144  :  Rows 597170-597360  :  2.2152328491210938 seconds elapsed
Iter 3145  :  Rows 597360-597550  :  2.2247607707977295 seconds elapsed
Iter 3146  :  Rows 597550-597740  :  2.1970818042755127 seconds elapsed
Iter 3147  :  Rows 597740-597930  :  2.2194628715515137 seconds elapsed
Iter 3148  :  Rows 597930-598120  :  2.197882890701294 seconds elapsed
Iter 3149  :  Rows 598120-598310  :  2.2116098403930664 seconds elapsed
Iter 3150  :  Rows 598310-598500  :  2.199924945831299 seconds elapsed
Iter 3151  :  Rows 598500-598690  :  2.207667112350464 seconds elapsed
Iter 3152  :  Rows 598690-598880  :  2.1882309913635254 seconds elap

Iter 3254  :  Rows 618070-618260  :  2.197087049484253 seconds elapsed
Iter 3255  :  Rows 618260-618450  :  2.1925032138824463 seconds elapsed
Iter 3256  :  Rows 618450-618640  :  2.2242579460144043 seconds elapsed
Iter 3257  :  Rows 618640-618830  :  2.20284104347229 seconds elapsed
Iter 3258  :  Rows 618830-619020  :  2.227349042892456 seconds elapsed
Iter 3259  :  Rows 619020-619210  :  2.2196412086486816 seconds elapsed
Iter 3260  :  Rows 619210-619400  :  2.2145609855651855 seconds elapsed
Iter 3261  :  Rows 619400-619590  :  2.210469961166382 seconds elapsed
Iter 3262  :  Rows 619590-619780  :  2.2171850204467773 seconds elapsed
Iter 3263  :  Rows 619780-619970  :  2.1994869709014893 seconds elapsed
Iter 3264  :  Rows 619970-620160  :  2.2175540924072266 seconds elapsed
Iter 3265  :  Rows 620160-620350  :  2.20202374458313 seconds elapsed
Iter 3266  :  Rows 620350-620540  :  2.2026920318603516 seconds elapsed
Iter 3267  :  Rows 620540-620730  :  2.2014338970184326 seconds elapsed

Iter 3369  :  Rows 639920-640110  :  2.2196309566497803 seconds elapsed
Iter 3370  :  Rows 640110-640300  :  2.2074596881866455 seconds elapsed
Iter 3371  :  Rows 640300-640490  :  2.194992780685425 seconds elapsed
Iter 3372  :  Rows 640490-640680  :  2.2052550315856934 seconds elapsed
Iter 3373  :  Rows 640680-640870  :  2.212114095687866 seconds elapsed
Iter 3374  :  Rows 640870-641060  :  2.2077279090881348 seconds elapsed
Iter 3375  :  Rows 641060-641250  :  2.1966350078582764 seconds elapsed
Iter 3376  :  Rows 641250-641440  :  2.22259521484375 seconds elapsed
Iter 3377  :  Rows 641440-641630  :  2.2153420448303223 seconds elapsed
Iter 3378  :  Rows 641630-641820  :  2.20792818069458 seconds elapsed
Iter 3379  :  Rows 641820-642010  :  2.191051959991455 seconds elapsed
Iter 3380  :  Rows 642010-642200  :  2.182781219482422 seconds elapsed
Iter 3381  :  Rows 642200-642390  :  2.2281270027160645 seconds elapsed
Iter 3382  :  Rows 642390-642580  :  2.1979172229766846 seconds elapsed


Iter 3484  :  Rows 661770-661960  :  2.227602958679199 seconds elapsed
Iter 3485  :  Rows 661960-662150  :  2.369614839553833 seconds elapsed
Iter 3486  :  Rows 662150-662340  :  2.2668330669403076 seconds elapsed
Iter 3487  :  Rows 662340-662530  :  2.200220823287964 seconds elapsed
Iter 3488  :  Rows 662530-662720  :  2.220158100128174 seconds elapsed
Iter 3489  :  Rows 662720-662910  :  2.2051148414611816 seconds elapsed
Iter 3490  :  Rows 662910-663100  :  2.1809821128845215 seconds elapsed
Iter 3491  :  Rows 663100-663290  :  2.18871808052063 seconds elapsed
Iter 3492  :  Rows 663290-663480  :  2.1904211044311523 seconds elapsed
Iter 3493  :  Rows 663480-663670  :  2.186706066131592 seconds elapsed
Iter 3494  :  Rows 663670-663860  :  2.1840600967407227 seconds elapsed
Iter 3495  :  Rows 663860-664050  :  2.2012507915496826 seconds elapsed
Iter 3496  :  Rows 664050-664240  :  2.216677188873291 seconds elapsed
Iter 3497  :  Rows 664240-664430  :  2.2183148860931396 seconds elapsed


Iter 3599  :  Rows 683620-683810  :  2.218035936355591 seconds elapsed
Iter 3600  :  Rows 683810-684000  :  2.214689254760742 seconds elapsed
Iter 3601  :  Rows 684000-684190  :  2.2127299308776855 seconds elapsed
Iter 3602  :  Rows 684190-684380  :  2.21683406829834 seconds elapsed
Iter 3603  :  Rows 684380-684570  :  2.2432961463928223 seconds elapsed
Iter 3604  :  Rows 684570-684760  :  2.216130256652832 seconds elapsed
Iter 3605  :  Rows 684760-684950  :  2.225377082824707 seconds elapsed
Iter 3606  :  Rows 684950-685140  :  2.2185637950897217 seconds elapsed
Iter 3607  :  Rows 685140-685330  :  2.226839065551758 seconds elapsed
Iter 3608  :  Rows 685330-685520  :  2.2242050170898438 seconds elapsed
Iter 3609  :  Rows 685520-685710  :  2.226043939590454 seconds elapsed
Iter 3610  :  Rows 685710-685900  :  2.2092480659484863 seconds elapsed
Iter 3611  :  Rows 685900-686090  :  2.228442907333374 seconds elapsed
Iter 3612  :  Rows 686090-686280  :  2.2139830589294434 seconds elapsed
I

In [6]:
del cc_array

In [7]:
cc_wed = list(chain.from_iterable(cc_wed_chunked))
print(len(cc_wed))

703532


In [8]:
print(f"""Max WED: {max(cc_wed)}
Min WED: {min(cc_wed)}
Mean: {np.mean(cc_wed)}
Std Dev: {np.std(cc_wed)}""")

Max WED: 0.8484386464904401
Min WED: 0.1815071659402093
Mean: 0.27598083037616433
Std Dev: 0.05075569724079787


In [9]:
wed_indexed = list(enumerate(cc_wed))
wed_indexed10k = sorted(wed_indexed, key=itemgetter(1))[-10000:]
wed_indices10k = [tup[0] for tup in wed_indexed10k]
wed_scores10k = [tup[1] for tup in wed_indexed10k]

In [10]:
print(f"""Max WED in subsample: {max(wed_scores10k)}
Min WED: {min(wed_scores10k)}
Mean: {np.mean(wed_scores10k)}
Std Dev: {np.std(wed_scores10k)}""")

Max WED in subsample: 0.8484386464904401
Min WED: 0.4253332667618559
Mean: 0.4456780084688563
Std Dev: 0.028223730462232817


In [11]:
with open(os.path.join("data","cc_news.json"), "r") as infile:
    cc_news = json.load(infile)

print(len(cc_news))

703532


In [12]:
%%time

cc_wed10k = [
    article for idx, article in enumerate(cc_news) if idx in wed_indices10k
]

CPU times: user 48.2 s, sys: 147 ms, total: 48.4 s
Wall time: 48.4 s


In [13]:
with open(os.path.join("data", "cc_wed10k.json"), "w") as outfile:
    json.dump(cc_wed10k, outfile)

In [14]:
with open(os.path.join("data", "cc_wed10k.txt"), "w") as outfile:
    for article in cc_wed10k:
        outfile.write(article)