# Calculating limits for sampling

From a random sample, these percentages obtained:

HI: 225753 (4.9%)
    HI, re: 49016 (1.1%)
    HI, other: 176737 (3.8%)
ID: 220351 (4.8%)
    ID, other: 220351 (4.8%)
IN: 2252366 (48.8%)
    IN, dtp: 1116067 (24.2%)
    IN, en: 42425 (0.9%)
    IN, fi: 2709 (0.1%)
    IN, lt: 26150 (0.6%)
    IN, ra: 49788 (1.1%)
    IN, other: 1015227 (22.0%)
IP: 739136 (16.0%)
    IP, ds: 560168 (12.1%)
    IP, other: 178968 (3.9%)
LY: 17925 (0.4%)
    LY, other: 17925 (0.4%)
MT: 138730 (3.0%)
    MT, other: 138730 (3.0%)
NA: 1541299 (33.4%)
    NA, nb: 289080 (6.3%)
    NA, ne: 836345 (18.1%)
    NA, sr: 164895 (3.6%)
    NA, other: 250979 (5.4%)
OP: 651281 (14.1%)
    OP, av: 48197 (1.0%)
    OP, ob: 141351 (3.1%)
    OP, rs: 50705 (1.1%)
    OP, rv: 164476 (3.6%)
    OP, other: 246552 (5.3%)
SP: 29393 (0.6%)
    SP, it: 22948 (0.5%)
    SP, other: 6445 (0.1%)
no register: 66038 (1.4%)
TOTAL: 4613048

In [2]:
# what we want to sample

registers = ["dtp","HI","ID","IN","IP","MT","NA","ne","OP","SP","LY", "no-label"]

percentages = {"HI": 4.9,
               "ID":4.8,
               "IN":48.8-24.2,  # since dpt is under IN
               "dtp": 24.2,
               "IP": 16.0,
               "LY": 0.4,
               "MT": 3.0,
               "NA": 33.4-18.1,  # since ne is under NA
               "ne": 18.1,
               "OP": 14.1,
               "SP": 0.6,
               "no-label": 1.4,
               }


From the hplt website:

eng_Latn DEDUPLICATED:

Docs: 7.72B
Words: 3.75T
Size: 23T

We want 150-160B tokens == 140-150B words per register.

Since we do not know the length distribution per these registers, and we're dropping short docs, so I'm taking a bit more.

In [4]:
total_words = 3.75e12
wanted_words = 170e9   # making this larger than needed

millnames = ['',' k',' M',' B',' T']
import math
def millify(n):
    n = float(n)
    millidx = max(0,min(len(millnames)-1,
                        int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))

    return '{:.0f}{}'.format(n / 10**(3 * millidx), millnames[millidx])

print("limits = {")
for r in registers:
    how_many_words_for_this_register_in_total= total_words*(percentages[r]/100)
    #print(f'{r} total number of words: {millify(how_many_words_for_this_register_in_total)}')
    limit = min(1,wanted_words/how_many_words_for_this_register_in_total)
    #print(f'\tlimit needed: {limit}')
    #print(f'\tAfter limit: {millify(how_many_words_for_this_register_in_total*limit)}')
    print(f'\t"{r}": {limit},')
print('\t"HI-IN": 1,')   # this manually to 1 since there is probably very little of it
print("}")

limits = {
	"dtp": 0.18732782369146006,
	"HI": 0.9251700680272109,
	"ID": 0.9444444444444444,
	"IN": 0.18428184281842822,
	"IP": 0.2833333333333333,
	"MT": 1,
	"NA": 0.29629629629629634,
	"ne": 0.2504604051565377,
	"OP": 0.3215130023640662,
	"SP": 1,
	"LY": 1,
	"no-label": 1,
	"HI-IN": 1,
}
