Skip to content

Latest commit

 

History

History
50 lines (50 loc) · 4.25 KB

File metadata and controls

50 lines (50 loc) · 4.25 KB
alpha3 language sentences dataset_percentage mean_len
1 eng English 1479733 19.83% 39.3277
2 rus Russian 849653 11.39% 33.4655
3 ita Italian 787053 10.55% 33.4897
4 tur Turkish 709573 9.51% 34.7355
5 deu German 553727 7.42% 47.4774
6 fra French 466192 6.25% 41.3866
7 por Portuguese 385737 5.17% 38.2929
8 spa Spanish 338781 4.54% 38.8894
9 hun Hungarian 323048 4.33% 34.0299
10 jpn Japanese 208761 2.80% 18.2659
11 heb Hebrew 197226 2.64% 25.5678
12 ukr Ukrainian 171674 2.30% 27.8153
13 nld Dutch 144340 1.93% 34.7853
14 fin Finnish 128011 1.72% 35.7946
15 pol Polish 109662 1.47% 33.2333
16 mkd Macedonian 77938 1.04% 27.3793
17 mar Marathi 64126 0.86% 27.587
18 lit Lithuanian 59659 0.80% 30.1439
19 ces Czech 57030 0.76% 28.3683
20 dan Danish 49399 0.66% 33.7159
21 swe Swedish 41677 0.56% 30.1428
22 ara Arabic 35991 0.48% 26.7817
23 ell Greek 34071 0.46% 30.3915
24 ron Romanian 24943 0.33% 34.4097
25 bul Bulgarian 24503 0.33% 31.7201
26 vie Vietnamese 19234 0.26% 38.7891
27 fil Filipino 16649 0.22% 36.8098
28 slk Slovak 14660 0.20% 25.7422
29 ind Indonesian 14542 0.19% 37.4785
30 hin Hindi 14230 0.19% 27.6058
31 nob Norwegian Bokmål 14223 0.19% 37.4732
32 cat Catalan 7971 0.11% 37.334
33 kor Korean 7570 0.10% 16.8085
34 hrv Croatian 5204 0.07% 30.058
35 ben Bangla 4714 0.06% 23.7809
36 afr Afrikaans 4031 0.05% 29.676
37 est Estonian 3637 0.05% 27.6646
38 tha Thai 3528 0.05% 20.5697
39 sqi Albanian 2526 0.03% 32.2743
40 urd Urdu 2008 0.03% 30.7495
41 cym Welsh 1344 0.02% 29.3058
42 slv Slovenian 1093 0.01% 28.4282
43 mal Malayalam 827 0.01% 36.8222
44 tam Tamil 334 0.00% 35.2784
45 tel Telugu 254 0.00% 28.0157
46 pan Punjabi 196 0.00% 32.8622
47 kan Kannada 176 0.00% 35.3636
48 guj Gujarati 168 0.00% 24.244