| alpha3 | language | sentences | dataset_percentage | mean_len | |
|---|---|---|---|---|---|
| 1 | eng | English | 1479733 | 19.83% | 39.3277 |
| 2 | rus | Russian | 849653 | 11.39% | 33.4655 |
| 3 | ita | Italian | 787053 | 10.55% | 33.4897 |
| 4 | tur | Turkish | 709573 | 9.51% | 34.7355 |
| 5 | deu | German | 553727 | 7.42% | 47.4774 |
| 6 | fra | French | 466192 | 6.25% | 41.3866 |
| 7 | por | Portuguese | 385737 | 5.17% | 38.2929 |
| 8 | spa | Spanish | 338781 | 4.54% | 38.8894 |
| 9 | hun | Hungarian | 323048 | 4.33% | 34.0299 |
| 10 | jpn | Japanese | 208761 | 2.80% | 18.2659 |
| 11 | heb | Hebrew | 197226 | 2.64% | 25.5678 |
| 12 | ukr | Ukrainian | 171674 | 2.30% | 27.8153 |
| 13 | nld | Dutch | 144340 | 1.93% | 34.7853 |
| 14 | fin | Finnish | 128011 | 1.72% | 35.7946 |
| 15 | pol | Polish | 109662 | 1.47% | 33.2333 |
| 16 | mkd | Macedonian | 77938 | 1.04% | 27.3793 |
| 17 | mar | Marathi | 64126 | 0.86% | 27.587 |
| 18 | lit | Lithuanian | 59659 | 0.80% | 30.1439 |
| 19 | ces | Czech | 57030 | 0.76% | 28.3683 |
| 20 | dan | Danish | 49399 | 0.66% | 33.7159 |
| 21 | swe | Swedish | 41677 | 0.56% | 30.1428 |
| 22 | ara | Arabic | 35991 | 0.48% | 26.7817 |
| 23 | ell | Greek | 34071 | 0.46% | 30.3915 |
| 24 | ron | Romanian | 24943 | 0.33% | 34.4097 |
| 25 | bul | Bulgarian | 24503 | 0.33% | 31.7201 |
| 26 | vie | Vietnamese | 19234 | 0.26% | 38.7891 |
| 27 | fil | Filipino | 16649 | 0.22% | 36.8098 |
| 28 | slk | Slovak | 14660 | 0.20% | 25.7422 |
| 29 | ind | Indonesian | 14542 | 0.19% | 37.4785 |
| 30 | hin | Hindi | 14230 | 0.19% | 27.6058 |
| 31 | nob | Norwegian Bokmål | 14223 | 0.19% | 37.4732 |
| 32 | cat | Catalan | 7971 | 0.11% | 37.334 |
| 33 | kor | Korean | 7570 | 0.10% | 16.8085 |
| 34 | hrv | Croatian | 5204 | 0.07% | 30.058 |
| 35 | ben | Bangla | 4714 | 0.06% | 23.7809 |
| 36 | afr | Afrikaans | 4031 | 0.05% | 29.676 |
| 37 | est | Estonian | 3637 | 0.05% | 27.6646 |
| 38 | tha | Thai | 3528 | 0.05% | 20.5697 |
| 39 | sqi | Albanian | 2526 | 0.03% | 32.2743 |
| 40 | urd | Urdu | 2008 | 0.03% | 30.7495 |
| 41 | cym | Welsh | 1344 | 0.02% | 29.3058 |
| 42 | slv | Slovenian | 1093 | 0.01% | 28.4282 |
| 43 | mal | Malayalam | 827 | 0.01% | 36.8222 |
| 44 | tam | Tamil | 334 | 0.00% | 35.2784 |
| 45 | tel | Telugu | 254 | 0.00% | 28.0157 |
| 46 | pan | Punjabi | 196 | 0.00% | 32.8622 |
| 47 | kan | Kannada | 176 | 0.00% | 35.3636 |
| 48 | guj | Gujarati | 168 | 0.00% | 24.244 |