 # Download the English Wikipedia Dump

In [10]:
# Create a working directory
!mkdir fasttext_language_id
!cd fasttext_language_id

# Download a sample of the English Wikipedia dump
!wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2 -O enwiki_sample.xml.bz2

--2024-10-09 16:37:55--  https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p1p41242.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.71
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 283332977 (270M) [application/octet-stream]
Saving to: 'enwiki_sample.xml.bz2'


2024-10-09 16:39:03 (4.09 MB/s) - 'enwiki_sample.xml.bz2' saved [283332977/283332977]



# Install WikiExtractor

In [11]:
!git clone https://github.com/attardi/wikiextractor.git

Cloning into 'wikiextractor'...
remote: Enumerating objects: 771, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 771 (delta 17), reused 21 (delta 14), pack-reused 741 (from 1)[K
Receiving objects: 100% (771/771), 1.31 MiB | 4.92 MiB/s, done.
Resolving deltas: 100% (450/450), done.


# Extract Text from the Wikipedia Dump

In [12]:
!cd wikiextractor && python3 -m wikiextractor.WikiExtractor ../enwiki_sample.xml.bz2 -o ../extracted -b 100M --processes 4

INFO: Preprocessing '../enwiki_sample.xml.bz2' to collect template definitions: this may take some time.
INFO: Loaded 0 templates in 26.7s
INFO: Starting page extraction from ../enwiki_sample.xml.bz2.
INFO: Using 4 extract processes.
INFO: Finished 4-process extraction of 27140 articles in 59.3s (457.5 art/s)


# Prepare the Data for fastText

In [17]:
# Combine all text files into one
!cd extracted &&find . -name 'wiki_*' -exec cat {} + > ../en_text.txt

In [18]:
# Add Labels to Each Line
!sed -i '' 's/^/__label__eng /' en_text.txt

In [19]:
# Let's verify the Data Format
!head en_text.txt

__label__eng <doc id="32173" url="https://en.wikipedia.org/wiki?curid=32173" title="United States Military Academy">
__label__eng United States Military Academy
__label__eng 
__label__eng The United States Military Academy (USMA), also referred to metonymically as West Point or simply as Army, is a United States service academy in West Point, New York. It was originally established as a fort during the American Revolutionary War, as it sits on strategic high ground overlooking the Hudson River north of New York City. Founded in 1802, it is the oldest of the five American service academies and educates cadets for commissioning into the United States Army. The academic program grants the Bachelor of Science degree with a curriculum that grades cadets' performance upon a broad academic program, military leadership performance, and mandatory participation in competitive athletics. 
__label__eng Candidates for admission must apply directly to the academy and receive a nomination, usually fr

In [20]:
# We need to clean the text data before training the model using utility funciton
!python3 clean_text.py en_text.txt final_cleaned_text.txt

Cleaning completed. Processed data saved to final_cleaned_text.txt


In [21]:
# Let's verify the Data Format
!head final_cleaned_text.txt

__label__eng The United States Military Academy (USMA), also referred to metonymically
__label__eng Candidates for admission must apply directly to the academy and
__label__eng The academy's traditions have influenced other institutions because of its
__label__eng Colonial period, founding, and early years.
__label__eng The Continental Army first occupied West Point, New York, on
__label__eng "Cadets" underwent training in artillery and engineering studies at the
__label__eng In 1817, Colonel Sylvanus Thayer became the Superintendent and established
__label__eng In 1835, during the Army's first year of the Second
__label__eng The Mexican–American War brought the academy to prominence as graduates
__label__eng Immediately following the Civil War, the academy enjoyed unprecedented fame


In [22]:
# We need to shuffle the data before training the model using utility funciton
!shuf final_cleaned_text.txt > shuffled_cleaned_text.txt

# Verify the shuffled data by showing the first few lines
!head shuffled_cleaned_text.txt

__label__eng The earliest systems employed a spinning disk to create and
__label__eng "And Sousakim gave to Jeroboam Ano the eldest sister of
__label__eng According to Leveritt, "Police records were a mess. To call
__label__eng The governorship of the Tendilla-Mondéjar family came to an end
__label__eng Euthanasia opponent Ian Dowbiggin argues that the early membership of
__label__eng κ Aquarii, also called "Situla", has an apparent
__label__eng As a non-signatory of the Treaty on Nuclear Non-Proliferation, Pakistan
__label__eng Nearby villages and settlements include St. Johnston. McKinnon's Pond is
__label__eng Competition keirin races are conducted over several rounds with one
__label__eng Tivara, the fourth son of Ashoka and Karuvaki, is the


In [25]:
# Splitting the data into training and testing sets
# Get the total number of lines and split the dataset
!total_lines=$(wc -l < shuffled_cleaned_text.txt) && \
train_lines=$(echo "$total_lines * 0.8 / 1" | bc) && \
test_lines=$(echo "$total_lines - $train_lines" | bc) && \
head -n $train_lines shuffled_cleaned_text.txt > train.txt && \
tail -n $test_lines shuffled_cleaned_text.txt > test.txt

# Verify the split
!echo "Train set: $(wc -l < train.txt) lines"
!echo "Test set: $(wc -l < test.txt) lines"

Train set:   622781 lines
Test set:   155696 lines


# Download and Build fastText

In [5]:
# Downloading fastText
!wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
!unzip v0.9.2.zip

--2024-10-09 16:33:09--  https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
Resolving github.com (github.com)... 4.237.22.38
Connecting to github.com (github.com)|4.237.22.38|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/refs/tags/v0.9.2 [following]
--2024-10-09 16:33:09--  https://codeload.github.com/facebookresearch/fastText/zip/refs/tags/v0.9.2
Resolving codeload.github.com (codeload.github.com)... 4.237.22.35
Connecting to codeload.github.com (codeload.github.com)|4.237.22.35|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: 'v0.9.2.zip.3'

v0.9.2.zip.3            [     <=>            ]   4.17M  4.12MB/s    in 1.0s    

2024-10-09 16:33:11 (4.12 MB/s) - 'v0.9.2.zip.3' saved [4369852]

Archive:  v0.9.2.zip
5b5943c118b0ec5fb9cd8d20587de2b2d3966dfe
   creating: fastText-0.9.2/
   creating: fastText-0.9.2/.circleci/
  i

In [7]:
# Moving to the fastText directory and building it
!cd fastText-0.9.2
!make

/Users/ruslankhissamiyev/Desktop/ADS Final Presentation/ADSFinal/fastText-0.9.2
make: Nothing to be done for `opt'.


In [8]:
# Testing if fastText is working
!./fasttext

usage: fasttext <command> <args>

The commands supported by fasttext are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  test-label              print labels with precision and recall scores
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  print-ngrams            print ngrams given a trained model and word
  nn                      query for nearest neighbors
  analogies               query for analogies
  dump                    dump arguments,dictionary,input/output vectors



# Training the Initial Model

In [26]:
# Training the model
!fastText-0.9.2/fasttext supervised -input train.txt -output langdetect -dim 16

Read 7M words
Number of words:  377291
Number of labels: 1
Progress: 100.0% words/sec/thread: 3271998 lr:  0.000000 avg.loss:  0.000000 ETA:   0h 0m 0s


In [27]:
# Testing the model
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	1480
P@1	1
R@1	1


# Loop Over Each Language

In [None]:
# Please don't run this code as it will take around 4 days to download and process all the languages >> Use code below instead
# Prepare the Main Document
!touch all_languages_text.txt

import os
import subprocess
import time
import shlex

def process_language(lang):
    print(f"Processing language: {lang}")
    
    parent_dir = os.getcwd()
    working_dir = os.path.join(parent_dir, f"{lang}_fasttext_language_id")
    os.makedirs(working_dir, exist_ok=True)
    
    dump_file = f"{lang}wiki-latest-pages-articles.xml.bz2"
    dump_url = f"https://dumps.wikimedia.org/{lang}wiki/latest/{dump_file}"
    
    # Download with retry
    max_retries = 3
    for attempt in range(max_retries):
        try:
            print(f"Downloading dump for {lang} (Attempt {attempt + 1}/{max_retries})")
            subprocess.run(f"wget {dump_url} -O {dump_file}", shell=True, cwd=working_dir, check=True)
            break
        except subprocess.CalledProcessError as e:
            if attempt == max_retries - 1:
                print(f"Failed to download dump for {lang} after {max_retries} attempts. Skipping.")
                return
            print(f"Download failed, retrying in 5 seconds...")
            time.sleep(5)
    
    clean_script_path = "/Users/ruslankhissamiyev/Desktop/ADS Final Presentation/ADSFinal/clean_text.py"
    
    steps = [
        ("extraction", f"python3 -m wikiextractor.WikiExtractor {dump_file} -o extracted --processes 4"),
        ("combination", "find extracted -name 'wiki_*' -exec cat {} + > text.txt"),
        ("labeling", f"sed -i '' 's/^/__label__{lang} /' text.txt"),
        ("cleaning", f"python3 {shlex.quote(clean_script_path)} text.txt cleaned_text.txt"),
        ("appending", f"cat cleaned_text.txt >> {shlex.quote(os.path.join(parent_dir, 'all_languages_text.txt'))}")
    ]
    
    for step_name, command in steps:
        print(f"Starting {step_name} for {lang}")
        try:
            subprocess.run(command, shell=True, cwd=working_dir, check=True)
        except subprocess.CalledProcessError as e:
            print(f"Failed to {step_name} for {lang}. Error: {e}")
            return
    
    print(f"Completed processing for {lang}")

# List of languages to process
languages = ["en", "es", "zh", "ar", "kk"] 

# Ensure the main output file exists
open('all_languages_text.txt', 'a').close()

# Process each language
for lang in languages:
    process_language(lang)

print("All languages processed.")

In [67]:
# We run this code to download and process the data for all languages >> it produces less data but is faster
# Prepare the Main Document
!touch all_languages_text.txt

languages = ["en", "zh", "es", "ar", "fr", "ru", "pt", "de", "ja", "hi", "kk"]

num_articles_per_language = 100  # Desired number of articles per language

import requests
import os
import time

def download_random_articles(lang, num_articles, output_file):
    base_url = f"https://{lang}.wikipedia.org/w/api.php"
    session = requests.Session()
    headers = {'User-Agent': 'LanguageDetectionBot/1.0 (khissamiyev@gmail.com)'}
    articles_per_request = 5  # Adjusted to the actual limit imposed by the API

    articles_collected = 0  # Counter for the number of articles collected

    while articles_collected < num_articles:
        rnlimit = min(articles_per_request, num_articles - articles_collected)

        # Step 1: Get random page IDs
        params = {
            'action': 'query',
            'list': 'random',
            'rnnamespace': 0,
            'rnlimit': rnlimit,
            'format': 'json'
        }

        response = session.get(url=base_url, params=params, headers=headers)
        data = response.json()
        random_pages = data.get('query', {}).get('random', [])

        if not random_pages:
            print(f"Failed to get random pages for language {lang}.")
            time.sleep(1)
            continue

        page_ids = [str(page['id']) for page in random_pages]

        # Step 2: Get content of pages
        params = {
            'action': 'query',
            'prop': 'extracts',
            'explaintext': True,
            'exlimit': 'max',
            'pageids': '|'.join(page_ids),
            'format': 'json'
        }

        response = session.get(url=base_url, params=params, headers=headers)
        data = response.json()

        if 'query' not in data or 'pages' not in data['query']:
            print(f"Failed to get page extracts for language {lang}.")
            time.sleep(1)
            continue

        pages = data['query']['pages']
        for page_id, page_content in pages.items():
            text = page_content.get('extract', '').strip()
            if text:
                # Clean the text if necessary
                cleaned_text = text.replace('\n', ' ').strip()
                # Ensure the line has at least 5 words
                if len(cleaned_text.split()) >= 5:
                    # Limit the line to 10 words
                    limited_text = ' '.join(cleaned_text.split()[:50])
                    # Write to file with label
                    with open(output_file, 'a', encoding='utf-8') as f_out:
                        f_out.write(f"__label__{lang} {limited_text}\n")
                    articles_collected += 1  # Increment the counter

                    # Break the loop if we've collected enough articles
                    if articles_collected >= num_articles:
                        break

        time.sleep(1)  # Be polite and don't overload the server

    print(f"Collected {articles_collected} articles for language {lang}.")

output_file = 'all_languages_text.txt'

for lang in languages:
    print(f"Downloading articles for language: {lang}")
    download_random_articles(lang, num_articles_per_language, output_file)

print("All languages processed.")

Downloading articles for language: en
Collected 100 articles for language en.
Downloading articles for language: zh
Collected 100 articles for language zh.
Downloading articles for language: es
Collected 100 articles for language es.
Downloading articles for language: ar
Collected 100 articles for language ar.
Downloading articles for language: fr
Collected 100 articles for language fr.
Downloading articles for language: ru
Collected 100 articles for language ru.
Downloading articles for language: pt
Collected 100 articles for language pt.
Downloading articles for language: de
Collected 100 articles for language de.
Downloading articles for language: ja
Collected 100 articles for language ja.
Downloading articles for language: hi
Collected 100 articles for language hi.
Downloading articles for language: kk
Collected 100 articles for language kk.
All languages processed.


In [68]:
# Applying cleaning function
!python3 clean_text.py all_languages_text.txt final_all_languages_text.txt

# Resaving data
!mv final_all_languages_text.txt all_languages_text.txt

Cleaning completed. Processed data saved to final_all_languages_text.txt


# Shuffle Data

In [69]:
# We need to shuffle the data before training the model using utility funciton
!shuf all_languages_text.txt > shuffled_all_languages_text.txt

# Splitting data into Training and Testing datasets

In [70]:
# Splitting the data into training and testing sets
# Get the total number of lines and split the dataset
!total_lines=$(wc -l < shuffled_all_languages_text.txt) && \
train_lines=$(echo "$total_lines * 0.8 / 1" | bc) && \
test_lines=$(echo "$total_lines - $train_lines" | bc) && \
head -n $train_lines shuffled_all_languages_text.txt > train.txt && \
tail -n $test_lines shuffled_all_languages_text.txt > test.txt

# Verify the split
!echo "Train set: $(wc -l < train.txt) lines"
!echo "Test set: $(wc -l < test.txt) lines"

Train set:      880 lines
Test set:      220 lines


# Training the Initial Model

In [71]:
# Training the model
!fastText-0.9.2/fasttext supervised -input train.txt -output langdetect -dim 16

Read 0M words
Number of words:  20540
Number of labels: 11
Progress: 100.0% words/sec/thread:  169841 lr:  0.000000 avg.loss:  2.331928 ETA:   0h 0m 0s


In [72]:
# Testing the model
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	220
P@1	0.268
R@1	0.268


# Improving the model

In [73]:
# Retraining the model with all improvements applied
import subprocess

# Command to run FastText with all the improvements applied
command = [
    './fastText-0.9.2/fasttext', 'supervised',
    '-input', 'train.txt',        # Training data
    '-output', 'langdetect',      # Output model name prefix
    '-dim', '16',                 # Set vector dimension to 16
    '-epoch', '15',               # Increase the number of epochs to 15
    '-lr', '1.0',                 # Increase the learning rate to 1.0
    '-loss', 'hs',                # Use hierarchical softmax loss function
    '-minn', '2', '-maxn', '4',   # Use character n-grams from 2 to 4 characters (subword features)
    '-wordNgrams', '2'            # Include word bigrams (word n-grams of length 2)
]

# Run the command using subprocess
subprocess.run(command)

Read 0M words
Number of words:  20540
Number of labels: 11
Progress: 100.0% words/sec/thread:  485883 lr:  0.000000 avg.loss:  0.711034 ETA:   0h 0m 0s


CompletedProcess(args=['./fastText-0.9.2/fasttext', 'supervised', '-input', 'train.txt', '-output', 'langdetect', '-dim', '16', '-epoch', '15', '-lr', '1.0', '-loss', 'hs', '-minn', '2', '-maxn', '4', '-wordNgrams', '2'], returncode=0)

In [74]:
# Testing the improved model on the validation set
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	220
P@1	0.95
R@1	0.95


# Started working with OSCAR data

In [25]:
# Downloading the dataset
!pip install datasets
!pip install tqdm

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting requests>=2.32.2
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fsspec[http]<=2024.6.1,>=2023.1.0
  Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m935.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting multiprocess
  Downloading multiprocess-0.70.17-py39-none-any.whl (133 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0
  Downloading pyarrow-17.0.0-cp39-cp39-macosx_10_15_x86_64.whl (29.0 MB)
[2K     [90m━━━━━━━━━━

In [79]:
from datasets import load_dataset
import re
import os
from tqdm import tqdm

def process_language(lang, max_size_bytes=2 * 1024 * 1024):
    print(f"Processing language: {lang}")
    try:
        # Load the dataset for the language and specify the split
        dataset = load_dataset("oscar", f"unshuffled_deduplicated_{lang}", split='train', streaming=True)
    except Exception as e:
        print(f"Language {lang} is not available in the OSCAR dataset or an error occurred: {e}")
        return

    output_file = f"{lang}_data.txt"
    bytes_written = 0
    max_bytes = max_size_bytes  # 100MB limit

    with open(output_file, 'w', encoding='utf-8') as f_out:
        # Use tqdm for progress bar
        for example in tqdm(dataset, desc=f"Processing {lang}"):
            text = example['text'].strip()
            # Clean the text (you can adjust the regex as needed)
            cleaned_text = re.sub(r'\s+', ' ', text)
            # Ensure the line has at least 5 words
            words = cleaned_text.split()
            if len(words) >= 5:
                # Limit the line to 10 words
                limited_text = ' '.join(words[:10])
                # Prepare the line with label
                line = f"__label__{lang} {limited_text}\n"
                # Write to file
                f_out.write(line)
                bytes_written += len(line.encode('utf-8'))
                # Check if the size limit is reached
                if bytes_written >= max_bytes:
                    print(f"Reached 100MB limit for {lang}")
                    break
    # Run your existing clean_text.py script
    os.system(f"python3 clean_text.py {output_file} cleaned_{output_file}")
    # Remove the uncleaned file to save space
    os.remove(output_file)

Old list of languages
languages = [
    "af", "als", "am", "an", "ar", "arz", "as", "ast", "av", "az", "azb", "ba", "bar", "bcl",
    "be", "bg", "bh", "bn", "bo", "bpy", "br", "bs", "bxr", "ca", "cbk", "ce", "ceb", "ckb",
    "co", "cs", "cv", "cy", "da", "de", "diq", "dsb", "dty", "dv", "el", "eml", "en", "eo",
    "es", "et", "eu", "fa", "fi", "fr", "frr", "fy", "ga", "gd", "gl", "gn", "gom", "gu",
    "gv", "he", "hi", "hif", "hr", "hsb", "ht", "hu", "hy", "ia", "id", "ie", "ilo", "io",
    "is", "it", "ja", "jbo", "jv", "ka", "kk", "km", "kn", "ko", "krc", "ku", "kv", "kw",
    "ky", "la", "lb", "lez", "li", "lmo", "lo", "lrc", "lt", "lv", "mai", "mg", "mhr",
    "min", "mk", "ml", "mn", "mr", "mrj", "ms", "mt", "mwl", "my", "myv", "mzn", "nah",
    "nap", "nds", "ne", "new", "nl", "nn", "no", "oc", "or", "os", "pa", "pam", "pfl", "pl",
    "pms", "pnb", "ps", "pt", "qu", "rm", "ro", "ru", "rue", "sa", "sah", "sc", "scn",
    "sco", "sd", "sh", "si", "sk", "sl", "so", "sq", "sr", "su", "sv", "sw", "ta", "te",
    "tg", "th", "tk", "tl", "tr", "tt", "tyv", "ug", "uk", "ur", "uz", "vec", "vep", "vi",
    "vls", "vo", "wa", "war", "wuu", "xal", "xmf", "yi", "yo", "yue", "zh"
] 

In [80]:
languages = ["en", "zh", "es", "ar", "fr", "ru", "pt", "de", "ja", "hi", "kk"]

for lang in languages:
    process_language(lang)
    # Append cleaned data to the main file
    cleaned_file = f"cleaned_{lang}_data.txt"
    if os.path.exists(cleaned_file):
        os.system(f"cat {cleaned_file} >> all_languages_text.txt")
        # Remove the individual cleaned file to save space
        os.remove(cleaned_file)


Processing language: en


Processing en: 11010it [00:52, 1163.88it/s]Got disconnected from remote data host. Retrying in 5sec [1/20]
Processing en: 28730it [02:36, 183.27it/s] 


Reached 100MB limit for en
Cleaning completed. Processed data saved to cleaned_en_data.txt
Processing language: zh


Processing zh: 1705it [00:26, 65.09it/s] 

Reached 100MB limit for zh
Cleaning completed. Processed data saved to cleaned_zh_data.txt
Processing language: es



Processing es: 16670it [00:59, 163.41it/s] Got disconnected from remote data host. Retrying in 5sec [1/20]
Processing es: 27880it [01:41, 274.38it/s] 


Reached 100MB limit for es
Cleaning completed. Processed data saved to cleaned_es_data.txt
Processing language: ar


Processing ar: 17516it [00:43, 401.09it/s] 


Reached 100MB limit for ar
Cleaning completed. Processed data saved to cleaned_ar_data.txt
Processing language: fr


Processing fr: 27089it [00:18, 1478.09it/s]


Reached 100MB limit for fr
Cleaning completed. Processed data saved to cleaned_fr_data.txt
Processing language: ru


Processing ru: 14189it [00:18, 785.83it/s] 


Reached 100MB limit for ru
Cleaning completed. Processed data saved to cleaned_ru_data.txt
Processing language: pt


Processing pt: 27694it [00:20, 1351.02it/s]


Reached 100MB limit for pt
Cleaning completed. Processed data saved to cleaned_pt_data.txt
Processing language: de


Processing de: 24365it [00:21, 1141.36it/s]


Reached 100MB limit for de
Cleaning completed. Processed data saved to cleaned_de_data.txt
Processing language: ja


Processing ja: 2417it [00:15, 160.62it/s]


Reached 100MB limit for ja
Cleaning completed. Processed data saved to cleaned_ja_data.txt
Processing language: hi


Processing hi: 14300it [00:22, 648.70it/s] 

Reached 100MB limit for hi
Cleaning completed. Processed data saved to cleaned_hi_data.txt
Processing language: kk



Processing kk: 13100it [00:12, 1026.74it/s]

Reached 100MB limit for kk
Cleaning completed. Processed data saved to cleaned_kk_data.txt





# We retrain the model with enhanced dataset

In [81]:
# We need to shuffle the data before training the model using utility funciton
!shuf all_languages_text.txt > shuffled_all_languages_text.txt

In [82]:
# Splitting the data into training and testing sets
# Get the total number of lines and split the dataset
!total_lines=$(wc -l < shuffled_all_languages_text.txt) && \
train_lines=$(echo "$total_lines * 0.8 / 1" | bc) && \
test_lines=$(echo "$total_lines - $train_lines" | bc) && \
head -n $train_lines shuffled_all_languages_text.txt > train.txt && \
tail -n $test_lines shuffled_all_languages_text.txt > test.txt

# Verify the split
!echo "Train set: $(wc -l < train.txt) lines"
!echo "Test set: $(wc -l < test.txt) lines"

Train set:   158942 lines
Test set:    39736 lines


In [83]:
# Training the model
!fastText-0.9.2/fasttext supervised -input train.txt -output langdetect -dim 16

Read 1M words
Number of words:  402140
Number of labels: 11
Progress: 100.0% words/sec/thread: 1531878 lr:  0.000000 avg.loss:  0.160834 ETA:   0h 0m 0s


In [84]:
# Testing the model
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	39736
P@1	0.96
R@1	0.96


# We trained improved model with enhanced data

In [85]:
# Retraining the model with all improvements applied
import subprocess

# Command to run FastText with all the improvements applied
command = [
    './fastText-0.9.2/fasttext', 'supervised',
    '-input', 'train.txt',        # Training data
    '-output', 'langdetect',      # Output model name prefix
    '-dim', '16',                 # Set vector dimension to 16
    '-epoch', '15',               # Increase the number of epochs to 15
    '-lr', '1.0',                 # Increase the learning rate to 1.0
    '-loss', 'hs',                # Use hierarchical softmax loss function
    '-minn', '2', '-maxn', '4',   # Use character n-grams from 2 to 4 characters (subword features)
    '-wordNgrams', '2'            # Include word bigrams (word n-grams of length 2)
]

# Run the command using subprocess
subprocess.run(command)

Read 1M words
Number of words:  402140
Number of labels: 11
Progress: 100.0% words/sec/thread:  890205 lr:  0.000000 avg.loss:  0.030108 ETA:   0h 0m 0s


CompletedProcess(args=['./fastText-0.9.2/fasttext', 'supervised', '-input', 'train.txt', '-output', 'langdetect', '-dim', '16', '-epoch', '15', '-lr', '1.0', '-loss', 'hs', '-minn', '2', '-maxn', '4', '-wordNgrams', '2'], returncode=0)

In [86]:
# Testing the improved model on the validation set
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	39736
P@1	0.982
R@1	0.982


# Adding Tatoeba Data

In [87]:
# Downloading files from Tatoeba
!wget http://downloads.tatoeba.org/exports/sentences.tar.bz2
!bunzip2 sentences.tar.bz2
!tar xvf sentences.tar

URL transformed to HTTPS due to an HSTS policy
--2024-10-11 14:39:03--  https://downloads.tatoeba.org/exports/sentences.tar.bz2
Resolving downloads.tatoeba.org (downloads.tatoeba.org)... 94.130.77.194
Connecting to downloads.tatoeba.org (downloads.tatoeba.org)|94.130.77.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 195102325 (186M) [application/octet-stream]
Saving to: 'sentences.tar.bz2'


2024-10-11 14:39:45 (4.63 MB/s) - 'sentences.tar.bz2' saved [195102325/195102325]

bunzip2: Output file sentences.tar already exists.
x sentences.csv


In [88]:
# Preparing the data for fastText
!awk -F"\t" '{print "__label__"$2" "$3}' sentences.csv | shuf > all.txt

In [89]:
# Adding tatoeba data to previous data all_languages_text.txt
!cat all.txt >> all_languages_text.txt

In [90]:
# We need to shuffle the data before training the model using utility funciton
!shuf all_languages_text.txt > shuffled_all_languages_text.txt

In [91]:
# Splitting the data into training and testing sets
# Get the total number of lines and split the dataset
!total_lines=$(wc -l < shuffled_all_languages_text.txt) && \
train_lines=$(echo "$total_lines * 0.8 / 1" | bc) && \
test_lines=$(echo "$total_lines - $train_lines" | bc) && \
head -n $train_lines shuffled_all_languages_text.txt > train.txt && \
tail -n $test_lines shuffled_all_languages_text.txt > test.txt

# Verify the split
!echo "Train set: $(wc -l < train.txt) lines"
!echo "Test set: $(wc -l < test.txt) lines"

Train set:  10003348 lines
Test set:  2500838 lines


In [92]:
# Training the model
!fastText-0.9.2/fasttext supervised -input train.txt -output langdetect -dim 16

Read 82M words
Number of words:  4244081
Number of labels: 429
Progress: 100.0% words/sec/thread:  246743 lr:  0.000000 avg.loss:  0.195939 ETA:   0h 0m 0sss 92.8% words/sec/thread:  246688 lr:  0.007186 avg.loss:  0.204689 ETA:   0h 0m10s0h 0m 5s


In [93]:
# Testing the model
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	2500834
P@1	0.946
R@1	0.946


In [94]:
# Retraining the model with all improvements applied
import subprocess

# Command to run FastText with all the improvements applied
command = [
    './fastText-0.9.2/fasttext', 'supervised',
    '-input', 'train.txt',        # Training data
    '-output', 'langdetect',      # Output model name prefix
    '-dim', '16',                 # Set vector dimension to 16
    '-epoch', '15',               # Increase the number of epochs to 15
    '-lr', '1.0',                 # Increase the learning rate to 1.0
    '-loss', 'hs',                # Use hierarchical softmax loss function
    '-minn', '2', '-maxn', '4',   # Use character n-grams from 2 to 4 characters (subword features)
    '-wordNgrams', '2'            # Include word bigrams (word n-grams of length 2)
]

# Run the command using subprocess
subprocess.run(command)

Read 82M words
Number of words:  4244081
Number of labels: 429
Progress: 100.0% words/sec/thread: 1078336 lr:  0.000000 avg.loss:  0.066381 ETA:   0h 0m 0s32.7% words/sec/thread: 1116410 lr:  0.672896 avg.loss:  0.118281 ETA:   0h 1m 2s


CompletedProcess(args=['./fastText-0.9.2/fasttext', 'supervised', '-input', 'train.txt', '-output', 'langdetect', '-dim', '16', '-epoch', '15', '-lr', '1.0', '-loss', 'hs', '-minn', '2', '-maxn', '4', '-wordNgrams', '2'], returncode=0)

In [95]:
# Testing the improved model on the validation set
!fastText-0.9.2/fasttext test langdetect.bin test.txt

N	2500834
P@1	0.972
R@1	0.972


# Comparing results with official model

In [96]:
# Downloading the Pre-trained Models
# Download the larger model
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

# Download the compressed model
!wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz

--2024-10-11 14:50:01--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.158.20.111, 108.158.20.43, 108.158.20.21, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.158.20.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 131266198 (125M) [application/octet-stream]
Saving to: 'lid.176.bin.1'


2024-10-11 14:50:14 (10.3 MB/s) - 'lid.176.bin.1' saved [131266198/131266198]

--2024-10-11 14:50:14--  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 108.158.20.111, 108.158.20.43, 108.158.20.21, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|108.158.20.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 938013 (916K) [binary/octet-stream]
Saving to: 'lid.176.ftz.1'


2024-10-11 14:50:15 (12.5 MB/s) - 'lid.176.ftz.1' saved [938013/93801

In [97]:
# Testing the Pre-trained Models
# Testing the larger model
!fastText-0.9.2/fasttext test lid.176.bin test.txt

N	57651
P@1	0.915
R@1	0.915


In [98]:
# Testing the compressed model
!fastText-0.9.2/fasttext test lid.176.ftz test.txt

N	57651
P@1	0.871
R@1	0.871
