# Kazakh CEFR Exploration

This notebook walks through the core components of the Kazakh↔Russian CEFR project: data preparation, alignment diagnostics, text-level prediction, and a tabular sentence classifier.

In [1]:
import pathlib
import platform
import torch

PROJECT_ROOT = pathlib.Path.cwd()
print(f"Project root: {PROJECT_ROOT}")
print(f"Python version: {platform.python_version()}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Project root: /Users/zhantore/Documents/cefr-classification-kk
Python version: 3.10.19
PyTorch version: 2.7.1
CUDA available: False


## Prepare Shared Resources

We cache the parallel KazParC slice and the derived silver labels. The helper functions regenerate artifacts only when they are missing.

In [2]:
from pathlib import Path
import importlib

from cefr import load_config
import cefr.alignment as alignment_mod
from cefr.data.download import save_kz_ru
from cefr.data.silver import build_silver_labels

alignment_mod = importlib.reload(alignment_mod)
EmbeddingAligner = alignment_mod.EmbeddingAligner

cfg = load_config()

PARALLEL_PATH = Path('data/parallel/kazparc_kz_ru.csv')
if not PARALLEL_PATH.exists():
    PARALLEL_PATH = Path(
        save_kz_ru(
            split='train[:2000]',
            out_dir='data/parallel',
            out_name='kazparc_kz_ru.csv',
        )
    )
else:
    print(f"Using existing parallel corpus: {PARALLEL_PATH}")

custom_aligner = EmbeddingAligner(cfg.pipeline.alignment)

SILVER_PATH = Path('data/labels/silver_word_labels.csv')
if not SILVER_PATH.exists():
    SILVER_PATH = build_silver_labels(
        parallel_csv=PARALLEL_PATH,
        rus_cefr=cfg.pipeline.russian_cefr_path,
        out_csv=SILVER_PATH,
        aligner=custom_aligner,
    )
print(f"Silver labels: {SILVER_PATH}")

  from .autonotebook import tqdm as notebook_tqdm


Using existing parallel corpus: data/parallel/kazparc_kz_ru.csv


AttributeError: 'dict' object has no attribute 'alignment'

## Silver Label Overview

Inspect the automatically generated token-level labels and basic statistics.

In [None]:
import pandas as pd

silver_df = pd.read_csv(SILVER_PATH)
print(f"Rows: {len(silver_df):,}")
silver_df.head()

Rows: 18,485


Unnamed: 0,kaz_item,rus_item,cefr,kaz_sent,rus_sent
0,кезінде,При,B1,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
1,трансшекаралық,трансграничной,Unknown,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
2,тасымалдау,перевозке,Unknown,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
3,Қауіпті,опасные,B2,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
4,қалдықтар,отходы,Unknown,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...


In [None]:
silver_df['cefr'].value_counts().sort_index()

cefr
A1         2028
A2         2104
B1         2795
B2         2039
C1         1618
C2         1209
Unknown    6692
Name: count, dtype: int64

## Alignment Diagnostics

Grab a sample sentence pair and review the informative alignments along with probability-based heuristics.

In [11]:
from cefr.alignment import (
    fraction_above_threshold,
    informative_link_share,
    is_informative,
)

sample = silver_df.sample(random_state=42).iloc[0]
kz_words = tuple(sample['kaz_sent'].split())
ru_words = tuple(sample['rus_sent'].split())

details = custom_aligner.diagnostics(
    kz_words,
    ru_words,
    layer=cfg.pipeline.alignment.layer,
    threshold=cfg.pipeline.alignment.threshold,
)
details_matrix = details.to_dataframe(kz_words, ru_words)

true_links = (
    details_matrix[details_matrix['is_link']]
    .copy()
    .assign(
        is_informative_pair=lambda df: df['kaz_token'].apply(is_informative)
        & df['rus_token'].apply(is_informative)
    )
    .sort_values(['kaz_index', 'rus_index'])
    .reset_index(drop=True)
)
true_links.drop(['is_link', 'is_informative_pair', 'p_ru_given_kz', 'p_kz_given_ru'], axis = 1)


Unnamed: 0,kaz_index,kaz_token,rus_index,rus_token,joint_prob
0,0,Айнымас,2,верные,1.0
1,1,принциптерге,3,принципы,1.0
2,3,және,4,и,1.0
3,4,өзіңнің,7,своей,1.0
4,5,жеке,8,личной,1.0
5,6,миссияңнан,9,миссии,1.0
6,7,ауытқымау,5,сосредоточенность,1.0
7,9,таңдау,16,выбор,1.0
8,10,жасауға,14,сделать,1.0
9,11,көмектесетін,13,помогающей,1.0


In [None]:
share_correct = informative_link_share(details, kz_words, ru_words)
prob_threshold = 0.8
population_size = len(silver_df)
sample_size = min(200, population_size)
if sample_size == 0:
    sample_records = []
else:
    sample_records = silver_df.sample(n=sample_size, random_state=1)[['kaz_sent', 'rus_sent']].to_dict('records')

coverage = fraction_above_threshold(
    sample_records,
    custom_aligner,
    layer=cfg.pipeline.alignment.layer,
    thresh=cfg.pipeline.alignment.threshold,
    prob_threshold=prob_threshold,
)

print(f"Informative link share: {share_correct:.3f}")
print(f"Fraction above {prob_threshold:.2f} threshold on sample: {coverage:.3f}")

#TODO:  Сгенерировать таблицу после предсказаний


Informative link share: 1.000
Fraction above 0.80 threshold on sample: 0.929


## Tabular CEFR Classifier

Load the sklearn pipeline trained on `data/text/kazparc_kz_ru_cefr_estimated.csv` and review sample predictions.

In [None]:
pd.read_csv('data/text/cefr_leveled_texts.csv')

# TODO: Найти источники для обоснования feature extraction process


# score = (
#     0.13 * nf["nf_avg_sent_len"]
#     + 0.82 * nf["nf_avg_word_len"]
#     + 0.01 * nf["nf_ttr"]
#     + 0.20 * nf["nf_long_ratio"]
#     + 0.05 * nf["nf_char_len"]
#     + 0.05 * nf["nf_align_diff"]
# ).astype(float)

# Source
# The texts are taken from free resources found online including: The British Council, ESLFast, and the cnn-dailymail dataset.
# Texts that were found without a label were labelled using Text Inspector.


Unnamed: 0,text,label
0,Hi!\nI've been meaning to write for ages and f...,B2
1,﻿It was not so much how hard people found the ...,B2
2,Keith recently came back from a trip to Chicag...,B2
3,"The Griffith Observatory is a planetarium, and...",B2
4,-LRB- The Hollywood Reporter -RRB- It's offici...,B2
...,...,...
1489,Light propagating in the vicinity of astrophys...,C2
1490,Future of dentistry has become one of the most...,C2
1491,﻿The forests – and suburbs – of Europe are ech...,C2
1492,Hedge funds are turning bullish on oil once ag...,C2


In [None]:
from pathlib import Path
import json
from joblib import load
import pandas as pd
from cefr.training import TabularTrainingConfig, train_tabular_model

MODEL_DIR = Path('models/kazparc_tabular_cefr')
MODEL_PATH = MODEL_DIR / 'model.joblib'
METRICS_PATH = MODEL_DIR / 'metrics.json'
DATA_PATH = Path('data/text/kazparc_kz_ru_cefr_estimated.csv')

if not DATA_PATH.exists():
    print(f"Tabular training dataset not found at {DATA_PATH}. Skipping tabular classifier section.")
    tabular_result = None
else:
    if not MODEL_PATH.exists():
        tabular_cfg = TabularTrainingConfig(train_path=DATA_PATH, output_dir=MODEL_DIR)
        train_tabular_model(tabular_cfg)

    classifier = load(MODEL_PATH)
    metrics = json.loads(METRICS_PATH.read_text()) if METRICS_PATH.exists() else None
    kazparc_df = pd.read_csv(DATA_PATH)
    feature_columns = [
        col for col in kazparc_df.columns if col not in {'predicted_cefr', 'predicted_cefr_int'}
    ]

    if metrics:
        print(f"Validation accuracy: {metrics['accuracy']:.3f}")

    sample = kazparc_df.sample(n=5, random_state=0)
    proba = classifier.predict_proba(sample[feature_columns])
    preds = classifier.predict(sample[feature_columns])
    confidence = proba.max(axis=1)

    tabular_result = pd.DataFrame({
        'kaz': sample['kaz'].values,
        'rus': sample['rus'].values,
        'true_cefr': sample['predicted_cefr'].values,
        'pred_cefr': preds,
        'confidence': confidence,
    })

if 'tabular_result' in locals() and tabular_result is not None:
    tabular_result


Validation accuracy: 0.860


Unnamed: 0,kaz,rus,true_cefr,pred_cefr,confidence
0,Маған бұдан былай жақсартудың қажеті жоқ.,Мне больше не нужно совершенствоваться.,A2,A2,0.798754
1,Бірақ басшылардың бірде-бірі жұмыс орнында жоқ.,Но ни одного из руководителей нет на рабочем м...,A2,A2,0.778494
2,Мұндай жағдайға әлемдік тарихтағы екі қасіретт...,Первые подобные случаи являются следствием дву...,C1,C1,0.728524
3,Бұл әрекеті қазір әлеуметтік желіде желдей есі...,Эта акция сейчас распространяется в социальных...,B2,B2,0.614697
4,Бірқатар өңірлер оның зардабын қатты тартуда.,Целый ряд регионов испытывает в ней острую пот...,B2,B2,0.760007



## Train CEFR Text Classifier

With the bilingual corpus prepared, we can train the CEFR text classifier that combines TF-IDF features with the linguistic statistics computed earlier. The helper below wraps the `cefr.training.text_classification` module so training can run directly from this notebook.


In [None]:

from cefr.training.text_classification import parse_args, train_text_classifier

TEXT_CLASSIFIER_DIR = Path("models/en_ru_text_classifier")
TEXT_CLASSIFIER_DIR.mkdir(parents=True, exist_ok=True)

classifier_config = parse_args([
    "--dataset-path", str(OUTPUT_CSV),
    "--output-dir", str(TEXT_CLASSIFIER_DIR),
    "--test-size", "0.2",
    "--random-state", "42",
])

training_result = train_text_classifier(classifier_config)
training_result



### Inspect Metrics

The training routine stores metrics alongside the fitted pipeline. Loading the JSON report provides overall accuracy, per-level precision/recall, and the confusion matrix so we can inspect model performance without leaving the notebook.


In [None]:

import json

metrics_path = Path(training_result["metrics_path"])
with metrics_path.open(encoding="utf-8") as handle:
    metrics = json.load(handle)

display({
    "accuracy": metrics.get("accuracy"),
    "labels": metrics.get("labels"),
})



### Quick Inference Helper

Load the persisted classifier and run predictions on fresh samples. Provide both English text and an optional Russian translation; the model will output the predicted CEFR level together with class probabilities.


In [None]:

import pandas as pd
from joblib import load

model_path = Path(training_result["model_path"])
text_classifier = load(model_path)

sample_rows = pd.DataFrame([
    {
        "text_en": "The industrial revolution transformed many European countries in the 19th century.",
        "text_ru": "Промышленная революция преобразила многие европейские страны в XIX веке.",
    },
])

predicted_levels = text_classifier.predict(sample_rows)
probabilities = text_classifier.predict_proba(sample_rows)
list(zip(predicted_levels, probabilities))
