# Kazakh CEFR Exploration

This notebook walks through the core components of the Kazakh↔Russian CEFR project: data preparation, alignment diagnostics, text-level prediction, and a tabular sentence classifier.

In [20]:
import pathlib
import platform
import torch

PROJECT_ROOT = pathlib.Path.cwd()
print(f"Project root: {PROJECT_ROOT}")
print(f"Python version: {platform.python_version()}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Project root: /Users/galymzhantore/cefr-classification-kk
Python version: 3.10.18
PyTorch version: 2.5.1
CUDA available: False


## Prepare Shared Resources

We cache the parallel KazParC slice and the derived silver labels. The helper functions regenerate artifacts only when they are missing.

In [21]:
from pathlib import Path

from src.data.download_parallel import save_kz_ru
from src.pipeline.build_silver_labels import main as build_silver_labels
from cefr.alignment import EmbeddingAligner

Using existing parallel corpus: data/parallel/kazparc_kz_ru.csv
Silver labels: data/labels/silver_word_labels.csv


## Silver Label Overview

Inspect the automatically generated token-level labels and basic statistics.

In [22]:
import pandas as pd

silver_df = pd.read_csv(SILVER_PATH)
print(f"Rows: {len(silver_df):,}")
silver_df.head()

Rows: 18,485


Unnamed: 0,kaz_item,rus_item,cefr,kaz_sent,rus_sent
0,кезінде,При,B1,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
1,трансшекаралық,трансграничной,Unknown,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
2,тасымалдау,перевозке,Unknown,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
3,Қауіпті,опасные,B2,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...
4,қалдықтар,отходы,Unknown,Қауіпті қалдықтар трансшекаралық тасымалдау ке...,При трансграничной перевозке опасные отходы до...


In [23]:
silver_df['cefr'].value_counts().sort_index()

cefr
A1         2028
A2         2104
B1         2795
B2         2039
C1         1618
C2         1209
Unknown    6692
Name: count, dtype: int64

## Alignment Diagnostics

Grab a sample sentence pair and review the informative alignments along with probability-based heuristics.

In [24]:
from cefr.alignment import (
    EmbeddingAligner,
    AlignmentDiagnostics,
    fraction_above_threshold,
    informative_link_share,
    is_informative,
)

sample = silver_df.sample(random_state=0).iloc[0]
kz_words = tuple(sample['kaz_sent'].split())
ru_words = tuple(sample['rus_sent'].split())

details = custom_aligner.diagnostics(
    kz_words,
    ru_words,
    layer=8,
    threshold=0.05,
)
details_matrix = details.to_dataframe(kz_words, ru_words)

informative_matrix = details_matrix[details_matrix['rus_token'].apply(is_informative)].copy()
informative_matrix.head()

Unnamed: 0,kaz_index,kaz_token,rus_index,rus_token,p_ru_given_kz,p_kz_given_ru,joint_prob,is_link
0,0,Мен,0,И,3.587796e-05,1.0,3.587796e-05,False
1,0,Мен,1,я,0.9999641,1.0,0.9999641,True
2,0,Мен,2,подумал,3.228321e-28,1.193797e-13,3.228321e-28,False
3,0,Мен,3,об,6.146913e-39,3.557704e-05,6.146913e-39,False
4,0,Мен,4,"этом,",4.2094240000000004e-32,2.08569e-07,4.2094240000000004e-32,False


In [25]:
share_correct = informative_link_share(details, kz_words, ru_words)
prob_threshold = 0.3
sample_records = silver_df.sample(n=200, random_state=1)[['kaz_sent', 'rus_sent']].to_dict('records')
coverage = fraction_above_threshold(
    sample_records,
    custom_aligner,
    layer=8,
    thresh=0.05,
    prob_threshold=prob_threshold,
)
print(f"Informative link share: {share_correct:.3f}")
print(f"Fraction above {prob_threshold:.2f} threshold on sample: {coverage:.3f}")

Informative link share: 1.000
Fraction above 0.30 threshold on sample: 0.964


## Text-Level Prediction Helper

Use the lightweight notebook wrapper to classify custom Kazakh text and inspect token alignments.

In [26]:
from app import predict_notebook_view, rows_to_dataframe

kazakh_example = "Ол кітап оқып жатыр"
russian_override = None  # provide manual translation if available

prediction = predict_notebook_view(
    kazakh_example,
    russian_text=russian_override,
    use_ensemble=False,
)

print(f"Predicted CEFR level: {prediction.cefr_level}")
print(f"Translation: {prediction.translation}")
rows_to_dataframe(prediction.rows)

Predicted CEFR level: A1
Translation: Он читает книгу


Unnamed: 0,id_kazakh_word,id_russian_word,kazakh_word,russian_word,cefr_level
0,0,0,Ол,Он,A1
1,2,1,оқып,читает,Unknown
2,1,2,кітап,книгу,Unknown


## Tabular CEFR Classifier

Load the sklearn pipeline trained on `data/text/kazparc_kz_ru_cefr_estimated.csv` and review sample predictions.

In [27]:
from pathlib import Path
import json
from joblib import load
import pandas as pd
from cefr.training import TabularTrainingConfig, train_tabular_model

MODEL_DIR = Path('models/kazparc_tabular_cefr')
MODEL_PATH = MODEL_DIR / 'model.joblib'
METRICS_PATH = MODEL_DIR / 'metrics.json'
DATA_PATH = Path('data/text/kazparc_kz_ru_cefr_estimated.csv')

if not MODEL_PATH.exists():
    cfg = TabularTrainingConfig(train_path=DATA_PATH, output_dir=MODEL_DIR)
    train_tabular_model(cfg)

classifier = load(MODEL_PATH)
metrics = json.loads(METRICS_PATH.read_text()) if METRICS_PATH.exists() else None
kazparc_df = pd.read_csv(DATA_PATH)
feature_columns = [
    col for col in kazparc_df.columns if col not in {'predicted_cefr', 'predicted_cefr_int'}
]

if metrics:
    print(f"Validation accuracy: {metrics['accuracy']:.3f}")

sample = kazparc_df.sample(n=5, random_state=0)
proba = classifier.predict_proba(sample[feature_columns])
preds = classifier.predict(sample[feature_columns])
confidence = proba.max(axis=1)

pd.DataFrame({
    'kaz': sample['kaz'].values,
    'rus': sample['rus'].values,
    'true_cefr': sample['predicted_cefr'].values,
    'pred_cefr': preds,
    'confidence': confidence,
})

Validation accuracy: 0.860


Unnamed: 0,kaz,rus,true_cefr,pred_cefr,confidence
0,Маған бұдан былай жақсартудың қажеті жоқ.,Мне больше не нужно совершенствоваться.,A2,A2,0.798754
1,Бірақ басшылардың бірде-бірі жұмыс орнында жоқ.,Но ни одного из руководителей нет на рабочем м...,A2,A2,0.778494
2,Мұндай жағдайға әлемдік тарихтағы екі қасіретт...,Первые подобные случаи являются следствием дву...,C1,C1,0.728524
3,Бұл әрекеті қазір әлеуметтік желіде желдей есі...,Эта акция сейчас распространяется в социальных...,B2,B2,0.614697
4,Бірқатар өңірлер оның зардабын қатты тартуда.,Целый ряд регионов испытывает в ней острую пот...,B2,B2,0.760007


Set a random seed or slice the dataframe differently to inspect additional examples.