# Project for 'Wikishop' Bert

**Project Description**

The online store "Vikishop" is launching a new service. Now users can edit and expand product descriptions, similar to wiki communities. This means customers can suggest their edits and comment on others' changes. The store needs a tool that will detect toxic comments and submit them for moderation.

**Task**

Train a model to classify comments as positive or negative. You have a dataset with edit toxicity labels.
Build a model with an F1 score of at least 0.75.

In [None]:
!pip -q install pymorphy3 pymorphy3-dicts-ru
!pip -q install langdetect




[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.9/53.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone


In [None]:

import re
import string
import torch
import nltk

import numpy as np
import pandas as pd



from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer




from langdetect import detect, DetectorFactory

from transformers import BertTokenizer, BertModel
from sklearn.linear_model import LogisticRegression
from tqdm.notebook import tqdm
from tqdm import tqdm


from sklearn.model_selection import (StratifiedKFold, cross_val_score,
                                     train_test_split, GridSearchCV)
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.svm import LinearSVC
from sklearn.metrics import (accuracy_score, f1_score, classification_report, confusion_matrix)

DetectorFactory.seed = 42  
wnl = WordNetLemmatizer()

In [42]:
# Language recognition function

def detect_lang_safe(s: str) -> str:
    s = "" if s is np.nan else str(s)
    # Very short/"noisy" lines often produce an error - we'll mark them as 'unk'
    if len(s.strip()) < 5:
        return "unk"
    try:
        return detect(s)
    except:
        return "unk"


# Text preprocessing

def clean_text_en(s: str) -> str:
    s = s.lower()
    s = re.sub(r"https?://\S+|www\.\S+", " ", s)            # URL
    s = re.sub(r"<.*?>", " ", s)                             # HTML-теги
    s = re.sub(r"[@#]\w+", " ", s)                           # @/#
    s = re.sub(r"&[a-z]+;", " ", s)                          # HTML-энтити
    # Latin/numbers/basic punctuation only
    s = re.sub(r"[^a-z0-9'\"!?.,:;()\[\]\-\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s


# POS-aware lemmatization (EN-only)

def penn2morphy(tag: str) -> str:
    m = {'NN':'n','JJ':'a','VB':'v','RB':'r'}
    return m.get(tag[:2], 'n')

def lemmatize_en(text: str) -> str:
    # Important: first cleaning → then tokenization → pos_tag → lemmas
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    lemmas = [wnl.lemmatize(w, penn2morphy(t)) for w, t in tagged]
    return " ".join(lemmas)

# Function for embeddings [CLS]

def get_bert_embeddings(texts, batch_size=8, max_length=256):
    embs = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch_texts = texts[i:i+batch_size]
        enc = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
        input_ids = enc['input_ids'].to(device)
        attention_mask = enc['attention_mask'].to(device)
        with torch.no_grad():
            outputs = bert_model(input_ids, attention_mask=attention_mask)
        # we take [CLS]
        batch_embs = outputs.last_hidden_state[:, 0, :].detach().cpu().numpy()
        embs.append(batch_embs)
    return np.vstack(embs)

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### 1. Loading data

In [None]:
data = pd.read_csv('https://code.s3.yandex.net/datasets/toxic_comments.csv')

display("Data size:", data.shape)


Saving toxic_comments.csv to toxic_comments.csv/toxic_comments.csv
Размер данных: (159292, 3)


In [None]:
# apply to raw text
data['lang'] = data['text'].apply(detect_lang_safe)

# 3) Language Summary
lang_counts = data['lang'].value_counts().sort_values(ascending=False)
lang_share  = (lang_counts / len(data)).round(4)

display("Top languages (counts):")
display(lang_counts.head(10))
display("\nTop languages (share):")
display(lang_share.head(10))


Top languages (counts):
lang
en    154922
de       582
fr       386
af       338
et       267
id       266
so       266
nl       220
cy       208
sv       170
Name: count, dtype: int64

Top languages (share):
lang
en    0.9726
de    0.0037
fr    0.0024
af    0.0021
et    0.0017
id    0.0017
so    0.0017
nl    0.0014
cy    0.0013
sv    0.0011
Name: count, dtype: float64


In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic,lang
0,0,Explanation\nWhy the edits made under my usern...,0,en
1,1,D'aww! He matches this background colour I'm s...,0,en
2,2,"Hey man, I'm really not trying to edit war. It...",0,en
3,3,"""\nMore\nI can't make any real suggestions on ...",0,en
4,4,"You, sir, are my hero. Any chance you remember...",0,en


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
 3   lang        159292 non-null  object
dtypes: int64(2), object(2)
memory usage: 4.9+ MB


**Conclusion**
- lines - 159292
- columns - 3

- There are no gaps
- Data types match the format

Language testing showed that 97% of the texts in the corpus are English. Therefore, I think we can apply the standard pretrained bert-base-uncased to BERT.

### 2. Text preprocessing

In [None]:
# application to data
data['text_clean'] = data['text'].astype(str).apply(clean_text_en)
# tqdm.pandas()
data['text_lemma'] = data['text_clean'].apply(lemmatize_en)


In [None]:
# Class balance (for reporting)
data["toxic"].value_counts(normalize=True).round(4)



Unnamed: 0_level_0,proportion
toxic,Unnamed: 1_level_1
0,0.8985
1,0.1015


### 3. Model training


#### TFIDF

In [None]:
# 1) Cleaning before splitting
data = data.dropna(subset=['text_lemma', 'toxic']).copy()
data = data.drop_duplicates(subset=['text_lemma', 'toxic']).copy()


X = data["text_lemma"].values
y = data["toxic"].astype(int).values

#  3) Partitioning (stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2, random_state=42, stratify=y
)

In [None]:
# 1) Baseline: TF-IDF by WORDS + LinearSVC
pipe_word = Pipeline([
    ("tfidf", TfidfVectorizer(
        analyzer="word",
        ngram_range=(1, 2),          # Unigrams + bigrams make a good baseline
        min_df=2,                    # softer than 10, so as not to throw away rarities
        max_df=0.9,
        sublinear_tf=True,
        lowercase=True,
        token_pattern=r"(?u)\b[\w']+\b",
        stop_words='english'
    )),
    ("clf", LinearSVC(C=1.0, class_weight="balanced", dual=True))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores_word = cross_val_score(pipe_word, X_train, y_train, scoring="f1", cv=cv, n_jobs=-1)
display("WORD TF-IDF + LinearSVC — CV F1:", np.mean(scores_word).round(4), "±", np.std(scores_word).round(4))

WORD TF-IDF + LinearSVC — CV F1: 0.7777 ± 0.0065


In [None]:
# 4) Model: TF-IDF (char 3–4) + LinearSVC 
pipe_tf = Pipeline([
    ("tfidf", TfidfVectorizer(
        analyzer="char",
        ngram_range=(3,4),
        min_df=10,            # we cut off very rare n-grams
        max_features=80000,   # Limiting dictionary size for RAM/speed
        sublinear_tf=True
    )),
    ("clf", LinearSVC(C=1.0, class_weight="balanced", dual=True))
])



cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipe_tf, X_train, y_train, scoring = 'f1', cv = cv, n_jobs = -1)
cv_score_mean = np.mean(cv_scores).round(4)
cv_score_std = np.std(cv_scores).round(4)
cv_score_mean, cv_score_std


(np.float64(0.776), np.float64(0.0017))

0.7786 0.0035


In [None]:
####

#### Bert





In [None]:
X = data["text_clean"].astype(str).tolist()
y = data["toxic"].astype(int).tolist()

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [None]:
# Let's reduce the training for a quick test
X_train_small, y_train_small = X_train[:4000], y_train[:4000]
X_test_small,  y_test_small  = X_test[:1000],  y_test[:1000]

# Loading the pre-trained BERT
#
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
bert_model.to(device)
bert_model.eval();

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
# We obtain embeddings (on a reduced sample)
X_train_vec = get_bert_embeddings(X_train_small, batch_size=8, max_length=512)
X_test_vec  = get_bert_embeddings(X_test_small,  batch_size=8, max_length=512)

100%|██████████| 500/500 [1:03:46<00:00,  7.65s/it]
100%|██████████| 125/125 [14:48<00:00,  7.11s/it]


We will train on two models:
- Logistic regression
- LinearSVC

In [None]:

pipe_l = make_pipeline(
    Normalizer(norm="l2"),
    LogisticRegression(
        solver="lbfgs",      # ← we fix it, not in the grid
        max_iter=2000

    )
)

param_grid = {
    'logisticregression__C': [0.1, 0.5, 1, 2, 5, 10],
    'logisticregression__class_weight': ['balanced', None]
    #'solver': ['lbfgs', 'liblinear']
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_LR = GridSearchCV(pipe_l, param_grid, scoring='f1', cv=cv, n_jobs = -1)
grid_LR.fit(X_train_vec, y_train_small)

display("Best params:", grid_LR.best_params_)
display("Best CV F1:", grid_LR.best_score_, 4)

'Best params:'

{'logisticregression__C': 10, 'logisticregression__class_weight': 'balanced'}

'Best CV F1:'

np.float64(0.6205126487568864)

4

In [None]:
pipe_s = Pipeline([
    ("norm", Normalizer(norm = 'l2')),
    ("clf", LinearSVC())
])

param_grid = [
    # Branch for hinge: only dual=True
    {
        "clf__C": [0.01, 0.1, 0.5, 1, 2, 5, 10],
        "clf__class_weight": [None, "balanced"],
        "clf__loss": ["hinge"],
        "clf__dual": [True],
        "clf__max_iter": [5000, 10000]
        # "clf__tol": [1e-4, 1e-3],
    },
    # Branch for squared_hinge: dual can be either True or False
    {
        "clf__C": [0.01, 0.1, 0.5, 1, 2, 5, 10],
        "clf__class_weight": [None, "balanced"],
        "clf__loss": ["squared_hinge"],
        "clf__dual": [True, False],
        "clf__max_iter": [5000, 10000]
        # "clf__tol": [1e-4, 1e-3],
    },
]


grid_SVC = GridSearchCV(
    pipe_s,
    param_grid,
    scoring='f1',
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_SVC.fit(X_train_vec, y_train_small)

display("Best params:", grid_SVC.best_params_)
display("Best CV F1:", grid_SVC.best_score_, 4)

Fitting 5 folds for each of 84 candidates, totalling 420 fits
Best params: {'clf__C': 5, 'clf__class_weight': None, 'clf__dual': True, 'clf__loss': 'squared_hinge', 'clf__max_iter': 5000}
Best CV F1: 0.6654530630282591 4


**Our best cross-validation model is Tfidf with letter-by-char splitting**


In [None]:
pipe_tf.fit(X_train, y_train)
y_pred = pipe_tf.predict(X_test)
#
display("Test F1:", f1_score(y_test, y_pred))
display("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
display("\nReport:\n", classification_report(y_test, y_pred, digits=4))

Test F1: 0.7858597614198429
Confusion matrix:
 [[27598   947]
 [  525  2701]]

Report:
               precision    recall  f1-score   support

           0     0.9813    0.9668    0.9740     28545
           1     0.7404    0.8373    0.7859      3226

    accuracy                         0.9537     31771
   macro avg     0.8609    0.9020    0.8799     31771
weighted avg     0.9569    0.9537    0.9549     31771



### 4. Conclusion to the models

- WORD TF-IDF + LinearSVC (baseline CV F1 ≈ 0.7777 ± 0.0065.

- CHAR TF-IDF (3–4) + LinearSVC: CV F1 ≈ 0.776 ± 0.0017 (at the word baseline level),

Test F1 = 0.7859.

Class "1" (toxic) metrics: precision 0.740, recall 0.837, F1 0.786. Good at catching toxic (high recall) indicators, at the cost of slightly more false positives.

- BERT
-- ([CLS] frozen) + classic on top: [CLS]+LR: CV F1 ≈ 0.621

-- [CLS]+LinearSVC: CV F1 ≈ 0.665

With a noticeable lag. Reasons: no fine-tune, reliance on [CLS].


For production: leave TF-IDF (char 3–4) + LinearSVC.

The reason BERT lags behind is that without fine-tuning, it loses to character n-grams on problems with typos/masking.

Conclusion

On the current dataset, the symbolic model (char 3–4) + LinearSVC leads the test (F1 ≈ 0.786) and is at least as good as the word baseline on CV. Its advantage is its resistance to typos and obfuscations (which is typical for toxic classification).

WORD TF-IDF is a fair and strong baseline (CV ≈ 0.778). It's also worth testing; there's a good chance it will be close, but in experience, char models often win in toxicity.

Frozen-BERT without retraining is weaker. For a fair comparison of transformers, fine-tuning and/or a multilingual model (xlm-roberta-base or bert-base-multilingual-cased) are needed, plus mean-pooling and L2 embedding normalization are more effective than classic ones.