<h1><center>How to train a Huggingface Tokenizer + TFIDF + RIDGE</center></h1>     

<center><img src = "https://i.imgur.com/iRX7hwu.png" width = "1000" height = "400"/></center>           

This notebook was inspided on the following other two notebooks:
* https://www.kaggle.com/vitaleey/tfidf-ridge
* https://www.kaggle.com/pablorosa01/naive-bayes-modeling-base-line

<h3 style='background:orange; color:black'><center>Consider upvoting this notebook if you found it helpful.</center></h3>

# Imports

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup

from tqdm.auto import tqdm

## Load Datasets

In [None]:
TRAIN_DATA_PATH = "/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv"
VALID_DATA_PATH = "/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv"
TEST_DATA_PATH = "/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv"

In [None]:
df_train = pd.read_csv(TRAIN_DATA_PATH)
df_valid = pd.read_csv(VALID_DATA_PATH)
df_test = pd.read_csv(TEST_DATA_PATH)

# Scoring training data

In [None]:
cat_mtpl = {'obscene': 0.16, 'toxic': 0.32, 'threat': 1.5, 
            'insult': 0.64, 'severe_toxic': 1.5, 'identity_hate': 1.5}

for category in cat_mtpl:
    df_train[category] = df_train[category] * cat_mtpl[category]

df_train['score'] = df_train.loc[:, 'toxic':'identity_hate'].mean(axis=1)

df_train['y'] = df_train['score']

min_len = (df_train['y'] > 0).sum()  # len of toxic comments
df_y0_undersample = df_train[df_train['y'] == 0].sample(n=min_len, random_state=41)  # take non toxic comments
df_train_new = pd.concat([df_train[df_train['y'] > 0], df_y0_undersample])  # make new df
df_train_new

# Train the tokenizer

In [None]:
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

raw_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
raw_tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
raw_tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

In [None]:
from datasets import Dataset

dataset = Dataset.from_pandas(df_train_new[['comment_text']])

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["comment_text"]

In [None]:
raw_tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

In [None]:
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# Train the Model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge

In [None]:
def dummy_fun(doc):
    return doc

In [None]:
labels = df_train_new['y']
comments = df_train_new['comment_text']
tokenized_comments = tokenizer(comments.to_list())['input_ids']

vectorizer = TfidfVectorizer(
    analyzer = 'word',
    tokenizer = dummy_fun,
    preprocessor = dummy_fun,
    token_pattern = None)

comments_tr = vectorizer.fit_transform(tokenized_comments)
comments_tr

In [None]:
regressor = Ridge(random_state=42, alpha=0.8)
regressor.fit(comments_tr, labels)

# Validation

In [None]:
# preprocess val data
less_toxic_comments = df_valid['less_toxic']
more_toxic_comments = df_valid['more_toxic']

less_toxic_comments = tokenizer(less_toxic_comments.to_list())['input_ids']
more_toxic_comments = tokenizer(more_toxic_comments.to_list())['input_ids']

less_toxic = vectorizer.transform(less_toxic_comments)
more_toxic = vectorizer.transform(more_toxic_comments)

# make predictions
y_pred_less = regressor.predict(less_toxic)
y_pred_more = regressor.predict(more_toxic)

(y_pred_less < y_pred_more).mean()


* Tokenizer (deberta-v3): 0.6699880430450379
* Tokenizer (trained): 0.6674970107612594
* Tokenizer (trained + dirty): 0.6716819449980072

** Be careful, this results suggest that the 0.86 LB score is not reliable!!! Use at your own risk!

# Predictions and load submission.csv

In [None]:
texts = df_test['text']
texts = tokenizer(texts.to_list())['input_ids']
texts = vectorizer.transform(texts)

In [None]:
df_test['prediction'] = regressor.predict(texts)
df_test = df_test[['comment_id','prediction']]

df_test['score'] = df_test['prediction']
df_test = df_test[['comment_id','score']]

In [None]:
df_test.to_csv('./submission.csv', index=False)
df_test