# Creation condBERT vocabulary
The next hypothesis is to improve on the previous one. I have done research on various models and their applications and decided that BERT is ideal for solving the problem of text detoxification.

BERT is designed for a wide range of natural language understanding tasks, including sentiment analysis, question answering, and text classification.
It is pre-trained on a massive amount of text data and learns contextual representations of words in a bidirectional manner, capturing rich semantic relationships. This pre-training enables it to understand the context of a word in a sentence or document.
The **"cond"** prefix suggests that it could be a model designed with a particular condition or constraint in mind.
In our case it means that this BERT will bw used in text-detoxification content.
## Create tokenizer

In [26]:
VOCAB_DIRNAME = '../data/interm/vocab' 

In [19]:
from transformers import BertTokenizer
import os

**'bert-base-uncased'** is a BERT model with a base architecture and lowercase text. This model is commonly used in natural language processing tasks for various purposes such as text classification, information extraction, and text generation.

In [21]:
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Preparing vocabularires
In this part we will create this files:
- negative-words.txt
- positive-words.txt
- tox_coef.pkl
- token_tox.txt

In [31]:
tox_corpus_path = '../data/interm/toxic_train.csv'
norm_corpus_path = '../data/interm/normal_train.csv'

In [32]:
if not os.path.exists(VOCAB_DIRNAME):
    os.makedirs(VOCAB_DIRNAME)

### Preparing toxic and normal vocabularies
Words with high "toxic salience" are saved in negative-words.txt, and words with high "polite salience" are saved in positive-words.txt.
This class is designed for calculating the salience of n-grams (combinations of adjacent words) in two different corpora: a toxic corpus (tox_corpus) and a normal corpus (norm_corpus).

In [30]:
import numpy as np
from tqdm import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

class NgramSalienceCalculator():
    def __init__(self, tox_corpus, norm_corpus, use_ngrams=False):
        ngrams = (1, 3) if use_ngrams else (1, 1)
        self.vectorizer = CountVectorizer(ngram_range=ngrams)

        tox_count_matrix = self.vectorizer.fit_transform(tox_corpus)
        self.tox_vocab = self.vectorizer.vocabulary_
        self.tox_counts = np.sum(tox_count_matrix, axis=0)

        norm_count_matrix = self.vectorizer.fit_transform(norm_corpus)
        self.norm_vocab = self.vectorizer.vocabulary_
        self.norm_counts = np.sum(norm_count_matrix, axis=0)

    def salience(self, feature, attribute='tox', lmbda=0.5):
        assert attribute in ['tox', 'norm']
        if feature not in self.tox_vocab:
            tox_count = 0.0
        else:
            tox_count = self.tox_counts[0, self.tox_vocab[feature]]

        if feature not in self.norm_vocab:
            norm_count = 0.0
        else:
            norm_count = self.norm_counts[0, self.norm_vocab[feature]]

        if attribute == 'tox':
            return (tox_count + lmbda) / (norm_count + lmbda)
        else:
            return (norm_count + lmbda) / (tox_count + lmbda)


In [33]:
from collections import Counter
c = Counter()

for fn in [tox_corpus_path, norm_corpus_path]:
    with open(fn, 'r') as corpus:
        for line in corpus.readlines():
            for tok in line.strip().split():
                c[tok] += 1

print(len(c))

393697


Next filter the vocabulary to retain only those words and n-grams that occur more than once (count greater than 0).

In [34]:
vocab = {w for w, _ in c.most_common() if _ > 0}  
print(len(vocab))

393697


And then just save our lists of words.

In [36]:
with open(tox_corpus_path, 'r') as tox_corpus, open(norm_corpus_path, 'r') as norm_corpus:
    corpus_tox = [' '.join([w if w in vocab else '<unk>' for w in line.strip().split()]) for line in tox_corpus.readlines()]
    corpus_norm = [' '.join([w if w in vocab else '<unk>' for w in line.strip().split()]) for line in norm_corpus.readlines()]

In [37]:
neg_out_name = VOCAB_DIRNAME + '/negative-words.txt'
pos_out_name = VOCAB_DIRNAME + '/positive-words.txt'

**threshold** is used in the code to determine whether a word or feature has high "toxic salience" or "polite salience" and should be saved in the corresponding files (negative-words.txt or positive-words.txt).

In [38]:
threshold = 4

In [39]:
sc = NgramSalienceCalculator(corpus_tox, corpus_norm, False)
seen_grams = set()

with open(neg_out_name, 'w') as neg_out, open(pos_out_name, 'w') as pos_out:
    for gram in set(sc.tox_vocab.keys()).union(set(sc.norm_vocab.keys())):
        if gram not in seen_grams:
            seen_grams.add(gram)
            toxic_salience = sc.salience(gram, attribute='tox')
            polite_salience = sc.salience(gram, attribute='norm')
            if toxic_salience > threshold:
                neg_out.writelines(f'{gram}\n')
            elif polite_salience > threshold:
                pos_out.writelines(f'{gram}\n')

## Evaluating word toxicities with a logistic regression
tox_coef.pkl: This file is used to store a mapping of words to their corresponding coefficients. The coefficients are calculated using logistic regression based on the provided data (toxic and normal corpora). These coefficients represent the importance of each word in distinguishing between toxic and normal text.

In [40]:
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(CountVectorizer(), LogisticRegression(max_iter=1000))

In [41]:
X_train = corpus_tox + corpus_norm
y_train = [1] * len(corpus_tox) + [0] * len(corpus_norm)
pipe.fit(X_train, y_train);

In [42]:
coefs = pipe[1].coef_[0]
coefs.shape

(393698,)

In [43]:
tox_coef = {w: coefs[idx] for w, idx in pipe[0].vocabulary_.items()}

The ".pkl" file extension is commonly used to indicate that a file is a binary serialized file in Python. 
Serialization is the process of converting data structures or objects into a format that can be easily stored or transmitted and later reconstructed back into their original form. 

In [44]:
import pickle
with open(VOCAB_DIRNAME + '/tox_coef.pkl', 'wb') as f:
    pickle.dump(tox_coef, f)

## Labelling BERT tokens by toxicity

token_tox.txt: This file contains the calculated toxicities for BERT tokens.

In [45]:
from collections import defaultdict
toxic_counter = defaultdict(lambda: 1)
nontoxic_counter = defaultdict(lambda: 1)

for text in tqdm(corpus_tox):
    for token in tokenizer.encode(text):
        toxic_counter[token] += 1
for text in tqdm(corpus_norm):
    for token in tokenizer.encode(text):
        nontoxic_counter[token] += 1

100%|██████████| 289975/289975 [02:18<00:00, 2097.95it/s]
100%|██████████| 241340/241340 [02:09<00:00, 1867.09it/s]


After counting the occurrences of BERT tokens in both "toxic" and "normal" texts, we calculate the token toxicity for each BERT token. The token toxicity is calculated as the ratio of the number of times a token appears in "toxic" texts to the total number of times it appears in both "toxic" and "normal" texts.

In [46]:
token_tox = [toxic_counter[i] / (nontoxic_counter[i] + toxic_counter[i]) for i in range(len(tokenizer.vocab))]

In [47]:
with open(VOCAB_DIRNAME + '/token_tox.txt', 'w') as f:
    for t in token_tox:
        f.write(str(t))
        f.write('\n')

Together, these files form the basis for identifying and replacing toxic content in the text with non-toxic alternatives. Using these files allows you to make informed decisions during the detoxification process, increasing the effectiveness of the text model of detoxification. 
So we have created a good dictionary, now we need to *feed* it to a new model.