# 1. Information about the submission

## 1.1 Name and number of the assignment 

Maria Lysyuk, HW2: Text Detoxification


## 1.2 Student name

Maria Lysyuk

## 1.3 Codalab user ID

Nickname at Codalab: zlatamaria

# 2. Technical Report

## 2.1 Methodology 

### Methodology

The baseline solution is taken from here https://github.com/yandex/mlcup/blob/main/nlp/offline_baseline.ipynb

The main idea of the solution is as follows:

1) Use pretrained Roberta model to predict the toxicity of the word.

2) If the toxicity of the word is higher than some threshold, then remove it. By the way, in the initial solution it's assumed that we don't remove the word but substitute it by the non-toxic synonym that lies in KDTree. However, this solution doesn't work well with our train data (partly, because the algorithm for obtaining synonyms is not that good, partly since from the train data it's vivid that in many cases we don't need substitution).

Upgrade over the baseline:

1) Impovement over the toxicity dictionary

2) Custom changes for some particular words substitution

3) Swap of the idea - not to substitute but simply remove in the majority of cases


## 2.2 Discussion of results


***From the analysis of the results of the baseline for the train dataset three problems with the solution became clear***:

1) The toxicity dictionary is far from being perfect: below you can see that there are a lot of non-toxic words with high toxicity metric and vice a versa

2) In the large part of the examples it's better to simply remove the toxic word than to substitute it

3) The KDTree algorithm which substitutes the toxic word to non-toxic from the one side works not that bad, but the way it substitutes words is that different from the proposed in the solutions - so, it really hurts the ChrF1 score

***In order to address these problems the following things have been done:***

1) I upgrade toxicity dictionary scores with the predefined by me toxic words with certain word roots by regular expressions

2) I changed the algorithm to simple removement

3) I noticed some patterns that are most frequent and started not simpy remove but also substitute words with predefined words

The evolution of the scores can be seen below. 

| Corpora/model| ChrF1 | Style transfer accuracy| 
| --- | --- | --- | 
| Dev Baseline| 0.057989 | 0.924321 | 
| Dev Baseline + removement| 0.063858  | 0.977072 | 
| Dev Baseline + removement + new dictionary+new substitution| 0.063987 | 0.987444 | 
| Final test (codalab)| 0.064574  | 0.988619| 

As you can see, there is one of the best scores in the style transfer accuracy, it's almost perfect. However, ChrF1 score is really not that good and it's quite understandable - no machine translation models have been used, so there is definetely room for impovement.


# 3. Code

### Download packages

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import sys
#dowload to Google Drive the files from my summission in order to complie the code
sys.path.append('/drive/MyDrive/data_detoxification/')

In [3]:
! pip install transformers



In [4]:
!wget http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz
!tar -xvf mystem-3.0-linux3.1-64bit.tar.gz
!cp mystem /root/.local/bin/mystem

import gensim
from pymystem3 import Mystem

stemmer = Mystem()

--2021-12-15 15:28:21--  http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz
Resolving download.cdn.yandex.net (download.cdn.yandex.net)... 5.45.205.242, 5.45.205.243, 5.45.205.244, ...
Connecting to download.cdn.yandex.net (download.cdn.yandex.net)|5.45.205.242|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://cache-man01i.cdn.yandex.net/download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz [following]
--2021-12-15 15:28:21--  http://cache-man01i.cdn.yandex.net/download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz
Resolving cache-man01i.cdn.yandex.net (cache-man01i.cdn.yandex.net)... 5.45.205.221, 2a02:6b8::3:221
Connecting to cache-man01i.cdn.yandex.net (cache-man01i.cdn.yandex.net)|5.45.205.221|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16457938 (16M) [application/octet-stream]
Saving to: ‘mystem-3.0-linux3.1-64bit.tar.gz’


2021-12-15 15:28:24 (9.14 MB/s) - ‘mystem-3.0-linux3.1-6

In [5]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import softmax, sigmoid
from functools import lru_cache

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

### Download pretrained classificator of toxicity


In [6]:
%%time
tokenizer = AutoTokenizer.from_pretrained("./drive/MyDrive/data_detoxification/trained_roberta/")

CPU times: user 552 ms, sys: 166 ms, total: 718 ms
Wall time: 736 ms


In [7]:
%%time
model = AutoModelForSequenceClassification.from_pretrained("./drive/MyDrive/data_detoxification/trained_roberta/").cuda()

CPU times: user 5.17 s, sys: 2.88 s, total: 8.05 s
Wall time: 9.53 s


In [15]:
TOXIC_CLASS=-1
TOKENIZATION_TYPE='sentencepiece'

In [8]:
ALLOWED_ALPHABET=list(map(chr, range(ord('а'), ord('я') + 1)))
ALLOWED_ALPHABET.extend(map(chr, range(ord('a'), ord('z') + 1)))
ALLOWED_ALPHABET.extend(list(map(str.upper, ALLOWED_ALPHABET)))
ALLOWED_ALPHABET = set(ALLOWED_ALPHABET)

### Functions to obtain classification

In [9]:
def logits_to_toxic_probas(logits):
    if logits.shape[-1] > 1:
        activation = lambda x: softmax(x, -1)
    else:
        activation = sigmoid
    return activation(logits)[:, TOXIC_CLASS].cpu().detach().numpy()

In [10]:
def is_word_start(token):
    if TOKENIZATION_TYPE == 'sentencepiece':
        return token.startswith('▁')
    if TOKENIZATION_TYPE == 'bert':
        return not token.startswith('##')
    raise ValueError("Unknown tokenization type")

In [11]:
def normalize(sentence, max_tokens_per_word=20):
    sentence = ''.join(map(lambda c: c if c.isalpha() else ' ', sentence.lower()))
    ids = tokenizer(sentence)['input_ids']
    tokens = tokenizer.convert_ids_to_tokens(ids)[1:-1]
    
    result = []
    num_continuation_tokens = 0
    for token in tokens:
        if not is_word_start(token):
            num_continuation_tokens += 1
            if num_continuation_tokens < max_tokens_per_word:
                result.append(token.lstrip('#▁'))
        else:
            num_continuation_tokens = 0
            result.extend([' ', token.lstrip('▁#')])
    
    return ''.join(result).strip()

In [12]:
def iterate_batches(data, batch_size=40):
    batch = []
    for x in data:
        batch.append(x)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if len(batch) > 0:
        yield batch

In [13]:
def predict_toxicity(sentences, batch_size=5, threshold=0.5, return_scores=False, verbose=True, device='cuda'):
    results = []
    tqdm_fn = tqdm if verbose else lambda x, total: x
    for batch in tqdm_fn(iterate_batches(sentences, batch_size), total=np.ceil(len(sentences) / batch_size)):
        normlized = [normalize(sent, max_tokens_per_word=5) for sent in batch]
        tokenized = tokenizer(normlized, return_tensors='pt', padding=True, max_length=512, truncation=True)
        
        logits = model.to(device)(**{key: val.to(device) for key, val in tokenized.items()}).logits
        preds = logits_to_toxic_probas(logits)
        if not return_scores:
            preds = preds >= threshold
        results.extend(preds)
    return results

### Read train/test data

In [17]:
#train dataset (with prospected answers)
df_train = pd.read_csv('train_dataset.csv')
df_toxic = df_train[['toxic']]
df_neutral = df_train[['neutral']]
texts = []
for i in range(len(df_toxic)):
  texts.append(normalize(df_toxic.loc[i, 'toxic'])) 

In [16]:
#development stage
texts = []
with open('dev_dataset.txt', 'rt') as f:
    for line in f:
        texts.append(normalize(line)) 

In [None]:
#final test stage
texts = []
with open('test_dataset.txt', 'rt') as f:
    for line in f:
        texts.append(normalize(line)) 

### Calculate the toxicity of separate words

In [29]:
words = set()
for text in texts:
    words.update(text.split())
words = sorted(words)

with torch.inference_mode():
    word_toxicities = predict_toxicity(words, batch_size=100, return_scores=True)
    
toxicity = dict(zip(words, word_toxicities))

  0%|          | 0/37.0 [00:00<?, ?it/s]

### Improvement over the pretrained dictionary

In [30]:
import re
mat = ['черт', 'трах', 'ебл', 'бля', 'ебан', 'член',  'сучк',  'хрень', 'тупо', 'блин', 'толст', 'пизд', 'дерьм', 'хуй']
for elem in toxicity:
  for el in mat:
    regexp = re.compile(el)
    if regexp.search(elem):
      toxicity[elem] = 0.9

### Look visually at the calculated toxicity

In [20]:
{k: v for k, v in sorted(toxicity.items(), key=lambda item: item[1], reverse = True)}

{'ублюдок': 0.9906267,
 'ублюдке': 0.9905998,
 'ублюдка': 0.99058,
 'ублюдков': 0.9905771,
 'выродок': 0.99057066,
 'педики': 0.99055076,
 'имбецилы': 0.99045753,
 'дебилов': 0.9904559,
 'ублюдком': 0.9901224,
 'ублюдки': 0.9889051,
 'педика': 0.9882355,
 'дебилы': 0.9881556,
 'пидор': 0.98812765,
 'дебила': 0.987356,
 'fucka': 0.9865441,
 'анальный': 0.9850927,
 'нандос': 0.98493713,
 'высасывая': 0.9835842,
 'чокнутые': 0.98275924,
 'дебиловский': 0.9781121,
 'дебилу': 0.97583413,
 'хохочешь': 0.97550654,
 'наебал': 0.97550154,
 'чокнутый': 0.97547805,
 'придурок': 0.9746688,
 'дебил': 0.9727328,
 'niggas': 0.9718959,
 'дурак': 0.9634208,
 'мразь': 0.9608082,
 'трепло': 0.96066064,
 'придурка': 0.9600687,
 'донн': 0.9585486,
 'кровососущие': 0.958254,
 'biebs': 0.95782626,
 'придурки': 0.9573734,
 'параноидальный': 0.95709866,
 'суды': 0.95651805,
 'придуркам': 0.95586455,
 'отъебись': 0.9529583,
 'дегенеративных': 0.949836,
 'мудаке': 0.94735897,
 'niggahs': 0.9468954,
 'nigga': 0.9

### Read embeddings and place them to the processing functions

In [31]:
embs_file = np.load('./drive/MyDrive/data_detoxification/embeddings_with_lemmas.npz', allow_pickle=True)
embs_vectors = embs_file['vectors']
embs_vectors_normed = embs_vectors / np.linalg.norm(embs_vectors, axis=1, keepdims=True)
embs_voc = embs_file['voc'].item()

embs_voc_by_id = [None for i in range(len(embs_vectors))]
for word, idx in embs_voc.items():
    if embs_voc_by_id[idx] is None:
        embs_voc_by_id[idx] = word

In [19]:
def get_w2v_indicies(a):
    res = []
    if isinstance(a, str):
        a = a.split()
    for w in a:
        if w in embs_voc:
            res.append(embs_voc[w])
        else:
            stemmer = Mystem()
            lemma = stemmer.lemmatize(w)[0]
            res.append(embs_voc.get(lemma, None))
    return res

def calc_embs(words):
    words = ' '.join(map(normalize, words))
    inds = get_w2v_indicies(words)
    return [None if i is None else embs_vectors[i] for i in inds]

### Put embeddings to KDTree to find the closest neighbours

In [32]:
nontoxic_emb_inds = [ind for word, ind in embs_voc.items() if toxicity.get(word, 1.0) <= 0.5]
embs_vectors_normed_nontoxic = embs_vectors_normed[nontoxic_emb_inds]

In [33]:
from sklearn.neighbors import KDTree
embs_tree = KDTree(embs_vectors_normed_nontoxic, leaf_size = 20)

### Function that returns another non-toxic word, initial word or nothing for the given word

In [38]:
@lru_cache()
def find_closest_nontoxic(word, threshold=0.5, allow_self=False):
  
    
    if word in ['идиоты', 'ублюдки', 'придурки', 'педики', 'имбецилы', 'дебилы', 'выродки']:
      return 'люди'
    if word in ['клоунов', 'ублюдков', 'уродов']:
      return 'людей'
    if toxicity.get(word, 1.0) <= threshold:
        return word
    
    if word not in toxicity and word not in embs_voc:
        return None
    
    threshold = min(toxicity.get(word, threshold), threshold)
    word = normalize(word)
    word_emb = calc_embs([word])
    if word_emb is None or word_emb[0] is None:
        return None
    #for i in embs_tree.query(word_emb)[1][0]:
    #    other_word = embs_voc_by_id[nontoxic_emb_inds[i]]
    #    if (other_word != word or allow_self) and toxicity.get(other_word, 1.0) <= threshold:
    #        return other_word

    return None

### Function to detox the whole line

In [35]:
def detox(line):
    words = normalize(line).split()
    fixed_words = [find_closest_nontoxic(word, allow_self=True) or '' for word in words]
    return ' '.join(fixed_words)

### Compare initial text, our detoxed version and the proposed detoxed version

In [30]:
ids = 1
tried = texts[ids]
print(tried)
print(detox(tried))
print(df_neutral.loc[ids, 'neutral'])

мусор создаваемый cnn и другими информационными агентствами возмутителен
 создаваемый cnn и другими информационными агентствами возмутителен
новости, создаваемые CNN и другими информационными агентствами, возмутительны.


### Functions to obtain the predictions for the result

In [39]:
%%time
new_texts = []
for i in tqdm(range(len(texts))):
  new_texts.append(detox(texts[i]))

  0%|          | 0/1298 [00:00<?, ?it/s]

CPU times: user 6.44 s, sys: 30 s, total: 36.4 s
Wall time: 8min 56s


In [40]:
df = pd.DataFrame(new_texts, columns=["colummn"])
df.to_csv('dataset_dev_remove_dictionary_add.csv', sep='\t', encoding='utf-8')