# Introduction

**In this kernel i will try to share my understanding and findings of cross lingual models.Feel free to correct me if I made any mistakes in this kernel.**

# Part 1 : Understanding cross lingual models

the paper titled [**Cross-lingual Language Model Pretraining**](https://arxiv.org/abs/1901.07291) by Facebook AI, named XLM, presents an improved version of BERT to achieve state-of-the-art results in both classification and translation tasks.XLM uses a known pre-processing technique (BPE) and a dual-language training mechanism with BERT in order to learn relations between words in different languages. The model outperforms other models in a cross-lingual classification task (sentence entailment in 15 languages) and significantly improves machine translation when a pre-trained model is used for initialization of the translation model.

# XLM is based on several key concepts:

**Transformers** : The Transformer architecture is at the core of almost all the recent major developments in NLP.It introduced an attention mechanism that processes the entire text input simultaneously to learn contextual relations between words (or sub-words). A Transformer includes two parts — an encoder that reads the text input and generates a lateral representation of it (e.g. a vector for each word), and a decoder that produces the translated text from that representation.

**A High-Level Look**

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

![](https://jalammar.github.io/images/t/the_transformer_3.png)

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

![](https://jalammar.github.io/images/t/The_transformer_encoders_decoders.png)

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

![](https://jalammar.github.io/images/t/The_transformer_encoder_decoder_stack.png)

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

![](https://jalammar.github.io/images/t/Transformer_encoder.png)

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

![](https://jalammar.github.io/images/t/Transformer_decoder.png)

This Transformer architecture outperformed both RNNs and CNNs (convolutional neural networks). The computational resources required to train models were reduced as well. A win-win for everyone in NLP. Check out the below comparison:
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/03/transformercomparison.png)

The below animation wonderfully illustrates how Transformer works on a machine translation task:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/03/transform20fps.gif)

The vanilla Transformer has only limited context of each word, i.e. only the predecessors of each word, in 2018 updated BERT used the Transformer’s encoder to learn a language model by masking (dropping) some of the words and then trying to predict them, allowing it to uses the entire context, i.e. words to the left and right of a masked word.

**How XLM works?**

The paper [**Cross-lingual Language Model Pretraining**](https://arxiv.org/abs/1901.07291) presents two innovative ideas — a new training technique of BERT for multilingual classification tasks and the use of BERT as initialization of machine translation models.

These are the language the XLM model supports: en-es-fr-de-zh-ru-pt-it-ar-ja-id-tr-nl-pl-simple-fa-vi-sv-ko-he-ro-no-hi-uk-cs-fi-hu-th-da-ca-el-bg-sr-ms-bn-hr-sl-zh_yue-az-sk-eo-ta-sh-lt-et-ml-la-bs-sq-arz-af-ka-mr-eu-tl-ang-gl-nn-ur-kk-be-hy-te-lv-mk-zh_classical-als-is-wuu-my-sco-mn-ceb-ast-cy-kn-br-an-gu-bar-uz-lb-ne-si-war-jv-ga-zh_min_nan-oc-ku-sw-nds-ckb-ia-yi-fy-scn-gan-tt-am.



The process of cross-lingual sentiment classification. We assume that the opinion units have already been determined. The English train set is used to train a classifier. The Spanish test set is mapped accordingly and the classifier is tested on this cross-lingual test set.check the below pictures : 
![](https://www.researchgate.net/profile/Jeremy_Barnes5/publication/309312650/figure/fig1/AS:669424235323406@1536614583578/The-process-of-cross-lingual-sentiment-classification-We-assume-that-the-opinion-units_W640.jpg)
here L1 means language 1 and L2 means language 2
![](https://slideplayer.com/slide/12311059/73/images/13/Cross-lingual+Document+Classification.jpg)

* First, instead of using word or characters as the input of the model, it uses Byte-Pair Encoding (BPE) that splits the input into the most common sub-words across all languages, thereby increasing the shared vocabulary between languages.

* Second, it upgrades the BERT architecture in two manners:

     1.  Each training sample consists of the same text in two languages, whereas in BERT each sample is built from a single language. As in BERT, the goal of the model is to predict the masked tokens, however, with the new architecture, the model can use the context from one language to predict tokens in the other, as different words are masked words in each language (they are chosen randomly).   
     
     2. The model also receives the language ID and the order of the tokens in each language, i.e. the Positional Encoding, separately. The new metadata helps the model learn the relationship between related tokens in different languages.

**The upgraded BERT is denoted as Translation Language Modeling (TLM) while the “vanilla” BERT with BPE inputs is denoted as Masked Language Modeling (MLM).**

The complete XLM model was trained by training both MLM and TLM and alternating between them.

![](https://miro.medium.com/max/1400/0*lBYVNRe1esIXn1qE.png)

To assess the contribution of the model, the paper presents its results on sentence entailment task (classify relationship between sentences) using XNLI dataset that includes sentences in 15 languages. The model significantly outperforms other prominent models

# Unsupervised Cross-lingual Representation Learning at Scale

Abstract : 
This paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

# Part 2: Read Data

In [None]:
import pandas as pd
train1 = pd.read_csv("train1.csv")
train2 = pd.read_csv("train2.csv")
train2.toxic = train2.toxic.round().astype(int)

valid = pd.read_csv('validation.csv')
test = pd.read_csv('test.csv')
sub = pd.read_csv('sample_submission.csv')

In [None]:
#forgive my memory
import gc
train = pd.concat([
    train1[['comment_text', 'toxic']],
    train2[['comment_text', 'toxic']].query('toxic==1'),
    train2[['comment_text', 'toxic']].query('toxic==0').sample(n=100000, random_state=0)
   
])
del train1,train2
gc.collect()

# Part 3 : Clean Data

In [None]:
import nltk
nltk.download('punkt')
from nltk import sent_tokenize

LANGS = {
    'en': 'english',
    'it': 'italian', 
    'fr': 'french', 
    'es': 'spanish',
    'tr': 'turkish', 
    'ru': 'russian',
    'pt': 'portuguese'
}

def get_sentences(text, lang='en'):
    return sent_tokenize(text, LANGS.get(lang,'english'))

def exclude_duplicate_sentences(text, lang='en'):
    sentences = []
    for sentence in get_sentences(text, lang):
        sentence = sentence.strip()
        if sentence not in sentences:
            sentences.append(sentence)
    return ' '.join(sentences)

def clean_text(text, lang='en'):
    text = str(text)
    text = re.sub(r'[0-9"]', '', text)
    text = re.sub(r'#[\S]+\b', '', text)
    text = re.sub(r'@[\S]+\b', '', text)
    text = re.sub(r'https?\S+', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = exclude_duplicate_sentences(text, lang)
    return text.strip()

!pip install pandarallel
import re
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=2, progress_bar=True)
train['lang']='en'
train['comment_text'] = train.parallel_apply(lambda x: clean_text(x['comment_text'], x['lang']), axis=1)
valid['comment_text'] = valid.parallel_apply(lambda x: clean_text(x['comment_text'], x['lang']), axis=1)
test['comment_text'] = test.parallel_apply(lambda x: clean_text(x['content'], x['lang']), axis=1)

**if u want a deeper clean**

**applying text cleaning techniques like clean_text,replace_typical_misspell,handle_contractions,fix_quote 
on train,test and validation set**

In [None]:
# https://www.kaggle.com/chenshengabc/from-quest-encoding-ensemble-a-little-bit-differen

puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£',
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '\xa0', '\t',
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '\u3000', '\u202f',
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '«',
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

mispell_dict = {"aren't" : "are not",
"can't" : "cannot",
"couldn't" : "could not",
"couldnt" : "could not",
"didn't" : "did not",
"doesn't" : "does not",
"doesnt" : "does not",
"don't" : "do not",
"hadn't" : "had not",
"hasn't" : "has not",
"haven't" : "have not",
"havent" : "have not",
"he'd" : "he would",
"he'll" : "he will",
"he's" : "he is",
"i'd" : "I would",
"i'd" : "I had",
"i'll" : "I will",
"i'm" : "I am",
"isn't" : "is not",
"it's" : "it is",
"it'll":"it will",
"i've" : "I have",
"let's" : "let us",
"mightn't" : "might not",
"mustn't" : "must not",
"shan't" : "shall not",
"she'd" : "she would",
"she'll" : "she will",
"she's" : "she is",
"shouldn't" : "should not",
"shouldnt" : "should not",
"that's" : "that is",
"thats" : "that is",
"there's" : "there is",
"theres" : "there is",
"they'd" : "they would",
"they'll" : "they will",
"they're" : "they are",
"theyre":  "they are",
"they've" : "they have",
"we'd" : "we would",
"we're" : "we are",
"weren't" : "were not",
"we've" : "we have",
"what'll" : "what will",
"what're" : "what are",
"what's" : "what is",
"what've" : "what have",
"where's" : "where is",
"who'd" : "who would",
"who'll" : "who will",
"who're" : "who are",
"who's" : "who is",
"who've" : "who have",
"won't" : "will not",
"wouldn't" : "would not",
"you'd" : "you would",
"you'll" : "you will",
"you're" : "you are",
"you've" : "you have",
"'re": " are",
"wasn't": "was not",
"we'll":" will",
"didn't": "did not",
"tryin'":"trying"}


def clean_my_text(x):
    x = str(x).replace("\n","")
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x


def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

from nltk.tokenize.treebank import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()

def handle_contractions(x):
    x = tokenizer.tokenize(x)
    return x

def fix_quote(x):
    x = [x_[1:] if x_.startswith("'") else x_ for x_ in x]
    x = ' '.join(x)
    return x

def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re


def replace_typical_misspell(text):
    mispellings, mispellings_re = _get_mispell(mispell_dict)

    def replace(match):
        return mispellings[match.group(0)]

    return mispellings_re.sub(replace, text)


def clean_data(df, columns: list):
    for col in columns:
#         df[col] = df[col].apply(lambda x: clean_numbers(x))
        df[col] = df[col].apply(lambda x: clean_my_text(x.lower())) 
        df[col] = df[col].apply(lambda x: replace_typical_misspell(x))
        df[col] = df[col].apply(lambda x: handle_contractions(x))  
        df[col] = df[col].apply(lambda x: fix_quote(x))   
    
    return df

input_columns = ['comment_text']
train = clean_data(train, input_columns ) 
val = clean_data(val, input_columns )
input_columns = ['content']
test_data = clean_data(test_data, input_columns )

del tokenizer
gc.collect()

**we can see it  takes 15+ minutes,  so it would be a good idea to save the results**

# Part 4 : Know your data and do EDA(Easy Data Augmentation)

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
def nonan(x):
    if type(x) == str:
        return x.replace("\n", "")
    else:
        return ""

text = ' '.join([nonan(abstract) for abstract in train["comment_text"]])
wordcloud = WordCloud(max_font_size=None, background_color='black', collocations=False,
                      width=1200, height=1000).generate(text)
fig = px.imshow(wordcloud)
fig.update_layout(title_text='Common words in comments')
fig = plt.imshow(wordcloud)

**you can do this for all train data or just label==1 data**

In [None]:
#random drop
new=[]
import random
for i in range(len(train)):
    temp=[]
    right=train['comment_new'][i]
    right=right.strip().split()
#     right=' '.join(right)
    words=[]
    nums=random.sample(range(0,len(right)),int(len(right)*0.2))
    for i in nums:
        words.append(right[i])
    for j in words:
        right.remove(j)
    right=' '.join(right)
    temp.append(right)
    new.append(temp)
df1=pd.DataFrame(new,columns=['comment_text'])
df1['lang']='en'
df1['toxic']=train['toxic'].tolist()

In [None]:
#random replace
new=[]
import random
for i in range(len(train)):
    temp=[]
    right=pos_train['comment_new'][i]
    right=right.strip().split()
#     right=' '.join(right)
    for i in range(int(len(right)*0.2)):
        a=random.sample(range(0,len(right)),1)[0]
        b=random.sample(range(0,len(right)),1)[0]
        right[a],right[b]=right[b],right[a]
    right=' '.join(right)
    temp.append(right)
    new.append(temp)
df2=pd.DataFrame(new,columns=['comment_new'])
df2['lang']='en'
df2['toxic']=train['toxic'].tolist()

In [None]:
#replace important words
new=[]
import random
for i in range(len(pos_train)):
    temp=[]
    right=pos_train['comment_new'][i]
    right=right.strip().split()
#     right=' '.join(right)
    for i in range(len(right)):
        #You know why I choose this word :)
        if right[i]=='Trump':
            right[i]=random.sample(['T','Donald','president','TD'],1)[0]+' '+str(random.sample(range(0,10000),1)[0])
        elif right[i]=='people':
            right[i]=random.sample(['someone','somebody','person','man','women'],1)[0]+' '+str(random.sample(range(0,10000),1)[0])
        elif right[i]=='will':
            right[i]=random.sample(['would','hope','want','like','plan','wish'],1)[0]+' '+str(random.sample(range(0,10000),1)[0])
        elif right[i]=='one':
            right[i]=random.sample(['two','girl','boy','it','him','her'],1)[0]+' '+str(random.sample(range(0,10000),1)[0])
        elif right[i]=='FUCK':
            right[i]=random.sample(['Fxxx','Fxxk','Fxk','fk','FK','fuck'],1)[0]+' '+str(random.sample(range(0,10000),1)[0])
    right=' '.join(right)
    temp.append(right)
    new.append(temp)
df3=pd.DataFrame(new,columns=['comment_new'])
df3['lang']='en'
df3['toxic']=train['toxic'].tolist()

In [None]:
train=train.append(df1)
train=train.append(df2)
train=train.append(df3)
train=train.reset_index(drop=True)

In [None]:
#shuffle
from sklearn.utils import shuffle
train=shuffle(train)
train=train.reset_index(drop=True)

**obviously it will cost more time to train the model, if you have limited time, let it go**

# Part 5 : Let's play

**Thank u for huggingface and jplu**
https://huggingface.co/transformers/model_doc/xlmroberta.html#xlmrobertatokenizer

In [None]:
#encoder
MODEL = 'jplu/tf-xlm-roberta-large'
from transformers import TFAutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)


import numpy as np
def regular_encode(texts, tokenizer, maxlen=512):
    enc_di = tokenizer.batch_encode_plus(
        texts, 
        return_attention_masks=False, 
        return_token_type_ids=False,
        pad_to_max_length=True,
        max_length=maxlen
    )
    
    return np.array(enc_di['input_ids'])
MAX_LEN = 192
x_train= regular_encode(train.comment_text.values, tokenizer, maxlen=MAX_LEN)
x_valid= regular_encode(valid.comment_text.values, tokenizer, maxlen=MAX_LEN)
x_test= regular_encode(test.comment_text.values, tokenizer, maxlen=MAX_LEN)

y_train = train.toxic.values
y_valid = valid.toxic.values

In [None]:
#if u want faster
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
    """
    https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(max_length=maxlen)
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

 here ~20 minutes, which is almost 1 epoch training time here..!! :(**

In [None]:
# Detect hardware, return appropriate distribution strategy
import tensorflow as tf
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

**oh yeah we have 8 TPUS ~~~**

In [None]:
#build dataset
AUTO = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 16 * strategy.num_replicas_in_sync 
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(438411)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

In [None]:
def label_smoothing(y_true,y_pred):
    return tf.keras.losses.binary_crossentropy(y_true,y_pred,label_smoothing=0)
#label_smoothing in [0,1], if you trust your model increase it
#like temperature in knowledge distillation

In [None]:
#build model
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import Lambda, concatenate, Activation
def build_model(transformer, max_len=512):
    """
    https://www.kaggle.com/xhlulu/jigsaw-tpu-distilbert-with-huggingface-and-keras
    """
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
#     cls_token = sequence_output[:, 0, :]
    cls_token = sequence_output
    
    x1 = tf.keras.layers.Conv1D(128, 2,padding='same')(cls_token)
    
    x1 = tf.keras.layers.Dropout(0.15)(x1)
    x1 = tf.keras.layers.LeakyReLU()(x1)
    x1 = tf.keras.layers.Dropout(0.15)(x1)
   
    
    x1 = tf.keras.layers.Conv1D(64, 2,padding='same')(x1)
    x1 = tf.keras.layers.Dropout(0.15)(x1)
    x1 = tf.keras.layers.Dense(1)(x1)
    x1 = tf.keras.layers.Dropout(0.15)(x1)
    x1 = tf.keras.layers.Flatten()(x1)
    

    out = Dense(1, activation='sigmoid')(x1)
    model = Model(inputs=input_word_ids, outputs=out)     
    model.compile(Adam(lr=1e-5), loss=label_smoothing, metrics=['accuracy'])
    
    
    return model

In [None]:
#load model to tpu
with strategy.scope():
    transformer_layer = TFAutoModel.from_pretrained(MODEL)
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()

In [None]:
#train1
EPOCHS=2
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

In [None]:
#train2
n_steps = x_valid.shape[0] // BATCH_SIZE
train_history_2 = model.fit(
    valid_dataset.repeat(),
    steps_per_epoch=n_steps,
    epochs=EPOCHS
)

In [None]:
#predict
sub['toxic'] = model.predict(test_dataset, verbose=1)
sub.to_csv('submission.csv', index=False)

# Part 6 : Do more

 **Learning rate schedule**

In [None]:
def build_lrfn(lr_start=0.000001, lr_max=0.000002, 
               lr_min=0.0000001, lr_rampup_epochs=7, 
               lr_sustain_epochs=0, lr_exp_decay=.87):
    lr_max = lr_max * strategy.num_replicas_in_sync

    def lrfn(epoch):
        if epoch < lr_rampup_epochs:
            lr = (lr_max - lr_start) / lr_rampup_epochs * epoch + lr_start
        elif epoch < lr_rampup_epochs + lr_sustain_epochs:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) * lr_exp_decay**(epoch - lr_rampup_epochs - lr_sustain_epochs) + lr_min
        return lr
    
    return lrfn

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 7))
lrfn = build_lrfn()
plt.plot([i for i in range(35)], [lrfn(i) for i in range(35)]);

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler
model_path = 'weights_{epoch:03d}_{val_loss:.4f}.hdf5'
checkpoint = ModelCheckpoint(model_path, monitor='val_accuracy', mode='max', save_best_only=True)
es = EarlyStopping(monitor='val_accuracy', mode='max', patience=50, 
                   restore_best_weights=True, verbose=1)
lr_callback = LearningRateScheduler(lrfn, verbose=1)
callback_list = [checkpoint,  lr_callback]

In [None]:
n_steps = x_train.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS,
    callbacks=callback_list
)

In [None]:
if os.path.exists(model_path):
    model.load_weights(model_path)

# Part7:Maybe you are a bert fans

**Multilingual DistilBERT**: DistilBERT is **2 times faster and 25% lighter** than multilingual BERT base, all while retaining **92% of its performance**. This model let you quickly experiments with different ideas, and when you are ready for the real thing, just change two lines of code to use `bert-base-multilingual-cased`.

In [None]:
import transformers
from tokenizers import BertWordPieceTokenizer
# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
fast_tokenizer

**regular_encode/fast_encoder and go on**

In [None]:
%%time
with strategy.scope():
    transformer_layer = (
        transformers.TFDistilBertModel
        .from_pretrained('distilbert-base-multilingual-cased')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()

# Part8:Pseudo tag

In [None]:
sub=pd.read_csv('submission.csv')
sub=sub.merge(test[['id','lang','comment_text']],on='id',how='left')
sub=sub[sub.toxic>0.95]
sub.toxic=1

# Part9:Merge

In [None]:
def scale_min_max_submission(submission):
    min_, max_ = submission['toxic'].min(), submission['toxic'].max()
    submission['toxic'] = (submission['toxic'] - min_) / (max_ - min_)
    return submission
sub['toxic'] = (scale_min_max_submission(sub)['toxic'] + scale_min_max_submission(sub1)['toxic']) / 2
sub['toxic'].hist(bins=100)

# Part10:Combat training

In [None]:
from tensorflow.keras import backend as K
def adversarial_training(model, embedding_name, epsilon=1):
   
    if model.train_function is None:  
        model._make_train_function()  
    old_train_function = model.train_function  

    # lookup embedding layer
    for output in model.outputs:
        embedding_layer = search_layer(output, embedding_name)
        if embedding_layer is not None:
            break
    if embedding_layer is None:
        raise Exception('Embedding layer not found')

    #find the gradients of embedding
    embeddings = embedding_layer.embeddings  
    gradients = K.gradients(model.total_loss, [embeddings])  
    gradients = K.zeros_like(embeddings) + gradients[0]  

    
    inputs = (
        model._feed_inputs + model._feed_targets + model._feed_sample_weights
    )  
    embedding_gradients = K.function(
        inputs=inputs,
        outputs=[gradients],
        name='embedding_gradients',
    )  

    #epsilon is the intensity of adversarial
    def train_function(inputs): 
        grads = embedding_gradients(inputs)[0]  
        delta = epsilon * grads / (np.sqrt((grads**2).sum()) + 1e-8)  
        K.set_value(embeddings, K.eval(embeddings) + delta) 
        outputs = old_train_function(inputs)  
        K.set_value(embeddings, K.eval(embeddings) - delta)  
        return outputs

    model.train_function = train_function  


#start
adversarial_training(model, 'Embedding-Token', 0.5)