# Text Features In CatBoost

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/catboost/tutorials/blob/master/events/2020_06_04_catboost_tutorial/text_features.ipynb)

**Set GPU as hardware accelerator**

First of all, you need to select GPU as hardware accelerator. There are two simple steps to do so:
Step 1. Navigate to **Runtime** menu and select **Change runtime type**
Step 2. Choose **GPU** as hardware accelerator.
That's all!

Let's install CatBoost.

In [None]:
!pip install catboost

In this tutorial we will use dataset **IMDB** from [Kaggle](https://www.kaggle.com) competition for our experiments. Data can be downloaded [here](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

In [4]:
import os
import pandas as pd
import numpy as np
np.set_printoptions(precision=4)

import catboost
print(catboost.__version__)

0.24.2


## Preparing data

In [112]:
!wget https://transfersh.com/ou7jB/imdb.csv -O imdb.csv
df = pd.read_csv('imdb.csv')
df['label'] = (df['sentiment'] == 'positive').astype(int)
df.drop(['sentiment'], axis=1, inplace=True)
df.head()

--2020-11-17 21:11:35--  https://transfersh.com/ou7jB/imdb.csv
Resolving transfersh.com (transfersh.com)... 64:ff9b::6bb2:6ca6
Connecting to transfersh.com (transfersh.com)|64:ff9b::6bb2:6ca6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 66212309 (63M) [text/csv]
Saving to: ‘imdb.csv’


2020-11-17 21:11:41 (13.4 MB/s) - ‘imdb.csv’ saved [66212309/66212309]



Unnamed: 0,review,label
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [6]:
from catboost import Pool
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, train_size=0.8, random_state=0)
y_train, X_train = train_df['label'], train_df.drop(['label'], axis=1)
y_test, X_test = test_df['label'], test_df.drop(['label'], axis=1)

train_pool = Pool(data=X_train, label=y_train, text_features=['review'])
test_pool = Pool(data=X_test, label=y_test, text_features=['review'])

print('Train dataset shape: {}\n'.format(train_pool.shape))



Train dataset shape: (40000, 1)



In [None]:
train_df.to_csv('imdb_train.')

In [59]:
from catboost import CatBoostClassifier

def fit_model(train_pool, test_pool, **kwargs):
    model = CatBoostClassifier(
        iterations=1000,
        learning_rate=0.05,
        eval_metric='AUC',
#         task_type='GPU',
        **kwargs
    )

    return model.fit(
        train_pool,
        eval_set=test_pool,
        verbose=100,
    )

model = fit_model(train_pool, test_pool)

0:	test: 0.8742963	best: 0.8742963 (0)	total: 224ms	remaining: 3m 43s
100:	test: 0.9385780	best: 0.9385780 (100)	total: 23.5s	remaining: 3m 28s
200:	test: 0.9460053	best: 0.9460053 (200)	total: 46.2s	remaining: 3m 3s
300:	test: 0.9505063	best: 0.9505063 (300)	total: 1m 9s	remaining: 2m 41s
400:	test: 0.9529535	best: 0.9529535 (400)	total: 1m 31s	remaining: 2m 17s
500:	test: 0.9546771	best: 0.9546771 (500)	total: 1m 56s	remaining: 1m 55s
600:	test: 0.9556983	best: 0.9556983 (600)	total: 2m 21s	remaining: 1m 33s
700:	test: 0.9564580	best: 0.9564580 (700)	total: 2m 45s	remaining: 1m 10s
800:	test: 0.9570723	best: 0.9570723 (800)	total: 3m 7s	remaining: 46.6s
900:	test: 0.9575270	best: 0.9575386 (892)	total: 3m 29s	remaining: 23.1s
999:	test: 0.9579518	best: 0.9579518 (999)	total: 3m 51s	remaining: 0us

bestTest = 0.9579518196
bestIteration = 999



## How it works?

1. **Text Tokenization**
2. **Dictionary Creation**
3. **Feature Calculation**

## Text Tokenization

Usually we get our text as a sequence of Unicode symbols. So, if the task isn't a DNA classification we don't need such granularity, moreover, we need to extract more complicated entities, e.g. words. The process of extraction tokens -- words, numbers, punctuation symbols or special symbols which defines emoji from a sequence is called **tokenization**.<br>

Tokenization is the first part of text preprocessing in CatBoost and performed as a simple splitting a sequence on a string pattern (e.g. space).

In [8]:
text_small = [
    "Cats are so cute :)",
    "Mouse scare...",
    "The cat defeated the mouse",
    "Cute: Mice gather an army!",
    "Army of mice defeated the cat :(",
    "Cat offers peace",
    "Cat is scared :(",
    "Cat and mouse live in peace :)"
]

target_small = [1, 0, 1, 1, 0, 1, 0, 1]

In [9]:
from catboost.text_processing import Tokenizer

simple_tokenizer = Tokenizer()

def tokenize_texts(texts):
    return [simple_tokenizer.tokenize(text) for text in texts]

simple_tokenized_text = tokenize_texts(text_small)
simple_tokenized_text

[['Cats', 'are', 'so', 'cute', ':)'],
 ['Mouse', 'scare...'],
 ['The', 'cat', 'defeated', 'the', 'mouse'],
 ['Cute:', 'Mice', 'gather', 'an', 'army!'],
 ['Army', 'of', 'mice', 'defeated', 'the', 'cat', ':('],
 ['Cat', 'offers', 'peace'],
 ['Cat', 'is', 'scared', ':('],
 ['Cat', 'and', 'mouse', 'live', 'in', 'peace', ':)']]

### More preprocessing!

Lets take a closer look on the tokenization result of small text example -- the tokens contains a lot of mistakes:

1. They are glued with punctuation 'Cute:', 'army!', 'skare...'.
2. The words 'Cat' and 'cat', 'Mice' and 'mice' seems to have same meaning, perhaps they should be the same tokens.
3. The same problem with tokens 'are'/'is' -- they are inflected forms of same token 'be'.

**Punctuation handling**, **lowercasing**, and **lemmatization** processes help to overcome these problems.

### Punctuation handling and lowercasing

In [10]:
tokenizer = Tokenizer(
    lowercasing=True,
    separator_type='BySense',
    token_types=['Word', 'Number']
)

tokenized_text = [tokenizer.tokenize(text) for text in text_small]
tokenized_text

[['cats', 'are', 'so', 'cute'],
 ['mouse', 'scare'],
 ['the', 'cat', 'defeated', 'the', 'mouse'],
 ['cute', 'mice', 'gather', 'an', 'army'],
 ['army', 'of', 'mice', 'defeated', 'the', 'cat'],
 ['cat', 'offers', 'peace'],
 ['cat', 'is', 'scared'],
 ['cat', 'and', 'mouse', 'live', 'in', 'peace']]

### Removing stop words

**Stop words** - the words that are considered to be uninformative in this task, e.g. function words such as *the, is, at, which, on*.
Usually stop words are removed during text preprocessing to reduce the amount of information that is considered for further algorithms.
Stop words are collected manually (in dictionary form) or automatically, for example taking the most frequent words.

In [11]:
stop_words = set(('be', 'is', 'are', 'the', 'an', 'of', 'and', 'in'))

def filter_stop_words(tokens):
    return list(filter(lambda x: x not in stop_words, tokens))
    
tokenized_text_no_stop = [filter_stop_words(tokens) for tokens in tokenized_text]
tokenized_text_no_stop

[['cats', 'so', 'cute'],
 ['mouse', 'scare'],
 ['cat', 'defeated', 'mouse'],
 ['cute', 'mice', 'gather', 'army'],
 ['army', 'mice', 'defeated', 'cat'],
 ['cat', 'offers', 'peace'],
 ['cat', 'scared'],
 ['cat', 'mouse', 'live', 'peace']]

### Lemmatization

Lemma (Wikipedia) -- is the canonical form, dictionary form, or citation form of a set of words.<br>
For example, the lemma "go" represents the inflected forms "go", "goes", "going", "went", and "gone".<br>
The process of convertation word to its lemma called **lemmatization**.


In [12]:
import nltk

nltk_data_path = os.path.join(os.path.dirname(nltk.__file__), 'nltk_data')
nltk.data.path.append(nltk_data_path)
nltk.download('wordnet', nltk_data_path)

lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_tokens_nltk(tokens):
    return list(map(lambda t: lemmatizer.lemmatize(t), tokens))

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/nikitxskv/.local/lib/python3.5/site-
[nltk_data]     packages/nltk/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
text_small_lemmatized_nltk = [lemmatize_tokens_nltk(tokens) for tokens in tokenized_text_no_stop]
text_small_lemmatized_nltk

[['cat', 'so', 'cute'],
 ['mouse', 'scare'],
 ['cat', 'defeated', 'mouse'],
 ['cute', 'mouse', 'gather', 'army'],
 ['army', 'mouse', 'defeated', 'cat'],
 ['cat', 'offer', 'peace'],
 ['cat', 'scared'],
 ['cat', 'mouse', 'live', 'peace']]

Now words with same meaning represented by the same token, tokens are not glued with punctuation.

<span style="color:red">Be carefull.</span> You should verify for your own task:<br>
Is it realy necessary to remove punctuation, lowercasing sentences or performing a lemmatization and/or by word tokenization?<br>

### Let's check up accuracy with new text preprocessing

Since CatBoost doesn't perform spacing punctuation, lowercasing letters and lemmatization, we need to preprocess text manually and then pass it to learning algorithm.

Since the natural text features is only synopsis and review, we will preprocess only them.

In [14]:
%%time

def preprocess_data(X):
    X_preprocessed = X.copy()
    X_preprocessed['review'] = X['review'].apply(lambda x: ' '.join(lemmatize_tokens_nltk(tokenizer.tokenize(x))))
    return X_preprocessed

X_preprocessed_train = preprocess_data(X_train)
X_preprocessed_test = preprocess_data(X_test)

train_processed_pool = Pool(
    X_preprocessed_train, y_train, 
    text_features=['review'],
)

test_processed_pool = Pool(
    X_preprocessed_test, y_test, 
    text_features=['review'],
)

CPU times: user 3min 49s, sys: 652 ms, total: 3min 50s
Wall time: 3min 51s


In [17]:
model_on_processed_data = fit_model(train_processed_pool, test_processed_pool)

0:	test: 0.8772679	best: 0.8772679 (0)	total: 114ms	remaining: 1m 54s
100:	test: 0.9422074	best: 0.9422074 (100)	total: 12.2s	remaining: 1m 48s
200:	test: 0.9513095	best: 0.9513095 (200)	total: 24.2s	remaining: 1m 36s
300:	test: 0.9566328	best: 0.9566328 (300)	total: 36.5s	remaining: 1m 24s
400:	test: 0.9593129	best: 0.9593129 (400)	total: 48.9s	remaining: 1m 13s
500:	test: 0.9610368	best: 0.9610368 (500)	total: 1m 1s	remaining: 1m
600:	test: 0.9622586	best: 0.9622586 (600)	total: 1m 13s	remaining: 48.6s
700:	test: 0.9631853	best: 0.9631853 (700)	total: 1m 25s	remaining: 36.5s
800:	test: 0.9637900	best: 0.9637900 (800)	total: 1m 37s	remaining: 24.2s
900:	test: 0.9642351	best: 0.9642422 (898)	total: 1m 49s	remaining: 12s
999:	test: 0.9646141	best: 0.9646180 (998)	total: 2m 1s	remaining: 0us

bestTest = 0.9646179863
bestIteration = 998

Shrink model to first 999 iterations.


In [60]:
def print_score_diff(first_model, second_model):
    first_accuracy = first_model.best_score_['validation']['AUC']
    second_accuracy = second_model.best_score_['validation']['AUC']

    gap = (second_accuracy - first_accuracy) / first_accuracy * 100

    print('{} vs {} ({:+.2f}%)'.format(first_accuracy, second_accuracy, gap))
    
print_score_diff(model, model_on_processed_data)

0.9579518196391623 vs 0.9646179862813278 (+0.70%)


## Dictionary Creation

After the first stage, preprocessing of text and tokenization, the second stage starts. The second stage uses the prepared text to select a set of units, which will be used for building new numerical features.

A set of selected units is called dictionary. It might contain words, word bigramms, or character n-gramms.

In [11]:
from catboost.text_processing import Dictionary

In [18]:
dictionary = Dictionary(occurence_lower_bound=0, max_dictionary_size=10)

dictionary.fit(text_small_lemmatized_nltk);
#dictionary.fit(text_small, tokenizer)

In [19]:
dictionary.save('dictionary.tsv')
!cat dictionary.tsv

{"end_of_word_token_policy":"Insert","skip_step":"0","start_token_id":"0","token_level_type":"Word","dictionary_format":"id_count_token","end_of_sentence_token_policy":"Skip","gram_order":"1"}
10
0	6	cat
1	5	mouse
2	2	army
3	2	cute
4	2	defeated
5	2	peace
6	1	gather
7	1	live
8	1	offer
9	1	scare


## Feature Calculation

### Convertation into fixed size vectors

The majority of classic ML algorithms are computing and performing predictions on a fixed number of features $F$.<br>
That means that learning set $X = \{x_i\}$ contains vectors $x_i = (a_0, a_1, ..., a_F)$ where $F$ is constant.

Since text object $x$ is not a fixed length vector, we need to perform preprocessing of the origin set $D$.<br>
One of the simplest text to vector encoding technique is **Bag of words (BoW)**.

### Bag of words algorithm

The algorithm takes in a dictionary and a text.<br>
During the algorithm text $x = (a_0, a_1, ..., a_k)$ converted into vector $\tilde x = (b_0, b_1, ..., b_F)$,<br> where $b_i$ is 0/1 (depending on whether there is a word with id=$i$ from dictionary into text $x$).

In [14]:
text_small_lemmatized_nltk

[['cat', 'so', 'cute'],
 ['mouse', 'scare'],
 ['cat', 'defeated', 'mouse'],
 ['cute', 'mouse', 'gather', 'army'],
 ['army', 'mouse', 'defeated', 'cat'],
 ['cat', 'offer', 'peace'],
 ['cat', 'scared'],
 ['cat', 'mouse', 'live', 'peace']]

In [15]:
dictionary.apply([text_small_lemmatized_nltk[0]])

[[0, 3]]

In [20]:
def bag_of_words(tokenized_text, dictionary):
    features = np.zeros((len(tokenized_text), dictionary.size))
    for i, tokenized_sentence in enumerate(tokenized_text):
        indices = np.array(dictionary.apply([tokenized_sentence])[0])
        features[i, indices] = 1
    return features

bow_features = bag_of_words(text_small_lemmatized_nltk, dictionary)
bow_features

array([[1., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 1., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 1., 1., 0., 0., 1., 0., 0., 0.],
       [1., 1., 1., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 0., 1., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0., 1., 0., 1., 0., 0.]])

In [88]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from scipy.sparse import csr_matrix
from sklearn.metrics import log_loss

def fit_linear_model(X, c):
    model = LogisticRegression()
    model.fit(X, c)
    return model

def fit_naive_bayes(X, c):
    clf = MultinomialNB()
    if isinstance(X, csr_matrix):
        X.eliminate_zeros()
    clf.fit(X, c)
    return clf

def evaluate_model_logloss(model, X, y):
    y_pred = model.predict_proba(X)[:,1]
    metric = log_loss(y, y_pred)
    print('Logloss: ' + str(metric))

In [22]:
def evaluate_models(X, y):
    linear_model = fit_linear_model(bow_features, target_small)
    naive_bayes = fit_naive_bayes(bow_features, target_small)
        
    print('Linear model')
    evaluate_model_logloss(linear_model, X, y)
    print('Naive bayes')
    evaluate_model_logloss(naive_bayes, X, y)
    print('Comparing to constant prediction')
    logloss_constant_prediction = log_loss(y, np.ones(shape=(len(text_small), 2)) * 0.5)
    print('Logloss: ' + str(logloss_constant_prediction))
    
evaluate_models(bow_features, target_small)

Linear model
Logloss: 0.49830123911420365
Naive bayes
Logloss: 0.4528488772318392
Comparing to constant prediction
Logloss: 0.6931471805599453


In [23]:
dictionary = Dictionary(occurence_lower_bound=0)
dictionary.fit(text_small_lemmatized_nltk)

bow_features = bag_of_words(text_small_lemmatized_nltk, dictionary)
evaluate_models(bow_features, target_small)

Linear model
Logloss: 0.46346010998469667
Naive bayes
Logloss: 0.3680393546716464
Comparing to constant prediction
Logloss: 0.6931471805599453


### Looking at sequences of letters / words

Let's look at the example: texts 'The cat defeated the mouse' and 'Army of mice defeated the cat :('<br>
Simplifying it we have three tokens in each sentence 'cat defeat mouse' and 'mouse defeat cat'.<br>
After applying BoW we get two equal vectors with the opposite meaning:

| cat | mouse | defeat |
|-----|-------|--------|
| 1   | 1     | 1      |
| 1   | 1     | 1      |

How to distinguish them?
Lets add sequences of words as a single tokens into our dictionary:

| cat | mouse | defeat | cat_defeat | mouse_defeat | defeat_cat | defeat_mouse |
|-----|-------|--------|------------|--------------|------------|--------------|
| 1   | 1     | 1      | 1          | 0            | 0          | 1            |
| 1   | 1     | 1      | 0          | 1            | 1          | 0            |

**N-gram** is a continguous sequence of $n$ items from a given sample of text or speech (Wikipedia).<br>
In example above Bi-gram (Bigram) = 2-gram of words.

Ngrams help to add into vectors more information about text structure, moreover there are n-grams has no meanings in separation, for example, 'Mickey Mouse company'.

In [24]:
dictionary = Dictionary(occurence_lower_bound=0, gram_order=2)
dictionary.fit(text_small_lemmatized_nltk)

dictionary.save('dictionary.tsv')
!cat dictionary.tsv

{"end_of_word_token_policy":"Insert","skip_step":"0","start_token_id":"0","token_level_type":"Word","dictionary_format":"id_count_token","end_of_sentence_token_policy":"Skip","gram_order":"2"}
17
0	1	army mouse
1	1	cat defeated
2	1	cat mouse
3	1	cat offer
4	1	cat scared
5	1	cat so
6	1	cute mouse
7	1	defeated cat
8	1	defeated mouse
9	1	gather army
10	1	live peace
11	1	mouse defeated
12	1	mouse gather
13	1	mouse live
14	1	mouse scare
15	1	offer peace
16	1	so cute


In [25]:
bow_features = bag_of_words(text_small_lemmatized_nltk, dictionary)
evaluate_models(bow_features, target_small)

Linear model
Logloss: 0.4084388990666391
Naive bayes
Logloss: 0.25985126095069233
Comparing to constant prediction
Logloss: 0.6931471805599453


### Unigram + Bigram

In [26]:
dictionary1 = Dictionary(occurence_lower_bound=0)
dictionary1.fit(text_small_lemmatized_nltk)

bow_features1 = bag_of_words(text_small_lemmatized_nltk, dictionary1)

dictionary2 = Dictionary(occurence_lower_bound=0, gram_order=2)
dictionary2.fit(text_small_lemmatized_nltk)

bow_features2 = bag_of_words(text_small_lemmatized_nltk, dictionary2)

bow_features = np.concatenate((bow_features1, bow_features2), axis=1)
evaluate_models(bow_features, target_small)

Linear model
Logloss: 0.32129697521334455
Naive bayes
Logloss: 0.13103685350656918
Comparing to constant prediction
Logloss: 0.6931471805599453


## CatBoost Configuration

Parameter names:

1. **Text Tokenization** - `tokenizers`
2. **Dictionary Creation** - `dictionaries`
3. **Feature Calculation** - `feature_calcers`

\* More complex configuration with `text_processing` parameter

### `tokenizers`

Tokenizers used to preprocess Text type feature columns before creating the dictionary.

[Documentation](https://catboost.ai/docs/references/tokenizer_options.html).

```
tokenizers = [{
	'tokenizerId': 'Space',
	'delimiter': ' ',
	'separator_type': 'ByDelimiter',
},{
	'tokenizerId': 'Sense',
	'separator_type': 'BySense',
}]
```

### `dictionaries`

Dictionaries used to preprocess Text type feature columns.

[Documentation](https://catboost.ai/docs/references/dictionaries_options.html).

```
dictionaries = [{
	'dictionaryId': 'Unigram',
	'max_dictionary_size': '50000',
	'gram_count': '1',
},{
	'dictionaryId': 'Bigram',
	'max_dictionary_size': '50000',
	'gram_count': '2',
},{
	'dictionaryId': 'Trigram',
	'token_level_type': 'Letter',
	'max_dictionary_size': '50000',
	'gram_count': '3',
}]
```

### `feature_calcers`

Feature calcers used to calculate new features based on preprocessed Text type feature columns.

1. **`BoW`**<br>
Bag of words: 0/1 features (text sample has or not token_id).<br>
Number of produced numeric features = dictionary size.<br>
Parameters: `top_tokens_count` - maximum number of tokens that will be used for vectorization in bag of words, the most frequent $n$ tokens are taken (**highly affect both on CPU ang GPU RAM usage**).

2. **`NaiveBayes`**<br>
NaiveBayes: [Multinomial naive bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes) model. As many new features as classes are added. This feature is calculated by analogy with counters in CatBoost by permutation ([estimation of CTRs](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html)). In other words, a random permutation is made and then we go from top to bottom on the dataset and calculate the probability of its belonging to this class for each object.

3. **`BM25`**<br>
[BM25](https://en.wikipedia.org/wiki/Okapi_BM25). As many new features as classes are added. The idea is the same as in Naive Bayes, but for each class we calculate not the conditional probability, but a certain relevance, which is similar to tf-idf, where the tokens instead of the words and the classes instead of the documents (or rather, the unification of all texts of this class). Only the tf multiplier in BM25 is replaced with another multiplier, which gives an advantage to classes that contain rare tokens.

```
feature_calcers = [
	'BoW:top_tokens_count=1000',
	'NaiveBayes',
	'BM25',
]
```

### `text_processing`

```
text_processing = {
    "tokenizers" : [{
        "tokenizer_id" : "Space",
        "separator_type" : "ByDelimiter",
        "delimiter" : " "
    }],

    "dictionaries" : [{
        "dictionary_id" : "BiGram",
        "max_dictionary_size" : "50000",
        "occurrence_lower_bound" : "3",
        "gram_order" : "2"
    }, {
        "dictionary_id" : "Word",
        "max_dictionary_size" : "50000",
        "occurrence_lower_bound" : "3",
        "gram_order" : "1"
    }],

    "feature_processing" : {
        "default" : [{
            "dictionaries_names" : ["BiGram", "Word"],
            "feature_calcers" : ["BoW"],
            "tokenizers_names" : ["Space"]
        }, {
            "dictionaries_names" : ["Word"],
            "feature_calcers" : ["NaiveBayes"],
            "tokenizers_names" : ["Space"]
        }],
    }
}
```

In [56]:
model_on_processed_data_2 = fit_model(
    train_processed_pool,
    test_processed_pool,
    text_processing = {
        "tokenizers" : [{
            "tokenizer_id" : "Space",
            "separator_type" : "ByDelimiter",
            "delimiter" : " "
        }],
    
        "dictionaries" : [{
            "dictionary_id" : "BiGram",
            "max_dictionary_size" : "50000",
            "occurrence_lower_bound" : "3",
            "gram_order" : "2"
        }, {
            "dictionary_id" : "Word",
            "max_dictionary_size" : "50000",
            "occurrence_lower_bound" : "3",
            "gram_order" : "1"
        }],
    
        "feature_processing" : {
            "default" : [{
                "dictionaries_names" : ["BiGram", "Word"],
                "feature_calcers" : ["BoW"],
                "tokenizers_names" : ["Space"]
            }, {
                "dictionaries_names" : ["Word"],
                "feature_calcers" : ["NaiveBayes"],
                "tokenizers_names" : ["Space"]
            }],
        }
    }
)

0:	test: 0.8772679	best: 0.8772679 (0)	total: 172ms	remaining: 2m 51s
100:	test: 0.9422074	best: 0.9422074 (100)	total: 19.1s	remaining: 2m 49s
200:	test: 0.9513095	best: 0.9513095 (200)	total: 37.3s	remaining: 2m 28s
300:	test: 0.9566328	best: 0.9566328 (300)	total: 56.2s	remaining: 2m 10s
400:	test: 0.9593129	best: 0.9593129 (400)	total: 1m 14s	remaining: 1m 50s
500:	test: 0.9610368	best: 0.9610368 (500)	total: 1m 32s	remaining: 1m 32s
600:	test: 0.9622586	best: 0.9622586 (600)	total: 1m 51s	remaining: 1m 13s
700:	test: 0.9631853	best: 0.9631853 (700)	total: 2m 9s	remaining: 55.1s
800:	test: 0.9637900	best: 0.9637900 (800)	total: 2m 27s	remaining: 36.6s
900:	test: 0.9642351	best: 0.9642422 (898)	total: 2m 46s	remaining: 18.3s
999:	test: 0.9646141	best: 0.9646180 (998)	total: 3m 5s	remaining: 0us

bestTest = 0.9646179863
bestIteration = 998

Shrink model to first 999 iterations.


In [None]:
print_score_diff(model_no_text, model_on_processed_data_2)

## Summary: Text features in CatBoost

### The algorithm:
1. Input text is loaded as a usual column. ``text_column: [string]``.
2. Each text sample is tokenized via splitting by space. ``tokenized_column: [[string]]``.
3. Dictionary estimation.
4. Each string in tokenized column is converted into token_id from dictionary. ``text: [[token_id]]``.
5. Depending on the parameters CatBoost produce features basing on the resulting text column: Bag of words, Multinomial naive bayes or Bm25.
6. Computed float features are passed into the usual CatBoost learning algorithm.

# Embeddings In CatBoost

### Get Embeddings

In [68]:
# from sentence_transformers import SentenceTransformer
# big_model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')
# X_embed_train = big_model.encode(X_train['review'].to_list())
# X_embed_test = big_model.encode(X_test['review'].to_list())

!wget https://transfersh.com/HDHxy/embedded_train.npy -O embedded_train.npy
X_embed_train = np.load('embedded_train.npy')

!wget https://transfersh.com/whOm3/embedded_test.npy -O embedded_test.npy
X_embed_test = np.load('embedded_test.npy')

--2020-11-17 17:22:02--  https://transfersh.com/HDHxy/embedded_train.npy
Resolving transfersh.com (transfersh.com)... 64:ff9b::6bb2:6ca6
Connecting to transfersh.com (transfersh.com)|64:ff9b::6bb2:6ca6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163840128 (156M) []
Saving to: ‘embedded_train.npy’


2020-11-17 17:22:13 (16.0 MB/s) - ‘embedded_train.npy’ saved [163840128/163840128]

--2020-11-17 17:22:18--  https://transfersh.com/whOm3/embedded_test.npy
Resolving transfersh.com (transfersh.com)... 64:ff9b::6bb2:6ca6
Connecting to transfersh.com (transfersh.com)|64:ff9b::6bb2:6ca6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40960128 (39M) []
Saving to: ‘embedded_test.npy’


2020-11-17 17:22:21 (11.9 MB/s) - ‘embedded_test.npy’ saved [40960128/40960128]



### Experiments

In [95]:
X_embed_train_small, y_train_small = X_embed_train[:1000], y_train[:1000]
X_embed_test_small, y_test_small = X_embed_test[:1000], y_test[:1000]

#### Pure embeddings

In [97]:
linmodel = fit_linear_model(X_embed_train_small, y_train_small)
evaluate_model_logloss(linmodel, X_embed_test_small, y_test_small)

Logloss: 0.7683870896879872


#### Linear Discriminant Analysis

In [108]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

clf = LinearDiscriminantAnalysis()
clf.fit(X_embed_train_small[:500], y_train_small[:500])

X_lda_train_small = clf.transform(X_embed_train_small[500:])
X_embed_lda_train_small = np.concatenate([X_embed_train_small[500:], X_lda_train_small], axis=1)

X_lda_test_small = clf.transform(X_embed_test_small)
X_embed_lda_test_small = np.concatenate([X_embed_test_small, X_lda_test_small], axis=1)


linmodel = fit_linear_model(X_embed_lda_train_small, y_train_small[500:])
evaluate_model_logloss(linmodel, X_embed_lda_test_small, y_test_small)

Logloss: 0.812982481413373


### Embeddings in CatBoost

In [109]:
import csv
with open('train_embed_text.tsv', 'w') as f:
    writer = csv.writer(f, delimiter='\t', quotechar='"')
    for y, text, row in zip(y_train, X_preprocessed_train['review'].to_list(), X_embed_train):
        writer.writerow((str(y), text, ';'.join(map(str, row))))

with open('test_embed_text.tsv', 'w') as f:
    writer = csv.writer(f, delimiter='\t', quotechar='"')
    for y, text, row in zip(y_test, X_preprocessed_test['review'].to_list(), X_embed_test):
        writer.writerow((str(y), text, ';'.join(map(str, row))))
        
with open('pool_text.cd', 'w') as f:
    f.write(
        '0\tLabel\n'\
        '1\tText\n'\
        '2\tNumVector'
    )

In [111]:
from catboost import Pool
train_embed_pool = Pool('train_embed_text.tsv', column_description='pool_text.cd')
test_embed_pool = Pool('test_embed_text.tsv', column_description='pool_text.cd')

In [74]:
model_text_embeddings = fit_model(train_embed_pool, test_embed_pool)

0:	test: 0.9189093	best: 0.9189093 (0)	total: 92.2ms	remaining: 1m 32s
100:	test: 0.9573114	best: 0.9573114 (100)	total: 9.45s	remaining: 1m 24s
200:	test: 0.9623914	best: 0.9623914 (200)	total: 18.8s	remaining: 1m 14s
300:	test: 0.9650703	best: 0.9650703 (300)	total: 28.2s	remaining: 1m 5s
400:	test: 0.9665432	best: 0.9665432 (400)	total: 37.6s	remaining: 56.2s
500:	test: 0.9675662	best: 0.9675662 (500)	total: 47.4s	remaining: 47.2s
600:	test: 0.9683025	best: 0.9683025 (600)	total: 57s	remaining: 37.8s
700:	test: 0.9688488	best: 0.9688488 (700)	total: 1m 6s	remaining: 28.2s
800:	test: 0.9691871	best: 0.9691871 (800)	total: 1m 15s	remaining: 18.8s
900:	test: 0.9695822	best: 0.9695822 (900)	total: 1m 24s	remaining: 9.33s
999:	test: 0.9697098	best: 0.9697265 (992)	total: 1m 34s	remaining: 0us

bestTest = 0.9697265166
bestIteration = 992

Shrink model to first 993 iterations.


In [75]:
print_score_diff(model, model_text_embeddings)

0.9579518196391623 vs 0.9697265165993134 (+1.23%)
