## Практическое задание к уроку № 2 по теме "Профилирование пользователей. Сегментация."

1. *Самостоятельно повторить tfidf (документация https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
2. Модифицировать код функции get_user_embedding таким образом, чтобы считалось не среднее (как в примере np.mean), а медиана. Применить такое преобразование к данным, обучить модель прогнозирования оттока и посчитать метрики качества и сохранить их: roc auc, precision/recall/f_score (для 3 последних - подобрать оптимальный порог)
3. Повторить п.2, но используя уже не медиану, а max
4. *Воспользовавшись полученными знаниями из п.1, повторить пункт 2, но уже взвешивая новости по tfidf (взяв список новостей пользователя)
    - подсказка 1: нужно получить веса-коэффициенты для каждого документа. Не все документы одинаково информативны и несут какой-то положительный сигнал
    - подсказка 2: нужен именно idf, как вес.
5. Сформировать на выходе единую таблицу, сравнивающую качество 2/3 разных метода получения эмбедингов пользователей: median, max, idf_mean по метрикам roc_auc, precision, recall, f_score
6. Сделать самостоятельные выводы и предположения о том, почему тот или ной способ оказался эффективнее остальных

In [1]:
from gensim.corpora.dictionary import Dictionary
from gensim.matutils import Sparse2Corpus
from gensim.models import LdaModel
from gensim.test.utils import datapath
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import numpy as np
import pandas as pd
import pymorphy2
from razdel import tokenize
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (f1_score, roc_auc_score, precision_score, average_precision_score,
                             classification_report, precision_recall_curve, confusion_matrix)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

In [2]:
news = pd.read_csv("articles.csv")
print(news.shape)
news.head(3)

(27000, 2)


Unnamed: 0,doc_id,title
0,6,Заместитель председателяnправительства РФnСерг...
1,4896,Матч 1/16 финала Кубка России по футболу был п...
2,4897,Форвард «Авангарда» Томаш Заборский прокоммент...


In [3]:
users = pd.read_csv("users_articles.csv")
users.head(3)

Unnamed: 0,uid,articles
0,u105138,"[293672, 293328, 293001, 293622, 293126, 1852]"
1,u108690,"[3405, 1739, 2972, 1158, 1599, 322665]"
2,u108339,"[1845, 2009, 2356, 1424, 2939, 323389]"


In [4]:
nltk.download('stopwords')
stopword_ru = stopwords.words('russian')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Shkin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
len(stopword_ru)

151

In [6]:
with open('stopwords.txt', encoding='utf-8') as f:
    additional_stopwords = [w.strip() for w in f.readlines() if w]
    
stopword_ru += additional_stopwords
len(stopword_ru)

776

In [7]:
def clean_text(text):
    '''
    очистка текста
    
    на выходе очищеный текст
    '''
    if not isinstance(text, str):
        text = str(text)
    
    text = text.lower()
    text = text.strip('\n').strip('\r').strip('\t')
    text = re.sub("-\s\r\n\|-\s\r\n|\r\n", '', str(text))

    text = re.sub("[0-9]|[-—.,:;_%©«»?*!@#№$^•·&()]|[+=]|[[]|[]]|[/]|", '', text)
    text = re.sub(r"\r\n\t|\n|\\s|\r\t|\\n", ' ', text)
    text = re.sub(r'[\xad]|[\s+]', ' ', text.strip())
    text = re.sub('n', ' ', text)
    
    return text

cache = {}
morph = pymorphy2.MorphAnalyzer()

def lemmatization(text):    
    '''
    лемматизация
        [0] если зашел тип не `str` делаем его `str`
        [1] токенизация предложения через razdel
        [2] проверка есть ли в начале слова '-'
        [3] проверка токена с одного символа
        [4] проверка есть ли данное слово в кэше
        [5] лемматизация слова
        [6] проверка на стоп-слова

    на выходе лист лемматизированых токенов
    '''

    # [0]
    if not isinstance(text, str):
        text = str(text)
    
    # [1]
    tokens = list(tokenize(text))
    words = [_.text for _ in tokens]

    words_lem = []
    for w in words:
        if w[0] == '-': # [2]
            w = w[1:]
        if len(w) > 1: # [3]
            if w in cache: # [4]
                words_lem.append(cache[w])
            else: # [5]
                temp_cach = cache[w] = morph.parse(w)[0].normal_form
                words_lem.append(temp_cach)
    
    words_lem_without_stopwords = [i for i in words_lem if not i in stopword_ru] # [6]
    
    return words_lem_without_stopwords

In [8]:
%%time
tqdm.pandas()
# Запускаем очистку текста
news['title'] = news['title'].progress_apply(lambda x: clean_text(x))

  text = re.sub("[0-9]|[-—.,:;_%©«»?*!@#№$^•·&()]|[+=]|[[]|[]]|[/]|", '', text)
100%|██████████| 27000/27000 [00:19<00:00, 1359.21it/s]

CPU times: total: 19.9 s
Wall time: 19.9 s





In [9]:
news['title'].iloc[:10]

0    заместитель председателя правительства рф серг...
1    матч  финала кубка россии по футболу был приос...
2    форвард авангарда томаш заборский прокомментир...
3    главный тренер кубани юрий красножан прокоммен...
4    решением попечительского совета владивостокско...
5    ио главного тренера вячеслав буцаев прокоммент...
6    запорожский металлург дома потерпел разгромное...
7    сборная сша одержала победу над австрией со сч...
8    бывший защитник сборной россии дарюс каспарайт...
9    полузащитник цска зоран тошич после победы над...
Name: title, dtype: object

In [10]:
%%time
# Запускаем лемматизацию текста
news['title'] = news['title'].progress_apply(lambda x: lemmatization(x))

100%|██████████| 27000/27000 [02:15<00:00, 199.71it/s]

CPU times: total: 2min 15s
Wall time: 2min 15s





In [11]:
# сформируем список наших текстов
texts = list(news['title'].values)

# Создадим корпус из списка с текстами
common_dictionary = Dictionary(texts)
common_corpus = [common_dictionary.doc2bow(text) for text in texts]

Запускаем обучение

In [12]:
N_topic = 20

In [13]:
%%time

# Обучаем модель на корпусе
lda = LdaModel(common_corpus, num_topics=N_topic, id2word=common_dictionary, random_state=29)#, passes=10)

CPU times: total: 35.6 s
Wall time: 32.4 s


In [14]:
# Сохраняем модель на диск
temp_file = datapath("model.lda")
lda.save(temp_file)

In [15]:
# Загружаем обученную модель с диска
lda = LdaModel.load(temp_file)

In [16]:
x = lda.show_topics(num_topics=N_topic, num_words=7, formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

# Печатаем только слова
for topic, words in topics_words:
    print(f"topic_{topic}: " + " ".join(words))

topic_0: млрд рубль составить млн бюджет ставка рост
topic_1: тело обнаружить взрыв выяснить сша газ космос
topic_2: автор свидетель лётчик выделить диапазон умереть фаза
topic_3: ракета снижение запуск энергия испытание израиль брюссель
topic_4: статья кровь земля свет способность кость вход
topic_5: предприниматель египетский гражданство sa концепция пилотировать ведение
topic_6: поверхность восток германия египет иран европа франция
topic_7: место рейтинг температура россиянин первый вода млн
topic_8: россия сша рынок российский цена рост всё
topic_9: исследование журнал всё женщина газета день писать
topic_10: гражданин остров конкурс фронт народный памятник супруг
topic_11: ребёнок жизнь смерть возраст автор организм советский
topic_12: газ участок торговый площадь глава москва задержать
topic_13: погибнуть фонд выяснить доллар дыра вирус воздух
topic_14: военный сша эксперимент земля рак армия северный
topic_15: российский россия банк население решение сторона объём
topic_16: рос

In [17]:
def get_lda_vector(lda, text):

    unseen_doc = common_dictionary.doc2bow(text)
    lda_tuple = lda[unseen_doc]

    not_null_topics = dict(lda_tuple)

    output_vector = []
    for i in range(N_topic):
        output_vector.append(not_null_topics[i] if i in not_null_topics else 0)
        
    return np.array(output_vector)

In [18]:
%%time
topic_matrix = pd.DataFrame([get_lda_vector(lda, text) for text in news['title'].values])
topic_matrix.columns = [f'topic_{i}' for i in range(N_topic)]
topic_matrix['doc_id'] = news['doc_id'].values
topic_matrix = topic_matrix[['doc_id']+[f'topic_{i}' for i in range(N_topic)]]

CPU times: total: 27.2 s
Wall time: 24.4 s


In [19]:
topic_matrix.head()

Unnamed: 0,doc_id,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.042943,0.0,...,0.0,0.0,0.056732,0.0,0.0,0.0,0.0,0.891777,0.0,0.0
1,4896,0.0,0.585589,0.0,0.0,0.0,0.0,0.390715,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4897,0.0,0.095284,0.136507,0.030956,0.199992,0.0,0.193319,0.116941,0.104652,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4898,0.0,0.0,0.0,0.0,0.0,0.0,0.086819,0.11057,0.221759,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063256,0.0,0.507816
4,4899,0.0,0.0,0.0,0.058675,0.0,0.0,0.0,0.0,0.0,...,0.0,0.077323,0.244675,0.0,0.053,0.0,0.0,0.428206,0.0,0.0


In [20]:
doc_dict = dict(zip(topic_matrix['doc_id'].values, topic_matrix[[f'topic_{i}' for i in range(N_topic)]].values))

In [21]:
doc_dict[293672]

array([0.        , 0.        , 0.13186674, 0.        , 0.09756994,
       0.        , 0.        , 0.        , 0.29824179, 0.        ,
       0.        , 0.07639583, 0.37508217, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [22]:
def get_user_embedding(user_articles_list, doc_dict, func='mean'):
    user_articles_list = eval(user_articles_list)
    user_vector = np.array([doc_dict[doc_id] for doc_id in user_articles_list])
    
    match func:
        case 'mean':
            user_vector = np.mean(user_vector, axis=0)
        case 'median':
            user_vector = np.median(user_vector, axis=0)
        case 'max':
            user_vector = np.max(user_vector, axis=0)
            
    return user_vector

In [23]:
target = pd.read_csv("users_churn.csv")
target.head(3)

Unnamed: 0,uid,churn
0,u107120,0
1,u102277,0
2,u102444,0


In [24]:
embds = []
for func in ('mean', 'median', 'max'):
    user_embeddings = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding(x, doc_dict, func=func))])
    user_embeddings.columns = [f'topic_{i}' for i in range(N_topic)]
    user_embeddings['uid'] = users['uid'].values
    user_embeddings = user_embeddings[['uid']+[f'topic_{i}' for i in range(N_topic)]]
    embds.append(user_embeddings)
    
user_embeddings_mean = embds[0]
user_embeddings_median = embds[1]
user_embeddings_max = embds[2]

In [25]:
user_embeddings_mean.head()

Unnamed: 0,uid,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,u105138,0.002669,0.025017,0.021978,0.0,0.038701,0.0,0.017695,0.026245,0.135578,...,0.03998,0.155588,0.081138,0.005453,0.020918,0.125869,0.027897,0.151842,0.0,0.009785
1,u108690,0.037412,0.003313,0.005821,0.00353,0.014091,0.0,0.0,0.008299,0.197037,...,0.008439,0.060942,0.050219,0.001755,0.051332,0.203402,0.059888,0.171751,0.0,0.007435
2,u108339,0.014135,0.024985,0.007149,0.0,0.004763,0.00204,0.003104,0.034605,0.056011,...,0.014784,0.024339,0.077272,0.025605,0.058922,0.227227,0.111113,0.199366,0.0,0.02061
3,u101138,0.0,0.036439,0.033607,0.014503,0.0,0.0,0.068481,0.215124,0.135114,...,0.0,0.031991,0.017394,0.0,0.035167,0.021153,0.037008,0.114231,0.0,0.22026
4,u108248,0.026361,0.032143,0.006323,0.0,0.014334,0.0,0.004212,0.059714,0.161864,...,0.005953,0.052074,0.022068,0.009924,0.057653,0.11316,0.079171,0.157355,0.0,0.021711


In [26]:
def get_scores(user_embeddings, target, scaled=False, plot=False):

    X = pd.merge(user_embeddings, target, 'left')
    X_train, X_test, y_train, y_test = train_test_split(X.drop(['uid', 'churn'], axis=1), 
                                                    X['churn'], random_state=29)
    
    if scaled:
        scaler = StandardScaler()
        X_train = scaler.fit_transform(X_train)
        X_test = scaler.transform(X_test)
    
    model = LogisticRegression()
    model.fit(X_train, y_train)
    preds = model.predict_proba(X_test)[:, 1]
    
    if plot:
        n = 50
        plt.figure(figsize=(10, 6))
        plt.plot(preds[:n], label='predict')
        plt.plot(y_test.values[:n], label='true')
        plt.title('ответ модели')
        plt.xlabel('№ примера')
        plt.ylabel('выход')
        plt.legend()
        plt.grid()
        plt.show()

    precision, recall, thresholds = precision_recall_curve(y_test, preds)
    fscore = []
    for i in range(len(precision)):
        if (precision[i] + recall[i]) != 0:
            fscore.append((2 * precision[i] * recall[i]) / (precision[i] + recall[i]))
        else:
            fscore.append(0)
            
    ix = np.argmax(np.array(fscore))
    f_score_ = round(fscore[ix], 3)
    precision_ = round(precision[ix], 3)
    recall_ = round(recall[ix], 3)
    roc_auc_score_ = round(roc_auc_score(y_test, preds), 3)
    ap_score_ = round(average_precision_score(y_test, preds), 3)
    print(f'F-score:\t{f_score_}\n'
          f'Precision:\t{precision_}\n'
          f'Recall:\t\t{recall_}\n'
          f'ROC-AUC score:\t{roc_auc_score_}\n'
          f'AP-score:\t{ap_score_}')
        
    return f_score_, precision_, recall_, roc_auc_score_, ap_score_

#### Mean, no tfidf

In [27]:
f_score_mean, precision_mean, recall_mean, roc_auc_score_mean, ap_score_mean = get_scores(user_embeddings_mean, target)

F-score:	0.634
Precision:	0.559
Recall:		0.733
ROC-AUC score:	0.923
AP-score:	0.626


#### Median, no tfidf

In [28]:
f_score_median, precision_median, recall_median, roc_auc_score_median, ap_score_median = get_scores(user_embeddings_median, target)

F-score:	0.71
Precision:	0.614
Recall:		0.842
ROC-AUC score:	0.958
AP-score:	0.77


#### Max, no tfidf

In [29]:
f_score_max, precision_max, recall_max, roc_auc_score_max, ap_score_max = get_scores(user_embeddings_max, target)

F-score:	0.745
Precision:	0.718
Recall:		0.773
ROC-AUC score:	0.962
AP-score:	0.816


### tfidf:

In [30]:
texts_tfidf = [' '.join(text) for text in texts]

Обучим на уже обработанном тексте, используя имеющийся словарь:

In [31]:
tf = TfidfVectorizer(vocabulary=common_dictionary.token2id)
tf.fit(texts_tfidf)

TfidfVectorizer(vocabulary={'aa': 60397, 'aaa': 102346, 'aaas': 135467,
                            'aabar': 68677, 'aacsb': 97861, 'aad': 51893,
                            'aamal': 102225, 'aamaq': 71366, 'aami': 76620,
                            'aaplo': 96784, 'aaq': 59120, 'aar': 90219,
                            'aarata': 103513, 'aarhus': 134368, 'aaro': 64137,
                            'aatip': 128768, 'aatsa': 98251, 'ab': 51755,
                            'aba': 38521, 'ababeel': 60163, 'abalsslv': 73069,
                            'abba': 29215, 'abbey': 32026, 'abbott': 105400,
                            'abbraccio': 75134, 'abbvie': 113001, 'abby': 99106,
                            'abbyy': 38966, 'abc': 16452, 'abccomlive': 95641, ...})

Добавим веса словам, а затем и темам:

In [32]:
weights_dict = {}
for word, idx in common_dictionary.token2id.items():
    weights_dict[word] = tf.idf_[idx]

In [33]:
topics_weights = []
for topic in topics_words:
    weight = np.mean([weights_dict[word] for word in topic[1]])
    topics_weights.append(weight)

In [34]:
topics_weights[:5]

[3.426062221135475,
 3.976909277630117,
 4.811976917613817,
 4.576623806870413,
 4.566009741992158]

Найдём веса документов, умножив веса тем на вероятности (доли) тем в каждом документе:

In [35]:
topic_matrix_tfidf = topic_matrix.copy()
doc_weights = sum(topic_matrix_tfidf[f'topic_{i}'] * topics_weights[i] for i in range(N_topic))

In [36]:
doc_weights.shape

(27000,)

Получим взвешенные векторы документов, умножив их на веса:

In [37]:
for i in range(N_topic):
    topic_matrix_tfidf[f'topic_{i}'] *= doc_weights

In [38]:
topic_matrix_tfidf.head()

Unnamed: 0,doc_id,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.130596,0.0,...,0.0,0.0,0.172528,0.0,0.0,0.0,0.0,2.711991,0.0,0.0
1,4896,0.0,2.32616,0.0,0.0,0.0,0.0,1.552054,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4897,0.0,0.370395,0.530643,0.120335,0.777428,0.0,0.751486,0.454583,0.406811,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4898,0.0,0.0,0.0,0.0,0.0,0.0,0.319524,0.406938,0.816152,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.232804,0.0,1.868945
4,4899,0.0,0.0,0.0,0.192758,0.0,0.0,0.0,0.0,0.0,...,0.0,0.254019,0.803796,0.0,0.174114,0.0,0.0,1.406721,0.0,0.0


Повторим операции для нового датасета:

In [39]:
doc_dict_tfidf = dict(zip(topic_matrix_tfidf['doc_id'].values, topic_matrix_tfidf[[f'topic_{i}' for i in range(N_topic)]].values))

In [40]:
embds_tfidf = []
for func in ('mean', 'median', 'max'):
    user_embeddings = pd.DataFrame([i for i in users['articles'].apply(lambda x: get_user_embedding(x, doc_dict_tfidf, func=func))])
    user_embeddings.columns = [f'topic_{i}' for i in range(N_topic)]
    user_embeddings['uid'] = users['uid'].values
    user_embeddings = user_embeddings[['uid']+[f'topic_{i}' for i in range(N_topic)]]
    embds_tfidf.append(user_embeddings)
    
user_embeddings_mean_tfidf = embds_tfidf[0]
user_embeddings_median_tfidf = embds_tfidf[1]
user_embeddings_max_tfidf = embds_tfidf[2]

#### Mean, tfidf

In [41]:
f_score_mean_tfidf, precision_mean_tfidf, recall_mean_tfidf, roc_auc_score_mean_tfidf, ap_score_mean_tfidf = get_scores(user_embeddings_mean_tfidf, target)

F-score:	0.665
Precision:	0.648
Recall:		0.684
ROC-AUC score:	0.936
AP-score:	0.68


#### Median, tfidf

In [42]:
f_score_median_tfidf, precision_median_tfidf, recall_median_tfidf, roc_auc_score_median_tfidf, ap_score_median_tfidf = get_scores(user_embeddings_median_tfidf, target)

F-score:	0.742
Precision:	0.666
Recall:		0.838
ROC-AUC score:	0.968
AP-score:	0.813


#### Max, tfidf

In [43]:
f_score_max_tfidf, precision_max_tfidf, recall_max_tfidf, roc_auc_score_max_tfidf, ap_score_max_tfidf = get_scores(user_embeddings_max_tfidf, target)

F-score:	0.754
Precision:	0.712
Recall:		0.802
ROC-AUC score:	0.965
AP-score:	0.823


Сведём все метрики в таблицу:

In [44]:
indices = ['F-score', 'Precision', 'Recall', 'ROC-AUC', 'AP-score']
columns = ['Mean', 'Median', 'Max', 'Mean_IDF', 'Median_IDF', 'Max_IDF']

data = np.array([[f_score_mean, f_score_median, f_score_max, f_score_mean_tfidf, f_score_median_tfidf, f_score_max_tfidf],
                 [precision_mean, precision_median, precision_max, precision_mean_tfidf, precision_median_tfidf, precision_max_tfidf],
                 [recall_mean, recall_median, recall_max, recall_mean_tfidf, recall_median_tfidf, recall_max_tfidf],
                 [roc_auc_score_mean, roc_auc_score_median, roc_auc_score_max, roc_auc_score_mean_tfidf, roc_auc_score_median_tfidf, roc_auc_score_max_tfidf],
                 [ap_score_mean, ap_score_median, ap_score_max, ap_score_mean_tfidf, ap_score_median_tfidf, ap_score_max_tfidf]
                ])
df = pd.DataFrame(data, index=indices, columns=columns)
df

Unnamed: 0,Mean,Median,Max,Mean_IDF,Median_IDF,Max_IDF
F-score,0.634,0.71,0.745,0.665,0.742,0.754
Precision,0.559,0.614,0.718,0.648,0.666,0.712
Recall,0.733,0.842,0.773,0.684,0.838,0.802
ROC-AUC,0.923,0.958,0.962,0.936,0.968,0.965
AP-score,0.626,0.77,0.816,0.68,0.813,0.823


Оценим дисбаланс классов целевой переменной:

In [45]:
target['churn'].value_counts(normalize=True)

0    0.875
1    0.125
Name: churn, dtype: float64

<u>Выводы:</u>  
В данной задаче метрика ROC-AUC плохо себя показала, демонстрируя "заоблачный результат",  
несмотря на посредственные показатели других метрик и простоту используемой модели. Это связано с тем,  
что доля положительного класса в выборке значительно меньше доли отрицательного, а принцип работы  
метрики таков, что она опирается на FPR (долю ложных срабатываний относительно общего числа отрицательных  
объектов). При большой доле отрицательных объектов FPR часто близка к нулю, а ROC-AUC, соответственно,  
к единице. Вместо этой метрики была взята для анализа average precision score (AP-score), которая является  
неким аналогом PR-AUC. ROC-AUC оставлена для соответствия условию домашней работы.  
<br>
Из таблицы выше видим, что наибольшую метрику (за основные берём F-score и AP-score) мы получаем, когда  
берём максимальные веса тем среди 6 документов, и значительно хуже получается результат, когда за веса  
тем пользователей берём средние значения по документам. Это происходит потому, что веса тем в документах  
содержат много нулей, и интерес пользователя к какой-то теме нивелируется тем, что он потом читал документ,  
не имеющий отношения к этой теме. Медиана частично решает эту проблему, не позволяя весам стремиться у нулю,  
но и медиана имеет проблему: если 4 из 6 значений весов окажутся нулями, то пользователь как будто бы вообще  
не интересуется темой. Использование максимального значения в нашем случае (6 документов) стало лучшим выходом,  
хоть он и не идеален, т.к. пользователю достаточно прочесть одну новость с явным уклоном в одну из тем, и он теперь  
считается фанатом этой темы.  
<br>
Использование tfidf добавило качества модели - все основные метрики увеличились. Механика весов получилась следующей:  
веса (доли) тем каждого документа получали прибавку в зависимости от долей тем в документе и весов этих тем (содержание  
редких/часто встречающихся слов). Таким образом, бОльшую прибавку получали документы, содержащие большую долю редких тем.  
Нельзя сказать, почему это привело к повышению метрик, т.к. нам неизвестно, по каким  параметрам происходит отток клиентов,  
но модель нашла какие-то зависимости.