## Данные

Данные в [архиве](https://drive.google.com/file/d/15o7fdxTgndoy6K-e7g8g1M2-bOOwqZPl/view?usp=sharing). В нём два файла:
- `news_train.txt` тренировочное множество
- `news_test.txt` тренировочное множество

С некоторых новостных сайтов были загружены тексты новостей за период  несколько лет, причем каждая новость принаделжит к какой-то рубрике: `science`, `style`, `culture`, `life`, `economics`, `business`, `travel`, `forces`, `media`, `sport`.

В каждой строке файла содержится метка рубрики, заголовок новостной статьи и сам текст статьи, например:

>    **sport**&nbsp;&lt;tab&gt;&nbsp;**Сборная Канады по хоккею разгромила чехов**&nbsp;&lt;tab&gt;&nbsp;**Сборная Канады по хоккею крупно об...**

In [45]:
import re
from tqdm.notebook import tqdm
from collections import Counter

import numpy as np 
import pandas as pd
import torch

from transformers import BertTokenizer, get_linear_schedule_with_warmup
from transformers import BertForSequenceClassification, AdamW, BertConfig
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

from gensim.models import Word2Vec
import pymorphy2

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

# Задача

1. Обработать данные, получив для каждого текста набор токенов
Обработать токены с помощью (один вариант из трех):
    - pymorphy2
    - русского [snowball стеммера](https://www.nltk.org/howto/stem.html)
    - [SentencePiece](https://github.com/google/sentencepiece) или [Huggingface Tokenizers](https://github.com/huggingface/tokenizers)

In [6]:
def preprocessing(text):
    text = text.lower()
    words = re.findall(r'\b\w+\b', text.lower())
    return words


def get_data(path):
    
    morph = pymorphy2.MorphAnalyzer()
    data = []
    
    with open(path, 'r', encoding='utf-8') as f:
        
        for line in f:
            category, title, text = line.strip().split('\t')
            
            title = preprocessing(title)
            title = [morph.parse(word)[0].normal_form for word in title]
            
            text = [preprocessing(sentence) for sentence in re.split(r'[.!?]', text) if len(sentence) > 10]
            text = [[morph.parse(word)[0].normal_form for word in sentence] for sentence in text] 
            
            data.append([category, title, text])

    return data

In [7]:
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


In [8]:
data_train = get_data('/content/gdrive/My Drive/session/documents/news_train.txt')
data_test = get_data('/content/gdrive/My Drive/session/documents/news_test.txt')

2. Обучить word embeddings (fastText, word2vec, gloVe) на тренировочных данных. Можно использовать [gensim](https://radimrehurek.com/gensim/models/word2vec.html) . Продемонстрировать семантические ассоциации. 

In [9]:
sentences = [note[1] for note in data_train]
sentences.extend([sentence for note in data_train for sentence in note[2]])

In [10]:
w2v = Word2Vec(sentences, workers=4)

In [11]:
w2v.wv.most_similar(positive=['яндекс'])

[('google', 0.8647419214248657),
 ('поисковик', 0.8639674186706543),
 ('сервис', 0.8248758316040039),
 ('microsoft', 0.8012833595275879),
 ('вконтакте', 0.7774312496185303),
 ('facebook', 0.7768967151641846),
 ('yahoo', 0.7766843438148499),
 ('соцсеть', 0.7736133337020874),
 ('мессенджер', 0.7731637358665466),
 ('amazon', 0.7580886483192444)]

3. Реализовать алгоритм классификации, посчитать точноть на тестовых данных, подобрать гиперпараметры. Метод векторизации выбрать произвольно - можно использовать $tf-idf$ с понижением размерности (см. scikit-learn), можно использовать обученные на предыдущем шаге векторные представления, можно использовать [предобученные модели](https://rusvectores.org/ru/models/). Имейте ввиду, что простое "усреднение" токенов в тексте скорее всего не даст положительных результатов. Нужно реализовать два алгоритмов из трех:
     - SVM
     - наивный байесовский классификатор
     - логистическая регрессия

In [28]:
train_labels = [note[0] for note in data_train]
test_labels = [note[0] for note in data_test]

train_data = []
for note in data_train:
    train_data.append(' '.join(note[1]) + ' ' + ' '.join(note[2][0]))
    
test_data = []
for note in data_test:
    test_data.append(' '.join(note[1]) + ' ' + ' '.join(note[2][0]))

In [13]:
print(train_data[3])
print(train_labels[3])

с футболист спартак снять четырехматчевой дисквалификация контрольный дисциплинарный комитет кдк рфс снять дисквалификация с полузащитник московский спартак эйден макгидти
sport


In [14]:
tfidf = TfidfVectorizer()

X_train_idf = tfidf.fit_transform(train_data)
X_test_idf = tfidf.transform(test_data)

In [15]:
params = {'C': [0.05, 0.1, 1], 
          'multi_class': ['ovr', 'multinomial']}

log_reg = GridSearchCV(LogisticRegression(random_state=42), 
                       params, n_jobs=-1)

log_reg.fit(X_train_idf, train_labels)

preds = log_reg.predict(X_test_idf)
score = round(f1_score(test_labels, preds, average='weighted'), 3)

print(f'best params: {log_reg.best_params_}')
print(f'score = {score}')

best params: {'C': 1, 'multi_class': 'multinomial'}
score = 0.808


In [16]:
params = {'alpha': [0, 0.5, 1]}

naive = GridSearchCV(MultinomialNB(),
                     params, n_jobs=-1)

naive.fit(X_train_idf, train_labels)

preds = naive.predict(X_test_idf)
score = round(f1_score(test_labels, preds, average='weighted'), 3)

print(f'best params: {naive.best_params_}')
print(f'score = {score}')

best params: {'alpha': 0.5}
score = 0.773


4.* Реализуйте классификацию с помощью нейросетевых моделей. Например [RuBERT](http://docs.deeppavlov.ai/en/master/features/models/bert.html) или [ELMo](https://rusvectores.org/ru/models/).

In [49]:
tokenizer = BertTokenizer.from_pretrained('DeepPavlov/rubert-base-cased', do_lower_case=True)
model = BertForSequenceClassification.from_pretrained('DeepPavlov/rubert-base-cased', num_labels=10, 
                                                      output_attentions=False, output_hidden_states=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at DeepPavlov/rubert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [29]:
def get_tokenized_data(data):
    new_data = tokenizer.batch_encode_plus(data, add_special_tokens=True, padding='max_length', 
                                           truncation=True, max_length=256, return_tensors='pt', 
                                           return_attention_mask=True)
    return new_data


train_data = get_tokenized_data(train_data)
test_data = get_tokenized_data(test_data)

In [30]:
counter = Counter(train_labels)
mapping = dict(zip(counter, list(range(10))))

train_labels = [mapping[key] for key in train_labels]
test_labels = [mapping[key] for key in test_labels]

In [31]:
train_input = train_data['input_ids']
test_input = test_data['input_ids']

train_masks = train_data['attention_mask']
test_masks = test_data['attention_mask']

train_labels = torch.LongTensor(train_labels)
test_labels = torch.LongTensor(test_labels)

In [50]:
batch_size = 16

train_data = TensorDataset(train_input, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

test_data = TensorDataset(test_input, test_masks, test_labels)
test_sampler = SequentialSampler(test_input)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

In [51]:
optimizer = AdamW(model.parameters(), lr=1e-5)
n_epochs = 3

In [52]:
device = torch.device('cuda')
model.cuda()

losses = []

for epoch in range(n_epochs):
    
    mean_loss = 0
    model.train()
    
    for batch in train_dataloader:
        
        torch.cuda.empty_cache()
        
        data = batch[0].to(device)
        masks = batch[1].to(device)
        labels = batch[2].to(device)
        
        model.zero_grad()
        
        torch.cuda.empty_cache()
        
        outputs = model(data, attention_mask=masks, labels=labels)
        loss = outputs[0]
        
        mean_loss += loss.item()
        loss.backward()
        optimizer.step()
        
    mean_loss = mean_loss / len(train_dataloader)
    losses.append(mean_loss)
    
    print(f'loss {mean_loss}')

loss 0.7854018273876547
loss 0.3870078748795015
loss 0.254989714681435


In [53]:
model.eval()
predictions = torch.Tensor().to(dtype=torch.int8)

for batch in test_dataloader:
        
    data = batch[0].to(device)
    masks = batch[1].to(device)
        
    with torch.no_grad():
        outputs = model(data, attention_mask=masks, 
                        output_hidden_states=False, 
                        output_attentions=False)
        
    predictions = torch.cat((predictions, outputs.logits.cpu().argmax(axis=1)))

In [54]:
print(classification_report(test_labels, predictions, target_names=mapping.keys()))

              precision    recall  f1-score   support

       sport       0.95      0.98      0.97       423
     culture       0.92      0.89      0.90       426
     science       0.93      0.74      0.83       466
       media       0.79      0.88      0.83       403
   economics       0.82      0.85      0.84       426
        life       0.88      0.79      0.83       415
      forces       0.72      0.92      0.81       245
      travel       0.68      0.70      0.69        54
       style       0.65      0.77      0.70        52
    business       0.39      0.33      0.36        90

    accuracy                           0.84      3000
   macro avg       0.77      0.79      0.78      3000
weighted avg       0.85      0.84      0.84      3000

