### Описание задания

На основе данных твитов [fakenews](https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv) произвести классификацию текста различными методами.

Получить для обученных моделей классификации значение f1 выше 0.91 для методов на sklearn и выше 0.52 для методов на pytorch.

In [1]:
import numpy as np
import pandas as pd

from collections import Counter
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from tqdm import tqdm

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import torch
import torch.nn as nn
import torch.optim as optim

### Необходимые функции и классы

In [2]:
def get_embedding(text, max_len=None):
    '''Получение эмбеддингов'''
    result = []
    if not max_len:
        for word in word_tokenize(text.lower()):
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])

        if len(result):
            result = np.sum(result, axis=0)
        else:
            result = np.zeros(300)
    else:
        for i in range(max_len):
            if i < len(text):
                word = text[i]
                if word in model_tweets.wv:
                    result.append(model_tweets.wv[word])
                else:
                    result.append(np.zeros(300))
            else:
                result.append(np.zeros(300))
    return result


def train_one_epoch(in_data, targets, batch_size=16):
    '''Train func for LSTM'''
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size].cuda()
        batch_y = targets[i:i + batch_size].cuda()
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)


class BiLSTM(nn.Module):
    '''RNN LSTM'''
    def __init__(self):
        super(BiLSTM, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction

### Загрузка данных

In [3]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2023-06-19 20:44:25--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv.17’


2023-06-19 20:44:26 (21.7 MB/s) - ‘Constraint_Train.csv.17’ saved [1253562/1253562]



In [4]:
df = pd.read_csv('./Constraint_Train.csv')

In [5]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6420 entries, 0 to 6419
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      6420 non-null   int64 
 1   tweet   6420 non-null   object
 2   label   6420 non-null   object
dtypes: int64(1), object(2)
memory usage: 150.6+ KB


 ### Классификация с использованием векторайзера

In [7]:
vec = CountVectorizer()
bow = vec.fit_transform(df.tweet)

Обучение модели на основе векторайзера.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(bow, df.label, test_size=0.3, random_state=21)
model = LogisticRegression(solver='lbfgs')
model.fit(X_train, y_train)

In [9]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.90      0.93      0.91       887
        real       0.94      0.91      0.92      1039

    accuracy                           0.92      1926
   macro avg       0.92      0.92      0.92      1926
weighted avg       0.92      0.92      0.92      1926



**Вывод:** Применение модели классификации LogisticRegression на основе векторайзера CountVectorizer дало точность результата **0.92**.

### Классификация с использованием эмбеддингов

In [10]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:
sentences = [word_tokenize(text.lower()) for text in df.tweet]

In [12]:
%time model_tweets = Word2Vec(sentences, workers=4, vector_size=300, min_count=3, window=5, epochs=15)

CPU times: user 8.3 s, sys: 68 ms, total: 8.37 s
Wall time: 6.98 s


Проверим правильно ли обучилась у нас модель на близости к слову covid.

In [13]:
model_tweets.wv.most_similar('covid')

[('coronavirus', 0.6735660433769226),
 ('covid-19', 0.6397594213485718),
 ('covid19', 0.6177120208740234),
 ('corona', 0.5911555886268616),
 ('virus', 0.5405569672584534),
 ('future', 0.5210791826248169),
 ('coronavirus._', 0.5153530240058899),
 ('flu', 0.5134640336036682),
 ('hydroxychloroquine', 0.49073326587677),
 ('caring', 0.4851606786251068)]

Получение эмбеддингов твитов на основе обученной модели Word2Vec.

In [14]:
features = [get_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:06<00:00, 1044.38it/s]


In [15]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.3, random_state=21)

Обучение модели на основе эмбеддингов.

In [16]:
model = LogisticRegression(solver='newton-cg')
model.fit(X_train, y_train)

In [17]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.89      0.94      0.91       887
        real       0.95      0.90      0.92      1039

    accuracy                           0.92      1926
   macro avg       0.92      0.92      0.92      1926
weighted avg       0.92      0.92      0.92      1926



**Вывод:** Применение модели классификации LogisticRegression с использованием эмбеддингов Word2Vec дало точность результата **0.92**.

### Классификация на основе PyTorch + LSTM
Подготовка данных

In [18]:
labels = (df.label == 'real').astype(int).to_list()

In [19]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]

На основе most_common выберем max_len.

In [20]:
max_len = 250

In [21]:
features = [get_embedding(text, max_len) for text in token_lists]

In [22]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=21)

In [23]:
net = BiLSTM()
print(net)

BiLSTM(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [24]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

  in_data = torch.tensor(X_train).float()


In [25]:
in_data.shape

torch.Size([4494, 250, 300])

In [26]:
targets.shape

torch.Size([4494])

In [27]:
optimizer = optim.SGD(net.parameters(), lr=0.1)
criterion = nn.BCELoss()

Будем использовать вычисления с использованием аппаратного ускорителя GPU.

In [28]:
net.cuda()

BiLSTM(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)

Обучим модель на 20 эпохах.

In [29]:
for i in range(20):
  train_one_epoch(in_data, targets)

100%|██████████| 281/281 [00:01<00:00, 228.94it/s]


tensor(0.6875, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 263.44it/s]


tensor(0.6868, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 242.54it/s]


tensor(0.6862, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 226.89it/s]


tensor(0.6857, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 255.82it/s]


tensor(0.6852, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 266.08it/s]


tensor(0.6848, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 262.33it/s]


tensor(0.6844, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 267.76it/s]


tensor(0.6841, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 261.70it/s]


tensor(0.6838, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 265.20it/s]


tensor(0.6836, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 267.32it/s]


tensor(0.6833, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 263.71it/s]


tensor(0.6831, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 264.58it/s]


tensor(0.6829, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 244.81it/s]


tensor(0.6828, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 227.92it/s]


tensor(0.6826, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 241.48it/s]


tensor(0.6825, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 266.76it/s]


tensor(0.6823, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 264.60it/s]


tensor(0.6822, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 266.07it/s]


tensor(0.6821, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)


100%|██████████| 281/281 [00:01<00:00, 267.28it/s]

tensor(0.6820, device='cuda:0', grad_fn=<BinaryCrossEntropyBackward0>)





In [30]:
with torch.no_grad():
    output = net(in_data_test.cuda()).reshape(-1)

In [31]:
result = (output.cpu() > 0.5) == targets_test
result.sum().item() / len(result)

0.5399792315680166

**Вывод:** Применив модель классификации на основе PyTorch + LSTM без внесения изменений схему нейросети была достигнута точность модели **0.54**

### Выводы

Были применены наивные методы классификации с использованием векторизации и эмбеддингов, а также с использованием PyTorch + LSTM. Были достигнуты требуемые по заданию параметры точности.
Наивные модели показывали высокую точность моделей, при этом они не требуют дополнительной настройки.
Нейросеть в базовом уровне показала низкую точность, так как она требует дополнительной настройки.