<a href="https://colab.research.google.com/github/nerobite/neural_networks/blob/main/Classification_in_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

## **Задача**

Тремя разными способами получить на задаче классификации значение f1 выше 0.91 для методов на sklearn и выше 0.52 для методов на pytorch.

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [8]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2024-05-02 07:59:29--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv.1’


2024-05-02 07:59:29 (36.8 MB/s) - ‘Constraint_Train.csv.1’ saved [1253562/1253562]



In [9]:
import pandas as pd

In [10]:
df = pd.read_csv('Constraint_Train.csv')

In [11]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [12]:
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [13]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:02<00:00, 2484.42it/s]


In [15]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=4, vector_size=300, min_count=3, window=5, epochs=25)

CPU times: user 13.4 s, sys: 111 ms, total: 13.5 s
Wall time: 7.34 s


In [16]:
model_tweets.wv.most_similar('france')

[('2015', 0.7835791707038879),
 ('arrest', 0.7409564852714539),
 ('bags', 0.735721230506897),
 ('streets', 0.7251313328742981),
 ('floor', 0.7208883762359619),
 ('spain', 0.7077752351760864),
 ('front', 0.7068606019020081),
 ('stranded', 0.7029460668563843),
 ('victims', 0.7025680541992188),
 ('wenliang', 0.6912217736244202)]

In [17]:
model_tweets.init_sims()

  model_tweets.init_sims()


In [18]:
import numpy as np

In [19]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [20]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:03<00:00, 2008.47it/s]


In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [30]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.2, random_state=42)

In [101]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [32]:
from sklearn.metrics import classification_report

In [102]:
predicted = model.predict(X_test)

In [103]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.92      0.91      0.91       596
           1       0.92      0.93      0.93       688

    accuracy                           0.92      1284
   macro avg       0.92      0.92      0.92      1284
weighted avg       0.92      0.92      0.92      1284



In [133]:
#К-ближайших соседей (KNN)
from sklearn.neighbors import KNeighborsClassifier

In [134]:
knn_model = KNeighborsClassifier()

In [135]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [136]:
X_train, X_test, y_train, y_test = train_test_split(features, le.fit_transform(df.label), test_size=0.2, random_state=42)

In [137]:
knn_model.fit(X_train, y_train)

In [138]:
predicted = knn_model.predict(X_test)

In [139]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.92      0.93      0.92       596
           1       0.94      0.93      0.93       688

    accuracy                           0.93      1284
   macro avg       0.93      0.93      0.93      1284
weighted avg       0.93      0.93      0.93      1284



In [140]:
#Градиентный бустинг
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()

In [141]:
gb_model.fit(X_train, y_train)

In [142]:
predicted = gb_model.predict(X_test)

In [143]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.92      0.92      0.92       596
           1       0.93      0.93      0.93       688

    accuracy                           0.93      1284
   macro avg       0.93      0.93      0.93      1284
weighted avg       0.93      0.93      0.93      1284



###  Что будет, если использовать самый наивный метод?

In [144]:
from sklearn.feature_extraction.text import CountVectorizer

In [145]:
vec = CountVectorizer()

In [146]:
bow = vec.fit_transform(df.tweet)

In [147]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [148]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.92      0.91      0.91       596
        real       0.92      0.93      0.93       688

    accuracy                           0.92      1284
   macro avg       0.92      0.92      0.92      1284
weighted avg       0.92      0.92      0.92      1284



Конечно, мы всегда можем поиграться с предобработкой.

### PyTorch + LSTM

In [21]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [22]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [23]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [24]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [25]:
fd.most_common(10)

[(20, 178),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 168),
 (21, 168),
 (16, 163),
 (17, 162),
 (15, 160),
 (23, 156)]

Зададим максимум 200.

Возьмём те же w2v эмбеддинги.

In [26]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [27]:
features = [get_word_embedding(text, 200) for text in tqdm(token_lists)]

100%|██████████| 6420/6420 [00:02<00:00, 2541.84it/s]


In [31]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

In [32]:
import torch
import torch.nn as nn
import torch.optim as optim

In [33]:
len(features[0][0])

300

In [34]:
len(X_train)

5136

In [35]:
len(X_train[0])

200

In [36]:
len(X_train[0][0])

300

In [37]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction


net = Net()
print(net)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [38]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

  in_data = torch.tensor(X_train).float()


In [39]:
in_data.shape

torch.Size([5136, 200, 300])

In [32]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCELoss()

In [33]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [34]:
train_one_epoch(in_data, targets)

100%|██████████| 321/321 [03:09<00:00,  1.69it/s]

tensor(0.6882, grad_fn=<BinaryCrossEntropyBackward0>)





Что получилось?

In [35]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [36]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [37]:
result = (output > 0.5) == targets_test

In [38]:
result.sum().item() / len(result)

0.5358255451713395

Но такую модель надо учить дольше(

In [40]:

# Определение функции потерь
criterion = nn.MSELoss()

# Определение оптимизатора
optimizer = optim.Adam(net.parameters(), lr=0.001)


In [44]:
def train_model(in_data, targets, num_epochs=10, batch_size=16):
    for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        for i in tqdm(range(0, in_data.shape[0], batch_size)):
            batch_x = in_data[i:i + batch_size]
            batch_y = targets[i:i + batch_size]
            optimizer.zero_grad()
            output = net(batch_x)
            loss = criterion(output.reshape(-1), batch_y)
            loss.backward()
            optimizer.step()
        print(f"Loss for epoch {epoch + 1}: {loss.item()}")



In [45]:
train_model(in_data, targets)

Epoch 1/10


100%|██████████| 321/321 [04:01<00:00,  1.33it/s]


Loss for epoch 1: 0.2497052103281021
Epoch 2/10


100%|██████████| 321/321 [04:09<00:00,  1.29it/s]


Loss for epoch 2: 0.24914513528347015
Epoch 3/10


100%|██████████| 321/321 [04:19<00:00,  1.24it/s]


Loss for epoch 3: 0.24841521680355072
Epoch 4/10


100%|██████████| 321/321 [04:24<00:00,  1.22it/s]


Loss for epoch 4: 0.24784429371356964
Epoch 5/10


100%|██████████| 321/321 [04:28<00:00,  1.19it/s]


Loss for epoch 5: 0.2474225014448166
Epoch 6/10


100%|██████████| 321/321 [04:32<00:00,  1.18it/s]


Loss for epoch 6: 0.24710671603679657
Epoch 7/10


100%|██████████| 321/321 [04:36<00:00,  1.16it/s]


Loss for epoch 7: 0.24688002467155457
Epoch 8/10


100%|██████████| 321/321 [04:38<00:00,  1.15it/s]


Loss for epoch 8: 0.2466840147972107
Epoch 9/10


100%|██████████| 321/321 [04:41<00:00,  1.14it/s]


Loss for epoch 9: 0.24657943844795227
Epoch 10/10


100%|██████████| 321/321 [04:39<00:00,  1.15it/s]

Loss for epoch 10: 0.24643635749816895





In [47]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [48]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [49]:
result = (output > 0.5) == targets_test

In [50]:
result.sum().item() / len(result)

0.5358255451713395