Предварительно про PyTorch:
* [Про тензоры в pytorch](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensor_tutorial.ipynb)
* [Про автоматическое дифференцирование и что такое .backwards()](https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/autograd_tutorial.ipynb)
* [Очень простая нейронка на pytorch](https://colab.research.google.com/drive/1RsZvw4KBGn5U5Aj5Ak7OG2pHx6z1OSlF)

# Классификация текстов

## Fakenews

1. Мы будем работать с данными fakenews отсюда: https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
2. Проведите препроцессинг текста. Разбейте данные на train и test для задачи классификации.
3. Векторизуйте.
4. Обучите на полученных векторах алгоритм классификации.

Мы уже видели как эта задача выполняется с помощью Word2vec. Давайте вспомним.

In [None]:
!wget https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv

--2021-10-22 18:12:21--  https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1253562 (1.2M) [text/plain]
Saving to: ‘Constraint_Train.csv’


2021-10-22 18:12:21 (21.7 MB/s) - ‘Constraint_Train.csv’ saved [1253562/1253562]



In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('Constraint_Train.csv')

In [None]:
df.head()

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [None]:
from nltk.tokenize import word_tokenize
from tqdm import tqdm

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
sentences = [word_tokenize(text.lower()) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:02<00:00, 2964.92it/s]


In [None]:
from gensim.models.word2vec import Word2Vec
%time model_tweets = Word2Vec(sentences, workers=4, size=300, min_count=3, window=5, iter=15)

CPU times: user 11.1 s, sys: 84.4 ms, total: 11.2 s
Wall time: 6.53 s


In [None]:
model_tweets.wv.most_similar('france')

[('tower', 0.9468374848365784),
 ('named', 0.9362770318984985),
 ('front', 0.9360196590423584),
 ('deceased', 0.933924674987793),
 ('representative', 0.9337502717971802),
 ('selling', 0.9319757223129272),
 ('film', 0.9301034212112427),
 ('bags', 0.9269586205482483),
 ('throwing', 0.9267023205757141),
 ('jamaat', 0.9254723787307739)]

In [None]:
model_tweets.init_sims()

In [None]:
import numpy as np

In [None]:
def get_text_embedding(text):
    result = []
    for word in word_tokenize(text.lower()):
        if word in model_tweets.wv:
            result.append(model_tweets.wv[word])

    if len(result):
        result = np.sum(result, axis=0)
    else:
        result = np.zeros(300)
    return result

In [None]:
features = [get_text_embedding(text) for text in tqdm(df.tweet)]

100%|██████████| 6420/6420 [00:02<00:00, 2170.57it/s]


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
from sklearn.metrics import classification_report

In [None]:
predicted = model.predict(X_test)

In [None]:
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.89      0.92      0.90       992
        real       0.92      0.90      0.91      1127

    accuracy                           0.91      2119
   macro avg       0.91      0.91      0.91      2119
weighted avg       0.91      0.91      0.91      2119



###  Что будет, если использовать самый наивный метод?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vec = CountVectorizer()

In [None]:
bow = vec.fit_transform(df.tweet)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, df.label, test_size=0.33)
model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
predicted = model.predict(X_test)
print(classification_report(y_test, predicted))

              precision    recall  f1-score   support

        fake       0.90      0.91      0.91      1016
        real       0.92      0.91      0.92      1103

    accuracy                           0.91      2119
   macro avg       0.91      0.91      0.91      2119
weighted avg       0.91      0.91      0.91      2119



Конечно, мы всегда можем поиграться с предобработкой.

### PyTorch + LSTM

In [None]:
labels = (df.label == 'real').astype(int).to_list()

Нужно заранее задать размер для макксимальной длины предложений.

In [None]:
token_lists = [word_tokenize(text.lower()) for text in df.tweet]
max_len = len(max(token_lists, key=len))

In [None]:
max_len

1592

Это слишком много. Но какая длина обычно?

In [None]:
from collections import Counter
fd = Counter([len(tokens) for tokens in token_lists])

In [None]:
fd.most_common(10)

[(20, 179),
 (25, 174),
 (22, 170),
 (18, 170),
 (19, 167),
 (21, 167),
 (16, 162),
 (15, 161),
 (17, 161),
 (23, 157)]

Зададим максимум 200.

Возьмём те же w2v эмбеддинги.

In [None]:
def get_word_embedding(tokens, max_len):
    result = []
    for i in range(max_len):
        if i < len(tokens):
            word = tokens[i]
            if word in model_tweets.wv:
                result.append(model_tweets.wv[word])
            else:
                result.append(np.zeros(300))
        else:
            result.append(np.zeros(300))
    return result

In [None]:
features = [get_word_embedding(text, 200) for text in tqdm(token_lists)]

100%|██████████| 6420/6420 [00:03<00:00, 1660.20it/s]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

In [None]:
len(features[0][0])

300

In [None]:
len(X_train)

4301

In [None]:
len(X_train[0])

200

In [None]:
len(X_train[0][0])

300

In [None]:
class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.lstm = nn.LSTM(300, 100)
        self.out = nn.Linear(100, 1)

    def forward(self, x):
        embeddings, (shortterm, longterm) = self.lstm(x.transpose(0, 1))
        prediction = torch.sigmoid(self.out(longterm))
        return prediction


net = Net()
print(net)

Net(
  (lstm): LSTM(300, 100)
  (out): Linear(in_features=100, out_features=1, bias=True)
)


In [None]:
in_data = torch.tensor(X_train).float()
targets = torch.tensor(y_train).float()

In [None]:
in_data.shape

torch.Size([4301, 200, 300])

In [None]:
optimizer = optim.SGD(net.parameters(), lr=0.01)
criterion = nn.BCELoss()

In [None]:
def train_one_epoch(in_data, targets, batch_size=16):
    for i in tqdm(range(0, in_data.shape[0], batch_size)):
        batch_x = in_data[i:i + batch_size]
        batch_y = targets[i:i + batch_size]
        optimizer.zero_grad()
        output = net(batch_x)
        loss = criterion(output.reshape(-1), batch_y)
        loss.backward()
        optimizer.step()
    print(loss)

In [None]:
train_one_epoch(in_data, targets)

100%|██████████| 269/269 [03:21<00:00,  1.34it/s]

tensor(0.6970, grad_fn=<BinaryCrossEntropyBackward>)





Что получилось?

In [None]:
in_data_test = torch.tensor(X_test).float()
targets_test = torch.tensor(y_test).float()

In [None]:
with torch.no_grad():
    output = net(in_data_test).reshape(-1)

In [None]:
result = (output > 0.5) == targets_test

In [None]:
result.sum().item() / len(result)

0.5280792826805096

Но такую модель надо учить дольше(