## Seminar 2

### Intro to PyTorch

based on official [PyTorch Blitz Tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

## To install PyTorch please follow instructions from official [website](https://pytorch.org/get-started/locally/).

### What is PyTorch?

* It's a package for scientific computations, basically, a replacement for NumPy, that supports GPUs.
* It's a deep learning research platform

### Tensors

Tensors are similar to NumPy's ndarrays, with the exception of being able to be operated with using GPUs.

In [123]:
import torch

To construct a randomly initialized matrix:

In [125]:
x = torch.rand(5, 3)
print(x)

tensor([[0.6372, 0.1697, 0.4210],
        [0.0900, 0.1915, 0.7062],
        [0.4440, 0.2796, 0.2975],
        [0.9916, 0.2178, 0.3695],
        [0.7804, 0.1118, 0.3410]])


To construct a matrix, filled with zeros and data-type long:

In [126]:
x = torch.zeros(5, 3, dtype=torch.long)
print(x)

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])


A tensor may be initialized directly from data:

In [127]:
x = torch.tensor([5.5, 3])
print(x)

tensor([5.5000, 3.0000])


A tensor may be created using an existing tensor. The new one will inherit all the properties of the one, that was passed as a parameter, apart from those, that were parametrized explicitly:

In [128]:
x = x.new_ones(5, 3)      # new_* methods take in sizes
print(x)

x = torch.randn_like(x, dtype=torch.float)    # override dtype!
print(x)   

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
tensor([[-0.6322, -0.9671, -0.1247],
        [ 2.0053,  0.4415,  2.3298],
        [ 1.4352,  0.5937, -1.1980],
        [ 0.5494, -0.4754, -0.2074],
        [ 0.1042,  1.0216, -1.3735]])


To check the size of a tensor we use:

In [129]:
x.size()

torch.Size([5, 3])

Another way:

In [130]:
x.shape

torch.Size([5, 3])

NB! The type torch.Size is an abstraction from a mere tuple, so it supports all the tuple operations

### Operations

PyTorch is so pythonic, that it implements operations on tensors in many different syntaxes to match everyones needs and tastes. Let us take a look at the addition operation:

In [131]:
y = torch.rand(5, 3)
print(x + y)

tensor([[-0.6134, -0.8136,  0.6153],
        [ 2.7688,  0.4961,  3.0683],
        [ 1.8372,  1.2829, -0.8196],
        [ 1.2146,  0.5219,  0.5046],
        [ 1.0140,  1.4486, -0.6107]])


In [132]:
print(torch.add(x, y))

tensor([[-0.6134, -0.8136,  0.6153],
        [ 2.7688,  0.4961,  3.0683],
        [ 1.8372,  1.2829, -0.8196],
        [ 1.2146,  0.5219,  0.5046],
        [ 1.0140,  1.4486, -0.6107]])


In case you need it, you can pass an out variable as a parameter to any operation like add:

In [133]:
result = torch.empty(5, 3)
torch.add(x, y, out=result)
print(result)

tensor([[-0.6134, -0.8136,  0.6153],
        [ 2.7688,  0.4961,  3.0683],
        [ 1.8372,  1.2829, -0.8196],
        [ 1.2146,  0.5219,  0.5046],
        [ 1.0140,  1.4486, -0.6107]])


Tensor objects support all the operations as methods:

In [134]:
x.add(y)

tensor([[-0.6134, -0.8136,  0.6153],
        [ 2.7688,  0.4961,  3.0683],
        [ 1.8372,  1.2829, -0.8196],
        [ 1.2146,  0.5219,  0.5046],
        [ 1.0140,  1.4486, -0.6107]])

In case you need to perform an operation in-place, you use the operation_ syntax:

In [135]:
x.add_(y)

tensor([[-0.6134, -0.8136,  0.6153],
        [ 2.7688,  0.4961,  3.0683],
        [ 1.8372,  1.2829, -0.8196],
        [ 1.2146,  0.5219,  0.5046],
        [ 1.0140,  1.4486, -0.6107]])

The result of an in-place operation is stored in the left operand object, in this particular case in x

In [136]:
x

tensor([[-0.6134, -0.8136,  0.6153],
        [ 2.7688,  0.4961,  3.0683],
        [ 1.8372,  1.2829, -0.8196],
        [ 1.2146,  0.5219,  0.5046],
        [ 1.0140,  1.4486, -0.6107]])

The sugarish NumPy indexing syntax is also supported:

In [137]:
print(x[:, 1])

tensor([-0.8136,  0.4961,  1.2829,  0.5219,  1.4486])


In case there is a need to resize (*reshape*) a tensor, the ``` view ``` method comes into action:

In [138]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)  # the size -1 denotes the original dimension size
print(x.size(), y.size(), z.size())

torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])


To get the number out of the tensor use:

In [139]:
x = torch.randn(1)
print(x)
print(x.item())

tensor([-0.1627])
-0.16270415484905243


In [140]:
y[1].item()

-1.1159268617630005

In case we need to check, if CUDA is available, we use:

In [141]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
    device = torch.device("cuda")          # a CUDA device object
    y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
    x = x.to(device)                       # or just use strings ``.to("cuda")``
    z = x + y
    print(z)
    print(z.to("cpu", torch.double))

tensor([0.8373], device='cuda:0')
tensor([0.8373], dtype=torch.float64)


### Autograd

The next thing that is worth looking at is the automatic gradient computation module of pyTorch. It is called
*torch.autograd* . This module does all the *magic* that is connected with gradient computations, using a sofisticated computation graph architecture, that is going to be covered later. For now we will get to know only basic concepts of it.

To include a `Tensor` into the computation graph, its `.requires_grad` attribute should be set to `True`

In [142]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


After any operation is applied (in this particular case - addition), a `Function` object is assigned to the `.grad_fn` attribute of the tensor `y` and added to the computation graph for backward propagation of the gradient.

In [143]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [144]:
print(y.grad_fn)

<AddBackward0 object at 0x7f4d6aef3640>


In [145]:
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


This `.grad_fn` attribute can be changed on the fly. See the difference: if a tensor does not require gradient, it is not included into the computation graph, hence it does not store any backward function. However, once `.grad_fn` changed to `True`, all the operations start to be tracked.

In [146]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x7f4d621ee190>


One of the most important things in the torch framework is the `.backward()` method. It triggers the calculation of the gradients for all the nodes (e.g. neural net parameters) in the computation graph that are chained to the callee node. 

NB! `.backward()` when called on a \[1, 1\] tensor, requires no arguments

In [147]:
out.backward()

In [148]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


In [149]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([-1226.9231,   916.6340,   181.5013], grad_fn=<MulBackward0>)


If there is a need to stop autograd from tracking history on Tensors you can use either context manager:

In [150]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
    print((x ** 2).requires_grad)

True
True
False


or `.detach()` method:

In [151]:
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())

True
False
tensor(True)


## Logistic Regression Using PyTorch
### based on [this](https://blog.goodaudience.com/awesome-introduction-to-logistic-regression-in-pytorch-d13883ceaa90) blogpost

Basically, most of pyTorch modeling can be broken down into these steps:
* loading the dataset
* making the dataset iterable
* instantiating the **model** class
* instantiating the **loss** class
* instantiating the **optimizer** class
* training the model

#### Load Dataset

In [152]:
%pip install torchtext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [153]:
from torchtext import data
from torch.nn import functional as F
import torch

In [154]:
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
else:
    DEVICE = torch.device("cpu")

In [155]:
SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [156]:
import nltk

In [157]:
nltk.download("movie_reviews")

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [158]:
import re
import os

In [159]:
POS = "pos"
NEG = "neg"

In [160]:
text_sentiments = (POS, NEG)

train_data_list = []
test_data_list = []

examples = []

for sentiment in text_sentiments:
    for filename in os.listdir(os.path.join(nltk.corpus.movie_reviews.root.path, sentiment)):
        with open(os.path.join(nltk.corpus.movie_reviews.root.path, sentiment, filename), "r", encoding="utf-8") as file:
            examples.append({"text": file.read().strip(),
                             "sentiment": int(sentiment == POS)})

In [161]:
%pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [162]:
import pandas as pd

In [163]:
examples_df = pd.DataFrame(examples)

In [164]:
examples_df.head()

Unnamed: 0,text,sentiment
0,"first , i am not a big fan of the x-files tv s...",1
1,the thought-provoking question of tradition ov...,1
2,"the small-scale film , in limited release , "" ...",1
3,metro i've seen san francisco in movies many t...,1
4,"three things i learned from "" being john malko...",1


In [165]:
examples_df = examples_df.sample(frac=1)
train_df = examples_df.sample(frac=0.7)
test_df = examples_df.drop(index=train_df.index)
train_texts, train_labels = train_df.text.values, train_df.sentiment.values
test_texts, test_labels = test_df.text.values, test_df.sentiment.values

In [166]:
test_labels

array([0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0,

In [167]:
len(test_df.text.values), len(test_df.sentiment.values), len(test_labels)

(600, 600, 600)

In [168]:
from typing import List, Dict, Any, Iterable
from collections import Counter, OrderedDict
import math
from itertools import islice
import torch.nn.functional as F

In [169]:
class TfIdfVectorizer:

            
    def __init__(self, lower=True, tokenizer_pattern=r"(?i)\b[a-z]{2,}\b"):  # ?i for case insensitive match
        # What are the drawbacks of this tokenization?
        self.lower = lower  
        self.tokenizer_pattern = re.compile(tokenizer_pattern)
        self.vocab_df = OrderedDict()
        
    def __tokenize(self, text: str) -> List[str]:
        return self.tokenizer_pattern.findall(text.lower() if self.lower else text)
    
    def fit(self, texts: Iterable[str]):
        term_id = 0
        for doc_idx, doc in enumerate(texts):
            tokenized = self.__tokenize(doc)
            for term in tokenized:
                if term not in self.vocab_df:
                    self.vocab_df[term] = {}  # Creating term-based dict
                    self.vocab_df[term]["doc_ids"] = {doc_idx}  # For each term adding documents where it is found
                    self.vocab_df[term]["doc_count"] = 1  # Initialising doc count
                    self.vocab_df[term]["id"] = term_id  # Adding term id in our vector
                    term_id += 1
                elif doc_idx not in self.vocab_df[term]["doc_ids"]:
                    self.vocab_df[term]["doc_ids"].add(doc_idx)  # Adding new documents for existing terms
                    self.vocab_df[term]["doc_count"] += 1  # Incrementing count
        texts_len = len(texts)  # Number of texts
        for term in self.vocab_df:
            # Calculating idf
            self.vocab_df[term]["idf"] = math.log(texts_len / self.vocab_df[term]["doc_count"])
        
        
    def transform(self, texts: Iterable[str]) -> torch.sparse.LongTensor:
        values = []
        doc_indices = []
        term_indices = []
        for doc_idx, raw_doc in enumerate(texts):
            term_counter = {}
            for token in self.__tokenize(raw_doc):
                if token in self.vocab_df:
                    term = self.vocab_df[token]
                    term_idx = term["id"]
                    term_idf = term["idf"]
                    if term_idx not in term_counter:
                        term_counter[term_idx] = term_idf
                    else:
                        term_counter[term_idx] += term_idf
            term_indices.extend(term_counter.keys())
            values.extend(term_counter.values())
            doc_indices.extend([doc_idx] * len(term_counter))
        # Transferring dict and encoded texts to cuda
        indices = torch.LongTensor([doc_indices, term_indices]).to(DEVICE)
        values_tensor = torch.LongTensor(values).to(DEVICE)
        # To optimise calculations we make it sparse
        tf_idf = torch.sparse.LongTensor(indices, values_tensor, torch.Size([len(texts), len(self.vocab_df)])).to(DEVICE)
        return tf_idf

In [170]:
%%time
vectorizer = TfIdfVectorizer()
vectorizer.fit(train_texts)

CPU times: user 1.55 s, sys: 33.2 ms, total: 1.58 s
Wall time: 1.75 s


In [171]:
%%time
train_data = vectorizer.transform(train_texts)
test_data = vectorizer.transform(test_texts)

  values_tensor = torch.LongTensor(values).to(DEVICE)


CPU times: user 1.84 s, sys: 20.4 ms, total: 1.86 s
Wall time: 1.88 s


In [172]:
train_texts[1]

"in 1912 , a ship set sail on her maiden voyage across the atlantic for america . \nthis ship was built to be the largest ship in the world , and she was . \nshe was also build to be one of the most luxurious , and that she was . \nfinally , she was built to be unsinkable and that unfortunately she was not . \nto get a ticket for this voyage you either : spent a life's savings to get to america to start life anew , were part of the upper class and had the money to spare , or finally were lucky enough to have a full house in a poker match by the docks like jack dawson . \njack dawson makes the trip , and happens to be at the right place at the right time . \nrose dewitt bukater , a first class passenger , climbs the railings at the aft of the ship with thoughts of jumping . \nthus is started a tale of romance and intrigue , and a tale of death and tragedy . . . \nthis movie is about a tragic event that took place a great many years ago , an even that should not be taken lightly as any o

In [173]:
train_data[1]

tensor(indices=tensor([[ 23, 395, 396, 397, 241, 246, 398, 399, 321,  20, 400,
                        149, 401,  97,  93, 402,  40,  52, 403, 109,  14, 404,
                        232, 405,  48,   9, 406, 407,   5, 408, 409, 410,  51,
                        135, 411, 105, 210, 412, 413, 414, 415, 416, 322, 417,
                        418, 419, 420, 421, 422, 163, 423, 193, 254, 424, 425,
                        426, 427, 134, 428, 146, 429, 430, 319, 237, 431, 304,
                        432, 372, 389, 433, 434, 435, 371, 436, 437, 438, 439,
                          7, 440, 441, 442,   1, 443, 444, 445, 446, 447, 448,
                        327, 206, 449, 450, 451, 452, 453, 150, 454,   2, 291,
                        332, 455, 456,  90, 207, 160, 457, 458, 459, 460, 461,
                        462,  59, 324, 463, 464,  54,  58,  94, 229, 465, 166,
                        213,  98, 466, 186, 467, 468, 469, 470, 471, 472, 473,
                        474,  92, 475, 476, 477,  55

#### Make the dataset iterable

In [174]:
from torch.utils.data import DataLoader, Dataset

In [175]:
train_data_loader = DataLoader(train_texts, batch_size=64)
test_data_loader = DataLoader(test_texts, batch_size=64)

In [176]:
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

#### Build the model

In [177]:
from torch import nn  # nn layers
from torch.nn import functional as F  # loss functions

class LogisticRegressionModel(nn.Module):

    def __init__(self, input_dim, output_dim):
        super(LogisticRegressionModel, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)  # What is our input dim?

    def forward(self, x):
        out = F.softmax(self.linear(x))
        return out

In [178]:
model = LogisticRegressionModel(len(vectorizer.vocab_df), 2)

In [179]:
criterion = nn.CrossEntropyLoss()

In [180]:
learning_rate = 0.001

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [181]:
# Type of parameter object
print(model.parameters())

# Length of parameters
print(len(list(model.parameters())))

# FC 1 Parameters
print(list(model.parameters())[0].size())

# FC 1 Bias Parameters
print(list(model.parameters())[1].size())

<generator object Module.parameters at 0x7f4d110eb5f0>
2
torch.Size([2, 33804])
torch.Size([2])


In [182]:
model.to(DEVICE)

LogisticRegressionModel(
  (linear): Linear(in_features=33804, out_features=2, bias=True)
)

In [183]:
num_epochs = 5

In [184]:
iteration = 0
for epoch in range(num_epochs):
    print(f"Epoch #{epoch}")
    for i, (texts, labels) in enumerate(zip(train_data_loader, batch(train_labels, 64))):
        labels = torch.LongTensor(labels).to(DEVICE)
        # To take document length into consideration
        texts = F.normalize(vectorizer.transform(texts).to(torch.float).to_dense()).requires_grad_()
#         print(texts.size(), labels.size(0))

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = model(texts)

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, labels)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        # Counting epochs
        iteration += 1

        if iteration % 50 == 0:
            # Calculate Accuracy         
            correct = 0
            total = 0
            # Iterate through test dataset
            for test_texts_batch, test_labels_batch in zip(test_data_loader, batch(test_labels, 64)):
                # Load value to a Torch Variable
                test_texts_tensor = F.normalize(vectorizer.transform(test_texts_batch).to(torch.float).to_dense())
                test_labels_batch = torch.Tensor(test_labels_batch).to(torch.long)
                # Forward pass only to get logits/output
                outputs = model(test_texts_tensor)

                # Get predictions from the maximum value
                _, predicted = torch.max(outputs.data, 1)

                # Total number of labels
                total += test_labels_batch.size(0)

                # Total correct predictions
                correct += (predicted.detach().cpu() == test_labels_batch).sum()

            accuracy = 100 * correct / total

            # Print Loss
            print('Iteration: {}. Loss: {}. Accuracy: {}'.format(iteration, loss.item(), accuracy))

Epoch #0


  values_tensor = torch.LongTensor(values).to(DEVICE)
  out = F.softmax(self.linear(x))


Epoch #1
Epoch #2
Iteration: 50. Loss: 0.6932342648506165. Accuracy: 52.5
Epoch #3
Epoch #4
Iteration: 100. Loss: 0.6932410001754761. Accuracy: 52.66666793823242


## Logistic Regression Using Scikit-learn

This is more simple way to vectorize documents and train Logistic regression model.

In [185]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

We call *fit_transform* to train tfidf vocabulary and vectorize train dataset:

In [186]:
%%time
vectorizer = TfidfVectorizer()
train_data = vectorizer.fit_transform(train_texts)

CPU times: user 1.08 s, sys: 6.24 ms, total: 1.09 s
Wall time: 1.41 s


For test dataset we call only transform method:

In [187]:
%%time
test_data = vectorizer.transform(test_texts)

CPU times: user 409 ms, sys: 1.03 ms, total: 410 ms
Wall time: 478 ms


The list of words in vocabulary:

In [188]:
vectorizer.get_feature_names_out()[1000:1020]

array(['advertised', 'advertisement', 'advertisements', 'advertiser',
       'advertising', 'advertisment', 'advice', 'advil', 'advisable',
       'advise', 'advised', 'adviser', 'advisers', 'advises', 'advising',
       'advisor', 'advisors', 'advocate', 'advocated', 'advocates'],
      dtype=object)

The weights of pre-trained model:

Initializing and training Logistic regression model:

In [189]:
clf = LogisticRegression(random_state=0)
clf.fit(train_data, train_labels)

LogisticRegression(random_state=0)

In [190]:
clf.coef_.shape

(1, 34471)

In [191]:
pred_data = clf.predict(test_data)

In [192]:
print(classification_report(test_labels, pred_data))

              precision    recall  f1-score   support

           0       0.82      0.82      0.82       301
           1       0.82      0.82      0.82       299

    accuracy                           0.82       600
   macro avg       0.82      0.82      0.82       600
weighted avg       0.82      0.82      0.82       600



Logistic reression with bag of words:

In [193]:
vectorizer = CountVectorizer()
train_data = vectorizer.fit_transform(train_texts)
test_data = vectorizer.transform(test_texts)

clf = LogisticRegression(random_state=0)
clf.fit(train_data, train_labels)
pred_data = clf.predict(test_data)
print(classification_report(test_labels, pred_data))

              precision    recall  f1-score   support

           0       0.82      0.85      0.83       301
           1       0.84      0.81      0.82       299

    accuracy                           0.83       600
   macro avg       0.83      0.83      0.83       600
weighted avg       0.83      0.83      0.83       600



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Tasks:
1. Get the most important feature for pre-trained Logistic regression model
2. Add lemmatisation or stemming as text preprocessing
3. Remove stopwords from CountVectorizer or TfidfVectorizer with stop_words parameter
4. Add bigrams and threegrams to CountVectorizer or TfidfVectorizer with ngram_range parameter

In [194]:
import numpy as np

In [195]:
np.argmax(abs(clf.coef_))

2579

In [196]:
vectorizer.get_feature_names_out()[2529]

'babysit'

In [198]:
vectorizer.get_feature_names_out()[1000:1020]

array(['advertised', 'advertisement', 'advertisements', 'advertiser',
       'advertising', 'advertisment', 'advice', 'advil', 'advisable',
       'advise', 'advised', 'adviser', 'advisers', 'advises', 'advising',
       'advisor', 'advisors', 'advocate', 'advocated', 'advocates'],
      dtype=object)

In [199]:
%pip install pymorphy2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.5 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 KB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt>=0.6
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dawg-python>=0.7.1
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Collecting pymorphy2-dicts-ru<3.0,>=2.4
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for do

In [200]:
from pymorphy2 import MorphAnalyzer

morph = MorphAnalyzer()

In [201]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [202]:
stemmer = nltk.stem.PorterStemmer()

def my_tokenize(text):
  words = nltk.word_tokenize(text)
  res_words = [stemmer.stem(word) for word in words]
  return ' '.join(res_words)

In [203]:
my_tokenize("writing text")

'write text'

In [204]:
vectorizer = CountVectorizer(analyzer=my_tokenize)
train_data = vectorizer.fit_transform(train_texts)
test_data = vectorizer.transform(test_texts)

clf = LogisticRegression(random_state=0)
clf.fit(train_data, train_labels)
pred_data = clf.predict(test_data)
print(classification_report(test_labels, pred_data))

              precision    recall  f1-score   support

           0       0.64      0.58      0.61       301
           1       0.61      0.67      0.64       299

    accuracy                           0.62       600
   macro avg       0.62      0.62      0.62       600
weighted avg       0.62      0.62      0.62       600



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
