<a href="https://colab.research.google.com/github/gn1dus/mosi/blob/main/%D0%9B%D0%90%D0%91_2_1%D1%87%D0%B0%D1%81%D1%82%D1%8C_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Загружаем библиотеки

In [1]:
import re
import nltk
import string
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

Загрузка необходимых ресурсов NLTK

In [14]:
nltk.download("punkt", force=True)
nltk.download("stopwords", force=True)
nltk.download("wordnet", force=True)
nltk.download('omw-1.4', force=True)
nltk.download('punkt_tab', force=True)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Приведение текста к нижнему регистру и удаление знаков препинания.

In [15]:
def preprocess_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)  # Удаляем знаки препинания
    return text

Токенизация, удаление стоп-слов и лемматизация.

In [16]:
def tokenize_and_lemmatize(text: str) -> list[str]:
    stop_words = set(stopwords.words("english"))
    lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return words

Построение словаря уникальных слов с индексами.

In [17]:
def build_vocabulary(corpus: list[str]) -> dict:
    vocab = {}
    for text in corpus:
        for word in tokenize_and_lemmatize(text):
            if word not in vocab:
                vocab[word] = len(vocab)
    return vocab

Реализация Bag of Words (BoW).

In [18]:
def bag_of_words(corpus: list[str], vocab: dict) -> np.ndarray:
    bow_matrix = np.zeros((len(corpus), len(vocab)))
    for i, text in enumerate(corpus):
        for word in tokenize_and_lemmatize(text):
            if word in vocab:
                bow_matrix[i][vocab[word]] += 1
    return bow_matrix

Вычисление TF (термин-фреквенции) на основе BoW матрицы.

In [19]:
def compute_tf(bow_matrix: np.ndarray) -> np.ndarray:
    return bow_matrix / np.maximum(bow_matrix.sum(axis=1, keepdims=True), 1)

 Вычисление IDF (обратной документной частоты).

In [20]:
def compute_idf(bow_matrix: np.ndarray) -> np.ndarray:
    num_docs = bow_matrix.shape[0]
    doc_freq = np.count_nonzero(bow_matrix, axis=0)
    return np.log10(num_docs / np.maximum(doc_freq, 1))

Вычисление TF-IDF матрицы на основе TF и IDF.

In [21]:
def compute_tfidf(bow_matrix: np.ndarray) -> np.ndarray:
    tf = compute_tf(bow_matrix)
    idf = compute_idf(bow_matrix)
    return tf * idf

Токенизация всех ASCII-символов.

In [22]:
def ascii_tokenizer(text: str) -> list[str]:
    return [char for char in text if char in string.printable]

Векторизация ASCII-символов (перевод символов в их ASCII-коды).

In [23]:
def ascii_vectorizer(text: str) -> list[int]:
    return [ord(char) for char in text if char in string.printable]

Векторизация текста через индексацию уникальных токенов.

In [24]:
def vectorize(tokens: list[str]) -> list[int]:
    dict_vectors = {}
    result = []
    for word in tokens:
        if word in dict_vectors:
            result.append(dict_vectors[word])
        else:
            dict_vectors[word] = len(dict_vectors)
            result.append(dict_vectors[word])
    return result

In [28]:
def main():
    """
    Основная функция программы.
    """
    documents = [
        "Far over the misty mountains cold",
        "To dungeons deep and caverns old",
        "We must away ere break of day",
        "To find our long-forgotten gold."

        "The pines were roaring on the height",
        "The winds were moaning in the night",
        "The fire was red, it flaming spread",
        "The trees like torches blazed with light."
    ]

    preprocessed_docs = [preprocess_text(doc) for doc in documents]
    tokenized_docs = [" ".join(tokenize_and_lemmatize(doc)) for doc in preprocessed_docs]

    print("\nPreprocessed Documents:", tokenized_docs)

    vocab = build_vocabulary(tokenized_docs)

    bow_matrix = bag_of_words(tokenized_docs, vocab)
    tfidf_matrix = compute_tfidf(bow_matrix)

    np.set_printoptions(precision=1, suppress=True)
    print("\nBag of Words Matrix:")
    print(bow_matrix)
    print("\nTF-IDF Matrix:")
    print(tfidf_matrix)

    print("\nASCII Tokenization:", ascii_tokenizer(documents[0]))
    print("\nASCII Vectorization:", ascii_vectorizer(documents[0]))

    print("\nVectorized Tokens (Lemmatized):", vectorize(tokenize_and_lemmatize(documents[0])))
    print("\nVectorized Tokens (Preprocessed):", vectorize(preprocessed_docs[0].split()))


if __name__ == "__main__":
    main()



Preprocessed Documents: ['far misty mountain cold', 'dungeon deep cavern old', 'must away ere break day', 'find longforgotten goldthe pine roaring height', 'wind moaning night', 'fire red flaming spread', 'tree like torch blazed light']

Bag of Words Matrix:
[[1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.
  1. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1. 1. 1. 1. 1.]]

TF-IDF Matrix:
[[0.2 0.2 0.2 0.2 0.  0.  0.  0.  0. 