##Count-based вектора: Bag-of-Words и TF-IDF

**Bag-of-Words (BoW)**: считаем, сколько раз токены встречаются в документе.  
**TF-IDF**: downweight частые слова, upweight специфические.

Плюсы:
- быстро, просто, часто работает “удивительно хорошо” на классификации.

Минусы:
- теряется порядок слов (мешок слов)
- огромная разреженность
- слабая переносимость на новые домены
- плохо с семантикой (синонимы, перефразирование)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

toy_docs = [
    "I love this movie, it is fantastic and fun",
    "This film was terrible, boring and bad",
    "What a great and wonderful experience",
    "Awful plot, bad acting, I hate it",
    "Fantastic acting and great story",
    "Boring movie, not good",
]
toy_y = np.array([1,0,1,0,1,0])  # 1=positive, 0=negative

bow = CountVectorizer(lowercase=True)
X_bow = bow.fit_transform(toy_docs)

tfidf = TfidfVectorizer(lowercase=True)
X_tfidf = tfidf.fit_transform(toy_docs)

print("BoW shape:", X_bow.shape, "nnz:", X_bow.nnz)
print("TF-IDF shape:", X_tfidf.shape, "nnz:", X_tfidf.nnz)
print("\nVocabulary:", list(bow.vocabulary_.keys())[:10], "...")

BoW shape: (6, 24) nnz: 35
TF-IDF shape: (6, 24) nnz: 35

Vocabulary: ['love', 'this', 'movie', 'it', 'is', 'fantastic', 'and', 'fun', 'film', 'was'] ...


In [3]:
# Посмотрим топ-слова по TF-IDF в одном документе
doc_id = 0
row = X_tfidf[doc_id].toarray().ravel()
top = row.argsort()[::-1][:8]
inv_vocab = {i:w for w,i in tfidf.vocabulary_.items()}
[(inv_vocab[i], row[i]) for i in top if row[i] > 0]

[('is', np.float64(0.4068386546370098)),
 ('fun', np.float64(0.4068386546370098)),
 ('love', np.float64(0.4068386546370098)),
 ('this', np.float64(0.3336135167099814)),
 ('movie', np.float64(0.3336135167099814)),
 ('it', np.float64(0.3336135167099814)),
 ('fantastic', np.float64(0.3336135167099814)),
 ('and', np.float64(0.24136075313322858))]

### Ограничения count-based подходов (что проговорить)

1) **Нет порядка**: “dog bites man” vs “man bites dog” → одинаково (в BoW).  
2) **Синонимы**: “great” и “excellent” — разные координаты.  
3) **Разреженность**: словарь растёт, память/время растут.  
4) **Доменные сдвиги**: новые слова/жанры → деградация.  
5) **Морфология**: без лемматизации “кошки/кошку/кошкой” раздувают словарь.

## Эмбеддинги: Word2Vec, GloVe, “плотные” представления (7–9 мин)

**Эмбеддинг**: отображение токена в плотный вектор `R^d`, где близость ≈ семантическая близость.

- **Word2Vec (CBOW/Skip-gram)**: предсказываем контекст по слову или слово по контексту.
- **GloVe**: использует глобальные матрицы совместной встречаемости (co-occurrence) и факторизацию с весами.

In [11]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [26]:
def simple_word_tokenize(text):
    return text.split()


def try_train_word2vec(sentences):
    try:
        from gensim.models import Word2Vec
        model = Word2Vec(sentences=sentences, vector_size=50, window=3, min_count=1, sg=1, epochs=200, workers=1)
        return model
    except Exception as e:
        print("gensim Word2Vec not available:", type(e).__name__, "-", e)
        return None

sentences = [simple_word_tokenize(t.lower()) for t in toy_docs]
w2v = try_train_word2vec(sentences)

if w2v is not None:
    print(w2v.wv.most_similar("fun", topn=5))

[('boring', 0.23871083557605743), ('this', 0.19837909936904907), ('film', 0.19324013590812683), ('good', 0.15575988590717316), ('i', 0.15475760400295258)]


## Базовая классификация текста
Собираем простой pipeline:
- текст → TF-IDF
- классификатор: Logistic Regression / Linear SVM / Naive Bayes

In [23]:
from sklearn.datasets import fetch_20newsgroups

categories = ['sci.space', 'rec.sport.hockey']

data = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes')
)

for class_id, class_name in enumerate(data.target_names):
    idx = list(data.target).index(class_id)
    print("="*80)
    print("CLASS:", class_name)
    print("="*80)
    print(data.data[idx][:1000])
    print("\n\n")


CLASS: rec.sport.hockey




	I think that you are incorrect, Roger.  Patrick,
Smythe and Adams all played or coached in the league before becoming
front office types.  Hence, they did help build the league, although
they were not great players themselves.  

	I agree that a name is a name is a name, and if some people
have trouble with names that are not easily processed by the fans,
then changing them to names that are more easily processed seems like
a reasonable idea.  If we can get people in the (arena) door by being
uncomplicated, then let's do so.  Once we have them, they will realize
what a great game hockey is, and we can then teach them something
abotu the history of the game.  
 

	No, I would not want to see a Ballard division.  But to say
that these owners are assholes, hence all NHL management people are
assholes would be fallacious.  Conn Smythe, for example, was a classy
individual (from what I have heard). 

	Also, isn't the point of "professional" hockey to make money


In [27]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix

categories = ['sci.space', 'rec.sport.hockey']

data = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes')
)

X = data.data
y = data.target  # 0 или 1

print("Total samples:", len(X))
print("Class names:", data.target_names)
print()

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

# Pipeline: TF-IDF + Logistic Regression
clf = Pipeline([
    ("tfidf", TfidfVectorizer(lowercase=True, stop_words="english", ngram_range=(1,2))),
    ("lr", LogisticRegression(max_iter=2000))
])

clf.fit(X_train, y_train)
pred = clf.predict(X_test)

print("Classification report:\n")
print(classification_report(y_test, pred, target_names=data.target_names))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))


Total samples: 1986
Class names: ['rec.sport.hockey', 'sci.space']

Classification report:

                  precision    recall  f1-score   support

rec.sport.hockey       0.99      0.93      0.96       300
       sci.space       0.93      0.99      0.96       296

        accuracy                           0.96       596
       macro avg       0.96      0.96      0.96       596
    weighted avg       0.96      0.96      0.96       596

Confusion matrix:
 [[278  22]
 [  4 292]]


### Интерпретация: какие признаки важны?

Для линейных моделей на TF-IDF можно посмотреть веса по словам/нграммам.

In [28]:
vec = clf.named_steps["tfidf"]
lr = clf.named_steps["lr"]

feature_names = np.array(vec.get_feature_names_out())
coef = lr.coef_[0]  # positive class
top_pos = coef.argsort()[::-1][:10]
top_neg = coef.argsort()[:10]

print("Top positive features:")
for i in top_pos:
    print(f"{feature_names[i]:<20} {coef[i]:.3f}")

print("\nTop negative features:")
for i in top_neg:
    print(f"{feature_names[i]:<20} {coef[i]:.3f}")

Top positive features:
space                3.534
orbit                1.490
nasa                 1.481
shuttle              1.302
launch               1.298
earth                1.293
moon                 1.230
program              1.055
sky                  1.050
spacecraft           1.049

Top negative features:
game                 -3.524
hockey               -2.637
team                 -2.437
games                -2.385
season               -1.722
players              -1.522
play                 -1.475
espn                 -1.426
nhl                  -1.358
player               -1.218
