# Probabilistic methods

First install and import needed packages and libraries

In [1]:
# !pip install numpy scipy pandas matplotlib scikit-learn pyarrow

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

We're going to perform a text topic classification using [AG News Classification dataset](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset). [Pełny zbiór AG News](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html) used in an article:
> Zhang, Xiang, Junbo Zhao, and Yann LeCun. *"Character-level convolutional networks for text classification."* Advances in neural information processing systems 28 (2015). [link](https://proceedings.neurips.cc/paper/2015/file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf)

**Zadanie 3 (0.5 punktu)**

1. Wczytaj zbiór treningowy i testowy z plików `ag_news_data_train.parquet` oraz `ag_news_data_test.parquet`.
2. Połącz tytuł i opis, łącząc je spacją. Wyodrębnij teksty do osobnych zmiennych `texts_train` i `texts_test`.
3. Wyodrębnij klasy do osobnych zmiennych. Są one numerowane od 1 do 4 - zmapuj je zamiast tego tak, żeby były od 0 do 3.
4. Sprawdź liczność zbioru treningowego oraz testowego.

In [3]:
train = pd.read_parquet("ag_news_data_train.parquet")
test = pd.read_parquet("ag_news_data_test.parquet")

train.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


The class distribution in the dataset is uniform

In [4]:
train["Class Index"].value_counts()

3    30000
4    30000
2    30000
1    30000
Name: Class Index, dtype: int64

In [5]:
texts_train = train["Title"].str.cat(train["Description"], sep=" ")
texts_test = test["Title"].str.cat(test["Description"], sep=" ")

y_train = train["Class Index"] - 1
y_test = test["Class Index"] - 1

print(y_train.shape[0])
print(y_test.shape[0])

120000
7600


We're going to perform 3 classifications:
   - `CountVectorizer` with binary features + `BernoulliNB`
   - `CountVectorizer` with continuous features + `MultinomialNB`
   - `TfidfVectorizer` + `MultinomialNB`

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

In [7]:
binomialPipe = Pipeline(
    [
        (
            "vectorizer",
            CountVectorizer(stop_words="english", binary=True, max_features=50000),
        ),
        ("mnb", BernoulliNB(binarize=None)),
    ]
)
binomialPipe.fit(texts_train, y_train)

In [8]:


multinomialPipe = Pipeline(
    [
        (
            "vectorizer",
            CountVectorizer(stop_words="english", binary=False, max_features=50000),
        ),
        ("nb", MultinomialNB()),
    ]
)
multinomialPipe.fit(texts_train, y_train)

In [9]:
tfidfPipe = Pipeline(
    [
        (
            "vectorizer",
            TfidfVectorizer(stop_words="english", binary=False, max_features=50000),
        ),
        ("nb", MultinomialNB()),
    ]
)
tfidfPipe.fit(texts_train, y_train)

In [10]:
from sklearn.metrics import accuracy_score

y_pred_b = binomialPipe.predict(texts_test)
y_pred_m = multinomialPipe.predict(texts_test)
y_pred_t = tfidfPipe.predict(texts_test)


acc_b = accuracy_score(y_test, y_pred_b)

acc_m = accuracy_score(y_test, y_pred_m)

acc_t = accuracy_score(y_test, y_pred_t)

print(f"Accuracy b: {100 * acc_b:.2f}%")

print(f"Accuracy m: {100 * acc_m:.2f}%")

print(f"Accuracy t: {100 * acc_t:.2f}%")

Accuracy b: 90.00%
Accuracy m: 90.41%
Accuracy t: 90.43%


It's not the best possible score. We should include the context of the ngrams

We're going to perform a hyperparameter tuning on the TF-IDF pipeline with multinomial Naive Bayes:
   - `ngram_range` - zakres wartości `[(1, 1), (1, 2), (1, 3)]`
   - `max_features` - zakres wartości `[500000, None]`
   - maksymalna liczba cech (rozmiar słownika) to 500000
   - użyj jednego zbioru walidacyjnego, stanowiącego 25% danych treningowych, z pomocą `ShuffleSplit`; pamiętaj o `random_state=0`
   - wybierz model o najwyższym accuracy
   - ustaw `n_jobs=1`, bo inaczej może ci łatwo zabraknąć pamięci
   - pamiętaj o `random_state=0`
2. Sprawdź wyniki na zbiorze testowym dla najlepszych hiperparametrów. Czy udało się uzyskać wynik lepszy od tych z artykułu dla n-gramów (patrz tabela na stronie 6 artykułu)?

In [11]:
from sklearn.model_selection import ShuffleSplit, GridSearchCV

tfidfPipe = Pipeline(
    [
        ("vectorizer", TfidfVectorizer(stop_words="english", binary=False)),
        ("nb", MultinomialNB()),
    ]
)

param_grid = {
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "vectorizer__max_features": [50000, None],
}

split = ShuffleSplit(n_splits=1, test_size=0.25, random_state=0)

cv = GridSearchCV(
    estimator=tfidfPipe,
    param_grid=param_grid,
    scoring="accuracy",
    cv=split,
    n_jobs=1,
)
cv.fit(texts_train, y_train)

y_pred = cv.predict(texts_test)
acc = accuracy_score(y_test, y_pred)

print(f"Optimal hyperparameters: {cv.best_params_}")
print(f"Accuracy: {100 * acc:.2f}%")

Optimal hyperparameters: {'vectorizer__max_features': None, 'vectorizer__ngram_range': (1, 2)}
Accuracy: 91.22%
