# **Text Classification**

## **Supervised Classification**
Learn a **classification model** on properties ("features") and their importance ("weights") from labeled instances.
* $X$: Set of attributes or features $\{x_1, x_2, ..., x_n\}$
* $y$: A "class" label from the label set $Y = \{y_1, y_2, ..., y_k\}$

#### **Classification Paradigms**
* **Binary Classification**: When there are only two possible classes; $|Y| = 2$
* **Multi-class Classification**: When there are more than two possible classes; $|Y| > 2$

#### **Questions to ask in Supervised Learning**
* Training phase:
    * What are the features? How to represent them?
    * What is the classification model/algorithm?
    * What are the model parameters?
* Inference phase:
    * What is the expected performance? What is a good measure?

#### **Types of textual features**
* Words
    * By far the most common class of features
    * Handling commonly-occurring words: Stop words
    * Normalization: Make lower case vs. leave as-is
    * Stemming / Lemmatization
* Characteristics of words: Capitalization
* Parts of speech of words in a sentence
* Grammatical structure, sentence parsing
* Grouping words of similar meaning, semantics
* Depending on the classification task, features may come from inside words and word sequences
    * bigrams, trigrams, n-grams (e.g. White House)
    * character sub-sequences in words (e.g. "ing", "ion", ...)

---
## **Naive Bayes Classifiers**
Updates the likelihood of the class given new information
* $Posterior\ probability = {{Prior\ probability \times Likelihood} \over Evidence}$

* $Pr(y|X) = {{Pr(y) \times Pr(X|y)}\over Pr(X)}$

* $y^* = argmax_y\ Pr(y|x) = argmax_y\ Pr(y) \times Pr(X|y)$

* **Naive assumption**: Given the class label, features are assumed to be independent of each other

* $\displaystyle y^* = argmax_y\ Pr(y|X) = argmax_y\ Pr(y) \times \prod_{n}^{i=1} Pr(x_i|y)$

* The predicted label $y$ is the $y$ that maximizes, the computation of probability of $y$ given $X$. Which is computed using Bayes Rule as probability of $y$, that is, the prior times $n$ independent products of individual features given $y$. So that's the likelihood of probability of $X$ given $y$.

#### **What are the parameters?**
* **Prior probabilities**: $Pr(y)$ for all $y$ in $Y$
    * Count the number of instances of each class
    * If there are $N$ instancesin all, and $n$ out of those are labeled as class $y$ then $Pr(y) = {n \over N}$

* **Likelihood**: $Pr(x_i|y)$ for all features $x_i$ and labels $y$ in $Y$
    * Count how many times feature $x_i$ appears in instances labeled as class $y$
    * If there are $p$ instances of class $y$, and $x_i$ appears in $k$ of those then $Pr(x_i|y) = {k \over p}$

#### **Smoothing**
* What happens if $Pr(x_i|y) = 0$ ?
    * If feature $x_i$ never occurs in documents labeled as $y$, the posterior probability $Pr(y|x_i)$ will be $0$
* **Laplace smoothing** or **Additive smoothing**: Add a dummy count so every word start with a count of $1$ for every class
    * $Pr(x_i|y) = {{k + 1}\over{p + n}}$ where $n$ is the number of features

#### **Important Concepts**
* Naive Bayes is a probabilistic model
* It is called naive because it assumes features are independent of each other, given the class label
* For text classification problems, Naive Bayes models typically provide very strong baselines 
* Simple model, easy to learn parameters

#### **Naive Bayes variations**
* **Multinomial Naive Bayes**
    * Data follows a multinomial distribution
    * Each feature value is a count (e.g. word occurrence count, TF-IDF weighting, ...)
* **Bernoulli naive Bayes**
    * Data follows a multivariate Bernoulli distribution
    * Each feature is binary (e.g. word is present / absent)


In [1]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.metrics import f1_score

train_df = pd.read_csv('../datasets/News_Classification/train.csv')
test_df = pd.read_csv('../datasets/News_Classification/test.csv')

# keeps only the Business(3) and Sci/Tech(4) news
train_df = train_df[train_df['Class Index'].isin([3, 4])]
test_df = test_df[test_df['Class Index'].isin([3, 4])]

# transforms labels 3 -> 0 and 4 -> 1
replace_labels = {3: 0, 4: 1}
train_df['Class Index'].replace(replace_labels, inplace=True)
test_df['Class Index'].replace(replace_labels, inplace=True)

# concatenate title and description in only one text
train_df['Description'] = train_df.apply(lambda x: ' '.join([str(x['Title']), str(x['Description'])]), axis=1)
test_df['Description'] = test_df.apply(lambda x: ' '.join([str(x['Title']), str(x['Description'])]), axis=1)

# gets X and y data
X_train, y_train = train_df['Description'].to_list(), train_df['Class Index'].to_list()
X_test, y_test = test_df['Description'].to_list(), test_df['Class Index'].to_list()

print('TRAIN DATA:')
display(train_df.info())

print('\nTEST DATA:')
display(test_df.info())

TRAIN DATA:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 60000 entries, 0 to 119981
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Class Index  60000 non-null  int64 
 1   Title        60000 non-null  object
 2   Description  60000 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.8+ MB


None


TEST DATA:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3800 entries, 0 to 7599
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Class Index  3800 non-null   int64 
 1   Title        3800 non-null   object
 2   Description  3800 non-null   object
dtypes: int64(1), object(2)
memory usage: 118.8+ KB


None

In [2]:
count_vectorizer = CountVectorizer(stop_words='english', max_features=1500, ngram_range=(1, 3))
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

multNB = MultinomialNB()
multNB.fit(X_train_count, y_train)

y_pred_mult = multNB.predict(X_test_count)
print(f1_score(y_test, y_pred_mult, average='micro'))

0.8721052631578947


In [3]:
hash_vectorizer = HashingVectorizer(stop_words='english', binary=True, n_features=1500,  ngram_range=(1, 3))
X_train_hash = count_vectorizer.fit_transform(X_train)
X_test_hash = count_vectorizer.transform(X_test)

berNB = BernoulliNB()
berNB.fit(X_train_hash, y_train)

y_pred_ber = berNB.predict(X_test_hash)
print(f1_score(y_test, y_pred_ber, average='micro'))

0.8623684210526316
